A Bison Tutorial: Do We Shift or Reduce?

HOWTOs

by Jack Dennon

on February 7, 2002

A shift-reduce conflict is the result of an ambiguity in the grammatical specification of a language, in our case, a programming language. The terms "shift" and "reduce" are explained in the course of this article.

For an example of grammatical specification, turn to page 201 of the book Principles of Compiler Design by Aho and Ullman and consider their two-function calculator that can add (+) and multiply (*). Here we have numbered the grammar rules that define the syntax for that calculator.

expr : expr '+' term
| term
term : term '*' factor
| factor
factor: '(' expr ')'
| id

An "id" is a numeric data value that has been entered by the user. When the user enters the number 16, for example, that creates a token type of id with a token value or semantic value of 16. A factor can be an id, that is, a number, or (|) it can be an expression enclosed in parentheses. A term can be a factor, or it can be a term followed by an asterisk followed by a factor. An expression can be a term, or it can be an expression followed by a plus sign followed by a term.

These six rules define the grammar of a two-function calculator designed to process input strings such as

16 + 5 * 10 \n

When the calculator finds the newline character, we will arrange for it to print the result of the indicated operations.

The Parsing Table

Grammar rules are the input data for the compiler--in our case, Bison--that creates a parsing table from the grammar rules. The parsing table defines a state machine that says what to do next at each step, as we work our way through an input string such as

id * id + id \n

where id represents some number that the user has entered. The newline character (\n) represents the end of the input stream.

In the parsing table below, E means expr, T means term and F stands for factor. A blank or empty space in the parse table represents an error exit. Column headings in the Action portion of the table are called terminal symbols. A terminal symbol is like an atom; it cannot be subdivided. Terminal symbols are also called tokens. Column headings in the Goto portion are called nonterminal symbols. A nonterminal is like a molecule; it is a grouping of terminal and nonterminal symbols.

State               Action                     Goto
_____   _______________________________     ___________      
        id    +     *     (     )    \n     E    T    F

    0   s5               s4                 1    2    3
    1        s6                     acc
    2        r2    s7          r2    r2
    3        r4    r4          r4    r4
    4   s5               s4                 8    2    3
    5        r6    r6          r6    r6
    6   s5               s4                      9    3
    7   s5               s4                          10
    8        s6               s11
    9        r1   s7           r1    r1
   10        r3   r3           r3    r3
   11        r5   r5           r5    r5

At the outset, we set state 0 on our stack and look ahead at the input stream. State 0 in our parsing table, with a look-ahead token of id, says s5, meaning shift and cover the stack with 5. To shift means to move the next input token onto the stack. The first two steps show what happens when we do that.

     Stack                                Input stream

     0                                    id * id + id \n 
     0 id 5                                  * id + id \n

Consulting the parse table for State 5 with a look-ahead of *, we find it says r6, meaning that we are to reduce using grammar rule 6. Reduce means apply the grammar rule to the items at the top of the stack. Reducing according to grammar rule 6, we replace id 5 on the stack with F and have

     0 F                                      * id + id \n

Now there is no state at the top of the stack; this tells us to look at the Goto table for state 0, column F, where we find that we are to cover the stack with state 3. This gives us

     0 F 3                                    * id + id \n

State 3 with a look-ahead of * says to reduce using rule 4. Rule 4 says replace F 3 with T and have

    0 T                                       * id + id \n

This stack tells us to look at the Goto table for state 0 at column T. There we find we are to cover the stack with state 2.

    0 T 2                                     * id + id \n

State 2 with a look-ahead token of * says shift and cover the stack with 7, giving us

    0 T 2 * 7                                    id + id \n

State 7 with a look-ahead of id says shift and cover the stack with 5.

    0 T 2 * 7 id 5                                  + id \n

State 5 with a look-ahead of + says reduce using rule 6. Rule 6 says replace id 5 with F.

    0 T 2 * 7 F                                      + id \n

This stack now tells us to look at the Goto table for state 7 at column F. There we find that we are to cover the stack with state 10.

    0 T 2 * 7 F 10                                   + id \n

State 10 with a look-ahead token of + says to reduce using rule 3. Rule 3 says to replace T * F with the product T, that is, remove two values from the stack, multiply them together and put the product back on the stack.

    0 T                                               + id \n

This stack tells us to look at the Goto table for state 0 at column T. There we find we are to cover the stack with state 2.

    0 T 2                                              + id \n

State 2 with a look-ahead of + says to reduce using rule 2. Rule 2 says to replace T 2 with E.

   0 E                                                 + id \n

This stack tells us to look at the Goto table for state 0 at column E. There we find we are to cover the stack with state 1.

   0 E 1                                               + id \n

State 1 with a look-ahead of + says shift and cover the stack with state 6.

   0 E 1 + 6                                             id \n

State 6 with a look-ahead of id says shift and cover the stack with state 5.

   0 E 1 + 6 id 5                                           \n

State 5 with a look-ahead of \n says reduce using rule 6. Rule 6 says replace id 5 with F.

   0 E 1 + 6 F                                              \n

This stack tells us to look at the Goto table for state 6 at column F. There we see that we are to cover the stack with state 3.

   0 E 1 + 6 F 3                                            \n

State 3 with a look-ahead of \n says reduce using rule 4. Rule 4 says replace F 3 with T.

   0 E 1 + 6 T                                              \n

This stack tells us to look at the Goto table for state 6 at column T. There we find we are to cover the stack with state 9.

   0 E 1 + 6 T 9                                            \n

State 9 with a look-ahead of \n says to reduce using rule 1. Rule 1 tells us to replace E + T with E, that is, remove two numbers from the stack, add them together and then put the sum on the stack.

   0 E                                                      \n

This stack tells us to look at the Goto table for state 0 at column E. There we see that we are to cover the stack with state 1.

   0 E 1                                                    \n

State 1 with a look-ahead of \n tells us to accept the result, we are done; it is time to print the result.

Tedious, isn't it? All this for a grammar that is almost too rudimentary to be interesting. The fundamental steps, however, are the same, no matter how elaborate the grammar may be.

The grammar for this two-function calculator was specifically designed to give multiplication (*) a higher precedence than addition (+). To prove that it does, try manually parsing the input string

id + id * id \n

such as we did above. The calculator should evaluate this input as though it had been written id + (id * id)\n.

A shift-reduce conflict occurs, therefore, when the grammar rules are ambiguous such that the parsing table created from those rules has more than one answer for the question, what do we do next?

Rules of precedence provide one way to resolve some types of ambiguities. Aho and Ullman discuss such matters at some length, and they also discuss a number of techniques for creating parsing tables for a given set of grammar rules. But rather than discuss more theory, let's get down to the nitty-gritty.

An Actual Bison Grammar File

The following text is an actual Bison grammar file for a two-function calculator patterned after the previous example.

Each rule in the grammar file can have associated with it an action, which is C code enclosed in braces. When the parser reduces using a grammar rule, it executes the action associated with that rule.

This real grammar file requires two additional nonterminal symbols that did not appear in our earlier example. The nonterminal named line is defined as an expression followed by a newline, and its associated action is used to print out the value of the expression. The purpose of line is to get the answer printed.

For the case where the user just presses Enter, the production for <line> has been defined such that it can also be nothing followed by a newline. The calculator will work after we define this one new nonterminal. It will work until the user enters a newline alone, then it will get a parse error. So what we do is define yet another nonterminal to catch the result of entering only a newline.

The nonterminal symbol <input> defines our start symbol; it is either a line or it can be "nothing," to catch nothing followed by newline. In either case, we accept the parse as complete and error-free.

/* twofunc2.y - a two function calculator */
%{
#define YYSTYPE double  /* yyparse() stack type */
%}

/* BISON Declarations */
%token ID

/* Grammar follows */
%%
input:          /* empty string */
        | input line
        ;

line:     '\n'
        | expr '\n'             { printf("\t%.10g\n",$1); }
        ;

expr:     expr '+' term         { $$ = $1 + $3; }
        | term
        ;
term:     term '*' factor       { $$ = $1 * $3; }
        | factor
        ;
factor:   '(' expr ')'          { $$ = $2; }
        | ID
        ;

%%
/*--------------------------------------------------------*/
/* Additional C code */

/* Error processor for yyparse */
#include <stdio.h>

int yyerror(char *s)    /* called by yyparse on error */
{
        printf("%s\n",s);
        return(0);
}

/*--------------------------------------------------------*/

/* Lexical analyzer returns a double floating point
   number on the stack and the token ID, or the ASCII
   character read if not a number. Skips all white space
   and returns 0 for EOF. */

#include &ltctype.h>

int yylex(void)
{
        int c;
        
        /* slip white space */
        while ( (c=getchar()) == ' ' || c == '\t')
                ;
                
        /* process numbers */
        if (c == '.' || isdigit(c)) {
                ungetc(c,stdin);
                scanf("%lf",&yylval);
                return(ID);
        }
        
        /* check EOF */
        if (c == EOF)
                return 0;
                
        /* return single character as its own token type */
        return c;
}
/*--------------------------------------------------------*/

/* The controlling function */

int main(void)
{
        yyparse();
        exit(0);
}
/*--------------------------------------------------------*/

Layout of a Bison Grammar File

All Bison grammar files have the form of the example above. The two-character strings %% , %{ and %} are punctuation marks that appear in every Bison grammar file to separate the sections. The general form of all Bison grammar files can be summarized as follows:

        %{
        C declarations
        %}
        Bison declarations
        %%
        Grammar rules
        %%
        Additional C code

From the Bison Manual:

The C declarations may define types and variables used in the actions. You can also use preprocessor commands to define macros used there, and use

email: jdennon@seasurf.net

Load Disqus comments