lex and yacc: Tools Worth Knowing

Today, computers can talk and they can listen—but how often do they do what you want?
Basic Coding of lex and yacc Programs

Both tools are coded in a similar manner. There are three sections in each program: declarations, rules and user routines. Each is separated by a line containing only the two characters %%.

For yacc, the declarations section contains the tokens it can use while parsing the input. Each has a unique value greater than 256, and the set of tokens is introduced via %token at the beginning of the line. lex can use the declarations section to define aliases for items that it must recognize while looking for tokens to pass to yacc.

For example, lex needs to know about white space which, while not used in identifying tokens, must be accounted for in some way. Similarly, mathematical symbols such as + or = must be recognized. These are needed in the interpretation of the algebraic statement coded by the user.

Within the rules section, yacc holds its parsing intelligence. This is the section that contains the grammar rules referred to in the English sentence example earlier. In fact, the coding used earlier is typical of a yacc grammar: items to be recognized are separated from the means to recognize them by a colon (:), and alternative means of recognition are separated from each other via a pipe (|) symbol.

lex uses the rules section to contain the regular expressions that allow it to identify tokens to pass to yacc. These expressions may be the aliases from the declaration section, regular expressions, or some combination.

The last section contains C code which may be invoked as each of the tools processes its input.

One requirement is that the yacc tokens be known to the lex program. This is accomplished by including the following statement:

#include "y.tab.h"

in the lex declarations section and creating it when compiling the yacc program code.

Compilation is accomplished in the following way:

yacc -d yacc.file -create 'y.tab.c and y.tab.h'
flex flex.file -create 'lex.yy.c'

The -d option on yacc's command line creates the y.tab.h file needed by lex.yy.c.

How lex and yacc were employed in Log Analysis

To successfully interpret the user's desired process, the program needs to know which well logs were available for processing. This information is available in the ASCII file selected by the user. This text file contains a one-to-one correspondence between curve description and data values. A very small subset of a typical file is shown in Listing 1.

Listing 1

As can be seen, there are several sections including well information (includes some hole parameters), curve information (notes which curves are in the file) and “A” which holds the recorded data values. Each is introduced with a tilde (~). Because the format of the file is fixed by convention, these are easily parsed, and needed values are stored for subsequent processing.

As written for the client, the program is a Motif application. The user selected the file to be processed; it was read in its entirety and numeric text items were converted to double-precision values.

Besides allowing file and curve merging and editing, there is a menu item for curve processing. Upon selecting this menu item, a dialog box is presented containing a list of available curves and arithmetic operations. The user selects curve names, numeric constants and operations which in turn are displayed as an algebraic operation on a text input line. When satisfied with the mathematical operation, the user clicks OK and the lex and yacc magic occurs. The result is stored as a separate curve and can be included in subsequent operations.

lex processed the incoming algebraic statement with the code shown in Listing 2.

Listing 2

Between lines 1 and 16 are declarations to be used in the program generated by lex. In particular, you will notice the inclusion of the header file y.tab.h which contains the following definitions:

#define INTEGER 257
#define FLOAT 258
#define DOUBLE 259
#define NUMBER 260
#define VARIABLE 261
#define EQUAL 262
#define LPAREN 263
#define RPAREN 264
#define PLUS 265
#define MINUS 266
#define TIMES 267
#define DIVIDE 268
#define RAISE 269
#define LHS 270

These definitions are used by both lex and yacc to describe the items yacc expects to receive from lex. They are generated by statements 73 to 77 of the yacc source which will be examined shortly.

From lines 17 to 31 of the lex listing are declarations which amount to aliases for common items that we wish lex to recognize. For example, we declare DIGIT to be any single numeric between 0 and 9 on line 21. Doing this allows us to declare INT (an integer) to be one or more DIGIT's.

Lines 33 to 90 contain the rules by which lex interprets incoming text items. For example, on line 34 we recognize an equal sign (=) and return the token EQUAL to the caller. In y.tab.h, EQUAL is defined to be 262.

As you can see, the lex rules simply recognize text items and then advise the caller what was seen in the input stream.

Listing 3

yacc interprets the token stream passed to it by lex with the following code, only a subset of which is shown in Listing 3. The code for the yacc routine (with the calling subroutine do_arithmetic and its accessory functions) was in excess of 900 lines. For those interested, it is available for your perusal from SSC's public FTP site. Listing 3 is a sample indicating what needed to be done.

Like the lex routine, yacc begins with lines to be included in the output code. Programs written for graphical user interfaces sit in a loop waiting for the user to click on something. When the user's needs are so indicated, the GUI-based program calls a function to perform the required action. These “called functions” are popularly called callbacks. In this program, one of the callbacks was do_arithmetic, which in turn called the yacc routine, which in its turn called the lex routine.

In Listing 3, do_arithmetic is described in the first few lines, and a portion of the code may be seen in lines 428 to 532. They are shown only to give some indication of what was being accomplished.

yacc does the work with its rules section beginning at line 79, and ending at line 426. Although too long to be included completely, you can see that an equation is defined to be something called an lhs (left hand side) EQUAL rhs (right hand side) at line 80. Looking down the listing, you will see that an equation may also be described by an expr (expression). When either of these are encountered, yacc pops a key from an internal stack created by a function called push (see near line 557) and then causes a log curve to be returned to the caller by calling another function called get_curve (not shown here, but included with the yacc and lex code).

Between lines 118 and 139, you can see how some of the tokens yacc expects are processed when recognized. The complete listing has many more.