lex and yacc: Tools Worth Knowing

Today, computers can talk and they can listen—but how often do they do what you want?
The Tools

The program lex is a lexical analyzer. Lexical analysis is the recognition of words in a language. As used in this particular application, lex, or more specifically flex, is used to recognize characters forming the names of log curves, arithmetic operators and algebraic groupings.

flex is a particular example of the lexical analysis programs available for UNIX systems and is the implementation delivered with Linux distributions. The original implementation was done by Mike Lesk and Eric Schmidt at Bell Laboratories and is documented in the book lex & yacc by John R. Levine, Tony Mason & Doug Brown, O'Reilly & Associates, Inc., 1992.

yacc is a language parser. It accepts word items and, given a list of rules describing how these items form larger entities, deduces which items or combinations of items satisfy a rule. This can also be thought of as grammatical analysis. Once a rule is satisfied, a programmer's code is applied to the result.

In the case of English, the words in a sentence can be assigned grammatical types such as noun, verb, adjective and so on. Particular combinations of words form more complex units and these in turn can be described as complete sentences.

For example, the sentence “The lazy dog lay in the sun,” is composed of an article “the”, a preposition “in”, adjective “lazy”, nouns “dog, sun” and a verb “lay”. Combinations of these grammatical items form more complex entities such as noun phrases “The lazy dog” and “in the sun”. The first noun phrase is the subject of the sentence, and the second, in combination with the verb, forms the predicate. Together they form a complete sentence.

It is possible to form parsers for the English language, although given English's many idiosyncrasies, yacc may prove to be inadequate for the task. It may also be that the yacc programmer may become exhausted in trying to describe all the nuances of the language.

yacc was originally developed to provide a mechanism to develop compilers, but it could just as easily be used to create interpreters. For example, BASIC is often an interpreted language and could easily be described by a yacc grammar. Once yacc understood a particular line of BASIC code, it could cause the execution of the equivalent instructions in the native language of the host computer.

Some Linux distributions provide a choice of yacc programs. One can install either (or both) Berkeley yacc or the GNU bison program. You'll probably find them in /usr/bin. They are quite similar; bison was originally derived from yacc, although there has been some divergence over the years.

The combination of lex, yacc and some programmer's C code provides a complete means to interpret and act upon a user's wishes. The lex program uses its regular expression interpretation capability to recognize strings of characters as words or tokens. (The term “words” is used loosely to describe any recognized string of characters.) Once a token is identified, it is passed to yacc where the various rules are applied until some combination of tokens form a recognizable structure to which yacc applies some pre-programmed C code.

How The Tools Are Used

As indicated, lex uses regular expressions to recognize strings of characters as items of interest. Regular expressions are composed of special characters which describe acceptable combinations of characters.

For example, regular expressions often use the character . (period) to indicate that any character except a newline (\n) is acceptable.

Similarly, the characters [ and ] (square brackets) are used to indicate acceptance of any of the characters enclosed within them or within the range of characters described between them. For example, the expression [abc] says to accept any of the characters a, b or c; the expression [a-c] says the same thing. A more complicated example might be [a-zA-Z0-9] which says to accept any alphanumeric character.

For a complete description of lex regular expression syntax, see lex & yacc by Levine, Mason and Brown (O'Reilly, 1992).

Once a regular expression matches the text stream being interpreted by lex, code created by the programmer is executed. When used with yacc, this code generally amounts to passing an indication of what was just recognized to yacc for further processing. The indication is a token that yacc knows about, and in fact, these are defined in the yacc portion of the analyzer/parser program so that they are common to both lex and yacc.

Also as indicated, yacc uses a grammar description to decode the meaning of the tokens that lex passes to it. As tokens are passed to yacc, it applies its rules until a single token, or some sequence of tokens, becomes a recognizable structure.

Before a programmer's C code is executed, though, yacc may require several structures or token-structure combinations to be recognized. For example, using our sample sentence, our rules might look like the following:

sentence  :  subject + predicate
{...execute some C code...}
subject       :  noun
              |  noun_phrase
predicate     :  verb + noun_phrase
noun_phrase   :  preposition + adjective + noun
              |  adjective + noun

The first rule says that a sentence is made up of two parts: a subject followed by a predicate. If that rule is satisfied, then the C code between the curly brackets will be executed. To satisfy the first rule, yacc has to recognize both a subject and a predicate. The subsequent rules help yacc to do just that.

For example, the second rule says that a subject is recognized when either a noun or a noun phrase is recognized. A noun is the smallest unit that yacc deals with, and in the yacc grammar, a noun is a token that yacc will want to have lex recognize. Thus, somewhere in the yacc program, a token will be defined (probably called NOUN) that lex and yacc will use to communicate the fact that a noun has been interpreted. How this is done we will see shortly.

Notice that a noun phrase is also used to create the predicate. If a verb is recognized and it is followed by a noun phrase, the predicate is identified. If only the noun phrase is identified, then the subject has been identified.

The example cited is not in yacc syntax, but is meant to provide understanding. Very detailed examples may be found in the references.

You may be wondering how the yacc parser actually works. yacc works as a finite-state machine, and it has a stack (think of this as a memory of what has been seen, as it tries to deduce what the incoming stream of tokens represents).

A finite-state machine records its current condition as each recognizable item is interpreted. For example, as a noun phrase is being interpreted, it moves from state 3 when it receives a preposition to state 4 when the adjective is interpreted and finally to state 5 when the noun is recognized. When the entire phrase has been recognized, it switches to another state, perhaps 37, to note that fact. Please do not attach any particular meaning to the numbers used in this example. They have been chosen arbitrarily to demonstrate how yacc progresses as it interprets the tokens received from lex. You should conclude only that to reach state 5 (noun phrase), yacc must progress through several preceding states, each of which might lead to another final state, depending on the grammar yacc is using.

In other words, given its current state, yacc requests from lex the next token (if it needs it) and places onto the stack its new state. In doing so, it may push the new state onto the stack (as when interpreting the noun phrase), or pop the old state off the stack, replacing it with a new state (as when the noun phrase is completely recognized). These actions are called “shift” and “reduce” and describe pushing and popping states to and from the stack.

When the sentence is finally recognized, yacc accepts it and returns to the calling program (the main program which invoked yacc and indirectly lex). For a complete description of how a yacc parser works, see Inside Xenix by Christopher Morgan, Howard W. Sams and Co., 1986. This reference describes yacc grammars and how yacc parses its input in exquisite detail.