Lex and Yacc : a very quick primer

A Preview on Lex and Yacc

Mirror site of . . .

What Do Lex and Yacc Do?

Lex (Lexical Analyser) and Yacc (Yet Another Compiler Compiler) are two tools that help you write programs that process input. This covers everything from grep-like programs to compilers. (Lex and Yacc also have a number of GNU derivatives called "flex" and "bison" that for our purposes work identically.) The task of processing text is basically split as follows : Lex scans and breaks the input into tokens (chunks or words), while Yacc defines a grammar that gives the legal sequences of these tokens (i.e parses) . Both these tools work by the user providing an specification file that the program uses to generate C source code.

A Lex Example

%{
/* a sample bit of code */
%}

ws    [ \t]
nonws [^ \t\n]

%%

int cc = 0, wc = 0, lc = 0;

{nonws}+    cc += yyleng; ++wc;
{ws}+       cc += yyleng;
\n          ++lc; ++cc;
<<EOF>>  {
    printf( "%8d %8d %8d\n", lc, wc, cc );
    yyterminate();
    }

All this lex specification does is do a character, word and line count. The first section (that before the "%%" characters) is the definitions. Any material placed between the "%{" and "%}" charcaters is copied directly into the final C program, so you can write any legal C code here. You can include comments in the lex specification outside of these markers but they must be indented so Lex will attempt to read them as lex code.

The definitions section also helps us make our later code more concise. Our definitions are a name followed by a regular expression. So we have defined "ws" (whitespace) as a blank or tab character and a "nonws" (non-whitespace) character as anything but a tab, blank or newline.

The next section is the rules. First some variables are initialised : "cc" character count, "wc" word count, "ls" line count. Next, each rule is a pattern (a regular expression), followed by whitespace and then an action which tells you what to do if you input that pattern. So if 1 or more "nonws" is input, the "cc" is increased by "yyleng" and the "wc" is increased by 1. If a end-of-line is encountered, the "lc" and "cc" are increased by 1. The final rule says that when the end of file is encountered the following C code will be fired up which prints out the value of the variables. Note that for your action you can have any legal C code.

So what's "yyleng"? yyleng is a lex global variable that is always the length of the token you just read in. lex globals always begin with "yy". Other special symbols used in regular expressions should be obvious : "<&ltEOF>>" end-of-file, "\t" tab, "." any single character, "[]" match any single expression within the braces, "[^ ]" match anything but the expressions within the braces, "\n" newline.

After the rule section there can be another section called "user subroutines", which is any legal C code to be copied to after the Lex generated code. In this case it isn't necessary since it's obvious when our input is finished (at the end of file), but it would be entirely equivalent to append this code :

%%

main()
{
   yylex();
}

"yylex" is of course the scanning routine generated by Lex. This final section is seperated from the rules section by, yup, "%%".

Compiling Lex

... will naturally depend on the vagarities of your local system and software versions. Usually lex spec files are named to end in ".l" so a typical compile cycle would be :

> lex myfirst.l
> cc lex.yy.c -ll

where "ll" is the lex library.

Other patterns and variables

We can of course have other patterns recognized like :

[a-zA-Z]+ : a character that is lower case, literally from a to z, or upper case, and occurs one or more times.
[^abc] : anyhting but a, b or c.
. : any single character but newline.
(wxy) : groups symbols together for other operations.
(01)? : zero or one occurence of 01.
| : or, to allow multiple patterns and choices.
"xyz" : match xyz exactly.
"^xyz" : match xyz at beginning of line.
"xyz$" : match xyz at end of line.

If you want to fire up several lines of code as an action, include them in braces. The Lex global "yytext" always contains the token you just read in.

Building Symbol Tables

Often what we'd like to do is read the input and build a list of names as we go along - like a list of constants or variables. For example :

    /* definitions section */
    
%{

enum {
   LOOKUP = 0,
   NODE,
   INPUT,
   FINAL,
   START
};

short input_state;

short add_word     ( short type, char *word );
short lookup_word  ( char *word );
%}

%%

    /* rules section */
    
\n        { state = LOOKUP; }

^defnode  { state = NODE; } 
^definput { state = INPUT; } 
^deffinal { state = FINAL; } 
^defstart { state = START; } 

[a-zA-Z]+ {
    if (state != LOOKUP)
       add_word(state,yytext);
    else
    {
        switch(lookup_word(yytext))
        {
           /* read in name and do something */
           
           case NODE: ...
           case INPUT : ...
           .
           .
           .
           /* what if it's mispelt or undeclared? */
           
           default : printf ("%s: Unknown token!", yytext");
        }
    }
}

.     ;     /* anything else - just ignore, action is null */

%%

           /* down here we would have our "main()" and    */
           /* the functions "add_word" and "lookup_word"  */
           /* that would maintain lists of words and      */
           /* and search through them. As long as they do */
           /* their job it doesn't matter what they look  */
           /* like. You can do that stuff right?          */

So this means that you could have an input file like :

defnode alpha beta gamma
definput able baker
defnode zeta ...

which means you can read in the list of tokens you're looking for.

Yakkity Yacc

So why do we need Yacc? In a lot of cases we don't : the grammars are simple enough such that lex and a bit of judicious C code can handle it. But once your grammar gets complicated, it helps to use Lex to clean up your input and hand the relevant tokens to Yacc which will parse it to a meaningful form.

So how do we get Lex and Yacc to work together? Firstly they both have to recognise the tokens as the same which we do by letting yacc create a file "y.tab.h" that contains token definitions. This will look like :

/* y.tab.h - file of token defns */

#define NODE    101
#define INPUT   102
#define FINAL   103
#define START   104

Note tokens are never defined anything as equal to 0 - Yacc reserves 0 for the logical end of input. In any event our Lex scanner now just chunks and checks the input before passing it to Yacc and it looks like :

%{

#include "y.tab.h"    /* we need to know what the tokens are */

#define LOOKUP   0    /* this is OK, Yacc reserves 0, not Lex */

short input_state;

short add_word     ( short type, char *word );
short lookup_word  ( char *word );
%}

%%
    
\n        { state = LOOKUP; }

^defnode  { state = NODE; } 
^definput { state = INPUT; } 
^deffinal { state = FINAL; } 
^defstart { state = START; } 

[a-zA-Z]+ {
    if (state != LOOKUP)
       add_word(state,yytext);
    else
    {
        switch(lookup_word(yytext))
        {
           /* read in name and if OK pass to Yacc */
           
           case NODE:
              return NODE;
              break;
           case INPUT :
              return INPUT;
              break;
           .
           .
           .
           /* what if it's mispelt or undeclared? */
           
           default : printf ("%s: Unknown token!", yytext");
        }
    }
}

\.\n    { input_state = LOOKUP; return 0; }  /* new rule ! */

.     ;     /* anything else - just ignore, action is null */

%%

           /* same as before here ... */

So there are 3 main differences : (1) the token definitions are in another file so Yacc and Lex can agree on them, (2) our parser doesn't try and do anything with the tokens - it just checks that they're legit and passes them on to yacc or complains if they're not, (3) we have a new rule that recognises the end of a line (actually the ned of a line with something on it) and signals yacc that it's the end of input by sending 0. To make it clear : when you use Yacc and Lex together, the Yacc parser calls the Lex scanner. When Lex returns stuff, it returns it to Yacc. We don't return anything from the statement that defines new words, because Yacc doesn't need to know about the defintion, only that they are legal.

So What's the Yacc Parser Look Like?

Uh, right. Well, that depends on what you want to do with your input, but let's say we had something that we wanted to read in English words :

%{
/* mystuff.y - my Yacc parser */

#include 
%}

     /* need to know what tokens to create! */
     
%token NOUN PRONOUN VERB ADVERB ADJECTIVE PREPOSITION CONJUNCTION

%%

sentence:  subject VERB object { printf("Sentence is valid.\n");
   ;
   
subject :  NOUN
   |       PRONOUN
   ;
   
object :   NOUN
   ;
   
%%

extern FILE *yyin;

main()
{
   while(!feof(yyin))
      yyparse();
}

yyerror ( char *s )
{
   fprintf(stderr, "%s\n", s);
}

No prizes for working this out - it's sectioned up just like a lex file, with identical rules for including comments and code. "%token" gives a list of the tokens (terminals!) the parser can expect to be handed to it from lex. Convention suggests that token names are uppercase and the Yacc constructs (non-terminals!) are lower case. The rules are fairly simple to work out : they are a name followed by a colon follwed by a sequence of other tokens or non-terminals they are made up of, and an optional action sequence between braces "{ ... }". We can use the "|" (or) symbol too. Looking at the last section, we can see that the Yacc routine "yyparse" is called repeatedly everytime it is passed a sentence (and the end of a sentence is signalled by lex passing a 0). The "yyerror" is called by Yacc whenever it can't handle something. yyin is of course a global pointing to our input file. And yacc defintions files usually end in ".y".

How Do I Run Yacc?

Like i said above, this depends on your local setup. But :

> lex mystuff.l             // makes "lex.yy.c"
> yacc -d mystuff.y         // makes "y.tab.c" and "y.tab.h"
> cc -c lex.yy.c y.tab.c    // compile C files
> cc lex.yy.o y.tab.o -ll   // link object and library files

How About Another Lex Example?

Sure. Here's an enhanced word counter :

/* Somewhat faster still: potentially match a lot of text with each rule */

ws    [ \t]
nonws [^ \t\n]
word  {ws}*{nonws}+
words {word}{ws}+

%%
	int cc = 0, wc = 0, lc = 0;

{word}{ws}*		cc += yyleng; ++wc;
{word}{ws}*\n		cc += yyleng; ++wc; ++lc;
{words}{word}{ws}*	cc += yyleng; wc += 2;
{words}{2}{word}{ws}*	cc += yyleng; wc += 3;
{words}{3}{word}{ws}*	cc += yyleng; wc += 4;

{ws}+			cc += yyleng;

\n+			cc += yyleng; lc += yyleng;

<<EOF>>		{
		printf( "%8d %8d %8d\n", lc, wc, cc );
		yyterminate();
		}

Last revised on August 02, 2001 20:46:04 +0300 .