The yacc tool _________________________________________________________________ Most Unix systems, including ours, also supply yacc, a program generator that creates a parser -- that is, a program that recognizes complex syntactic constructions and performs specified actions when such constructions are completed. In a compiler, the action usually involves adding a component to a data structure that internally represents the structure of a program; in an interpreter, the action might be to execute an instruction or evaluate an expression that the parser recognizes. A parser constructed by yacc relies on the presence of a lexical analyzer to supply it with tokens; instead of invoking getchar() or some similar function to pick up input one character at a time, a yacc-derived parser invokes yylex() to pick up input one token at a time, and recognizes patterns in the token stream. The lexical analyzer can be one that is produced by lex -- lex and yacc were designed by the same people and intended to work cooperatively -- or it may simply be a function hand-coded by the programmer. In any case, it should be implemented as a function named yylex() that returns an integer value representing the next token in the input, or zero to indicate that the end of the input has been reached, and sets a global variable yylval of type int to indicate which particular instance of some token-category (``identifier,'' ``constant,'' etc.) was found. The yacc program allows the programmer to specify the syntactic constructions she is looking for by writing grammar rules. A grammar rule consists of the name of a syntactic category -- the type of syntactic construction that yacc is supposed to recognize -- followed by a colon, followed by an analysis, or possible structure, for items of that category. The analysis is simply a sequence of zero or more category names, token names, and token literals; the idea is that an instance of the syntactic category to the left of the colon can consist of a substructure for each of the category names to the right of the colon, together with specific tokens for each of the token names and token literals to the right of the colon. For instance, in the rule STATEMENT : iftoken '(' EXPRESSION ')' STATEMENT STATEMENT and EXPRESSION are category names, iftoken is a token name, and '(' and ')' are token literals. The rule says that a statement can be made up of an ``if-token'' (presumably a token corresponding to the keyword if or some equivalent string of characters), a left parenthesis, a substructure of the EXPRESSION category, a right parenthesis, and a substructure of the STATEMENT category. Note that the analysis can include a reference to the same category that is being analyzed, so that syntactic categories can be defined recursively. (Of course, this means that if the process is to terminate, there must also be some other analysis for the STATEMENT category that does not involve recursion.) The heart of an input file for yacc consists of a sequence of one or more such grammar rules. It is common for one syntactic category to have two or more analyses, and in such cases the rules are usually combined by replacing the left-hand side and colon of all but the first analysis with a vertical bar: STATEMENT : iftoken '(' EXPRESSION ')' STATEMENT | whiletoken '(' EXPRESSION ')' STATEMENT | fortoken '(' EXPRESSION ';' EXPRESSION ';' EXPRESSION ') STATEMENT | EXPRESSION ';' ; A rule may be accompanied by an action, to be taken when the recognition of an instance of the given syntactic category has been completed. The action is specified as a compound statement in C, enclosed in braces. The action may include invocation of user-defined functions, and indeed any kind of operation that can be expressed in C. (If no action is provided for a given rule, a semicolon should be placed at the end of the last alternative, as in the example just shown, to indicate the end of the rule.) In addition, each of the category names, token names, and token literals that occur in a rule has an integer value, and these values can be referred to in actions. An action can set the value of the syntactic category name that is being analyzed by executing a C assignment statement that has the special symbol $$ on its left-hand side. ($$ is simply shorthand for ``the value of the category name to the left of the colon.) An action can use the values of the names on the right-hand side of the rule by placing the special symbols $1, $2, $3, $4, ... in the code as if they were expressions of type int; $1 refers to the value of the first category name, token name, or token literal on the right-hand side of the rule, $2 to the second, and so on. For instance, in the rule EXPRESSION : '(' EXPRESSION ')' { $$ = $2; } the action ensures that the value associated with a parenthesized EXPRESSION construction is the same as the value associated with the unparenthesized EXPRESSION. The value of a category name that appears on the right-hand side of the rule is presumably set by an action in some other rule. The value of a token name or token literal is whatever was contained in the global variable yylval at the time the token was released by the lexical analyzer. In addition to grammar rules, a yacc input file may contain definitions that specify names for tokens and give precedence and associativity information about operators occurring in expressions. The most common kinds of definitions begin with one of the following indicators (starting in column 1): * %token -- the names that follow this indicator are understood to be token names. If such a name is followed by an integer, that integer is understood to be the int that identifies the token; otherwise, int values are supplied sequentially, beginning with 257. * %start -- the category name that follows is the ``start symbol'' for the parser, that is, the syntactic category that describes the entire input file (typically, PROGRAM). * %left -- indicates that the token name or literal that follows it is an operator that associates to the left. * %right -- indicates that the token name or literal that follows it is an operator that associates to the right. * %nonassoc -- indicates that the token name or literal that follows it is a non-associative operator that should not occur twice in a row within an expression. To be precise, a yacc input file consists of zero or more definition lines, a line containing just the character pair %%, and one or more rules. An additional line containing just the character pair %% can be appended at the end of the rules; it is optional. As in lex, one can also insert C code in the definition section of a yacc input file by preceding it with a line containing only the character pair %{ and following it with a line containing only the character pair %}. This is the conventional way of including global variable declarations and #include directives. Similarly, if there is a second %% line in the yacc input file, anything that follows it is assumed to be a programmer-supplied declaration and is inserted at the end of the output program, again as a global declaration. The parsing function created by yacc is called yyparse(); it takes no arguments and returns an integer -- 0 for a successful parse of the entire input file, a conventional error code otherwise. The parser can be built to stand alone (in which case the parser actions will include any output that the parser is to generate), but it is more usual for the programmer to create her own main program that invokes yyparse() at the appropriate point. Different implementations of yacc give different names to the output file. The most common name is y.tab.c. On our system, however, the recommended version of yacc is the one distributed by the Free Software Foundation, /usr/local/bin/bison, which chooses the name of its output file by replacing the terminal .y in the name of its input file with .tab.c. Here is the sequence of commands one would give to generate a lexical analyzer from a lex input file proj.l, a parser from a yacc input file proj.y, and then to incorporate the lexical analyzer and parser into a program proj.c that invokes yyparse(): flex proj.l gcc -ansi -c lex.yy.c bison proj.y gcc -ansi -c proj.tab.c gcc -ansi -c proj.c gcc -ansi -o proj proj.o proj.tab.o lex.yy.o -ly -ll The -ly option links in the yacc support library, and the -ll option links in the lex support library. An extended example of the use of lex and yacc to construct a compiler for a (very small) programming language can be found on MathLAN in the directory /u2/stone/courses/software-design/examples/yacc. Read the opening comment in the file Strans.c for an overview of the project and a definition of the S programming language. _________________________________________________________________