Lex - A Lexical Analyzer Generator M. E. Lesk and E. Schmidt Bell Laboratories Murray Hill, New Jersey 07974 _A_B_S_T_R_A_C_T Lex helps write programs whose control flow is directed by instances of regular expressions in the input stream. It is well suited for editor- script type transformations and for segmenting input in preparation for a parsing routine. Lex source is a table of regular expressions and corresponding program fragments. The table is translated to a program which reads an input stream, copying it to an output stream and parti- tioning the input into strings which match the given expressions. As each such string is recog- nized the corresponding program fragment is exe- cuted. The recognition of the expressions is per- formed by a deterministic finite automaton gen- erated by Lex. The program fragments written by the user are executed in the order in which the corresponding regular expressions occur in the input stream. The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match possible at each input point. If necessary, substantial lookahead is performed on the input, but the input stream will be backed up to the end of the current partition, so that the user has general freedom to manipulate it. Lex can generate analyzers in either C or Ratfor, a language which can be translated automatically to portable Fortran. It is avail- able on the PDP-11 UNIX, Honeywell GCOS, and IBM OS systems. This manual, however, will only dis- cuss generating analyzers in C on the UNIX system, which is the only supported form of Lex under UNIX Version 7. Lex is designed to simplify interfac- ing with Yacc, for those with access to this compiler-compiler system. - 2 - _1. _I_n_t_r_o_d_u_c_t_i_o_n. Lex is a program gen- erator designed for lexical processing of character input streams. It accepts a high-level, problem oriented specification for character string matching, and pro- duces a program in a general purpose language which recognizes regular expres- sions. The regular expres- sions are specified by the user in the source specifi- cations given to Lex. The Lex written code recognizes these expressions in an input stream and partitions the input stream into strings matching the expres- sions. At the boundaries between strings program sec- tions provided by the user are executed. The Lex source file associates the regular expressions and the program fragments. As each expression appears in the input to the program written by Lex, the corresponding fragment is executed. The user supplies the additional code beyond expression matching needed to complete his tasks, pos- sibly including code written by other generators. The program that recognizes the expressions is generated in the general purpose program- ming language employed for the user's program frag- ments. Thus, a high level expression language is pro- vided to write the string expressions to be matched while the user's freedom to write actions is unimpaired. This avoids forcing the user who wishes to use a string manipulation language for input analysis to write 777777777777777777777777777777777777777777777777777777 processing programs in the same and often inappropriate string handling language. Lex is not a complete language, but rather a gen- erator representing a new language feature which can be added to different pro- gramming languages, called ``host languages.'' Just as general purpose languages can produce code to run on different computer hardware, Lex can write code in dif- ferent host languages. The host language is used for the output code generated by Lex and also for the program fragments added by the user. Compatible run-time libraries for the different host languages are also pro- vided. This makes Lex adaptable to different environments and different users. Each application may be directed to the combina- tion of hardware and host language appropriate to the task, the user's background, and the properties of local implementations. At present, the only supported host language is C, although Fortran (in the form of Rat- for [2] has been available in the past. Lex itself exists on UNIX, GCOS, and OS/370; but the code gen- erated by Lex may be taken anywhere the appropriate compilers exist. Lex turns the user's expressions and actions (called _s_o_u_r_c_e in this memo) into the host general- purpose language; the gen- erated program is named _y_y_l_e_x. The _y_y_l_e_x program will recognize expressions in a stream (called _i_n_p_u_t in this memo) and perform the - 3 - specified actions for each expression as it is detected. See Figure 1. 8 _______ Source -> | Lex | -> yylex 8 _______ 8 _______ Input -> | yylex | -> Output 8 _______ An overview of Lex Figure 1 For a trivial example, consider a program to delete from the input all blanks or tabs at the ends of lines. %% [ \t]+$ ; is all that is required. The program contains a %% delimiter to mark the begin- ning of the rules, and one rule. This rule contains a regular expression which matches one or more instances of the characters blank or tab (written \t for visibility, in accordance with the C language conven- tion) just prior to the end of a line. The brackets indicate the character class made of blank and tab; the + indicates ``one or more ...''; and the $ indicates ``end of line,'' as in QED. No action is specified, so the program generated by Lex (yylex) will ignore these characters. Everything else will be copied. To change any remaining string of blanks or tabs to a single blank, add another rule: %% [ \t]+$ ; [ \t]+ printf(" "); The finite automaton gen- erated for this source will scan for both rules at once, observing at the termination of the string of blanks or tabs whether or not there is 777777777777777777777777777777777777777777777777777777 a newline character, and executing the desired rule action. The first rule matches all strings of blanks or tabs at the end of lines, and the second rule all remaining strings of blanks or tabs. Lex can be used alone for simple transformations, or for analysis and statis- tics gathering on a lexical level. Lex can also be used with a parser generator to perform the lexical analysis phase; it is particularly easy to interface Lex and Yacc [3]. Lex programs recognize only regular expressions; Yacc writes parsers that accept a large class of context free gram- mars, but require a lower level analyzer to recognize input tokens. Thus, a com- bination of Lex and Yacc is often appropriate. When used as a preprocessor for a later parser generator, Lex is used to partition the input stream, and the parser generator assigns structure to the resulting pieces. The flow of control in such a case (which might be the first half of a compiler, for example) is shown in Figure 2. Additional pro- grams, written by other gen- erators or by hand, can be added easily to programs written by Lex. lexical grammar rules rules 8 _________ _________ | Lex | | Yacc | 7 8 _________ _________ 8 _________ _________ Input -> | yylex | -> | yyparse | -> Parsed input 7 8 _________ _________ Lex with Yacc Figure 2 - 4 - Yacc users will realize that the name _y_y_l_e_x is what Yacc expects its lexical analyzer to be named, so that the use of this name by Lex simpli- fies interfacing. Lex generates a deter- ministic finite automaton from the regular expressions in the source [4]. The automaton is interpreted, rather than compiled, in order to save space. The result is still a fast analyzer. In particular, the time taken by a Lex pro- gram to recognize and parti- tion an input stream is pro- portional to the length of the input. The number of Lex rules or the complexity of the rules is not impor- tant in determining speed, unless rules which include forward context require a significant amount of re- scanning. What does increase with the number and complexity of rules is the size of the finite automa- ton, and therefore the size of the program generated by Lex. In the program written by Lex, the user's fragments (representing the _a_c_t_i_o_n_s to be performed as each regular expression is found) are gathered as cases of a switch. The automaton interpreter directs the con- trol flow. Opportunity is provided for the user to insert either declarations or additional statements in the routine containing the actions, or to add subrou- tines outside this action routine. Lex is not limited to source which can be 777777777777777777777777777777777777777777777777777777 interpreted on the basis of one character lookahead. For example, if there are two rules, one looking for _a_b and another for _a_b_c_d_e_f_g, and the input stream is _a_b_c_d_e_f_h, Lex will recognize _a_b and leave the input pointer just before _c_d. . . Such backup is more costly than the processing of simpler languages. _2. _L_e_x _S_o_u_r_c_e. The general format of Lex source is: {definitions} %% {rules} %% {user subroutines} where the definitions and the user subroutines are often omitted. The second %% is optional, but the first is required to mark the beginning of the rules. The absolute minimum Lex program is thus %% (no definitions, no rules) which translates into a pro- gram which copies the input to the output unchanged. In the outline of Lex programs shown above, the _r_u_l_e_s represent the user's control decisions; they are a table, in which the left column contains _r_e_g_u_l_a_r _e_x_p_r_e_s_s_i_o_n_s (see section 3) and the right column con- tains _a_c_t_i_o_n_s, program frag- ments to be executed when the expressions are recog- nized. Thus an individual rule might appear integer printf("found keyword INT"); to look for the string _i_n_t_e_g_e_r in the input stream and print the message ``found keyword INT'' - 5 - whenever it appears. In this example the host pro- cedural language is C and the C library function _p_r_i_n_t_f is used to print the string. The end of the expression is indicated by the first blank or tab char- acter. If the action is merely a single C expres- sion, it can just be given on the right side of the line; if it is compound, or takes more than a line, it should be enclosed in braces. As a slightly more useful example, suppose it is desired to change a number of words from British to American spelling. Lex rules such as colour printf("color"); mechanise printf("mechanize"); petrol printf("gas"); would be a start. These rules are not quite enough, since the word _p_e_t_r_o_l_e_u_m would become _g_a_s_e_u_m; a way of dealing with this will be described later. _3. _L_e_x _R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s. The definitions of reg- ular expressions are very similar to those in QED [5]. A regular expression speci- fies a set of strings to be matched. It contains text characters (which match the corresponding characters in the strings being compared) and operator characters (which specify repetitions, choices, and other features). The letters of the alphabet and the digits are always text characters; thus the regular expression integer matches the string _i_n_t_e_g_e_r wherever it appears and the expression a57D 777777777777777777777777777777777777777777777777777777 looks for the string _a_5_7_D. _O_p_e_r_a_t_o_r_s. The opera- tor characters are " \ [ ] ^ - ? . * + | ( ) $ / { } % < > and if they are to be used as text characters, an escape should be used. The quotation mark operator (") indicates that whatever is contained between a pair of quotes is to be taken as text characters. Thus xyz"++" matches the string _x_y_z++ when it appears. Note that a part of a string may be quoted. It is harmless but unnecessary to quote an ordinary text character; the expression "xyz++" is the same as the one above. Thus by quoting every non-alphanumeric char- acter being used as a text character, the user can avoid remembering the list above of current operator characters, and is safe should further extensions to Lex lengthen the list. An operator character may also be turned into a text character by preceding it with \ as in xyz\+\+ which is another, less read- able, equivalent of the above expressions. Another use of the quoting mechanism is to get a blank into an expression; normally, as explained above, blanks or tabs end a rule. Any blank character not contained within [] (see below) must be quoted. Several normal C escapes with \ are recog- nized: \n is newline, \t is tab, and \b is backspace. To enter \ itself, use \\. Since newline is illegal in - 6 - an expression, \n must be used; it is not required to escape tab and backspace. Every character but blank, tab, newline and the list above is always a text char- acter. _C_h_a_r_a_c_t_e_r _c_l_a_s_s_e_s. Classes of characters can be specified using the operator pair []. The construction [_a_b_c] matches a single char- acter, which may be _a, _b, or _c. Within square brackets, most operator meanings are ignored. Only three charac- ters are special: these are \ - and ^. The - character indicates ranges. For exam- ple, [a-z0-9<>_] indicates the character class containing all the lower case letters, the digits, the angle brackets, and underline. Ranges may be given in either order. Using - between any pair of characters which are not both upper case letters, both lower case letters, or both digits is implementa- tion dependent and will get a warning message. (E.g., [0-z] in ASCII is many more characters than it is in EBCDIC). If it is desired to include the character - in a character class, it should be first or last; thus [-+0-9] matches all the digits and the two signs. In character classes, the ^ operator must appear as the first character after the left bracket; it indi- cates that the resulting string is to be complemented with respect to the computer character set. Thus 777777777777777777777777777777777777777777777777777777 [^abc] matches all characters except a, b, or c, including all special or control char- acters; or [^a-zA-Z] is any character which is not a letter. The \ charac- ter provides the usual escapes within character class brackets. _A_r_b_i_t_r_a_r_y _c_h_a_r_a_c_t_e_r. To match almost any charac- ter, the operator character . is the class of all charac- ters except newline. Escap- ing into octal is possible although non-portable: [\40-\176] matches all printable char- acters in the ASCII charac- ter set, from octal 40 (blank) to octal 176 (tilde). _O_p_t_i_o_n_a_l _e_x_p_r_e_s_s_i_o_n_s. The operator ? indicates an optional element of an expression. Thus ab?c matches either _a_c or _a_b_c. _R_e_p_e_a_t_e_d _e_x_p_r_e_s_s_i_o_n_s. Repetitions of classes are indicated by the operators * and +. _a* is any number of consecutive _a characters, including zero; while a+ is one or more instances of _a. For example, [a-z]+ is all strings of lower case letters. And [A-Za-z][A-Za-z0-9]* indicates all alphanumeric strings with a leading alphabetic character. This is a typical expression for recognizing identifiers in - 7 - computer languages. _A_l_t_e_r_n_a_t_i_o_n _a_n_d _G_r_o_u_p_- _i_n_g. The operator | indi- cates alternation: (ab|cd) matches either _a_b or _c_d. Note that parentheses are used for grouping, although they are not necessary on the outside level; ab|cd would have sufficed. Parentheses can be used for more complex expressions: (ab|cd+)?(ef)* matches such strings as _a_b_e_f_e_f, _e_f_e_f_e_f, _c_d_e_f, or _c_d_d_d; but not _a_b_c, _a_b_c_d, or _a_b_c_d_e_f. _C_o_n_t_e_x_t _s_e_n_s_i_t_i_v_i_t_y. Lex will recognize a small amount of surrounding con- text. The two simplest operators for this are ^ and $. If the first character of an expression is ^, the expression will only be matched at the beginning of a line (after a newline character, or at the begin- ning of the input stream). This can never conflict with the other meaning of ^, com- plementation of character classes, since that only applies within the [] opera- tors. If the very last character is $, the expres- sion will only be matched at the end of a line (when immediately followed by new- line). The latter operator is a special case of the / operator character, which indicates trailing context. The expression ab/cd matches the string _a_b, but only if followed by _c_d. Thus ab$ is the same as 777777777777777777777777777777777777777777777777777777 ab/\n Left context is handled in Lex by _s_t_a_r_t _c_o_n_d_i_t_i_o_n_s as explained in section 10. If a rule is only to be exe- cuted when the Lex automaton interpreter is in start con- dition _x, the rule should be prefixed by using the angle bracket operator characters. If we considered ``being at the beginning of a line'' to be start condition _O_N_E, then the ^ operator would be equivalent to Start conditions are explained more fully later. _R_e_p_e_t_i_t_i_o_n_s _a_n_d _D_e_f_i_n_i_- _t_i_o_n_s. The operators {} specify either repetitions (if they enclose numbers) or definition expansion (if they enclose a name). For example {digit} looks for a predefined string named _d_i_g_i_t and inserts it at that point in the expression. The defini- tions are given in the first part of the Lex input, before the rules. In con- trast, a{1,5} looks for 1 to 5 occurrences of _a. Finally, initial % is special, being the separator for Lex source segments. _4. _L_e_x _A_c_t_i_o_n_s. When an expression written as above is matched, Lex executes the correspond- ing action. This section describes some features of Lex which aid in writing actions. Note that there is - 8 - a default action, which con- sists of copying the input to the output. This is per- formed on all strings not otherwise matched. Thus the Lex user who wishes to absorb the entire input, without producing any out- put, must provide rules to match everything. When Lex is being used with Yacc, this is the normal situa- tion. One may consider that actions are what is done instead of copying the input to the output; thus, in gen- eral, a rule which merely copies can be omitted. Also, a character combina- tion which is omitted from the rules and which appears as input is likely to be printed on the output, thus calling attention to the gap in the rules. One of the simplest things that can be done is to ignore the input. Specifying a C null state- ment, ; as an action causes this result. A frequent rule is [ \t\n] ; which causes the three spac- ing characters (blank, tab, and newline) to be ignored. Another easy way to avoid writing actions is the action character |, which indicates that the action for this rule is the action for the next rule. The pre- vious example could also have been written " " "\t" "\n" with the same result, although in different style. The quotes around \n and \t are not required. 777777777777777777777777777777777777777777777777777777 In more complex actions, the user will often want to know the actual text that matched some expression like [_a-_z]+. Lex leaves this text in an external character array named _y_y_t_e_x_t. Thus, to print the name found, a rule like [a-z]+ printf("%s", yytext); will print the string in _y_y_t_e_x_t. The C function _p_r_i_n_t_f accepts a format argument and data to be printed; in this case, the format is ``print string'' (% indicating data conver- sion, and _s indicating string type), and the data are the characters in _y_y_t_e_x_t. So this just places the matched string on the output. This action is so common that it may be writ- ten as ECHO: [a-z]+ ECHO; is the same as the above. Since the default action is just to print the characters found, one might ask why give a rule, like this one, which merely specifies the default action? Such rules are often required to avoid matching some other rule which is not desired. For example, if there is a rule which matches _r_e_a_d it will normally match the instances of _r_e_a_d contained in _b_r_e_a_d or _r_e_a_d_j_u_s_t; to avoid this, a rule of the form [_a-_z]+ is needed. This is explained further below. Sometimes it is more convenient to know the end of what has been found; hence Lex also provides a count _y_y_l_e_n_g of the number of characters matched. To count both the number of words and the number of characters in words in the - 9 - input, the user might write [a-zA-Z]+ {words++; chars += yyleng;} which accumulates in _c_h_a_r_s the number of characters in the words recognized. The last character in the string matched can be accessed by yytext[yyleng-1] Occasionally, a Lex action may decide that a rule has not recognized the correct span of characters. Two routines are provided to aid with this situation. First, _y_y_m_o_r_e() can be called to indicate that the next input expression recog- nized is to be tacked on to the end of this input. Nor- mally, the next input string would overwrite the current entry in _y_y_t_e_x_t. Second, _y_y_l_e_s_s (_n) may be called to indicate that not all the characters matched by the currently successful expres- sion are wanted right now. The argument _n indicates the number of characters in _y_y_t_e_x_t to be retained. Further characters previ- ously matched are returned to the input. This provides the same sort of lookahead offered by the / operator, but in a different form. _E_x_a_m_p_l_e: Consider a language which defines a string as a set of charac- ters between quotation (") marks, and provides that to include a " in a string it must be preceded by a \. The regular expression which matches that is somewhat confusing, so that it might be preferable to write \"[^"]* { if (yytext[yyleng-1] == '\\') yymore(); else ... normal user processing 777777777777777777777777777777777777777777777777777777 } which will, when faced with a string such as "_a_b_c_\"_d_e_f" first match the five charac- ters "_a_b_c_\; then the call to _y_y_m_o_r_e() will cause the next part of the string, "_d_e_f, to be tacked on the end. Note that the final quote ter- minating the string should be picked up in the code labeled ``normal process- ing''. The function _y_y_l_e_s_s() might be used to reprocess text in various cir- cumstances. Consider the C problem of distinguishing the ambiguity of ``=-a''. Suppose it is desired to treat this as ``=- a'' but print a message. A rule might be =-[a-zA-Z] { printf("Op (=-) ambiguous\n"); yyless(yyleng-1); ... action for =- ... } which prints a message, returns the letter after the operator to the input stream, and treats the operator as ``=-''. Alter- natively it might be desired to treat this as ``= -a''. To do this, just return the minus sign as well as the letter to the input: =-[a-zA-Z] { printf("Op (=-) ambiguous\n"); yyless(yyleng-2); ... action for = ... } will perform the other interpretation. Note that the expressions for the two cases might more easily be written =-/[A-Za-z] in the first case and =/-[A-Za-z] in the second; no backup would be required in the - 10 - rule action. It is not necessary to recognize the whole identifier to observe the ambiguity. The possi- bility of ``=-3'', however, makes =-/[^ \t\n] a still better rule. In addition to these routines, Lex also permits access to the I/O routines it uses. They are: 1) _i_n_p_u_t() which returns the next input charac- ter; 2) _o_u_t_p_u_t(_c) which writes the character _c on the output; and 3) _u_n_p_u_t(_c) pushes the character _c back onto the input stream to be read later by _i_n_p_u_t(). By default these routines are provided as macro defin- itions, but the user can override them and supply private versions. These routines define the rela- tionship between external files and internal charac- ters, and must all be retained or modified con- sistently. They may be redefined, to cause input or output to be transmitted to or from strange places, including other programs or internal memory; but the character set used must be consistent in all routines; a value of zero returned by _i_n_p_u_t must mean end of file; and the relationship between _u_n_p_u_t and _i_n_p_u_t must be retained or the Lex look- ahead will not work. Lex does not look ahead at all if it does not have to, but every rule ending in + * ? 777777777777777777777777777777777777777777777777777777 or $ or containing / implies lookahead. Lookahead is also necessary to match an expression that is a prefix of another expression. See below for a discussion of the character set used by Lex. The standard Lex library imposes a 100 char- acter limit on backup. Another Lex library routine that the user will sometimes want to redefine is _y_y_w_r_a_p() which is called whenever Lex reaches an end-of-file. If _y_y_w_r_a_p returns a 1, Lex continues with the normal wrapup on end of input. Sometimes, however, it is convenient to arrange for more input to arrive from a new source. In this case, the user should provide a _y_y_w_r_a_p which arranges for new input and returns 0. This instructs Lex to continue processing. The default _y_y_w_r_a_p always returns 1. This routine is also a convenient place to print tables, summaries, etc. at the end of a program. Note that it is not possible to write a normal rule which recognizes end-of-file; the only access to this condi- tion is through _y_y_w_r_a_p. In fact, unless a private ver- sion of _i_n_p_u_t() is supplied a file containing nulls can- not be handled, since a value of 0 returned by _i_n_p_u_t is taken to be end-of-file. _5. _A_m_b_i_g_u_o_u_s _S_o_u_r_c_e _R_u_l_e_s. Lex can handle ambigu- ous specifications. When more than one expression can match the current input, Lex chooses as follows: - 11 - 1) The longest match is preferred. 2) Among rules which matched the same number of characters, the rule given first is pre- ferred. Thus, suppose the rules integer keyword action ...; [a-z]+ identifier action ...; to be given in that order. If the input is _i_n_t_e_g_e_r_s, it is taken as an identifier, because [_a-_z]+ matches 8 characters while _i_n_t_e_g_e_r matches only 7. If the input is _i_n_t_e_g_e_r, both rules match 7 characters, and the keyword rule is selected because it was given first. Anything shorter (e.g. _i_n_t) will not match the expres- sion _i_n_t_e_g_e_r and so the identifier interpretation is used. The principle of preferring the longest match makes rules containing expressions like .* dangerous. For example, '.*' might seem a good way of recognizing a string in sin- gle quotes. But it is an invitation for the program to read far ahead, looking for a distant single quote. Presented with the input 'first' quoted string here, 'second' here the above expression will match 'first' quoted string here, 'second' which is probably not what was wanted. A better rule is of the form '[^'\n]*' which, on the above input, will stop after '_f_i_r_s_t'. The consequences of errors like this are mitigated by the fact that the . 777777777777777777777777777777777777777777777777777777 operator will not match new- line. Thus expressions like .* stop on the current line. Don't try to defeat this with expressions like [._\_n]+ or equivalents; the Lex gen- erated program will try to read the entire input file, causing internal buffer overflows. Note that Lex is nor- mally partitioning the input stream, not searching for all possible matches of each expression. This means that each character is accounted for once and only once. For example, suppose it is desired to count occurrences of both _s_h_e and _h_e in an input text. Some Lex rules to do this might be she s++; he h++; \n | . ; where the last two rules ignore everything besides _h_e and _s_h_e. Remember that . does not include newline. Since _s_h_e includes _h_e, Lex will normally _n_o_t recognize the instances of _h_e included in _s_h_e, since once it has passed a _s_h_e those charac- ters are gone. Sometimes the user would like to override this choice. The action REJECT means ``go do the next alternative.'' It causes whatever rule was second choice after the current rule to be executed. The position of the input pointer is adjusted accord- ingly. Suppose the user really wants to count the included instances of _h_e: she {s++; REJECT;} he {h++; REJECT;} \n | - 12 - . ; these rules are one way of changing the previous exam- ple to do just that. After counting each expression, it is rejected; whenever appropriate, the other expression will then be counted. In this example, of course, the user could note that _s_h_e includes _h_e but not vice versa, and omit the REJECT action on _h_e; in other cases, however, it would not be possible a priori to tell which input characters were in both classes. Consider the two rules a[bc]+ { ... ; REJECT;} a[cd]+ { ... ; REJECT;} If the input is _a_b, only the first rule matches, and on _a_d only the second matches. The input string _a_c_c_b matches the first rule for four characters and then the second rule for three char- acters. In contrast, the input _a_c_c_d agrees with the second rule for four charac- ters and then the first rule for three. In general, REJECT is useful whenever the purpose of Lex is not to partition the input stream but to detect all examples of some items in the input, and the instances of these items may overlap or include each other. Suppose a digram table of the input is desired; normally the digrams overlap, that is the word _t_h_e is considered to contain both _t_h and _h_e. Assuming a two-dimensional array named _d_i_g_r_a_m to be incremented, the appropriate source is %% 777777777777777777777777777777777777777777777777777777 [a-z][a-z] { digram[yytext[0]][yytext[1]]++; REJECT; } . ; \n ; where the REJECT is neces- sary to pick up a letter pair beginning at every character, rather than at every other character. _6. _L_e_x _S_o_u_r_c_e _D_e_f_i_n_i_t_i_o_n_s. Remember the format of the Lex source: {definitions} %% {rules} %% {user routines} So far only the rules have been described. The user needs additional options, though, to define variables for use in his program and for use by Lex. These can go either in the definitions section or in the rules sec- tion. Remember that Lex is turning the rules into a program. Any source not intercepted by Lex is copied into the generated program. There are three classes of such things. 1) Any line which is not part of a Lex rule or action which begins with a blank or tab is copied into the Lex generated program. Such source input prior to the first %% delim- iter will be external to any function in the code; if it appears immediately after the first %%, it appears in an appropriate place for declarations in the - 13 - function written by Lex which contains the actions. This material must look like program fragments, and should precede the first Lex rule. As a side effect of the above, lines which begin with a blank or tab, and which contain a comment, are passed through to the gen- erated program. This can be used to include comments in either the Lex source or the gen- erated code. The com- ments should follow the host language conven- tion. 2) Anything included between lines contain- ing only %{ and %} is copied out as above. The delimiters are dis- carded. This format permits entering text like preprocessor statements that must begin in column 1, or copying lines that do not look like programs. 3) Anything after the third %% delimiter, regardless of formats, etc., is copied out after the Lex output. Definitions intended for Lex are given before the first %% delimiter. Any line in this section not contained between %{ and %}, and begining in column 1, is assumed to define Lex sub- stitution strings. The for- mat of such lines is name translation and it causes the string given as a translation to be 777777777777777777777777777777777777777777777777777777 associated with the name. The name and translation must be separated by at least one blank or tab, and the name must begin with a letter. The translation can then be called out by the {name} syntax in a rule. Using {D} for the digits and {E} for an exponent field, for example, might abbrevi- ate rules to recognize numbers: D [0-9] E [DEde][-+]?{D}+ %% {D}+ printf("integer"); {D}+"."{D}*({E})? | {D}*"."{D}+({E})? | {D}+{E} Note the first two rules for real numbers; both require a decimal point and contain an optional exponent field, but the first requires at least one digit before the decimal point and the second requires at least one digit after the decimal point. To correctly handle the problem posed by a Fortran expres- sion such as _3_5._E_Q._I, which does not contain a real number, a context-sensitive rule such as [0-9]+/"."EQ printf("integer"); could be used in addition to the normal rule for integers. The definitions section may also contain other com- mands, including the selec- tion of a host language, a character set table, a list of start conditions, or adjustments to the default size of arrays within Lex itself for larger source programs. These possibili- ties are discussed below under ``Summary of Source Format,'' section 12. - 14 - _7. _U_s_a_g_e. There are two steps in compiling a Lex source pro- gram. First, the Lex source must be turned into a gen- erated program in the host general purpose language. Then this program must be compiled and loaded, usually with a library of Lex sub- routines. The generated program is on a file named _l_e_x._y_y._c. The I/O library is defined in terms of the C standard library [6]. The C programs gen- erated by Lex are slightly different on OS/370, because the OS compiler is less powerful than the UNIX or GCOS compilers, and does less at compile time. C programs generated on GCOS and UNIX are the same. _U_N_I_X. The library is accessed by the loader flag -_l_l. So an appropriate set of commands is lex source cc lex.yy.c -ll The resulting program is placed on the usual file _a._o_u_t for later execution. To use Lex with Yacc see below. Although the default Lex I/O routines use the C standard library, the Lex automata themselves do not do so; if private versions of _i_n_p_u_t, _o_u_t_p_u_t and _u_n_p_u_t are given, the library can be avoided. _8. _L_e_x _a_n_d _Y_a_c_c. If you want to use Lex with Yacc, note that what Lex writes is a program named _y_y_l_e_x(), the name required by Yacc for its analyzer. Normally, the 777777777777777777777777777777777777777777777777777777 default main program on the Lex library calls this rou- tine, but if Yacc is loaded, and its main program is used, Yacc will call _y_y_l_e_x(). In this case each Lex rule should end with return(token); where the appropriate token value is returned. An easy way to get access to Yacc's names for tokens is to com- pile the Lex output file as part of the Yacc output file by placing the line # include "lex.yy.c" in the last section of Yacc input. Supposing the gram- mar to be named ``good'' and the lexical rules to be named ``better'' the UNIX command sequence can just be: yacc good lex better cc y.tab.c -ly -ll The Yacc library (-ly) should be loaded before the Lex library, to obtain a main program which invokes the Yacc parser. The gen- erations of Lex and Yacc programs can be done in either order. _9. _E_x_a_m_p_l_e_s. As a trivial problem, consider copying an input file while adding 3 to every positive number divisible by 7. Here is a suitable Lex source program %% int k; [0-9]+ { k = atoi(yytext); if (k%7 == 0) printf("%d", k+3); else printf("%d",k); } to do just that. The rule [0-9]+ recognizes strings of - 15 - digits; _a_t_o_i converts the digits to binary and stores the result in _k. The opera- tor % (remainder) is used to check whether _k is divisible by 7; if it is, it is incre- mented by 3 as it is written out. It may be objected that this program will alter such input items as _4_9._6_3 or _X_7. Furthermore, it incre- ments the absolute value of all negative numbers divisi- ble by 7. To avoid this, just add a few more rules after the active one, as here: %% int k; -?[0-9]+ { k = atoi(yytext); printf("%d", k%7 == 0 ? k+3 : k); } -?[0-9.]+ ECHO; [A-Za-z][A-Za-z0-9]+ ECHO; Numerical strings containing a ``.'' or preceded by a letter will be picked up by one of the last two rules, and not changed. The _i_f-_e_l_s_e has been replaced by a C conditional expression to save space; the form _a?_b:_c means ``if _a then _b else _c''. For an example of statistics gathering, here is a program which histo- grams the lengths of words, where a word is defined as a string of letters. int lengs[100]; %% [a-z]+ lengs[yyleng]++; . | \n ; %% yywrap() { int i; printf("Length No. words\n"); for(i=0; i<100; i++) 777777777777777777777777777777777777777777777777777777 if (lengs[i] > 0) printf("%5d%10d\n",i,lengs[i]); return(1); } This program accumulates the histogram, while producing no output. At the end of the input it prints the table. The final statement _r_e_t_u_r_n(_1); indicates that Lex is to perform wrapup. If _y_y_w_r_a_p returns zero (false) it implies that further input is available and the program is to con- tinue reading and process- ing. To provide a _y_y_w_r_a_p that never returns true causes an infinite loop. As a larger example, here are some parts of a program written by N. L. Schryer to convert double precision Fortran to single precision Fortran. Because Fortran does not distinguish upper and lower case letters, this routine begins by defining a set of classes including both cases of each letter: a [aA] b [bB] c [cC] ... z [zZ] An additional class recog- nizes white space: W [ \t]* The first rule changes ``double precision'' to ``real'', or ``DOUBLE PRECI- SION'' to ``REAL''. {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} { printf(yytext[0]=='d'? "real" : "REAL"); } Care is taken throughout this program to preserve the case (upper or lower) of the original program. The con- ditional operator is used to select the proper form of the keyword. The next rule - 16 - copies continuation card indications to avoid confus- ing them with constants: ^" "[^ 0] ECHO; In the regular expression, the quotes surround the blanks. It is interpreted as ``beginning of line, then five blanks, then anything but blank or zero.'' Note the two different meanings of ^. There follow some rules to change double pre- cision constants to ordinary floating constants. [0-9]+{W}{d}{W}[+-]?{W}[0-9]+ | [0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9]+ | "."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9]+ { /* convert constants */ for(p=yytext; *p != 0; p++) { if (*p == 'd' || *p == 'D') *p=+ 'e'- 'd'; ECHO; } After the floating point constant is recognized, it is scanned by the _f_o_r loop to find the letter _d or _D. The program than adds '_e'-'_d', which converts it to the next letter of the alphabet. The modified con- stant, now single-precision, is written out again. There follow a series of names which must be respelled to remove their initial _d. By using the array _y_y_t_e_x_t the same action suffices for all the names (only a sample of a rather long list is given here). {d}{s}{i}{n} | {d}{c}{o}{s} | {d}{s}{q}{r}{t} | {d}{a}{t}{a}{n} | ... {d}{f}{l}{o}{a}{t} printf("%s",yytext+1); Another list of names must have initial _d changed to initial _a: {d}{l}{o}{g} | {d}{l}{o}{g}10 | 777777777777777777777777777777777777777777777777777777 {d}{m}{i}{n}1 | {d}{m}{a}{x}1 { yytext[0] =+ 'a' - 'd'; ECHO; } And one routine must have initial _d changed to initial _r: {d}1{m}{a}{c}{h} {yytext[0] =+ 'r' - 'd'; To avoid such names as _d_s_i_n_x being detected as instances of _d_s_i_n, some final rules pick up longer words as identifiers and copy some surviving characters: [A-Za-z][A-Za-z0-9]* | [0-9]+ | \n | . ECHO; Note that this program is not complete; it does not deal with the spacing prob- lems in Fortran or with the use of keywords as identif- iers. _1_0. _L_e_f_t _C_o_n_t_e_x_t _S_e_n_s_i_- _t_i_v_i_t_y. Sometimes it is desir- able to have several sets of lexical rules to be applied at different times in the input. For example, a com- piler preprocessor might distinguish preprocessor statements and analyze them differently from ordinary statements. This requires sensitivity to prior con- text, and there are several ways of handling such prob- lems. The ^ operator, for example, is a prior context operator, recognizing immediately preceding left context just as $ recognizes immediately following right context. Adjacent left con- text could be extended, to produce a facility similar to that for adjacent right context, but it is unlikely - 17 - to be as useful, since often the relevant left context appeared some time earlier, such as at the beginning of a line. This section describes three means of dealing with different environments: a simple use of flags, when only a few rules change from one environment to another, the use of _s_t_a_r_t _c_o_n_d_i_t_i_o_n_s on rules, and the possibil- ity of making multiple lexi- cal analyzers all run together. In each case, there are rules which recog- nize the need to change the environment in which the following input text is analyzed, and set some parameter to reflect the change. This may be a flag explicitly tested by the user's action code; such a flag is the simplest way of dealing with the problem, since Lex is not involved at all. It may be more con- venient, however, to have Lex remember the flags as initial conditions on the rules. Any rule may be associated with a start con- dition. It will only be recognized when Lex is in that start condition. The current start condition may be changed at any time. Finally, if the sets of rules for the different environments are very dis- similar, clarity may be best achieved by writing several distinct lexical analyzers, and switching from one to another as desired. Consider the following problem: copy the input to the output, changing the word _m_a_g_i_c to _f_i_r_s_t on every line which began with the 777777777777777777777777777777777777777777777777777777 letter _a, changing _m_a_g_i_c to _s_e_c_o_n_d on every line which began with the letter _b, and changing _m_a_g_i_c to _t_h_i_r_d on every line which began with the letter _c. All other words and all other lines are left unchanged. These rules are so sim- ple that the easiest way to do this job is with a flag: int flag; %% ^a {flag = 'a'; ECHO;} ^b {flag = 'b'; ECHO;} ^c {flag = 'c'; ECHO;} \n {flag = 0 ; ECHO;} magic { switch (flag) { case 'a': printf("first"); break; case 'b': printf("second"); break; case 'c': printf("third"); break; default: ECHO; break; } } should be adequate. To handle the same problem with start condi- tions, each start condition must be introduced to Lex in the definitions section with a line reading %Start name1 name2 ... where the conditions may be named in any order. The word _S_t_a_r_t may be abbrevi- ated to _s or _S. The condi- tions may be referenced at the head of a rule with the <> brackets: expression is a rule which is only recognized when Lex is in the start condition _n_a_m_e_1. To enter a start condition, execute the action statement BEGIN name1; which changes the start con- dition to _n_a_m_e_1. To resume the normal state, BEGIN 0; - 18 - resets the initial condition of the Lex automaton inter- preter. A rule may be active in several start con- ditions: is a legal prefix. Any rule not beginning with the <> prefix operator is always active. The same example as before can be written: %START AA BB CC %% ^a {ECHO; BEGIN AA;} ^b {ECHO; BEGIN BB;} ^c {ECHO; BEGIN CC;} \n {ECHO; BEGIN 0;} magic printf("first"); magic printf("second"); magic printf("third"); where the logic is exactly the same as in the previous method of handling the prob- lem, but Lex does the work rather than the user's code. _1_1. _C_h_a_r_a_c_t_e_r _S_e_t. The programs generated by Lex handle character I/O only through the routines _i_n_p_u_t, _o_u_t_p_u_t, and _u_n_p_u_t. Thus the character represen- tation provided in these routines is accepted by Lex and employed to return values in _y_y_t_e_x_t. For internal use a character is represented as a small integer which, if the stan- dard library is used, has a value equal to the integer value of the bit pattern representing the character on the host computer. Nor- mally, the letter _a is represented as the same form as the character constant '_a'. If this interpretation is changed, by providing I/O routines which translate the characters, Lex must be told 777777777777777777777777777777777777777777777777777777 about it, by giving a trans- lation table. This table must be in the definitions section, and must be brack- eted by lines containing only ``%T''. The table con- tains lines of the form {integer} {character string} which indicate the value associated with each charac- ter. Thus the next example %T 1 Aa 2 Bb ... 26 Zz 27 \n 28 + 29 - 30 0 31 1 ... 39 9 %T Sample character table. maps the lower and upper case letters together into the integers 1 through 26, newline into 27, + and - into 28 and 29, and the digits into 30 through 39. Note the escape for newline. If a table is supplied, every character that is to appear either in the rules or in any valid input must be included in the table. No character may be assigned the number 0, and no charac- ter may be assigned a bigger number than the size of the hardware character set. _1_2. _S_u_m_m_a_r_y _o_f _S_o_u_r_c_e _F_o_r_- _m_a_t. The general form of a Lex source file is: {definitions} %% {rules} %% {user subroutines} - 19 - The definitions section con- tains a combination of 1) Definitions, in the form ``name space translation''. 2) Included code, in the form ``space code''. 3) Included code, in the form %{ code %} 4) Start conditions, given in the form %S name1 name2 ... 5) Character set tables, in the form %T number space character-string ... %T 6) Changes to internal array sizes, in the form %_x _n_n_n where _n_n_n is a decimal integer representing an array size and _x selects the parameter as follows: Letter Parameter p positions n states e tree nodes a transitions k packed character classes o output array size Lines in the rules section have the form ``expression action'' where the action may be continued on succeed- ing lines by using braces to delimit it. Regular expressions in Lex use the following opera- tors: x the character "x" "x" an "x", even if x is an operator. \x an "x", even if x is an operator. 777777777777777777777777777777777777777777777777777777 [xy] the character x or y. [x-z] the characters x, y or z. [^x] any character but x. . any character but newline. ^x an x at the beginning of a line. x an x when Lex is in start condition y. x$ an x at the end of a line. x? an optional x. x* 0,1,2, ... instances of x. x+ 1,2,3, ... instances of x. x|y an x or a y. (x) an x. x/y an x but only if followed by y. {xx} the translation of xx from the definitions section. x{m,n} _m through _n occurrences of x _1_3. _C_a_v_e_a_t_s _a_n_d _B_u_g_s. There are pathological expressions which produce exponential growth of the tables when converted to deterministic machines; for- tunately, they are rare. REJECT does not rescan the input; instead it remembers the results of the previous scan. This means that if a rule with trailing context is found, and REJECT executed, the user must not have used _u_n_p_u_t to change the characters forthcoming from the input stream. This is the only restriction on the user's ability to mani- pulate the not-yet-processed input. _1_4. _A_c_k_n_o_w_l_e_d_g_m_e_n_t_s. As should be obvious from the above, the outside of Lex is patterned on Yacc and the inside on Aho's string matching routines. Therefore, both S. C. John- son and A. V. Aho are really originators of much of Lex, as well as debuggers of it. Many thanks are due to both. - 20 - The code of the current version of Lex was designed, written, and debugged by Eric Schmidt. M. E. Lesk and E. Schmidt 7MH-1274-MEL-unix _1_5. _R_e_f_e_r_e_n_c_e_s. 1. B. W. Kernighan and D. M. Ritchie, _T_h_e _C _P_r_o_- _g_r_a_m_m_i_n_g _L_a_n_g_u_a_g_e, Prentice-Hall, N. J. (1978). 2. B. W. Kernighan, _R_a_t_- _f_o_r: _A _P_r_e_p_r_o_c_e_s_s_o_r _f_o_r _a _R_a_t_i_o_n_a_l _F_o_r_t_r_a_n, Software - Practice and Experience, 5, pp. 395-496 (1975). 3. S. C. Johnson, _Y_a_c_c: _Y_e_t _A_n_o_t_h_e_r _C_o_m_p_i_l_e_r _C_o_m_p_i_l_e_r, Computing Science Technical Report No. 32, 1975, Bell Laboratories, Mur- ray Hill, NJ 07974. 4. A. V. Aho and M. J. Corasick, _E_f_f_i_c_i_e_n_t _S_t_r_i_n_g _M_a_t_c_h_i_n_g: _A_n _A_i_d _t_o _B_i_b_l_i_o_g_r_a_p_h_i_c _S_e_a_r_c_h, Comm. ACM _1_8, 333-340 (1975). 5. B. W. Kernighan, D. M. Ritchie and K. L. Thompson, _Q_E_D _T_e_x_t _E_d_i_- _t_o_r, Computing Science Technical Report No. 5, 1972, Bell Labora- tories, Murray Hill, NJ 07974. 6. D. M. Ritchie, private communication. See also M. E. Lesk, _T_h_e _P_o_r_t_a_b_l_e _C _L_i_b_r_a_r_y, Computing Science 777777777777777777777777777777777777777777777777777777 Technical Report No. 31, Bell Laboratories, Murray Hill, NJ 07974.