Lex - A Lexical Analyzer Generator


                 M. E. Lesk and E. Schmidt

                     Bell Laboratories
               Murray Hill, New Jersey 07974


                          _A_B_S_T_R_A_C_T


          Lex helps write programs whose  control  flow
     is directed by instances of regular expressions in
     the input stream.  It is well suited  for  editor-
     script  type  transformations  and  for segmenting
     input in preparation for a parsing routine.

          Lex source is a table of regular  expressions
     and corresponding program fragments.  The table is
     translated to  a  program  which  reads  an  input
     stream,  copying it to an output stream and parti-
     tioning the input into  strings  which  match  the
     given  expressions.  As each such string is recog-
     nized the corresponding program fragment  is  exe-
     cuted.  The recognition of the expressions is per-
     formed by a deterministic  finite  automaton  gen-
     erated  by  Lex.  The program fragments written by
     the user are executed in the order  in  which  the
     corresponding  regular  expressions  occur  in the
     input stream.

          The lexical analysis  programs  written  with
     Lex accept ambiguous specifications and choose the
     longest match possible at each  input  point.   If
     necessary,  substantial  lookahead is performed on
     the input, but the input stream will be backed  up
     to  the  end of the current partition, so that the
     user has general freedom to manipulate it.

          Lex can generate analyzers  in  either  C  or
     Ratfor,   a   language  which  can  be  translated
     automatically to portable Fortran.  It  is  avail-
     able  on  the PDP-11 UNIX, Honeywell GCOS, and IBM
     OS systems.  This manual, however, will only  dis-
     cuss generating analyzers in C on the UNIX system,
     which is the only supported form of Lex under UNIX
     Version  7.  Lex is designed to simplify interfac-
     ing with Yacc,  for  those  with  access  to  this
     compiler-compiler system.


                           - 2 -


_1.  _I_n_t_r_o_d_u_c_t_i_o_n.

     Lex is a  program  gen-
erator  designed for lexical
processing   of    character
input streams.  It accepts a
high-level, problem oriented
specification  for character
string  matching,  and  pro-
duces a program in a general
purpose    language    which
recognizes  regular  expres-
sions.  The regular  expres-
sions  are  specified by the
user in the source  specifi-
cations  given  to Lex.  The
Lex written code  recognizes
these   expressions   in  an
input stream and  partitions
the    input   stream   into
strings matching the expres-
sions.   At  the  boundaries
between strings program sec-
tions  provided  by the user
are   executed.    The   Lex
source  file  associates the
regular expressions and  the
program  fragments.  As each
expression  appears  in  the
input to the program written
by  Lex,  the  corresponding
fragment is executed.

     The user  supplies  the
additional    code    beyond
expression  matching  needed
to  complete his tasks, pos-
sibly including code written
by  other  generators.   The
program that recognizes  the
expressions  is generated in
the general purpose program-
ming  language  employed for
the  user's  program   frag-
ments.   Thus,  a high level
expression language is  pro-
vided  to  write  the string
expressions  to  be  matched
while  the user's freedom to
write actions is unimpaired.
This avoids forcing the user
who wishes to use  a  string
manipulation   language  for
input  analysis   to   write
777777777777777777777777777777777777777777777777777777                                processing  programs  in the
                                same and often inappropriate
                                string handling language.

                                     Lex is not  a  complete
                                language,  but rather a gen-
                                erator  representing  a  new
                                language  feature  which can
                                be added to  different  pro-
                                gramming  languages,  called
                                ``host languages.'' Just  as
                                general   purpose  languages
                                can produce code to  run  on
                                different computer hardware,
                                Lex can write code  in  dif-
                                ferent  host languages.  The
                                host language  is  used  for
                                the output code generated by
                                Lex and also for the program
                                fragments added by the user.
                                Compatible          run-time
                                libraries  for the different
                                host languages are also pro-
                                vided.    This   makes   Lex
                                adaptable    to    different
                                environments  and  different
                                users.  Each application may
                                be  directed to the combina-
                                tion of  hardware  and  host
                                language  appropriate to the
                                task, the user's background,
                                and  the properties of local
                                implementations.          At
                                present,  the only supported
                                host language is C, although
                                Fortran (in the form of Rat-
                                for [2] has  been  available
                                in  the  past.   Lex  itself
                                exists on  UNIX,  GCOS,  and
                                OS/370;  but  the  code gen-
                                erated by Lex may  be  taken
                                anywhere   the   appropriate
                                compilers exist.

                                     Lex  turns  the  user's
                                expressions    and   actions
                                (called _s_o_u_r_c_e in this memo)
                                into   the   host   general-
                                purpose language;  the  gen-
                                erated   program   is  named
                                _y_y_l_e_x.   The  _y_y_l_e_x  program
                                will  recognize  expressions
                                in a stream (called _i_n_p_u_t in
                                this  memo)  and perform the


                           - 3 -


specified actions  for  each
expression    as    it    is
detected.  See Figure 1.
8           _______
Source -> |  Lex  |  -> yylex
8           _______


8           _______
Input ->  | yylex | -> Output
8           _______

     An overview of Lex
          Figure 1

     For a trivial  example,
consider a program to delete
from the input all blanks or
tabs at the ends of lines.
        %%
        [ \t]+$   ;
is  all  that  is  required.
The  program  contains  a %%
delimiter to mark the begin-
ning  of  the rules, and one
rule.  This rule contains  a
regular   expression   which
matches    one    or    more
instances  of the characters
blank or tab (written \t for
visibility,   in  accordance
with the C language  conven-
tion)  just prior to the end
of  a  line.   The  brackets
indicate the character class
made of blank and tab; the +
indicates   ``one   or  more
...''; and the  $  indicates
``end  of line,'' as in QED.
No action is  specified,  so
the program generated by Lex
(yylex)  will  ignore  these
characters.  Everything else
will be copied.   To  change
any   remaining   string  of
blanks or tabs to  a  single
blank, add another rule:
   %%
   [ \t]+$   ;
   [ \t]+    printf(" ");
The  finite  automaton  gen-
erated  for this source will
scan for both rules at once,
observing at the termination
of the string of  blanks  or
tabs whether or not there is
777777777777777777777777777777777777777777777777777777                                a  newline  character,   and
                                executing  the  desired rule
                                action.   The   first   rule
                                matches   all   strings   of
                                blanks or tabs at the end of
                                lines,  and  the second rule
                                all  remaining  strings   of
                                blanks or tabs.

                                     Lex can be  used  alone
                                for  simple transformations,
                                or for analysis and  statis-
                                tics  gathering on a lexical
                                level.  Lex can also be used
                                with  a  parser generator to
                                perform the lexical analysis
                                phase;  it  is  particularly
                                easy to  interface  Lex  and
                                Yacc   [3].    Lex  programs
                                recognize    only    regular
                                expressions;   Yacc   writes
                                parsers that accept a  large
                                class  of context free gram-
                                mars, but  require  a  lower
                                level  analyzer to recognize
                                input tokens.  Thus, a  com-
                                bination  of Lex and Yacc is
                                often   appropriate.    When
                                used as a preprocessor for a
                                later parser generator,  Lex
                                is  used  to  partition  the
                                input stream, and the parser
                                generator  assigns structure
                                to  the  resulting   pieces.
                                The  flow of control in such
                                a case (which might  be  the
                                first  half  of  a compiler,
                                for  example)  is  shown  in
                                Figure  2.   Additional pro-
                                grams, written by other gen-
                                erators  or  by hand, can be
                                added  easily  to   programs
                                written by Lex.
                                           lexical        grammar
                                            rules          rules

8                                          _________      _________
                                         |   Lex   |    |  Yacc   |
7
8                                          _________      _________

8                                          _________      _________
                                Input -> |  yylex  | -> | yyparse | -> Parsed input
7
8                                          _________      _________

                                               Lex with Yacc
                                                  Figure 2


                           - 4 -


Yacc users will realize that
the  name _y_y_l_e_x is what Yacc
expects its lexical analyzer
to be named, so that the use
of this name by Lex  simpli-
fies interfacing.

     Lex generates a  deter-
ministic   finite  automaton
from the regular expressions
in   the  source  [4].   The
automaton  is   interpreted,
rather   than  compiled,  in
order to  save  space.   The
result   is   still  a  fast
analyzer.   In   particular,
the time taken by a Lex pro-
gram to recognize and parti-
tion an input stream is pro-
portional to the  length  of
the  input.   The  number of
Lex rules or the  complexity
of  the  rules is not impor-
tant in  determining  speed,
unless  rules  which include
forward  context  require  a
significant  amount  of  re-
scanning.      What     does
increase with the number and
complexity of rules  is  the
size  of  the finite automa-
ton, and therefore the  size
of  the program generated by
Lex.

     In the program  written
by Lex, the user's fragments
(representing the _a_c_t_i_o_n_s to
be performed as each regular
expression  is  found)   are
gathered   as   cases  of  a
switch.     The    automaton
interpreter directs the con-
trol flow.   Opportunity  is
provided  for  the  user  to
insert  either  declarations
or  additional statements in
the routine  containing  the
actions,  or  to add subrou-
tines  outside  this  action
routine.

     Lex is not  limited  to
source    which    can    be
777777777777777777777777777777777777777777777777777777                                interpreted on the basis  of
                                one   character   lookahead.
                                For example,  if  there  are
                                two  rules,  one looking for
                                _a_b and another for  _a_b_c_d_e_f_g,
                                and   the  input  stream  is
                                _a_b_c_d_e_f_h, Lex will  recognize
                                _a_b   and   leave  the  input
                                pointer just before _c_d. .  .
                                Such  backup  is more costly
                                than   the   processing   of
                                simpler languages.

                                _2.  _L_e_x _S_o_u_r_c_e.

                                     The general  format  of
                                Lex source is:
                                     {definitions}
                                     %%
                                     {rules}
                                     %%
                                     {user subroutines}
                                where  the  definitions  and
                                the   user  subroutines  are
                                often omitted.   The  second
                                %%   is  optional,  but  the
                                first is  required  to  mark
                                the  beginning of the rules.
                                The  absolute  minimum   Lex
                                program is thus
                                             %%
                                (no definitions,  no  rules)
                                which translates into a pro-
                                gram which copies the  input
                                to the output unchanged.

                                     In the outline  of  Lex
                                programs  shown  above,  the
                                _r_u_l_e_s represent  the  user's
                                control  decisions; they are
                                a table, in which  the  left
                                column    contains   _r_e_g_u_l_a_r
                                _e_x_p_r_e_s_s_i_o_n_s (see section  3)
                                and  the  right  column con-
                                tains _a_c_t_i_o_n_s, program frag-
                                ments  to  be  executed when
                                the expressions  are  recog-
                                nized.   Thus  an individual
                                rule might appear
                                integer   printf("found keyword INT");
                                to  look  for   the   string
                                _i_n_t_e_g_e_r  in the input stream
                                and   print   the    message
                                ``found     keyword    INT''


                           - 5 -


whenever  it  appears.    In
this  example  the host pro-
cedural language  is  C  and
the   C   library   function
_p_r_i_n_t_f is used to print  the
string.    The  end  of  the
expression is  indicated  by
the first blank or tab char-
acter.   If  the  action  is
merely  a  single  C expres-
sion, it can just  be  given
on  the  right  side  of the
line; if it is compound,  or
takes  more  than a line, it
should   be   enclosed    in
braces.   As a slightly more
useful example,  suppose  it
is   desired   to  change  a
number of words from British
to  American  spelling.  Lex
rules such as
colour      printf("color");
mechanise   printf("mechanize");
petrol      printf("gas");
would  be  a  start.   These
rules  are not quite enough,
since  the  word   _p_e_t_r_o_l_e_u_m
would  become  _g_a_s_e_u_m; a way
of dealing with this will be
described later.

_3.  _L_e_x _R_e_g_u_l_a_r _E_x_p_r_e_s_s_i_o_n_s.

     The definitions of reg-
ular  expressions  are  very
similar to those in QED [5].
A  regular expression speci-
fies a set of strings to  be
matched.   It  contains text
characters (which match  the
corresponding  characters in
the strings being  compared)
and    operator   characters
(which specify  repetitions,
choices,      and      other
features).  The  letters  of
the  alphabet and the digits
are always text  characters;
thus the regular expression
         integer
matches the  string  _i_n_t_e_g_e_r
wherever  it appears and the
expression
            a57D
777777777777777777777777777777777777777777777777777777                                looks for the string _a_5_7_D.

                                     _O_p_e_r_a_t_o_r_s.  The  opera-
                                tor characters are
                                " \ [ ] ^ - ? . * + | ( ) $ / { } % < >
                                and if they are to  be  used
                                as   text   characters,   an
                                escape should be used.   The
                                quotation  mark operator (")
                                indicates that  whatever  is
                                contained  between a pair of
                                quotes is  to  be  taken  as
                                text characters.  Thus
                                          xyz"++"
                                matches  the  string   _x_y_z++
                                when  it appears.  Note that
                                a part of a  string  may  be
                                quoted.   It is harmless but
                                unnecessary  to   quote   an
                                ordinary text character; the
                                expression
                                          "xyz++"
                                is  the  same  as  the   one
                                above.    Thus   by  quoting
                                every non-alphanumeric char-
                                acter  being  used as a text
                                character,  the   user   can
                                avoid  remembering  the list
                                above  of  current  operator
                                characters,   and   is  safe
                                should further extensions to
                                Lex lengthen the list.

                                     An  operator  character
                                may  also  be  turned into a
                                text character by  preceding
                                it with \ as in
                                          xyz\+\+
                                which is another, less read-
                                able,   equivalent   of  the
                                above expressions.   Another
                                use of the quoting mechanism
                                is to get a  blank  into  an
                                expression;   normally,   as
                                explained above,  blanks  or
                                tabs  end a rule.  Any blank
                                character   not    contained
                                within  []  (see below) must
                                be quoted.  Several normal C
                                escapes  with  \  are recog-
                                nized: \n is newline, \t  is
                                tab,  and  \b  is backspace.
                                To enter \ itself,  use  \\.
                                Since  newline is illegal in


                           - 6 -


an expression,  \n  must  be
used;  it is not required to
escape  tab  and  backspace.
Every  character  but blank,
tab, newline  and  the  list
above is always a text char-
acter.

     _C_h_a_r_a_c_t_e_r      _c_l_a_s_s_e_s.
Classes of characters can be
specified using the operator
pair  [].   The construction
[_a_b_c] matches a single char-
acter, which may be _a, _b, or
_c.  Within square  brackets,
most  operator  meanings are
ignored.  Only three charac-
ters  are special: these are
\ - and ^.  The -  character
indicates ranges.  For exam-
ple,
        [a-z0-9<>_]
indicates   the    character
class   containing  all  the
lower  case   letters,   the
digits,  the angle brackets,
and underline.   Ranges  may
be  given  in  either order.
Using - between any pair  of
characters   which  are  not
both  upper  case   letters,
both  lower case letters, or
both digits  is  implementa-
tion  dependent and will get
a warning  message.   (E.g.,
[0-z]  in ASCII is many more
characters  than  it  is  in
EBCDIC).   If  it is desired
to include the  character  -
in  a  character  class,  it
should  be  first  or  last;
thus
          [-+0-9]
matches all the  digits  and
the two signs.

     In  character  classes,
the  ^  operator must appear
as the first character after
the  left  bracket; it indi-
cates  that  the   resulting
string is to be complemented
with respect to the computer
character set.  Thus
777777777777777777777777777777777777777777777777777777                                           [^abc]
                                matches    all    characters
                                except a, b, or c, including
                                all special or control char-
                                acters; or
                                         [^a-zA-Z]
                                is any  character  which  is
                                not a letter.  The \ charac-
                                ter   provides   the   usual
                                escapes   within   character
                                class brackets.

                                     _A_r_b_i_t_r_a_r_y    _c_h_a_r_a_c_t_e_r.
                                To  match almost any charac-
                                ter, the operator character
                                             .
                                is the class of all  charac-
                                ters except newline.  Escap-
                                ing into octal  is  possible
                                although non-portable:
                                         [\40-\176]
                                matches all printable  char-
                                acters  in the ASCII charac-
                                ter  set,  from   octal   40
                                (blank)    to    octal   176
                                (tilde).

                                     _O_p_t_i_o_n_a_l   _e_x_p_r_e_s_s_i_o_n_s.
                                The operator ?  indicates an
                                optional   element   of   an
                                expression.  Thus
                                            ab?c
                                matches either _a_c or _a_b_c.

                                     _R_e_p_e_a_t_e_d   _e_x_p_r_e_s_s_i_o_n_s.
                                Repetitions  of  classes are
                                indicated by the operators *
                                and +.
                                             _a*
                                is any number of consecutive
                                _a    characters,   including
                                zero; while
                                             a+
                                is one or more instances  of
                                _a.  For example,
                                           [a-z]+
                                is all strings of lower case
                                letters.  And
                                    [A-Za-z][A-Za-z0-9]*
                                indicates  all  alphanumeric
                                strings   with   a   leading
                                alphabetic character.   This
                                is  a typical expression for
                                recognizing  identifiers  in


                           - 7 -


computer languages.

     _A_l_t_e_r_n_a_t_i_o_n _a_n_d  _G_r_o_u_p_-
_i_n_g.   The  operator | indi-
cates alternation:
          (ab|cd)
matches  either  _a_b  or  _c_d.
Note  that  parentheses  are
used for grouping,  although
they  are  not  necessary on
the outside level;
           ab|cd
would     have     sufficed.
Parentheses  can be used for
more complex expressions:
       (ab|cd+)?(ef)*
matches  such   strings   as
_a_b_e_f_e_f,   _e_f_e_f_e_f,  _c_d_e_f,  or
_c_d_d_d; but not _a_b_c, _a_b_c_d,  or
_a_b_c_d_e_f.

     _C_o_n_t_e_x_t    _s_e_n_s_i_t_i_v_i_t_y.
Lex  will  recognize a small
amount of  surrounding  con-
text.    The   two  simplest
operators for this are ^ and
$.   If  the first character
of an expression is  ^,  the
expression   will   only  be
matched at the beginning  of
a   line  (after  a  newline
character, or at the  begin-
ning  of  the input stream).
This can never conflict with
the other meaning of ^, com-
plementation  of   character
classes,   since  that  only
applies within the [] opera-
tors.    If  the  very  last
character is $, the  expres-
sion will only be matched at
the  end  of  a  line  (when
immediately followed by new-
line).  The latter  operator
is  a  special case of the /
operator  character,   which
indicates  trailing context.
The expression
           ab/cd
matches the string  _a_b,  but
only   if  followed  by  _c_d.
Thus
            ab$
is the same as
777777777777777777777777777777777777777777777777777777                                           ab/\n
                                Left context is  handled  in
                                Lex  by  _s_t_a_r_t _c_o_n_d_i_t_i_o_n_s as
                                explained in section 10.  If
                                a  rule  is  only to be exe-
                                cuted when the Lex automaton
                                interpreter is in start con-
                                dition _x, the rule should be
                                prefixed by
                                            <x>
                                using  the   angle   bracket
                                operator  characters.  If we
                                considered  ``being  at  the
                                beginning  of a line'' to be
                                start  condition  _O_N_E,  then
                                the   ^  operator  would  be
                                equivalent to
                                           <ONE>
                                Start     conditions     are
                                explained more fully later.

                                     _R_e_p_e_t_i_t_i_o_n_s _a_n_d _D_e_f_i_n_i_-
                                _t_i_o_n_s.    The  operators  {}
                                specify  either  repetitions
                                (if they enclose numbers) or
                                definition   expansion   (if
                                they  enclose  a name).  For
                                example
                                          {digit}
                                looks   for   a   predefined
                                string   named   _d_i_g_i_t   and
                                inserts it at that point  in
                                the expression.  The defini-
                                tions are given in the first
                                part   of   the  Lex  input,
                                before the rules.   In  con-
                                trast,
                                           a{1,5}
                                looks for 1 to 5 occurrences
                                of _a.

                                     Finally, initial  %  is
                                special, being the separator
                                for Lex source segments.

                                _4.  _L_e_x _A_c_t_i_o_n_s.

                                     When   an    expression
                                written as above is matched,
                                Lex executes the correspond-
                                ing  action.   This  section
                                describes some  features  of
                                Lex  which  aid  in  writing
                                actions.  Note that there is


                           - 8 -


a default action, which con-
sists of copying  the  input
to the output.  This is per-
formed on  all  strings  not
otherwise matched.  Thus the
Lex  user  who   wishes   to
absorb   the  entire  input,
without producing  any  out-
put,  must  provide rules to
match everything.  When  Lex
is  being  used  with  Yacc,
this is  the  normal  situa-
tion.  One may consider that
actions  are  what  is  done
instead of copying the input
to the output; thus, in gen-
eral,  a  rule  which merely
copies   can   be   omitted.
Also,  a  character combina-
tion which is  omitted  from
the  rules and which appears
as input  is  likely  to  be
printed  on the output, thus
calling attention to the gap
in the rules.

     One  of  the   simplest
things  that  can be done is
to   ignore    the    input.
Specifying  a  C null state-
ment, ; as an action  causes
this   result.   A  frequent
rule is
        [ \t\n]   ;
which causes the three spac-
ing  characters (blank, tab,
and newline) to be ignored.

     Another  easy  way   to
avoid writing actions is the
action  character  |,  which
indicates  that  the  action
for this rule is the  action
for the next rule.  The pre-
vious  example  could   also
have been written
          " "
          "\t"
          "\n"
with   the   same    result,
although in different style.
The quotes around \n and  \t
are not required.

777777777777777777777777777777777777777777777777777777                                     In     more     complex
                                actions, the user will often
                                want to know the actual text
                                that matched some expression
                                like  [_a-_z]+.   Lex   leaves
                                this  text  in  an  external
                                character    array     named
                                _y_y_t_e_x_t.   Thus, to print the
                                name found, a rule like
                                [a-z]+   printf("%s", yytext);
                                will  print  the  string  in
                                _y_y_t_e_x_t.    The   C  function
                                _p_r_i_n_t_f  accepts   a   format
                                argument   and  data  to  be
                                printed; in this  case,  the
                                format  is  ``print string''
                                (% indicating  data  conver-
                                sion,   and   _s   indicating
                                string type), and  the  data
                                are    the   characters   in
                                _y_y_t_e_x_t.  So this just places
                                the  matched  string  on the
                                output.  This action  is  so
                                common  that it may be writ-
                                ten as ECHO:
                                       [a-z]+   ECHO;
                                is the same  as  the  above.
                                Since  the default action is
                                just to print the characters
                                found,  one  might  ask  why
                                give a rule, like this  one,
                                which  merely  specifies the
                                default action?  Such  rules
                                are  often required to avoid
                                matching  some  other   rule
                                which  is  not desired.  For
                                example, if there is a  rule
                                which  matches  _r_e_a_d it will
                                normally match the instances
                                of  _r_e_a_d  contained in _b_r_e_a_d
                                or _r_e_a_d_j_u_s_t; to avoid  this,
                                a rule of the form [_a-_z]+ is
                                needed.  This  is  explained
                                further below.

                                     Sometimes  it  is  more
                                convenient  to  know the end
                                of  what  has  been   found;
                                hence  Lex  also  provides a
                                count _y_y_l_e_n_g of  the  number
                                of  characters  matched.  To
                                count  both  the  number  of
                                words   and  the  number  of
                                characters in words  in  the


                           - 9 -


input, the user might write
[a-zA-Z]+   {words++; chars += yyleng;}
which accumulates  in  _c_h_a_r_s
the  number of characters in
the words  recognized.   The
last character in the string
matched can be accessed by
      yytext[yyleng-1]

     Occasionally,   a   Lex
action  may  decide  that  a
rule has not recognized  the
correct  span of characters.
Two routines are provided to
aid   with  this  situation.
First,   _y_y_m_o_r_e()   can   be
called  to indicate that the
next input expression recog-
nized  is to be tacked on to
the end of this input.  Nor-
mally, the next input string
would overwrite the  current
entry  in  _y_y_t_e_x_t.   Second,
_y_y_l_e_s_s (_n) may be called  to
indicate  that  not  all the
characters  matched  by  the
currently successful expres-
sion are wanted  right  now.
The argument _n indicates the
number  of   characters   in
_y_y_t_e_x_t   to   be   retained.
Further  characters   previ-
ously  matched  are returned
to the input.  This provides
the  same  sort of lookahead
offered by the  /  operator,
but in a different form.

     _E_x_a_m_p_l_e:   Consider   a
language   which  defines  a
string as a set  of  charac-
ters  between  quotation (")
marks, and provides that  to
include  a  " in a string it
must be  preceded  by  a  \.
The regular expression which
matches  that  is   somewhat
confusing,  so that it might
be preferable to write
\"[^"]*   {
          if (yytext[yyleng-1] == '\\')
               yymore();
          else
               ... normal user processing
777777777777777777777777777777777777777777777777777777                                          }
                                which will, when faced  with
                                a  string such as "_a_b_c_\"_d_e_f"
                                first match the five charac-
                                ters "_a_b_c_\; then the call to
                                _y_y_m_o_r_e() will cause the next
                                part of the string, "_d_e_f, to
                                be tacked on the end.   Note
                                that  the  final  quote ter-
                                minating the  string  should
                                be  picked  up  in  the code
                                labeled  ``normal   process-
                                ing''.

                                     The  function  _y_y_l_e_s_s()
                                might  be  used to reprocess
                                text   in    various    cir-
                                cumstances.   Consider the C
                                problem  of   distinguishing
                                the  ambiguity  of  ``=-a''.
                                Suppose  it  is  desired  to
                                treat  this  as ``=- a'' but
                                print  a  message.   A  rule
                                might be
                                =-[a-zA-Z]   {
                                             printf("Op (=-) ambiguous\n");
                                             yyless(yyleng-1);
                                             ... action for =- ...
                                             }
                                which  prints   a   message,
                                returns the letter after the
                                operator   to   the    input
                                stream,   and   treats   the
                                operator as ``=-''.   Alter-
                                natively it might be desired
                                to treat this as ``=   -a''.
                                To  do this, just return the
                                minus sign as  well  as  the
                                letter to the input:
                                =-[a-zA-Z]   {
                                             printf("Op (=-) ambiguous\n");
                                             yyless(yyleng-2);
                                             ... action for = ...
                                             }
                                will   perform   the   other
                                interpretation.   Note  that
                                the expressions for the  two
                                cases  might  more easily be
                                written
                                       =-/[A-Za-z]
                                in the first case and
                                        =/-[A-Za-z]
                                in  the  second;  no  backup
                                would  be  required  in  the


                           - 10 -


rule  action.   It  is   not
necessary  to  recognize the
whole identifier to  observe
the  ambiguity.   The possi-
bility of ``=-3'',  however,
makes
        =-/[^ \t\n]
a still better rule.

     In  addition  to  these
routines,  Lex  also permits
access to the  I/O  routines
it uses.  They are:

1)   _i_n_p_u_t()  which  returns
     the  next input charac-
     ter;

2)   _o_u_t_p_u_t(_c) which  writes
     the  character _c on the
     output; and

3)   _u_n_p_u_t(_c)   pushes   the
     character  _c  back onto
     the input stream to  be
     read later by _i_n_p_u_t().

By  default  these  routines
are provided as macro defin-
itions,  but  the  user  can
override   them  and  supply
private   versions.    These
routines  define  the  rela-
tionship  between   external
files  and  internal charac-
ters,  and   must   all   be
retained  or  modified  con-
sistently.   They   may   be
redefined, to cause input or
output to be transmitted  to
or   from   strange  places,
including other programs  or
internal   memory;  but  the
character set used  must  be
consistent  in all routines;
a value of zero returned  by
_i_n_p_u_t must mean end of file;
and the relationship between
_u_n_p_u_t   and  _i_n_p_u_t  must  be
retained or  the  Lex  look-
ahead  will  not  work.  Lex
does not look ahead  at  all
if  it does not have to, but
every rule ending in +  *  ?
777777777777777777777777777777777777777777777777777777                                or $ or containing / implies
                                lookahead.    Lookahead   is
                                also  necessary  to match an
                                expression that is a  prefix
                                of  another expression.  See
                                below for  a  discussion  of
                                the  character  set  used by
                                Lex.    The   standard   Lex
                                library  imposes a 100 char-
                                acter limit on backup.

                                     Another   Lex   library
                                routine  that  the user will
                                sometimes want  to  redefine
                                is  _y_y_w_r_a_p() which is called
                                whenever  Lex   reaches   an
                                end-of-file.     If   _y_y_w_r_a_p
                                returns a 1,  Lex  continues
                                with  the  normal  wrapup on
                                end  of  input.   Sometimes,
                                however, it is convenient to
                                arrange for  more  input  to
                                arrive  from  a  new source.
                                In  this  case,   the   user
                                should   provide   a  _y_y_w_r_a_p
                                which arranges for new input
                                and    returns    0.    This
                                instructs  Lex  to  continue
                                processing.    The   default
                                _y_y_w_r_a_p always returns 1.

                                     This routine is also  a
                                convenient  place  to  print
                                tables, summaries,  etc.  at
                                the  end of a program.  Note
                                that it is not  possible  to
                                write  a  normal  rule which
                                recognizes end-of-file;  the
                                only  access  to this condi-
                                tion is through _y_y_w_r_a_p.   In
                                fact,  unless a private ver-
                                sion of _i_n_p_u_t() is  supplied
                                a file containing nulls can-
                                not  be  handled,  since   a
                                value of 0 returned by _i_n_p_u_t
                                is taken to be end-of-file.

                                _5.  _A_m_b_i_g_u_o_u_s _S_o_u_r_c_e _R_u_l_e_s.

                                     Lex can handle  ambigu-
                                ous   specifications.   When
                                more than one expression can
                                match the current input, Lex
                                chooses as follows:


                           - 11 -


1)   The  longest  match  is
     preferred.

2)   Among    rules    which
     matched the same number
     of characters, the rule
     given   first  is  pre-
     ferred.

Thus, suppose the rules
integer   keyword action ...;
[a-z]+    identifier action ...;
to be given in  that  order.
If the input is _i_n_t_e_g_e_r_s, it
is taken as  an  identifier,
because   [_a-_z]+  matches  8
characters   while   _i_n_t_e_g_e_r
matches   only  7.   If  the
input is _i_n_t_e_g_e_r, both rules
match  7 characters, and the
keyword  rule  is   selected
because  it was given first.
Anything shorter (e.g.  _i_n_t)
will  not  match the expres-
sion  _i_n_t_e_g_e_r  and  so   the
identifier interpretation is
used.

     The    principle     of
preferring the longest match
makes    rules    containing
expressions      like     .*
dangerous.  For example,
            '.*'
might seem  a  good  way  of
recognizing a string in sin-
gle quotes.  But  it  is  an
invitation  for  the program
to read far  ahead,  looking
for  a distant single quote.
Presented with the input
'first' quoted string here, 'second' here
the  above  expression  will
match
'first' quoted string here, 'second'
which is probably  not  what
was  wanted.   A better rule
is of the form
         '[^'\n]*'
which, on the  above  input,
will   stop  after  '_f_i_r_s_t'.
The consequences  of  errors
like  this  are mitigated by
the   fact   that   the    .
777777777777777777777777777777777777777777777777777777                                operator will not match new-
                                line.  Thus expressions like
                                .* stop on the current line.
                                Don't  try  to  defeat  this
                                with expressions like [._\_n]+
                                or equivalents; the Lex gen-
                                erated  program  will try to
                                read the entire input  file,
                                causing    internal   buffer
                                overflows.

                                     Note that Lex  is  nor-
                                mally partitioning the input
                                stream,  not  searching  for
                                all possible matches of each
                                expression.  This means that
                                each  character is accounted
                                for once and only once.  For
                                example,   suppose   it   is
                                desired to count occurrences
                                of  both  _s_h_e  and  _h_e in an
                                input text.  Some Lex  rules
                                to do this might be
                                         she   s++;
                                         he    h++;
                                         \n    |
                                         .     ;
                                where  the  last  two  rules
                                ignore everything besides _h_e
                                and _s_h_e.   Remember  that  .
                                does  not  include  newline.
                                Since _s_h_e includes  _h_e,  Lex
                                will  normally _n_o_t recognize
                                the instances of _h_e included
                                in  _s_h_e,  since  once it has
                                passed a _s_h_e  those  charac-
                                ters are gone.

                                     Sometimes   the    user
                                would  like to override this
                                choice.  The  action  REJECT
                                means   ``go   do  the  next
                                alternative.''   It   causes
                                whatever   rule  was  second
                                choice  after  the   current
                                rule  to  be  executed.  The
                                position   of   the    input
                                pointer  is adjusted accord-
                                ingly.   Suppose  the   user
                                really  wants  to  count the
                                included instances of _h_e:
                                    she   {s++; REJECT;}
                                    he    {h++; REJECT;}
                                    \n    |


                           - 12 -


    .     ;
these rules are one  way  of
changing  the previous exam-
ple to do just that.   After
counting each expression, it
is    rejected;     whenever
appropriate,    the    other
expression  will   then   be
counted.   In  this example,
of course,  the  user  could
note  that  _s_h_e  includes _h_e
but not vice versa, and omit
the  REJECT action on _h_e; in
other  cases,  however,   it
would   not  be  possible  a
priori to tell  which  input
characters   were   in  both
classes.

     Consider the two rules
 a[bc]+   { ... ; REJECT;}
 a[cd]+   { ... ; REJECT;}
If the input is _a_b, only the
first  rule  matches, and on
_a_d only the second  matches.
The    input   string   _a_c_c_b
matches the first  rule  for
four characters and then the
second rule for three  char-
acters.   In  contrast,  the
input _a_c_c_d agrees  with  the
second rule for four charac-
ters and then the first rule
for three.

     In general,  REJECT  is
useful  whenever the purpose
of Lex is not  to  partition
the   input  stream  but  to
detect all examples of  some
items  in the input, and the
instances of these items may
overlap   or   include  each
other.   Suppose  a   digram
table   of   the   input  is
desired;    normally     the
digrams overlap, that is the
word _t_h_e  is  considered  to
contain   both  _t_h  and  _h_e.
Assuming  a  two-dimensional
array  named  _d_i_g_r_a_m  to  be
incremented, the appropriate
source is
%%
777777777777777777777777777777777777777777777777777777                                [a-z][a-z]   {
                                             digram[yytext[0]][yytext[1]]++;
                                             REJECT;
                                             }
                                .            ;
                                \n           ;
                                where the REJECT  is  neces-
                                sary  to  pick  up  a letter
                                pair  beginning   at   every
                                character,  rather  than  at
                                every other character.

                                _6.  _L_e_x _S_o_u_r_c_e _D_e_f_i_n_i_t_i_o_n_s.

                                     Remember the format  of
                                the Lex source:
                                      {definitions}
                                      %%
                                      {rules}
                                      %%
                                      {user routines}
                                So far only the  rules  have
                                been  described.   The  user
                                needs  additional   options,
                                though,  to define variables
                                for use in his  program  and
                                for  use  by Lex.  These can
                                go either in the definitions
                                section or in the rules sec-
                                tion.

                                     Remember  that  Lex  is
                                turning  the  rules  into  a
                                program.   Any  source   not
                                intercepted by Lex is copied
                                into the generated  program.
                                There  are  three classes of
                                such things.

                                1)   Any line which  is  not
                                     part  of  a Lex rule or
                                     action   which   begins
                                     with  a blank or tab is
                                     copied  into  the   Lex
                                     generated      program.
                                     Such source input prior
                                     to  the first %% delim-
                                     iter will  be  external
                                     to  any function in the
                                     code;  if  it   appears
                                     immediately  after  the
                                     first %%, it appears in
                                     an   appropriate  place
                                     for declarations in the


                           - 13 -


     function written by Lex
     which   contains    the
     actions.  This material
     must look like  program
     fragments,  and  should
     precede the  first  Lex
     rule.

     As a side effect of the
     above,    lines   which
     begin with a  blank  or
     tab,  and which contain
     a comment,  are  passed
     through   to  the  gen-
     erated  program.   This
     can  be used to include
     comments in either  the
     Lex  source or the gen-
     erated code.  The  com-
     ments should follow the
     host  language  conven-
     tion.

2)   Anything       included
     between  lines contain-
     ing only %{ and  %}  is
     copied  out  as  above.
     The delimiters are dis-
     carded.    This  format
     permits  entering  text
     like       preprocessor
     statements  that   must
     begin  in  column 1, or
     copying lines  that  do
     not look like programs.

3)   Anything   after    the
     third   %%   delimiter,
     regardless of  formats,
     etc.,   is  copied  out
     after the Lex output.

     Definitions    intended
for Lex are given before the
first  %%  delimiter.    Any
line  in  this  section  not
contained between %{ and %},
and begining in column 1, is
assumed to define  Lex  sub-
stitution strings.  The for-
mat of such lines is
    name translation
and  it  causes  the  string
given as a translation to be
777777777777777777777777777777777777777777777777777777                                associated  with  the  name.
                                The   name  and  translation
                                must  be  separated  by   at
                                least  one blank or tab, and
                                the name must begin  with  a
                                letter.  The translation can
                                then be called  out  by  the
                                {name}  syntax  in  a  rule.
                                Using {D} for the digits and
                                {E}  for  an exponent field,
                                for example, might  abbrevi-
                                ate   rules   to   recognize
                                numbers:
                                D                   [0-9]
                                E                   [DEde][-+]?{D}+
                                %%
                                {D}+                printf("integer");
                                {D}+"."{D}*({E})?   |
                                {D}*"."{D}+({E})?   |
                                {D}+{E}
                                Note the first two rules for
                                real numbers; both require a
                                decimal point and contain an
                                optional exponent field, but
                                the first requires at  least
                                one digit before the decimal
                                point   and    the    second
                                requires  at least one digit
                                after the decimal point.  To
                                correctly handle the problem
                                posed by a  Fortran  expres-
                                sion  such as _3_5._E_Q._I, which
                                does  not  contain  a   real
                                number,  a context-sensitive
                                rule such as
                                [0-9]+/"."EQ   printf("integer");
                                could be used in addition to
                                the    normal    rule    for
                                integers.

                                     The definitions section
                                may  also contain other com-
                                mands, including the  selec-
                                tion  of  a host language, a
                                character set table, a  list
                                of   start   conditions,  or
                                adjustments to  the  default
                                size  of  arrays  within Lex
                                itself  for  larger   source
                                programs.   These possibili-
                                ties  are  discussed   below
                                under  ``Summary  of  Source
                                Format,'' section 12.


                           - 14 -


_7.  _U_s_a_g_e.

     There are two steps  in
compiling  a Lex source pro-
gram.  First, the Lex source
must  be  turned into a gen-
erated program in  the  host
general   purpose  language.
Then this  program  must  be
compiled and loaded, usually
with a library of  Lex  sub-
routines.    The   generated
program is on a  file  named
_l_e_x._y_y._c.   The  I/O library
is defined in terms of the C
standard library [6].

     The  C  programs   gen-
erated  by  Lex are slightly
different on OS/370, because
the   OS  compiler  is  less
powerful than  the  UNIX  or
GCOS   compilers,  and  does
less  at  compile  time.   C
programs  generated  on GCOS
and UNIX are the same.

     _U_N_I_X.  The  library  is
accessed  by the loader flag
-_l_l.  So an appropriate  set
of commands is
     lex source cc  lex.yy.c
     -ll
The  resulting  program   is
placed  on  the  usual  file
_a._o_u_t for  later  execution.
To  use  Lex  with  Yacc see
below.  Although the default
Lex  I/O  routines use the C
standard  library,  the  Lex
automata  themselves  do not
do so; if  private  versions
of  _i_n_p_u_t,  _o_u_t_p_u_t and _u_n_p_u_t
are given, the  library  can
be avoided.

_8.  _L_e_x _a_n_d _Y_a_c_c.

     If you want to use  Lex
with  Yacc,  note  that what
Lex  writes  is  a   program
named   _y_y_l_e_x(),   the  name
required  by  Yacc  for  its
analyzer.    Normally,   the
777777777777777777777777777777777777777777777777777777                                default main program on  the
                                Lex  library calls this rou-
                                tine, but if Yacc is loaded,
                                and   its  main  program  is
                                used,   Yacc    will    call
                                _y_y_l_e_x().   In this case each
                                Lex rule should end with
                                       return(token);
                                where the appropriate  token
                                value  is returned.  An easy
                                way to get access to  Yacc's
                                names  for tokens is to com-
                                pile the Lex output file  as
                                part of the Yacc output file
                                by placing the line
                                    # include "lex.yy.c"
                                in the last section of  Yacc
                                input.   Supposing the gram-
                                mar to be named ``good'' and
                                the   lexical  rules  to  be
                                named  ``better''  the  UNIX
                                command  sequence  can  just
                                be:
                                     yacc good
                                     lex better
                                     cc y.tab.c -ly -ll
                                The   Yacc   library   (-ly)
                                should  be loaded before the
                                Lex  library,  to  obtain  a
                                main  program  which invokes
                                the Yacc parser.   The  gen-
                                erations  of  Lex  and  Yacc
                                programs  can  be  done   in
                                either order.

                                _9.  _E_x_a_m_p_l_e_s.

                                     As a  trivial  problem,
                                consider  copying  an  input
                                file while adding 3 to every
                                positive number divisible by
                                7.  Here is a  suitable  Lex
                                source program
                                %%
                                         int k;
                                [0-9]+   {
                                         k = atoi(yytext);
                                         if (k%7 == 0)
                                              printf("%d", k+3);
                                         else
                                              printf("%d",k);
                                         }
                                to do just that.   The  rule
                                [0-9]+ recognizes strings of


                           - 15 -


digits;  _a_t_o_i  converts  the
digits  to binary and stores
the result in _k.  The opera-
tor % (remainder) is used to
check whether _k is divisible
by 7; if it is, it is incre-
mented by 3 as it is written
out.   It  may  be  objected
that this program will alter
such input items as _4_9._6_3 or
_X_7.  Furthermore, it  incre-
ments  the absolute value of
all negative numbers divisi-
ble  by  7.   To avoid this,
just add a  few  more  rules
after  the  active  one,  as
here:
%%
                       int k;
-?[0-9]+               {
                       k = atoi(yytext);
                       printf("%d",
                         k%7 == 0 ? k+3 : k);
                       }
-?[0-9.]+              ECHO;
[A-Za-z][A-Za-z0-9]+   ECHO;
Numerical strings containing
a  ``.''  or  preceded  by a
letter will be picked up  by
one  of  the last two rules,
and   not   changed.     The
_i_f-_e_l_s_e has been replaced by
a C  conditional  expression
to   save  space;  the  form
_a?_b:_c means ``if  _a  then  _b
else _c''.

     For   an   example   of
statistics  gathering,  here
is a  program  which  histo-
grams  the lengths of words,
where a word is defined as a
string of letters.
         int lengs[100];
%%
[a-z]+   lengs[yyleng]++;
.        |
\n       ;
%%
yywrap()
{
int i;
printf("Length  No. words\n");
for(i=0; i<100; i++)
777777777777777777777777777777777777777777777777777777                                     if (lengs[i] > 0)
                                          printf("%5d%10d\n",i,lengs[i]);
                                return(1);
                                }
                                This program accumulates the
                                histogram,  while  producing
                                no output.  At  the  end  of
                                the   input  it  prints  the
                                table.  The final  statement
                                _r_e_t_u_r_n(_1);   indicates  that
                                Lex is  to  perform  wrapup.
                                If   _y_y_w_r_a_p   returns   zero
                                (false)  it   implies   that
                                further  input  is available
                                and the program is  to  con-
                                tinue  reading  and process-
                                ing.  To  provide  a  _y_y_w_r_a_p
                                that   never   returns  true
                                causes an infinite loop.

                                     As  a  larger  example,
                                here  are  some  parts  of a
                                program  written  by  N.  L.
                                Schryer  to  convert  double
                                precision Fortran to  single
                                precision  Fortran.  Because
                                Fortran does not distinguish
                                upper    and    lower   case
                                letters, this routine begins
                                by defining a set of classes
                                including both cases of each
                                letter:
                                         a     [aA]
                                         b     [bB]
                                         c     [cC]
                                         ...
                                         z     [zZ]
                                An additional  class  recog-
                                nizes white space:
                                         W   [ \t]*
                                The   first   rule   changes
                                ``double    precision''   to
                                ``real'', or ``DOUBLE PRECI-
                                SION'' to ``REAL''.
                                {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
                                     printf(yytext[0]=='d'? "real" : "REAL");
                                     }
                                Care  is  taken   throughout
                                this program to preserve the
                                case (upper or lower) of the
                                original  program.  The con-
                                ditional operator is used to
                                select  the  proper  form of
                                the keyword.  The next  rule


                           - 16 -


copies   continuation   card
indications to avoid confus-
ing them with constants:
   ^"     "[^ 0]   ECHO;
In the  regular  expression,
the   quotes   surround  the
blanks.  It  is  interpreted
as ``beginning of line, then
five blanks,  then  anything
but  blank  or  zero.'' Note
the two  different  meanings
of  ^.   There  follow  some
rules to change double  pre-
cision constants to ordinary
floating constants.
[0-9]+{W}{d}{W}[+-]?{W}[0-9]+     |
[0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9]+     |
"."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9]+     {
     /* convert constants */
     for(p=yytext; *p != 0; p++)
          {
          if (*p == 'd' || *p == 'D')
               *p=+ 'e'- 'd';
          ECHO;
          }
After  the  floating   point
constant  is  recognized, it
is scanned by the  _f_o_r  loop
to  find  the letter _d or _D.
The   program   than    adds
'_e'-'_d',  which  converts it
to the next  letter  of  the
alphabet.  The modified con-
stant, now single-precision,
is written out again.  There
follow  a  series  of  names
which  must  be respelled to
remove their initial _d.   By
using  the  array _y_y_t_e_x_t the
same action suffices for all
the  names (only a sample of
a rather long list is  given
here).
{d}{s}{i}{n}         |
{d}{c}{o}{s}         |
{d}{s}{q}{r}{t}      |
{d}{a}{t}{a}{n}      |
...
{d}{f}{l}{o}{a}{t}   printf("%s",yytext+1);
Another list of  names  must
have  initial  _d  changed to
initial _a:
{d}{l}{o}{g}     |
{d}{l}{o}{g}10   |
777777777777777777777777777777777777777777777777777777                                {d}{m}{i}{n}1    |
                                {d}{m}{a}{x}1    {
                                                 yytext[0] =+ 'a' - 'd';
                                                 ECHO;
                                                 }
                                And one  routine  must  have
                                initial _d changed to initial
                                _r:
                                {d}1{m}{a}{c}{h}   {yytext[0] =+ 'r'  - 'd';


                                To avoid such names as _d_s_i_n_x
                                being  detected as instances
                                of _d_s_i_n,  some  final  rules
                                pick   up  longer  words  as
                                identifiers  and  copy  some
                                surviving characters:
                                [A-Za-z][A-Za-z0-9]*   |
                                [0-9]+                 |
                                \n                     |
                                .                      ECHO;
                                Note that  this  program  is
                                not  complete;  it  does not
                                deal with the spacing  prob-
                                lems  in Fortran or with the
                                use of keywords as  identif-
                                iers.
                                _1_0.   _L_e_f_t  _C_o_n_t_e_x_t   _S_e_n_s_i_-
                                _t_i_v_i_t_y.

                                     Sometimes it is  desir-
                                able to have several sets of
                                lexical rules to be  applied
                                at  different  times  in the
                                input.  For example, a  com-
                                piler   preprocessor   might
                                distinguish     preprocessor
                                statements  and analyze them
                                differently  from   ordinary
                                statements.   This  requires
                                sensitivity  to  prior  con-
                                text,  and there are several
                                ways of handling such  prob-
                                lems.   The  ^ operator, for
                                example, is a prior  context
                                operator,        recognizing
                                immediately  preceding  left
                                context just as $ recognizes
                                immediately following  right
                                context.  Adjacent left con-
                                text could be  extended,  to
                                produce  a  facility similar
                                to that for  adjacent  right
                                context,  but it is unlikely


                           - 17 -


to be as useful, since often
the  relevant  left  context
appeared some time  earlier,
such  as at the beginning of
a line.

     This section  describes
three  means of dealing with
different  environments:   a
simple  use  of  flags, when
only a few rules change from
one  environment to another,
the use of _s_t_a_r_t  _c_o_n_d_i_t_i_o_n_s
on  rules, and the possibil-
ity of making multiple lexi-
cal    analyzers   all   run
together.   In  each   case,
there are rules which recog-
nize the need to change  the
environment   in  which  the
following  input   text   is
analyzed,   and   set   some
parameter  to  reflect   the
change.   This may be a flag
explicitly  tested  by   the
user's  action  code; such a
flag is the simplest way  of
dealing  with  the  problem,
since Lex is not involved at
all.   It  may  be more con-
venient,  however,  to  have
Lex  remember  the  flags as
initial  conditions  on  the
rules.    Any  rule  may  be
associated with a start con-
dition.   It  will  only  be
recognized when  Lex  is  in
that  start  condition.  The
current start condition  may
be   changed  at  any  time.
Finally,  if  the  sets   of
rules   for   the  different
environments are  very  dis-
similar, clarity may be best
achieved by writing  several
distinct  lexical analyzers,
and switching  from  one  to
another as desired.

     Consider the  following
problem:  copy  the input to
the  output,  changing   the
word _m_a_g_i_c to _f_i_r_s_t on every
line which  began  with  the
777777777777777777777777777777777777777777777777777777                                letter  _a, changing _m_a_g_i_c to
                                _s_e_c_o_n_d on every  line  which
                                began with the letter _b, and
                                changing _m_a_g_i_c to  _t_h_i_r_d  on
                                every  line which began with
                                the  letter  _c.   All  other
                                words  and  all  other lines
                                are left unchanged.

                                     These rules are so sim-
                                ple  that the easiest way to
                                do this job is with a flag:
                                        int flag;
                                %%
                                ^a      {flag = 'a'; ECHO;}
                                ^b      {flag = 'b'; ECHO;}
                                ^c      {flag = 'c'; ECHO;}
                                \n      {flag =  0 ; ECHO;}
                                magic   {
                                        switch (flag)
                                        {
                                        case 'a': printf("first"); break;
                                        case 'b': printf("second"); break;
                                        case 'c': printf("third"); break;
                                        default: ECHO; break;
                                        }
                                        }
                                should be adequate.

                                     To  handle   the   same
                                problem  with  start  condi-
                                tions, each start  condition
                                must be introduced to Lex in
                                the definitions section with
                                a line reading
                                  %Start   name1 name2 ...
                                where the conditions may  be
                                named  in  any  order.   The
                                word _S_t_a_r_t may  be  abbrevi-
                                ated  to _s or _S.  The condi-
                                tions may be  referenced  at
                                the  head of a rule with the
                                <> brackets:
                                     <name1>expression
                                is  a  rule  which  is  only
                                recognized  when  Lex  is in
                                the start  condition  _n_a_m_e_1.
                                To  enter a start condition,
                                execute the action statement
                                        BEGIN name1;
                                which changes the start con-
                                dition  to _n_a_m_e_1.  To resume
                                the normal state,
                                          BEGIN 0;


                           - 18 -


resets the initial condition
of  the Lex automaton inter-
preter.   A  rule   may   be
active in several start con-
ditions:
    <name1,name2,name3>
is a legal prefix.  Any rule
not  beginning  with  the <>
prefix  operator  is  always
active.

     The  same  example   as
before can be written:
%START AA BB CC
%%
^a                {ECHO; BEGIN AA;}
^b                {ECHO; BEGIN BB;}
^c                {ECHO; BEGIN CC;}
\n                {ECHO; BEGIN 0;}
<AA>magic         printf("first");
<BB>magic         printf("second");
<CC>magic         printf("third");
where the logic  is  exactly
the  same as in the previous
method of handling the prob-
lem,  but  Lex does the work
rather than the user's code.

_1_1.  _C_h_a_r_a_c_t_e_r _S_e_t.

     The programs  generated
by  Lex handle character I/O
only  through  the  routines
_i_n_p_u_t,  _o_u_t_p_u_t,  and  _u_n_p_u_t.
Thus the character represen-
tation   provided  in  these
routines is accepted by  Lex
and   employed   to   return
values   in   _y_y_t_e_x_t.    For
internal  use a character is
represented   as   a   small
integer  which, if the stan-
dard library is used, has  a
value  equal  to the integer
value  of  the  bit  pattern
representing  the  character
on the host computer.   Nor-
mally,   the   letter  _a  is
represented as the same form
as  the  character  constant
'_a'.  If this interpretation
is changed, by providing I/O
routines which translate the
characters, Lex must be told
777777777777777777777777777777777777777777777777777777                                about it, by giving a trans-
                                lation  table.   This  table
                                must be in  the  definitions
                                section,  and must be brack-
                                eted  by  lines   containing
                                only ``%T''.  The table con-
                                tains lines of the form
                                {integer} {character string}
                                which  indicate  the   value
                                associated with each charac-
                                ter.  Thus the next example
                                          %T
                                           1    Aa
                                           2    Bb
                                          ...
                                          26    Zz
                                          27    \n
                                          28    +
                                          29    -
                                          30    0
                                          31    1
                                          ...
                                          39    9
                                          %T

                                  Sample character table.
                                maps  the  lower  and  upper
                                case  letters  together into
                                the integers 1  through  26,
                                newline  into  27,  +  and -
                                into  28  and  29,  and  the
                                digits  into  30 through 39.
                                Note the escape for newline.
                                If   a  table  is  supplied,
                                every character that  is  to
                                appear  either  in the rules
                                or in any valid  input  must
                                be  included  in  the table.
                                No character may be assigned
                                the number 0, and no charac-
                                ter may be assigned a bigger
                                number  than the size of the
                                hardware character set.

                                _1_2.  _S_u_m_m_a_r_y _o_f _S_o_u_r_c_e  _F_o_r_-
                                _m_a_t.

                                     The general form  of  a
                                Lex source file is:
                                     {definitions}
                                     %%
                                     {rules}
                                     %%
                                     {user subroutines}


                           - 19 -


The definitions section con-
tains a combination of

1)   Definitions,   in   the
     form    ``name    space
     translation''.

2)   Included code,  in  the
     form ``space code''.

3)   Included code,  in  the
     form
              %{
              code
              %}
4)   Start conditions, given
     in the form
       %S name1 name2 ...
5)   Character  set  tables,
     in the form
  %T
  number space character-string
  ...
  %T
6)   Changes   to   internal
     array   sizes,  in  the
     form
             %_x  _n_n_n
     where _n_n_n is a  decimal
     integer representing an
     array   size   and    _x
     selects  the  parameter
     as follows:
Letter          Parameter
  p      positions
  n      states
  e      tree nodes
  a      transitions
  k      packed character classes
  o      output array size

Lines in the  rules  section
have  the  form ``expression
action''  where  the  action
may be continued on succeed-
ing lines by using braces to
delimit it.

     Regular expressions  in
Lex use the following opera-
tors:
x        the character "x"
"x"      an "x", even if x is an operator.
\x       an "x", even if x is an operator.
777777777777777777777777777777777777777777777777777777                                [xy]     the character x or y.
                                [x-z]    the characters x, y or z.
                                [^x]     any character but x.
                                .        any character but newline.
                                ^x       an x at the beginning of a line.
                                <y>x     an x when Lex is in start condition y.
                                x$       an x at the end of a line.
                                x?       an optional x.
                                x*       0,1,2, ... instances of x.
                                x+       1,2,3, ... instances of x.
                                x|y      an x or a y.
                                (x)      an x.
                                x/y      an x but only if followed by y.
                                {xx}     the translation of xx from the
                                         definitions section.
                                x{m,n}   _m through _n occurrences of x

                                _1_3.  _C_a_v_e_a_t_s _a_n_d _B_u_g_s.

                                     There are  pathological
                                expressions   which  produce
                                exponential  growth  of  the
                                tables   when  converted  to
                                deterministic machines; for-
                                tunately, they are rare.

                                     REJECT does not  rescan
                                the    input;   instead   it
                                remembers the results of the
                                previous  scan.   This means
                                that if a rule with trailing
                                context is found, and REJECT
                                executed, the user must  not
                                have  used  _u_n_p_u_t  to change
                                the  characters  forthcoming
                                from the input stream.  This
                                is the only  restriction  on
                                the  user's ability to mani-
                                pulate the not-yet-processed
                                input.

                                _1_4.  _A_c_k_n_o_w_l_e_d_g_m_e_n_t_s.

                                     As  should  be  obvious
                                from  the above, the outside
                                of Lex is patterned on  Yacc
                                and   the  inside  on  Aho's
                                string  matching   routines.
                                Therefore,  both S. C. John-
                                son and A. V. Aho are really
                                originators  of much of Lex,
                                as well as debuggers of  it.
                                Many thanks are due to both.


                           - 20 -


     The code of the current
version of Lex was designed,
written,  and  debugged   by
Eric Schmidt.


                   M. E. Lesk and E. Schmidt
7MH-1274-MEL-unix

_1_5.  _R_e_f_e_r_e_n_c_e_s.

1.   B. W. Kernighan and  D.
     M.  Ritchie, _T_h_e _C _P_r_o_-
     _g_r_a_m_m_i_n_g      _L_a_n_g_u_a_g_e,
     Prentice-Hall,   N.  J.
     (1978).

2.   B. W.  Kernighan,  _R_a_t_-
     _f_o_r: _A _P_r_e_p_r_o_c_e_s_s_o_r _f_o_r
     _a   _R_a_t_i_o_n_a_l   _F_o_r_t_r_a_n,
     Software - Practice and
     Experience,   5,    pp.
     395-496 (1975).

3.   S.  C.  Johnson,  _Y_a_c_c:
     _Y_e_t   _A_n_o_t_h_e_r  _C_o_m_p_i_l_e_r
     _C_o_m_p_i_l_e_r,     Computing
     Science       Technical
     Report  No.  32,  1975,
     Bell Laboratories, Mur-
     ray Hill, NJ 07974.

4.   A. V.  Aho  and  M.  J.
     Corasick,     _E_f_f_i_c_i_e_n_t
     _S_t_r_i_n_g _M_a_t_c_h_i_n_g: _A_n _A_i_d
     _t_o        _B_i_b_l_i_o_g_r_a_p_h_i_c
     _S_e_a_r_c_h, Comm.  ACM  _1_8,
     333-340 (1975).

5.   B. W. Kernighan, D.  M.
     Ritchie   and   K.   L.
     Thompson, _Q_E_D _T_e_x_t _E_d_i_-
     _t_o_r,  Computing Science
     Technical Report No. 5,
     1972,    Bell   Labora-
     tories, Murray Hill, NJ
     07974.

6.   D. M. Ritchie,  private
     communication.      See
     also M.  E.  Lesk,  _T_h_e
     _P_o_r_t_a_b_l_e   _C   _L_i_b_r_a_r_y,
     Computing       Science
777777777777777777777777777777777777777777777777777777                                     Technical   Report  No.
                                     31, Bell  Laboratories,
                                     Murray Hill, NJ 07974.