A Tour Through the Portable C Compiler


                       S. C. Johnson


_I_n_t_r_o_d_u_c_t_i_o_n

     A C compiler has been implemented that has proved to be
quite  portable,  serving  as  the  basis for C compilers on
roughly a dozen machines, including the Honeywell 6000,  IBM
370,  and Interdata 8/32.  The compiler is highly compatible
with the C language standard.   Kernighan  Ritchie  Prentice
1978

     Among the goals of this compiler are portability,  high
reliability,  and the use of state-of-the-art techniques and
tools wherever practical.  Although the  efficiency  of  the
compiling  process  is  not  a primary goal, the compiler is
efficient enough, and produces good enough code, to serve as
a production compiler.

     The language implemented is highly compatible with  the
current  PDP-11  version of C.  Moreover, roughly 75% of the
compiler, including nearly all the  syntactic  and  semantic
routines,  is machine independent.  The compiler also serves
as the major portion of the program  _l_i_n_t,  described  else-
where.  Johnson Lint Program Checker

     A number of earlier attempts to make portable compilers
are worth noting.  While on CO-OP assignment to Bell Labs in
1973, Alan Snyder wrote a portable C compiler which was  the
basis  of  his  Master's Thesis at M.I.T.  Snyder Portable C
Compiler This compiler was very slow  and  complicated,  and
contained  a  number of rather serious implementation diffi-
culties; nevertheless, a number of Snyder's ideas appear  in
this work.

     Most earlier portable  compilers,  including  Snyder's,
have proceeded by defining an intermediate language, perhaps
based on three-address code or code for a stack machine, and
writing  a machine independent program to translate from the
source code to this  intermediate  code.   The  intermediate
code  is then read by a second pass, and interpreted or com-
piled.  This approach is elegant, and has a number of advan-
tages,  especially if the target machine is far removed from
the host.  It suffers from some disadvantages as well.  Some


                       August 3, 1987


                           - 2 -


constructions,  like  initialization and subroutine prologs,
are difficult or expensive to express in a machine  indepen-
dent  way that still allows them to be easily adapted to the
target assemblers.  Most of these approaches require a  sym-
bol  table  to  be constructed in the second (machine depen-
dent)  pass,  and/or  require  powerful  target  assemblers.
Also,  many  conversion operators may be generated that have
no effect on a given machine, but may be  needed  on  others
(for  example,  pointer  to  pointer  conversions usually do
nothing in C, but must be generated because there  are  some
machines where they are significant).

     For these reasons, the first pass of the portable  com-
piler is not entirely machine independent.  It contains some
machine dependent features, such as initialization,  subrou-
tine  prolog  and  epilog,  certain storage allocation func-
tions, code for the _s_w_i_t_c_h statement, and code to throw  out
unneeded conversion operators.

     As a crude measure of the degree of  portability  actu-
ally achieved, the Interdata 8/32 C compiler has roughly 600
machine dependent lines of source out of 4600 in Pass 1, and
1000  out of 3400 in Pass 2.  In total, 1600 out of 8000, or
20%, of the total source is machine dependent (12%  in  Pass
1,  30%  in  Pass  2).  These percentages can be expected to
rise slightly as the compiler is tuned.  The  percentage  of
machine-dependent code for the IBM is 22%, for the Honeywell
25%.  If the assembler format and structure  were  the  same
for  all  these  machines, perhaps another 5-10% of the code
would become machine independent.

     These figures are  sufficiently  misleading  as  to  be
almost  meaningless.  A large fraction of the machine depen-
dent code can be  converted  in  a  straightforward,  almost
mechanical  way.  On the other hand, a certain amount of the
code requres hard intellectual effort to convert, since  the
algorithms  embodied  in this part of the code are typically
complicated and machine dependent.

     To summarize, however, if you need a C compiler written
for  a  machine with a reasonable architecture, the compiler
is already three quarters finished!

_O_v_e_r_v_i_e_w

     This paper discusses the structure and organization  of
the  portable  compiler.  The intent is to give the big pic-
ture, rather than discussing the  details  of  a  particular
machine  implementation.   After a brief overview and a dis-
cussion of the source file structure,  the  paper  describes
the major data structures, and then delves more closely into
the two passes.  Some of the theoretical work on  which  the
compiler  is  based, and its application to the compiler, is
discussed elsewhere.  johnson portable theory  practice  One


                       August 3, 1987


                           - 3 -


of  the major design issues in any C compiler, the design of
the calling sequence and stack frame, is the  subject  of  a
separate memorandum.  johnson lesk ritchie calling sequence

     The compiler consists of two passes, _p_a_s_s_1  and  _p_a_s_s_2,
that together turn C source code into assembler code for the
target machine.  The two passes are preceded by a preproces-
sor,  that  handles the #_d_e_f_i_n_e and #_i_n_c_l_u_d_e statements, and
related features (e.g.,  #_i_f_d_e_f,  etc.).   It  is  a  nearly
machine  independent  program,  and will not be further dis-
cussed here.

     The output of the preprocessor is a text file  that  is
read as the standard input of the first pass.  This produces
as standard output another text file that becomes the  stan-
dard input of the second pass.  The second pass produces, as
standard output, the desired assembler language source code.
The preprocessor and the two passes all write error messages
on the standard error file.  Thus the compiler itself  makes
few  demands  on  the  I/O  library  support,  aiding in the
bootstrapping process.

     Although the compiler is divided into two passes,  this
represents historical accident more than deep necessity.  In
fact, the compiler can optionally be  loaded  so  that  both
passes  operate  in  the  same  program.   This ``one pass''
operation eliminates the overhead of reading and writing the
intermediate file, so the compiler operates about 30% faster
in this mode.  It also occupies about 30%  more  space  than
the larger of the two component passes.

     Because the compiler is fundamentally structured as two
passes,  even  when  loaded  as one, this document primarily
describes the two pass version.

     The first pass does the lexical analysis, parsing,  and
symbol  table  maintenance.   It also constructs parse trees
for expressions, and keeps track of the types of  the  nodes
in  these  trees.  Additional code is devoted to initializa-
tion.  Machine dependent portions of the first pass serve to
generate  subroutine prologs and epilogs, code for switches,
and code for branches, label definitions,  alignment  opera-
tions, changes of location counter, etc.

     The intermediate file is a  text  file  organized  into
lines.   Lines beginning with a right parenthesis are copied
by the second pass directly to its  output  file,  with  the
parenthesis  stripped  off.   Thus, when the first pass pro-
duces assembly code, such as subroutine prologs, etc.,  each
line  is  prefaced with a right parenthesis; the second pass
passes these lines to through to the assembler.

     The major job done by the second pass is generation  of
code  for  expressions.  The expression parse trees produced


                       August 3, 1987


                           - 4 -


in the first pass are written onto the intermediate file  in
Polish  Prefix form: first, there is a line beginning with a
period, followed by the source file line number and name  on
which the expression appeared (for debugging purposes).  The
successive lines represent the nodes of the p