A Tour Through the Portable C Compiler S. C. Johnson AT&T Bell Laboratories Donn Seeley Department of Computer Science University of Utah _A_B_S_T_R_A_C_T Since its introduction, the Portable C Com- piler has become the standard UNIX C compiler for many machines. Three quarters or more of the code in the compiler is machine independent and much of the rest can be generated easily using knowledge of the target architecture. This paper describes the structure and organization of the compiler and tries to further simplify the job of the compiler porter. This document originally appeared with the Seventh Edition of UNIX, and has been revised and extended for publication with the Fourth Berkeley Software Distribution. The new material covers changes which have been made in the compiler since the Seventh Edition, and includes some discussion of secondary topics which were thought to be of interest in future ports of the compiler. Revised April, 1986 _I_n_t_r_o_d_u_c_t_i_o_n A C compiler has been implemented that has proved to be quite portable, serving as the basis for C compilers on roughly a dozen machines, including the DEC VAX, Honeywell 6000, IBM 370, and Interdata 8/32. The compiler is highly compatible with the C language standard. [ Kernighan Ritchie Prentice 1978 ] September 27, 1987 - 2 - Among the goals of this compiler are portability, high reliability, and the use of state-of-the-art techniques and tools wherever practical. Although the efficiency of the compiling process is not a primary goal, the compiler is efficient enough, and produces good enough code, to serve as a production compiler. The language implemented is highly compatible with the current PDP-11 version of C. Moreover, roughly 75% of the compiler, including nearly all the syntactic and semantic routines, is machine independent. The compiler also serves as the major portion of the program _l_i_n_t, described else- where. [ Johnson Lint Program Checker ] A number of earlier attempts to make portable compilers are worth noting. While on CO-OP assignment to Bell Labs in 1973, Alan Snyder wrote a portable C compiler which was the basis of his Master's Thesis at M.I.T. [ Snyder Portable C Compiler ] This compiler was very slow and complicated, and contained a number of rather serious implementation diffi- culties; nevertheless, a number of Snyder's ideas appear in this work. Most earlier portable compilers, including Snyder's, have proceeded by defining an intermediate language, perhaps based on three-address code or code for a stack machine, and writing a machine independent program to translate from the source code to this intermediate code. The intermediate code is then read by a second pass, and interpreted or com- piled. This approach is elegant, and has a number of advan- tages, especially if the target machine is far removed from the host. It suffers from some disadvantages as well. Some constructions, like initialization and subroutine prologs, are difficult or expensive to express in a machine indepen- dent way that still allows them to be easily adapted to the target assemblers. Most of these approaches require a sym- bol table to be constructed in the second (machine depen- dent) pass, and/or require powerful target assemblers. Also, many conversion operators may be generated that have no effect on a given machine, but may be needed on others (for example, pointer to pointer conversions usually do nothing in C, but must be generated because there are some machines where they are significant). For these reasons, the first pass of the portable com- piler is not entirely machine independent. It contains some machine dependent features, such as initialization, subrou- tine prolog and epilog, certain storage allocation func- tions, code for the _s_w_i_t_c_h statement, and code to throw out unneeded conversion operators. As a crude measure of the degree of portability actu- ally achieved, the Interdata 8/32 C compiler has roughly 600 machine dependent lines of source out of 4600 in Pass 1, and September 27, 1987 - 3 - 1000 out of 3400 in Pass 2. In total, 1600 out of 8000, or 20%, of the total source is machine dependent (12% in Pass 1, 30% in Pass 2). These percentages can be expected to rise slightly as the compiler is tuned. The percentage of machine-dependent code for the IBM is 22%, for the Honeywell 25%. If the assembler format and structure were the same for all these machines, perhaps another 5-10% of the code would become machine independent. These figures are sufficiently misleading as to be almost meaningless. A large fraction of the machine depen- dent code can be converted in a straightforward, almost mechanical way. On the other hand, a certain amount of the code requires hard intellectual effort to convert, since the algorithms embodied in this part of the code are typically complicated and machine dependent. To summarize, however, if you need a C compiler written for a machine with a reasonable architecture, the compiler is already three quarters finished! _O_v_e_r_v_i_e_w This paper discusses the structure and organization of the portable compiler. The intent is to give the big pic- ture, rather than discussing the details of a particular machine implementation. After a brief overview and a dis- cussion of the source file structure, the paper describes the major data structures, and then delves more closely into the two passes. Some of the theoretical work on which the compiler is based, and its application to the compiler, is discussed elsewhere. [ johnson portable theory practice ] One of the major design issues in any C compiler, the design of the calling sequence and stack frame, is the subject of a separate memorandum. [ johnson lesk ritchie calling sequence ] The compiler consists of two passes, _p_a_s_s_1 and _p_a_s_s_2, that together turn C source code into assembler code for the target machine. The two passes are preceded by a preproces- sor, that handles the #_d_e_f_i_n_e and #_i_n_c_l_u_d_e statements, and related features (e.g., #_i_f_d_e_f, etc.). The two passes may optionally be followed by a machine dependent code improver. The output of the preprocessor is a text file that is read as the standard input of the first pass. This produces as standard output another text file that becomes the stan- dard input of the second pass. The second pass produces, as standard output, the desired assembler language source code. The code improver, if used, converts the assembler code to more effective code, and the result is passed to the assem- bler. The preprocessor and the two passes all write error messages on the standard error file. Thus the compiler itself makes few demands on the I/O library support, aiding September 27, 1987 - 4 - in the bootstrapping process. The division of the compiler into two passes is some- what artificial. The compiler can optionally be loaded so that both passes operate in the same program. This ``one pass'' operation eliminates the overhead of reading and writing the intermediate file, so the compiler operates about 30% faster in this mode. It also occupies about 30% more space than the larger of the two component passes. This ``one pass'' compiler is the standard version on machines with large address spaces, such as the VAX. Because the compiler is fundamentally structured as two passes, even when loaded as one, this document primarily describes the two pass version. The first pass does the lexical analysis, parsing, and symbol table maintenance. It also constructs parse trees for expressions, and keeps track of the types of the nodes in these trees. Additional code is devoted to initializa- tion. Machine dependent portions of the first pass serve to generate subroutine prologs and epilogs, code for switches, and code for branches, label definitions, alignment opera- tions, changes of location counter, etc. The intermediate file is a text file organized into lines. Lines beginning with a right parenthesis are copied by the second pass directly to its output file, with the parenthesis stripped off. Thus, when the first pass pro- duces assembly code, such as subroutine prologs, etc., each line is prefaced with a right parenthesis; the second pass passes these lines to through to the assembler. The major job done by the second pass is generation of code for expressions. The expression parse trees produced in the first pass are written onto the intermediate file in Polish Prefix form: first, there is a line beginning with a period, followed by the source file line number and name on which the expression appeared (for debugging purposes). The successive lines represent the nodes of the parse tree, one node per line. Each line contains the node number, type, and any values (e.g., values of constants) that may appear in the node. Lines representing nodes with descendants are immediately followed by the left subtree of descendants, then the right. Since the number of descendants of any node is completely determined by the node number, there is no need to mark the end of the tree. There are only two other line types in the intermediate file. Lines beginning with a left square bracket (`[') represent the beginning of blocks (delimited by { ... } in the C source); lines beginning with right square brackets (`]') represent the end of blocks. The remainder of these lines tell how much stack space, and how many register September 27, 1987 - 5 - variables, are currently in use. Thus, the second pass reads the intermediate files, copies the `)' lines, makes note of the information in the `[' and `]' lines, and devotes most of its effort to the `.' lines and their associated expression trees, turning them turns into assembly code to evaluate the expressions. In the one pass version of the compiler, the expression trees contain information useful to both logical passes. Instead of writing the trees onto an intermediate file, each tree is transformed in place into an acceptable form for the code generator. The code generator then writes the result of compiling this tree onto the standard output. Instead of `[' and `]' lines in the intermediate file, the information is passed directly to the second pass routines. Assembly code produced by the first pass is simply written out, without the need for `)' at the head of each line. _T_h_e _S_o_u_r_c_e _F_i_l_e_s The compiler source consists of 25 source files. Several header files contain information which is needed across various source modules. _M_a_n_i_f_e_s_t._h has declarations for node types, type manipulation macros and other macros, and some global data definitions. _M_a_c_d_e_f_s._h has machine- dependent definitions, such as the size and alignment of the various data representations. _C_o_n_f_i_g._h defines symbols which control the configuration of the compiler, including such things as the sizes of various tables and whether the compiler is ``one pass''. The compiler conditionally includes another file, _o_n_e_p_a_s_s._h, which contains definitions which are particular to a ``one pass'' compiler. _N_d_u._h defines the basic tree building structure which is used throughout the compiler to construct expression trees. _M_a_n_i_f_e_s_t._h includes a file of opcode and type definitions named _p_c_c_l_o_c_a_l._h; this file is automatically generated from a header file specific to the C compiler named _l_o_c_a_l_d_e_f_s._h and a public header file /_u_s_r/_i_n_c_l_u_d_e/_p_c_c._h. Another file, _p_c_c_t_o_k_e_n_s, is generated in a similar way and contains token definitions for the compiler's Yacc [ Johnson Yacc Compiler cstr ] grammar. Two machine independent header files, _p_a_s_s_1._h and _p_a_s_s_2._h, contain the data structure and manifest definitions for the first and second passes, respectively. In the second pass, a machine dependent header file, _m_a_c_2_d_e_f_s._h, contains declarations of register names, etc. _C_o_m_m_o_n._c contains machine independent routines used in both passes. These include routines for allocating and freeing trees, walking over trees, printing debugging infor- mation, and printing error messages. This file can be com- piled in two flavors, one for pass 1 and one for pass 2, depending on what conditional compilation symbol is used. September 27, 1987 - 6 - Entire sections of this document are devoted to the detailed structure of the passes. For the moment, we just give a brief description of the files. The first pass is obtained by compiling and loading _c_g_r_a_m._y, _c_o_d_e._c, _c_o_m_m_o_n._c, _l_o_c_a_l._c, _o_p_t_i_m._c, _p_f_t_n._c, _s_c_a_n._c, _s_t_a_b._c, _t_r_e_e_s._c and _x_d_e_f_s._c. _S_c_a_n._c is the lexical analyzer, which provides tokens to the bottom-up parser which is defined by the Yacc grammar _c_g_r_a_m._y. _X_d_e_f_s._c is a short file of external defin- itions. _P_f_t_n._c maintains the symbol table, and does ini- tialization. _T_r_e_e_s._c builds the expression trees, and com- putes the node types. _O_p_t_i_m._c does some machine independent optimizations on the expression trees. _C_o_m_m_o_n._c contains service routines common to the two passes of the compiler. All the above files are machine independent. The files _l_o_c_a_l._c and _c_o_d_e._c contain machine dependent code for gen- erating subroutine prologs, switch code, and the like. _S_t_a_b._c contains machine dependent code for producing exter- nal symbol table information which can drive a symbolic debugger. The second pass is produced by compiling and loading _a_l_l_o._c, _c_o_m_m_o_n._c, _l_o_c_a_l_2._c, _m_a_t_c_h._c, _o_r_d_e_r._c, _r_e_a_d_e_r._c and _t_a_b_l_e._c. _R_e_a_d_e_r._c reads the intermediate file, and controls the major logic of the code generation. _A_l_l_o._c keeps track of busy and free registers. _M_a_t_c_h._c controls the matching of code templates to subtrees of the expression tree to be compiled. _C_o_m_m_o_n._c defines certain service routines, as in the first pass. The above files are machine independent. _O_r_d_e_r._c controls the machine dependent details of the code generation strategy. _L_o_c_a_l_2._c has many small machine depen- dent routines, and tables of opcodes, register types, etc. _T_a_b_l_e._c has the code template tables, which are also clearly machine dependent. _D_a_t_a _S_t_r_u_c_t_u_r_e _C_o_n_s_i_d_e_r_a_t_i_o_n_s This section discusses the node numbers, type words, and expression trees, used throughout both passes of the compiler. The file _m_a_n_i_f_e_s_t._h defines those symbols used throughout both passes. The intent is to use the same sym- bol name (e.g., MINUS) for the given operator throughout the lexical analysis, parsing, tree building, and code genera- tion phases. _M_a_n_i_f_e_s_t._h obtains some of its definitions from two other header files, _l_o_c_a_l_d_e_f_s._h and _p_c_c._h. _L_o_c_a_l_d_e_f_s._h contains definitions for operator symbols which are specific to the C compiler. _P_c_c._h contains definitions for operators and types which may be used by other compilers to communicate with a portable code generator based on pass 2; this code generator will be described later. A token like MINUS may be seen in the lexical analyzer before it is known whether it is a unary or binary operator; September 27, 1987 - 7 - clearly, it is necessary to know this by the time the parse tree is constructed. Thus, an operator (really a macro) called UNARY is provided, so that MINUS and UNARY MINUS are both distinct node numbers. Similarly, many binary opera- tors exist in an assignment form (for example, -=), and the operator ASG may be applied to such node names to generate new ones, e.g. ASG MINUS. It is frequently desirable to know if a node represents a leaf (no descendants), a unary operator (one descendant) or a binary operator (two descendants). The macro _o_p_t_y_p_e(_o) returns one of the manifest constants LTYPE, UTYPE, or BITYPE, respectively, depending on the node number _o. Simi- larly, _a_s_g_o_p(_o) returns true if _o is an assignment operator number (=, +=, etc. ), and _l_o_g_o_p(_o) returns true if _o is a relational or logical (&&, ||, or !) operator. C has a rich typing structure, with a potentially infinite number of types. To begin with, there are the basic types: CHAR, SHORT, INT, LONG, the unsigned versions known as UCHAR, USHORT, UNSIGNED, ULONG, and FLOAT, DOUBLE, and finally STRTY (a structure), UNIONTY, and ENUMTY. Then, there are three operators that can be applied to types to make others: if _t is a type, we may potentially have types _p_o_i_n_t_e_r _t_o _t, _f_u_n_c_t_i_o_n _r_e_t_u_r_n_i_n_g _t, and _a_r_r_a_y _o_f _t'_s gen- erated from _t. Thus, an arbitrary type in C consists of a basic type, and zero or more of these operators. In the compiler, a type is represented by an unsigned integer; the rightmost four bits hold the basic type, and the remaining bits are divided into two-bit fields, contain- ing 0 (no operator), or one of the three operators described above. The modifiers are read right to left in the word, starting with the two-bit field adjacent to the basic type, until a field with 0 in it is reached. The macros PTR, FTN, and ARY represent the _p_o_i_n_t_e_r _t_o, _f_u_n_c_t_i_o_n _r_e_t_u_r_n_i_n_g, and _a_r_r_a_y _o_f operators. The macro values are shifted so that they align with the first two-bit field; thus PTR+INT represents the type for an integer pointer, and ARY + (PTR<<2) + (FTN<<4) + DOUBLE represents the type of an array of pointers to functions returning doubles. The type words are ordinarily manipulated by macros. If _t is a type word, _B_T_Y_P_E(_t) gives the basic type. _I_S_P_T_R(_t), _I_S_A_R_Y(_t), and _I_S_F_T_N(_t) ask if an object of this type is a pointer, array, or a function, respectively. _M_O_D_T_Y_P_E(_t,_b) sets the basic type of _t to _b. _D_E_C_R_E_F(_t) gives the type resulting from removing the first operator from _t. Thus, if _t is a pointer to _t', a function returning _t', or an array of _t', then _D_E_C_R_E_F(_t) would equal _t'. _I_N_C_R_E_F(_t) gives the type representing a pointer to _t. Finally, there September 27, 1987 - 8 - are operators for dealing with the unsigned types. _I_S_U_N_S_I_G_N_E_D(_t) returns true if _t is one of the four basic unsigned types; in this case, _D_E_U_N_S_I_G_N(_t) gives the associ- ated `signed' type. Similarly, _U_N_S_I_G_N_A_B_L_E(_t) returns true if _t is one of the four basic types that could become unsigned, and _E_N_U_N_S_I_G_N(_t) returns the unsigned analogue of _t in this case. The other important global data structure is that of expression trees. The actual shapes of the nodes are given in _n_d_u._h. The information stored for each pass is not quite the same; in the first pass, nodes contain dimension and size information, while in the second pass nodes contain register allocation information. Nevertheless, all nodes contain fields called _o_p, containing the node number, and _t_y_p_e, containing the type word. A function called _t_a_l_l_o_c() returns a pointer to a new tree node. To free a node, its _o_p field need merely be set to FREE. The other fields in the node will remain intact at least until the next alloca- tion. Nodes representing binary operators contain fields, _l_e_f_t and _r_i_g_h_t, that contain pointers to the left and right descendants. Unary operator nodes have the _l_e_f_t field, and a value field called _r_v_a_l. Leaf nodes, with no descendants, have two value fields: _l_v_a_l and _r_v_a_l. At appropriate times, the function _t_c_h_e_c_k() can be called, to check that there are no busy nodes remaining. This is used as a compiler consistency check. The function _t_c_o_p_y(_p) takes a pointer _p that points to an expression tree, and returns a pointer to a disjoint copy of the tree. The function _w_a_l_k_f(_p,_f) performs a postorder walk of the tree pointed to by _p, and applies the function _f to each node. The function _f_w_a_l_k(_p,_f,_d) does a preorder walk of the tree pointed to by _p. At each node, it calls a function _f, passing to it the node pointer, a value passed down from its ancestor, and two pointers to values to be passed down to the left and right descendants (if any). The value _d is the value passed down to the root. _F_w_a_l_k is used for a number of tree labeling and debugging activities. The other major data structure, the symbol table, exists only in pass one, and will be discussed later. _P_a_s_s _O_n_e The first pass does lexical analysis, parsing, symbol table maintenance, tree building, optimization, and a number of machine dependent things. This pass is largely machine independent, and the machine independent sections can be pretty successfully ignored. Thus, they will be only sketched here. September 27, 1987 - 9 - _L_e_x_i_c_a_l _A_n_a_l_y_s_i_s The lexical analyzer is a conceptually simple routine that reads the input and returns the tokens of the C language as it encounters them: names, constants, operators, and keywords. The conceptual simplicity of this job is con- founded a bit by several other simple jobs that unfor- tunately must go on simultaneously. These include o+ Keeping track of the current filename and line number, and occasionally setting this information as the result of preprocessor control lines. o+ Skipping comments. o+ Properly dealing with octal, decimal, hex, floating point, and character constants, as well as character strings. To achieve speed, the program maintains several tables that are indexed into by character value, to tell the lexi- cal analyzer what to do next. To achieve portability, these tables must be initialized each time the compiler is run, in order that the table entries reflect the local character set values. _P_a_r_s_i_n_g As mentioned above, the parser is generated by Yacc from the grammar _c_g_r_a_m._y. The grammar is relatively read- able, but contains some unusual features that are worth com- ment. Perhaps the strangest feature of the grammar is the treatment of declarations. The problem is to keep track of the basic type and the storage class while interpreting the various stars, brackets, and parentheses that may surround a given name. The entire declaration mechanism must be recur- sive, since declarations may appear within declarations of structures and unions, or even within a _s_i_z_e_o_f construction inside a dimension in another declaration! There are some difficulties in using a bottom-up parser, such as produced by Yacc, to handle constructions where a lot of left context information must be kept around. The problem is that the original PDP-11 compiler is top-down in implementation, and some of the semantics of C reflect this. In a top-down parser, the input rules are restricted somewhat, but one can naturally associate temporary storage with a rule at a very early stage in the recognition of that rule. In a bottom-up parser, there is more freedom in the specification of rules, but it is more difficult to know what rule is being matched until the entire rule is seen. The parser described by _c_g_r_a_m._y makes effective use of the September 27, 1987 - 10 - bottom-up parsing mechanism in some places (notably the treatment of expressions), but struggles against the res- trictions in others. The usual result is that it is neces- sary to run a stack of values ``on the side'', independent of the Yacc value stack, in order to be able to store and access information deep within inner constructions, where the relationship of the rules being recognized to the total picture is not yet clear. In the case of declarations, the attribute information (type, etc.) for a declaration is carefully kept immediately to the left of the declarator (that part of the declaration involving the name). In this way, when it is time to declare the name, the name and the type information can be quickly brought together. The ``$0'' mechanism of Yacc is used to accomplish this. The result is not pretty, but it works. The storage class information changes more slowly, so it is kept in an external variable, and stacked if neces- sary. Some of the grammar could be considerably cleaned up by using some more recent features of Yacc, notably actions within rules and the ability to return multiple values for actions. A stack is also used to keep track of the current loca- tion to be branched to when a _b_r_e_a_k or _c_o_n_t_i_n_u_e statement is processed. This use of external stacks dates from the time when Yacc did not permit values to be structures. Some, or most, of this use of external stacks could be eliminated by redo- ing the grammar to use the mechanisms now provided. There are some areas, however, particularly the processing of structure, union, and enumeration declarations, function prologs, and switch statement processing, when having all the affected data together in an array speeds later process- ing; in this case, use of external storage seems essential. The _c_g_r_a_m._y file also contains some small functions used as utility functions in the parser. These include rou- tines for saving case values and labels in processing switches, and stacking and popping values on the external stack described above. _S_t_o_r_a_g_e _C_l_a_s_s_e_s C has a finite, but fairly extensive, number of storage classes available. One of the compiler design decisions was to process the storage class information totally in the first pass; by the second pass, this information must have been totally dealt with. This means that all of the storage allocation must take place in the first pass, so that refer- ences to automatics and parameters can be turned into refer- ences to cells lying a certain number of bytes offset from certain machine registers. Much of this transformation is September 27, 1987 - 11 - machine dependent, and strongly depends on the storage class. The classes include EXTERN (for externally declared, but not defined variables), EXTDEF (for external defini- tions), and similar distinctions for USTATIC and STATIC, UFORTRAN and FORTRAN (for fortran functions) and ULABEL and LABEL. The storage classes REGISTER and AUTO are obvious, as are STNAME, UNAME, and ENAME (for structure, union, and enumeration tags), and the associated MOS, MOU, and MOE (for the members). TYPEDEF is treated as a storage class as well. There are two special storage classes: PARAM and SNULL. SNULL is used to distinguish the case where no explicit storage class has been given; before an entry is made in the symbol table the true storage class is discovered. Similarly, PARAM is used for the temporary entry in the symbol table made before the declaration of function parameters is completed. The most complexity in the storage class process comes from bit fields. A separate storage class is kept for each width bit field; a _k bit bit field has storage class _k plus FIELD. This enables the size to be quickly recovered from the storage class. _S_y_m_b_o_l _T_a_b_l_e _M_a_i_n_t_e_n_a_n_c_e The symbol table routines do far more than simply enter names into the symbol table; considerable semantic process- ing and checking is done as well. For example, if a new declaration comes in, it must be checked to see if there is a previous declaration of the same symbol. If there is, there are many cases. The declarations may agree and be compatible (for example, an extern declaration can appear twice) in which case the new declaration is ignored. The new declaration may add information (such as an explicit array dimension) to an already present declaration. The new declaration may be different, but still correct (for exam- ple, an extern declaration of something may be entered, and then later the definition may be seen). The new declaration may be incompatible, but appear in an inner block; in this case, the old declaration is carefully hidden away, and the new one comes into force until the block is left. Finally, the declarations may be incompatible, and an error message must be produced. A number of other factors make for additional complex- ity. The type declared by the user is not always the type entered into the symbol table (for example, if a formal parameter to a function is declared to be an array, C requires that this be changed into a pointer before entry in the symbol table). Moreover, there are various kinds of illegal types that may be declared which are difficult to check for syntactically (for example, a function returning September 27, 1987 - 12 - an array). Finally, there is a strange feature in C that requires structure tag names and member names for structures and unions to be taken from a different logical symbol table than ordinary identifiers. Keeping track of which kind of name is involved is a bit of struggle (consider typedef names used within structure declarations, for example). The symbol table handling routines have been rewritten a number of times to extend features, improve performance, and fix bugs. They address the above problems with reason- able effectiveness but a singular lack of grace. When a name is read in the input, it is hashed, and the routine _l_o_o_k_u_p is called, together with a flag which tells which symbol table should be searched (actually, both symbol tables are stored in one, and a flag is used to distinguish individual entries). If the name is found, _l_o_o_k_u_p returns the index to the entry found; otherwise, it makes a new entry, marks it UNDEF (undefined), and returns the index of the new entry. This index is stored in the _r_v_a_l field of a NAME node. When a declaration is being parsed, this NAME node is made part of a tree with UNARY MUL nodes for each *, LB nodes for each array descriptor (the right descendant has the dimension), and UNARY CALL nodes for each function descriptor. This tree is passed to the routine _t_y_m_e_r_g_e, along with the attribute type of the whole declaration; this routine collapses the tree to a single node, by calling _t_y_r_e_d_u_c_e, and then modifies the type to reflect the overall type of the declaration. Dimension and size information is stored in a table called _d_i_m_t_a_b. To properly describe a type in C, one needs not just the type information but also size information (for structures and enumerations) and dimension information (for arrays). Sizes and offsets are dealt with in the compiler by giving the associated indices into _d_i_m_t_a_b. _T_y_m_e_r_g_e and _t_y_r_e_d_u_c_e call _d_s_t_a_s_h to put the discovered dimensions away into the _d_i_m_t_a_b array. _T_y_m_e_r_g_e returns a pointer to a sin- gle node that contains the symbol table index in its _r_v_a_l field, and the size and dimension indices in fields _c_s_i_z and _c_d_i_m, respectively. This information is properly considered part of the type in the first pass, and is carried around at all times. To enter an element into the symbol table, the routine _d_e_f_i_d is called; it is handed a storage class, and a pointer to the node produced by _t_y_m_e_r_g_e. _D_e_f_i_d calls _f_i_x_t_y_p_e, which adjusts and checks the given type depending on the storage class, and converts null types appropriately. It then calls _f_i_x_c_l_a_s_s, which does a similar job for the storage class; it is here, for example, that register declarations are either allowed or changed to auto. September 27, 1987 - 13 - The new declaration is now compared against an older one, if present, and several pages of validity checks per- formed. If the definitions are compatible, with possibly some added information, the processing is straightforward. If the definitions differ, the block levels of the current and the old declaration are compared. The current block level is kept in _b_l_e_v_e_l, an external variable; the old declaration level is kept in the symbol table. Block level 0 is for external declarations, 1 is for arguments to func- tions, and 2 and above are blocks within a function. If the current block level is the same as the old declaration, an error results. If the current block level is higher, the new declaration overrides the old. This is done by marking the old symbol table entry ``hidden'', and making a new entry, marked ``hiding''. _L_o_o_k_u_p will skip over hidden entries. When a block is left, the symbol table is searched, and any entries defined in that block are des- troyed; if they hid other entries, the old entries are ``unhidden''. This nice block structure is warped a bit because labels do not follow the block structure rules (one can do a _g_o_t_o into a block, for example); default definitions of functions in inner blocks also persist clear out to the outermost scope. This implies that cleaning up the symbol table after block exit is more subtle than it might first seem. For successful new definitions, _d_e_f_i_d also initializes a ``general purpose'' field, _o_f_f_s_e_t, in the symbol table. It contains the stack offset for automatics and parameters, the register number for register variables, the bit offset into the structure for structure members, and the internal label number for static variables and labels. The offset field is set by _f_a_l_l_o_c for bit fields, and _d_c_l_s_t_r_u_c_t for structures and unions. The symbol table entry itself thus contains the name, type word, size and dimension offsets, offset value, and declaration block level. It also has a field of flags, describing what symbol table the name is in, and whether the entry is hidden, or hides another. Finally, a field gives the line number of the last use, or of the definition, of the name. This is used mainly for diagnostics, but is use- ful to _l_i_n_t as well. In some special cases, there is more than the above amount of information kept for the use of the compiler. This is especially true with structures; for use in initial- ization, structure declarations must have access to a list of the members of the structure. This list is also kept in _d_i_m_t_a_b. Because a structure can be mentioned long before the members are known, it is necessary to have another level of indirection in the table. The two words following the September 27, 1987 - 14 - _c_s_i_z entry in _d_i_m_t_a_b are used to hold the alignment of the structure, and the index in dimtab of the list of members. This list contains the symbol table indices for the struc- ture members, terminated by a -1. _T_r_e_e _B_u_i_l_d_i_n_g The portable compiler transforms expressions into expression trees. As the parser recognizes each rule making up an expression, it calls _b_u_i_l_d_t_r_e_e which is given an operator number, and pointers to the left and right descen- dants. _B_u_i_l_d_t_r_e_e first examines the left and right descen- dants, and, if they are both constants, and the operator is appropriate, simply does the constant computation at compile time, and returns the result as a constant. Otherwise, _b_u_i_l_d_t_r_e_e allocates a node for the head of the tree, attaches the descendants to it, and ensures that conversion operators are generated if needed, and that the type of the new node is consistent with the types of the operands. There is also a considerable amount of semantic complexity here; many combinations of types are illegal, and the port- able compiler makes a strong effort to check the legality of expression types completely. This is done both for _l_i_n_t purposes, and to prevent such semantic errors from being passed through to the code generator. The heart of _b_u_i_l_d_t_r_e_e is a large table, accessed by the routine _o_p_a_c_t. This routine maps the types of the left and right operands into a rather smaller set of descriptors, and then accesses a table (actually encoded in a switch statement) which for each operator and pair of types causes an action to be returned. The actions are logical or's of a number of separate actions, which may be carried out by _b_u_i_l_d_t_r_e_e. These component actions may include checking the left side to ensure that it is an lvalue (can be stored into), applying a type conversion to the left or right operand, setting the type of the new node to the type of the left or right operand, calling various routines to balance the types of the left and right operands, and suppressing the ordinary conversion of arrays and function operands to pointers. An important operation is OTHER, which causes some special code to be invoked in _b_u_i_l_d_t_r_e_e, to handle issues which are unique to a particular operator. Examples of this are structure and union reference (actually handled by the routine _s_t_r_e_f), the building of NAME, ICON, STRING and FCON (floating point constant) nodes, unary * and &, structure assignment, and calls. In the case of unary * and &, _b_u_i_l_d_t_r_e_e will cancel a * applied to a tree, the top node of which is &, and conversely. Another special operation is PUN; this causes the com- piler to check for type mismatches, such as intermixing pointers and integers. September 27, 1987 - 15 - The treatment of conversion operators is a rather strange area of the compiler (and of C!). The introduction of type casts only confounded this situation. Most of the conversion operators are generated by calls to _t_y_m_a_t_c_h and _p_t_m_a_t_c_h, both of which are given a tree, and asked to make the operands agree in type. _P_t_m_a_t_c_h treats the case where one of the operands is a pointer; _t_y_m_a_t_c_h treats all other cases. Where these routines have decided on the proper type for an operand, they call _m_a_k_e_t_y, which is handed a tree, and a type word, dimension offset, and size offset. If necessary, it inserts a conversion operation to make the types correct. Conversion operations are never inserted on the left side of assignment operators, however. There are two conversion operators used; PCONV, if the conversion is to a non-basic type (usually a pointer), and SCONV, if the conversion is to a basic type (scalar). To allow for maximum flexibility, every node produced by _b_u_i_l_d_t_r_e_e is given to a machine dependent routine, _c_l_o_- _c_a_l, immediately after it is produced. This is to allow more or less immediate rewriting of those nodes which must be adapted for the local machine. The conversion operations are given to _c_l_o_c_a_l as well; on most machines, many of these conversions do nothing, and should be thrown away (being careful to retain the type). If this operation is done too early, however, later calls to _b_u_i_l_d_t_r_e_e may get confused about correct type of the subtrees; thus _c_l_o_c_a_l is given the conversion operations only after the entire tree is built. This topic will be dealt with in more detail later. _I_n_i_t_i_a_l_i_z_a_t_i_o_n Initialization is one of the messier areas in the port- able compiler. The only consolation is that most of the mess takes place in the machine independent part, where it is may be safely ignored by the implementor of the compiler for a particular machine. The basic problem is that the semantics of initializa- tion really calls for a co-routine structure; one collection of programs reading constants from the input stream, while another, independent set of programs places these constants into the appropriate spots in memory. The dramatic differ- ences in the local assemblers also come to the fore here. The parsing problems are dealt with by keeping a rather extensive stack containing the current state of the initial- ization; the assembler problems are dealt with by having a fair number of machine dependent routines. The stack contains the symbol table number, type, dimension index, and size index for the current identifier being initialized. Another entry has the offset, in bits, of the beginning of the current identifier. Another entry keeps track of how many elements have been seen, if the September 27, 1987 - 16 - current identifier is an array. Still another entry keeps track of the current member of a structure being initial- ized. Finally, there is an entry containing flags which keep track of the current state of the initialization pro- cess (e.g., tell if a `}' has been seen for the current identifier). When an initialization begins, the routine _b_e_g_i_n_i_t is called; it handles the alignment restrictions, if any, and calls _i_n_s_t_k to create the stack entry. This is done by first making an entry on the top of the stack for the item being initialized. If the top entry is an array, another entry is made on the stack for the first element. If the top entry is a structure, another entry is made on the stack for the first member of the structure. This continues until the top element of the stack is a scalar. _I_n_s_t_k then returns, and the parser begins collecting initializers. When a constant is obtained, the routine _d_o_i_n_i_t is called; it examines the stack, and does whatever is neces- sary to assign the current constant to the scalar on the top of the stack. _g_o_t_s_c_a_l is then called, which rearranges the stack so that the next scalar to be initialized gets placed on top of the stack. This process continues until the end of the initializers; _e_n_d_i_n_i_t cleans up. If a `{' or `}' is encountered in the string of initializers, it is handled by calling _i_l_b_r_a_c_e or _i_r_b_r_a_c_e, respectively. A central issue is the treatment of the ``holes'' that arise as a result of alignment restrictions or explicit requests for holes in bit fields. There is a global vari- able, _i_n_o_f_f, which contains the current offset in the ini- tialization (all offsets in the first pass of the compiler are in bits). _D_o_i_n_i_t figures out from the top entry on the stack the expected bit offset of the next identifier; it calls the machine dependent routine _i_n_f_o_r_c_e which, in a machine dependent way, forces the assembler to set aside space if need be so that the next scalar seen will go into the appropriate bit offset position. The scalar itself is passed to one of the machine dependent routines _f_i_n_c_o_d_e (for floating point initialization), _i_n_c_o_d_e (for fields, and other initializations less than an int in size), and _c_i_n_i_t (for all other initializations). The size is passed to all these routines, and it is up to the machine dependent rou- tines to ensure that the initializer occupies exactly the right size. Character strings represent a bit of an exception. If a character string is seen as the initializer for a pointer, the characters making up the string must be put out under a different location counter. When the lexical analyzer sees the quote at the head of a character string, it returns the token STRING, but does not do anything with the contents. The parser calls _g_e_t_s_t_r, which sets up the appropriate September 27, 1987 - 17 - location counters and flags, and calls _l_x_s_t_r to read and process the contents of the string. If the string is being used to initialize a character array, _l_x_s_t_r calls _p_u_t_b_y_t_e, which in effect simulates _d_o_i_n_i_t for each character read. If the string is used to initial- ize a character pointer, _l_x_s_t_r calls a machine dependent routine, _b_y_c_o_d_e, which stashes away each character. The pointer to this string is then returned, and processed nor- mally by _d_o_i_n_i_t. The null at the end of the string is treated as if it were read explicitly by _l_x_s_t_r. _S_t_a_t_e_m_e_n_t_s The first pass addresses four main areas; declarations, expressions, initialization, and statements. The statement processing is relatively simple; most of it is carried out in the parser directly. Most of the logic is concerned with allocating label numbers, defining the labels, and branching appropriately. An external symbol, _r_e_a_c_h_e_d, is 1 if a statement can be reached, 0 otherwise; this is used to do a bit of simple flow analysis as the program is being parsed, and also to avoid generating the subroutine return sequence if the subroutine cannot ``fall through'' the last state- ment. Conditional branches are handled by generating an expression node, CBRANCH, whose left descendant is the con- ditional expression and the right descendant is an ICON node containing the internal label number to be branched to. For efficiency, the semantics are that the label is gone to if the condition is _f_a_l_s_e. The switch statement is compiled by collecting the case entries, and an indication as to whether there is a default case; an internal label number is generated for each of these, and remembered in a big array. The expression comprising the value to be switched on is compiled when the switch keyword is encountered, but the expression tree is headed by a special node, FORCE, which tells the code gen- erator to put the expression value into a special dis- tinguished register (this same mechanism is used for pro- cessing the return statement). When the end of the switch block is reached, the array containing the case values is sorted, and checked for duplicate entries (an error); if all is correct, the machine dependent routine _g_e_n_s_w_i_t_c_h is called, with this array of labels and values in increasing order. _G_e_n_s_w_i_t_c_h can assume that the value to be tested is already in the register which is the usual integer return value register. September 27, 1987 - 18 - _O_p_t_i_m_i_z_a_t_i_o_n There is a machine independent file, _o_p_t_i_m._c, which contains a relatively short optimization routine, _o_p_t_i_m. Actually the word optimization is something of a misnomer; the results are not optimum, only improved, and the routine is in fact not optional; it must be called for proper opera- tion of the compiler. _O_p_t_i_m is called after an expression tree is built, but before the code generator is called. The essential part of its job is to call _c_l_o_c_a_l on the conversion operators. On most machines, the treatment of & is also essential: by this time in the processing, the only node which is a legal des- cendant of & is NAME. (Possible descendants of * have been eliminated by _b_u_i_l_d_t_r_e_e.) The address of a static name is, almost by definition, a constant, and can be represented by an ICON node on most machines (provided that the loader has enough power). Unfortunately, this is not universally true; on some machine, such as the IBM 370, the issue of addressa- bility rears its ugly head; thus, before turning a NAME node into an ICON node, the machine dependent function _a_n_d_a_b_l_e is called. The optimization attempts of _o_p_t_i_m are quite limited. It is primarily concerned with improving the behavior of the compiler with operations one of whose arguments is a con- stant. In the simplest case, the constant is placed on the right if the operation is commutative. The compiler also makes a limited search for expressions such as ( _x + _a ) + _b where _a and _b are constants, and attempts to combine _a and _b at compile time. A number of special cases are also exam- ined; additions of 0 and multiplications by 1 are removed, although the correct processing of these cases to get the type of the resulting tree correct is decidedly nontrivial. In some cases, the addition or multiplication must be replaced by a conversion operator to keep the types from becoming fouled up. In cases where a relational operation is being done and one operand is a constant, the operands are permuted and the operator altered, if necessary, to put the constant on the right. Finally, multiplications by a power of 2 are changed to shifts. _M_a_c_h_i_n_e _D_e_p_e_n_d_e_n_t _S_t_u_f_f A number of the first pass machine dependent routines have been discussed above. In general, the routines are short, and easy to adapt from machine to machine. The two exceptions to this general rule are _c_l_o_c_a_l and the function prolog and epilog generation routines, _b_f_c_o_d_e and _e_f_c_o_d_e. September 27, 1987 - 19 - _C_l_o_c_a_l has the job of rewriting, if appropriate and desirable, the nodes constructed by _b_u_i_l_d_t_r_e_e. There are two major areas where this is important: NAME nodes and conversion operations. In the case of NAME nodes, _c_l_o_c_a_l must rewrite the NAME node to reflect the actual physical location of the name in the machine. In effect, the NAME node must be examined, the symbol table entry found (through the _r_v_a_l field of the node), and, based on the storage class of the node, the tree must be rewritten. Automatic vari- ables and parameters are typically rewritten by treating the reference to the variable as a structure reference, off the register which holds the stack or argument pointer; the _s_t_r_e_f routine is set up to be called in this way, and to build the appropriate tree. In the most general case, the tree consists of a unary * node, whose descendant is a + node, with the stack or argument register as left operand, and a constant offset as right operand. In the case of LABEL and internal static nodes, the _r_v_a_l field is rewritten to be the negative of the internal label number; a negative _r_v_a_l field is taken to be an internal label number. Finally, a name of class REGISTER must be converted into a REG node, and the _r_v_a_l field replaced by the register number. In fact, this part of the _c_l_o_c_a_l routine is nearly machine independent; only for machines with addressability problems (IBM 370 again!) does it have to be noticeably dif- ferent. The conversion operator treatment is rather tricky. It is necessary to handle the application of conversion opera- tors to constants in _c_l_o_c_a_l, in order that all constant expressions can have their values known at compile time. In extreme cases, this may mean that some simulation of the arithmetic of the target machine might have to be done in a cross-compiler. In the most common case, conversions from pointer to pointer do nothing. For some machines, however, conversion from byte pointer to short or long pointer might require a shift or rotate operation, which would have to be generated here. The extension of the portable compiler to machines where the size of a pointer depends on its type would be straightforward, but has not yet been done. Another machine dependent issue in the first pass is the generation of external ``symbol table'' information. This sort of symbol table is used by programs such as sym- bolic debuggers to relate object code back to source code. Symbol table routines are provided in the file _s_t_a_b._c, which is included in the machine dependent sources for the first pass. The symbol table routines insert assembly code con- taining assembly pseudo-ops directly into the instruction stream generated by the compiler. There are two basic kinds of symbol table operations. September 27, 1987 - 20 - The simplest operation is the generation of a source line number; this serves to map an address in an executable image into a line in a source file so that a debugger can find the source code corresponding to the instructions being exe- cuted. The routine _p_s_l_i_n_e is called by the scanner to emit source line numbers when a nonempty source line is seen. The other variety of symbol table operation is the genera- tion of type and address information about C symbols. This is done through the _o_u_t_s_t_a_b routine, which is normally called using the FIXDEF macro in the monster _d_e_f_i_d routine in _p_f_t_n._c that enters symbols into the compiler's internal symbol table. Yet another major machine dependent issue involves function prolog and epilog generation. The hard part here is the design of the stack frame and calling sequence; this design issue is discussed elsewhere. [ Johnson Lesk Ritchie calling sequence ] The routine _b_f_c_o_d_e is called with the number of arguments the function is defined with, and an array containing the symbol table indices of the declared parameters. _B_f_c_o_d_e must generate the code to establish the new stack frame, save the return address and previous stack pointer value on the stack, and save whatever registers are to be used for register variables. The stack size and the number of register variables is not known when _b_f_c_o_d_e is called, so these numbers must be referred to by assembler constants, which are defined when they are known (usually in the second pass, after all register variables, automatics, and temporaries have been seen). The final job is to find those parameters which may have been declared register, and generate the code to initialize the register with the value passed on the stack. Once again, for most machines, the general logic of _b_f_c_o_d_e remains the same, but the contents of the _p_r_i_n_t_f calls in it will change from machine to machine. _e_f_c_o_d_e is rather simpler, having just to generate the default return at the end of a function. This may be nontrivial in the case of a function returning a structure or union, however. There seems to be no really good place to discuss structures and unions, but this is as good a place as any. The C language now supports structure assignment, and the passing of structures as arguments to functions, and the receiving of structures back from functions. This was added rather late to C, and thus to the portable compiler. Conse- quently, it fits in less well than the older features. Moreover, most of the burden of making these features work is placed on the machine dependent code. There are both conceptual and practical problems. Con- ceptually, the compiler is structured around the idea that to compute something, you put it into a register and work on it. This notion causes a bit of trouble on some machines (e.g., machines with 3-address opcodes), but matches many September 27, 1987 - 21 - machines quite well. Unfortunately, this notion breaks down with structures. The closest that one can come is to keep the addresses of the structures in registers. The actual code sequences used to move structures vary from the trivial (a multiple byte move) to the horrible (a function call), and are very machine dependent. The practical problem is more painful. When a function returning a structure is called, this function has to have some place to put the structure value. If it places it on the stack, it has difficulty popping its stack frame. If it places the value in a static temporary, the routine fails to be reentrant. The most logically consistent way of imple- menting this is for the caller to pass in a pointer to a spot where the called function should put the value before returning. This is relatively straightforward, although a bit tedious, to implement, but means that the caller must have properly declared the function type, even if the value is never used. On some machines, such as the Interdata 8/32, the return value simply overlays the argument region (which on the 8/32 is part of the caller's stack frame). The caller takes care of leaving enough room if the returned value is larger than the arguments. This also assumes that the caller declares the function properly. The PDP-11 and the VAX have stack hardware which is used in function calls and returns; this makes it very inconvenient to use either of the above mechanisms. In these machines, a static area within the called function is allocated, and the function return value is copied into it on return; the function returns the address of that region. This is simple to implement, but is non-reentrant. However, the function can now be called as a subroutine without being properly declared, without the disaster which would other- wise ensue. No matter what choice is taken, the convention is that the function actually returns the address of the return structure value. In building expression trees, the portable compiler takes a bit for granted about structures. It assumes that functions returning structures actually return a pointer to the structure, and it assumes that a reference to a struc- ture is actually a reference to its address. The structure assignment operator is rebuilt so that the left operand is the structure being assigned to, but the right operand is the address of the structure being assigned; this makes it easier to deal with _a = _b = _c and similar constructions. There are four special tree nodes associated with these operations: STASG (structure assignment), STARG (structure September 27, 1987 - 22 - argument to a function call), and STCALL and UNARY STCALL (calls of a function with nonzero and zero arguments, respectively). These four nodes are unique in that the size and alignment information, which can be determined by the type for all other objects in C, must be known to carry out these operations; special fields are set aside in these nodes to contain this information, and special intermediate code is used to transmit this information. _F_i_r_s_t _P_a_s_s _S_u_m_m_a_r_y There are may other issues which have been ignored here, partly to justify the title ``tour'', and partially because they have seemed to cause little trouble. There are some debugging flags which may be turned on, by giving the compiler's first pass the argument -_X[flags] Some of the more interesting flags are -Xd for the defining and freeing of symbols, -Xi for initialization comments, and -Xb for various comments about the building of trees. In many cases, repeating the flag more than once gives more information; thus, -Xddd gives more information than -Xd. In the two pass version of the compiler, the flags should not be set when the output is sent to the second pass, since the debugging output and the intermediate code both go onto the standard output. We turn now to consideration of the second pass. _P_a_s_s _T_w_o Code generation is far less well understood than pars- ing or lexical analysis, and for this reason the second pass is far harder to discuss in a file by file manner. A great deal of the difficulty is in understanding the issues and the strategies employed to meet them. Any particular func- tion is likely to be reasonably straightforward. Thus, this part of the paper will concentrate a good deal on the broader aspects of strategy in the code genera- tor, and will not get too intimate with the details. _O_v_e_r_v_i_e_w It is difficult to organize a code generator to be flexible enough to generate code for a large number of machines, and still be efficient for any one of them. Flex- ibility is also important when it comes time to tune the code generator to improve the output code quality. On the other hand, too much flexibility can lead to semantically incorrect code, and potentially a combinatorial explosion in the number of cases to be considered in the compiler. September 27, 1987 - 23 - One goal of the code generator is to have a high degree of correctness. It is very desirable to have the compiler detect its own inability to generate correct code, rather than to produce incorrect code. This goal is achieved by having a simple model of the job to be done (e.g., an expression tree) and a simple model of the machine state (e.g., which registers are free). The act of generating an instruction performs a transformation on the tree and the machine state; hopefully, the tree eventually gets reduced to a single node. If each of these instruction/transformation pairs is correct, and if the machine state model really represents the actual machine, and if the transformations reduce the input tree to the desired single node, then the output code will be correct. For most real machines, there is no definitive theory of code generation that encompasses all the C operators. Thus the selection of which instruction/transformations to generate, and in what order, will have a heuristic flavor. If, for some expression tree, no transformation applies, or, more seriously, if the heuristics select a sequence of instruction/transformations that do not in fact reduce the tree, the compiler will report its inability to generate code, and abort. A major part of the code generator is concerned with the model and the transformations. Most of this is machine independent, or depends only on simple tables. The flexi- bility comes from the heuristics that guide the transforma- tions of the trees, the selection of subgoals, and the ord- ering of the computation. _T_h_e _M_a_c_h_i_n_e _M_o_d_e_l The machine is assumed to have a number of registers, of at most two different types: _A and _B. Within each regis- ter class, there may be scratch (temporary) registers and dedicated registers (e.g., register variables, the stack pointer, etc.). Requests to allocate and free registers involve only the temporary registers. Each of the registers in the machine is given a name and a number in the _m_a_c_2_d_e_f_s._h file; the numbers are used as indices into various tables that describe the registers, so they should be kept small. One such table is the _r_s_t_a_t_u_s table on file _l_o_c_a_l_2._c. This table is indexed by register number, and contains expressions made up from manifest con- stants describing the register types: SAREG for dedicated AREG's, SAREG|STAREG for scratch AREG's, and SBREG and SBREG|STBREG similarly for BREG's. There are macros that access this information: _i_s_b_r_e_g(_r) returns true if register number _r is a BREG, and _i_s_t_r_e_g(_r) returns true if register number _r is a temporary AREG or BREG. Another table, _r_n_a_m_e_s, contains the register names; this is used when September 27, 1987 - 24 - putting out assembler code and diagnostics. The usage of registers is kept track of by an array called _b_u_s_y. _B_u_s_y[_r] is the number of uses of register _r in the current tree being processed. The allocation and free- ing of registers will be discussed later as part of the code generation algorithm. _G_e_n_e_r_a_l _O_r_g_a_n_i_z_a_t_i_o_n As mentioned above, the second pass reads lines from the intermediate file, copying through to the output unchanged any lines that begin with a `)', and making note of the information about stack usage and register allocation contained on lines beginning with `]' and `['. The expres- sion trees, whose beginning is indicated by a line beginning with `.', are read and rebuilt into trees. If the compiler is loaded as one pass, the expression trees are immediately available to the code generator. The actual code generation is done by a hierarchy of routines. The routine _d_e_l_a_y is first given the tree; it attempts to delay some postfix ++ and -- computations that might reasonably be done after the smoke clears. It also attempts to handle comma (`,') operators by computing the left side expression first, and then rewriting the tree to eliminate the operator. _D_e_l_a_y calls _c_o_d_g_e_n to control the actual code generation process. _C_o_d_g_e_n takes as arguments a pointer to the expression tree, and a second argument that, for socio-historical reasons, is called a _c_o_o_k_i_e. The cookie describes a set of goals that would be acceptable for the code generation: these are assigned to individual bits, so they may be logically or'ed together to form a large number of possible goals. Among the possible goals are FOREFF (compute for side effects only; don't worry about the value), INTEMP (compute and store value into a temporary location in memory), INAREG (compute into an A register), INTAREG (compute into a scratch A register), INBREG and INTBREG similarly, FORCC (compute for condition codes), and FORARG (compute it as a function argument; e.g., stack it if appropriate). _C_o_d_g_e_n first canonicalizes the tree by calling _c_a_n_o_n. This routine looks for certain transformations that might now be applicable to the tree. One, which is very common and very powerful, is to fold together an indirection opera- tor (UNARY MUL) and a register (REG); in most machines, this combination is addressable directly, and so is similar to a NAME in its behavior. The UNARY MUL and REG are folded together to make another node type called OREG. In fact, in many machines it is possible to directly address not just the cell pointed to by a register, but also cells differing by a constant offset from the cell pointed to by the regis- ter. _C_a_n_o_n also looks for such cases, calling the machine September 27, 1987 - 25 - dependent routine _n_o_t_o_f_f to decide if the offset is accept- able (for example, in the IBM 370 the offset must be between 0 and 4095 bytes). Another optimization is to replace bit field operations by shifts and masks if the operation involves extracting the field. Finally, a machine dependent routine, _s_u_c_o_m_p, is called that computes the Sethi-Ullman numbers for the tree (see below). After the tree is canonicalized, _c_o_d_g_e_n calls the rou- tine _s_t_o_r_e whose job is to select a subtree of the tree to be computed and (usually) stored before beginning the compu- tation of the full tree. _S_t_o_r_e must return a tree that can be computed without need for any temporary storage loca- tions. In effect, the only store operations generated while processing the subtree must be as a response to explicit assignment operators in the tree. This division of the job marks one of the more significant, and successful, depar- tures from most other compilers. It means that the code generator can operate under the assumption that there are enough registers to do its job, without worrying about tem- porary storage. If a store into a temporary appears in the output, it is always as a direct result of logic in the _s_t_o_r_e routine; this makes debugging easier. One consequence of this organization is that code is not generated by a treewalk. There are theoretical results that support this decision. [ Aho Johnson Optimal Code Trees jacm ] It may be desirable to compute several subtrees and store them before tackling the whole tree; if a subtree is to be stored, this is known before the code generation for the subtree is begun, and the subtree is computed when all scratch registers are available. The _s_t_o_r_e routine decides what subtrees, if any, should be stored by making use of numbers, called _S_e_t_h_i-_U_l_l_m_a_n _n_u_m_b_e_r_s, that give, for each subtree of an expression tree, the minimum number of scratch registers required to compile the subtree, without any stores into temporaries. [ Sethi Ullman jacm 1970 ] These numbers are computed by the machine-dependent routine _s_u_c_o_m_p, called by _c_a_n_o_n. The basic notion is that, knowing the Sethi-Ullman numbers for the descendants of a node, and knowing the operator of the node and some information about the machine, the Sethi- Ullman number of the node itself can be computed. If the Sethi-Ullman number for a tree exceeds the number of scratch registers available, some subtree must be stored. Unfor- tunately, the theory behind the Sethi-Ullman numbers applies only to uselessly simple machines and operators. For the rich set of C operators, and for machines with asymmetric registers, register pairs, different kinds of registers, and exceptional forms of addressing, the theory cannot be applied directly. The basic idea of estimation is a good one, however, and well worth applying; the application, especially when the compiler comes to be tuned for high code September 27, 1987 - 26 - quality, goes beyond the park of theory into the swamp of heuristics. This topic will be taken up again later, when more of the compiler structure has been described. After examining the Sethi-Ullman numbers, _s_t_o_r_e selects a subtree, if any, to be stored, and returns the subtree and the associated cookie in the external variables _s_t_o_t_r_e_e and _s_t_o_c_o_o_k. If a subtree has been selected, or if the whole tree is ready to be processed, the routine _o_r_d_e_r is called, with a tree and cookie. _O_r_d_e_r generates code for trees that do not require temporary locations. _O_r_d_e_r may make recur- sive calls on itself, and, in some cases, on _c_o_d_g_e_n; for example, when processing the operators &&, ||, and comma (`,'), that have a left to right evaluation, it is incorrect for _s_t_o_r_e examine the right operand for subtrees to be stored. In these cases, _o_r_d_e_r will call _c_o_d_g_e_n recursively when it is permissible to work on the right operand. A similar issue arises with the ? : operator. The _o_r_d_e_r routine works by matching the current tree with a set of code templates. If a template is discovered that will match the current tree and cookie, the associated assembly language statement or statements are generated. The tree is then rewritten, as specified by the template, to represent the effect of the output instruction(s). If no template match is found, first an attempt is made to find a match with a different cookie; for example, in order to com- pute an expression with cookie INTEMP (store into a tem- porary storage location), it is usually necessary to compute the expression into a scratch register first. If all attempts to match the tree fail, the heuristic part of the algorithm becomes dominant. Control is typically given to one of a number of machine-dependent routines that may in turn recursively call _o_r_d_e_r to achieve a subgoal of the com- putation (for example, one of the arguments may be computed into a temporary register). After this subgoal has been achieved, the process begins again with the modified tree. If the machine-dependent heuristics are unable to reduce the tree further, a number of default rewriting rules may be considered appropriate. For example, if the left operand of a + is a scratch register, the + can be replaced by a += operator; the tree may then match a template. To close this introduction, we will discuss the steps in compiling code for the expression _a += _b where _a and _b are static variables. To begin with, the whole expression tree is examined with cookie FOREFF, and no match is found. Search with other cookies is equally fruitless, so an attempt at rewrit- ing is made. Suppose we are dealing with the Interdata 8/32 September 27, 1987 - 27 - for the moment. It is recognized that the left hand and right hand sides of the += operator are addressable, and in particular the left hand side has no side effects, so it is permissible to rewrite this as _a = _a + _b and this is done. No match is found on this tree either, so a machine dependent rewrite is done; it is recognized that the left hand side of the assignment is addressable, but the right hand side is not in a register, so _o_r_d_e_r is called recursively, being asked to put the right hand side of the assignment into a register. This invocation of _o_r_d_e_r searches the tree for a match, and fails. The machine dependent rule for + notices that the right hand operand is addressable; it decides to put the left operand into a scratch register. Another recursive call to _o_r_d_e_r is made, with the tree consisting solely of the leaf _a, and the cookie asking that the value be placed into a scratch regis- ter. This now matches a template, and a load instruction is emitted. The node consisting of _a is rewritten in place to represent the register into which _a is loaded, and this third call to _o_r_d_e_r returns. The second call to _o_r_d_e_r now finds that it has the tree reg + _b to consider. Once again, there is no match, but the default rewriting rule rewrites the + as a += operator, since the left operand is a scratch register. When this is done, there is a match: in fact, reg += _b simply describes the effect of the add instruction on a typ- ical machine. After the add is emitted, the tree is rewrit- ten to consist merely of the register node, since the result of the add is now in the register. This agrees with the cookie passed to the second invocation of _o_r_d_e_r, so this invocation terminates, returning to the first level. The original tree has now become _a = reg which matches a template for the store instruction. The store is output, and the tree rewritten to become just a single register node. At this point, since the top level call to _o_r_d_e_r was interested only in side effects, the call to _o_r_d_e_r returns, and the code generation is completed; we have generated a load, add, and store, as might have been expected. The effect of machine architecture on this is consider- able. For example, on the Honeywell 6000, the machine September 27, 1987 - 28 - dependent heuristics recognize that there is an ``add to storage'' instruction, so the strategy is quite different; _b is loaded in to a register, and then an add to storage instruction generated to add this register in to _a. The transformations, involving as they do the semantics of C, are largely machine independent. The decisions as to when to use them, however, are almost totally machine dependent. Having given a broad outline of the code generation process, we shall next consider the heart of it: the tem- plates. This leads naturally into discussions of template matching and register allocation, and finally a discussion of the machine dependent interfaces and strategies. _T_h_e _T_e_m_p_l_a_t_e_s The templates describe the effect of the target machine instructions on the model of computation around which the compiler is organized. In effect, each template has five logical sections, and represents an assertion of the form: _I_f we have a subtree of a given shape (1), and we have a goal (cookie) or goals to achieve (2), and we have sufficient free resources (3), _t_h_e_n we may emit an instruction or instructions (4), and rewrite the sub- tree in a particular manner (5), and the rewritten tree will achieve the desired goals. These five sections will be discussed in more detail later. First, we give an example of a template: ASG PLUS, INAREG, SAREG, TINT, SNAME, TINT, 0, RLEFT, " add AL,AR\n", The top line specifies the operator (+=) and the cookie (compute the value of the subtree into an AREG). The second and third lines specify the left and right descendants, respectively, of the += operator. The left descendant must be a REG node, representing an A register, and have integer type, while the right side must be a NAME node, and also have integer type. The fourth line contains the resource requirements (no scratch registers or temporaries needed), and the rewriting rule (replace the subtree by the left des- cendant). Finally, the quoted string on the last line represents the output to the assembler: lower case letters, tabs, spaces, etc. are copied _v_e_r_b_a_t_i_m. to the output; upper case letters trigger various macro-like expansions. Thus, _A_L would expand into the Address form of the Left operand - presumably the register number. Similarly, _A_R would expand into the name of the right operand. The _a_d_d instruction of the last section might well be emitted by September 27, 1987 - 29 - this template. In principle, it would be possible to make separate templates for all legal combinations of operators, cookies, types, and shapes. In practice, the number of combinations is very large. Thus, a considerable amount of mechanism is present to permit a large number of subtrees to be matched by a single template. Most of the shape and type specifiers are individual bits, and can be logically or'ed together. There are a number of special descriptors for matching classes of operators. The cookies can also be combined. As an example of the kind of template that really arises in practice, the actual template for the Interdata 8/32 that subsumes the above example is: ASG OPSIMP, INAREG|FORCC, SAREG, TINT|TUNSIGNED|TPOINT, SAREG|SNAME|SOREG|SCON, TINT|TUNSIGNED|TPOINT, 0, RLEFT|RESCC, " OI AL,AR\n", Here, OPSIMP represents the operators +, -, |, &, and ^. The _O_I macro in the output string expands into the appropri- ate Integer Opcode for the operator. The left and right sides can be integers, unsigned, or pointer types. The right side can be, in addition to a name, a register, a memory location whose address is given by a register and displacement (OREG), or a constant. Finally, these instruc- tions set the condition codes, and so can be used in condi- tion contexts: the cookie and rewriting rules reflect this. _T_h_e _T_e_m_p_l_a_t_e _M_a_t_c_h_i_n_g _A_l_g_o_r_i_t_h_m The heart of the second pass is the template matching algorithm, in the routine _m_a_t_c_h. _M_a_t_c_h is called with a tree and a cookie; it attempts to match the given tree against some template that will transform it according to one of the goals given in the cookie. If a match is suc- cessful, the transformation is applied; _e_x_p_a_n_d is called to generate the assembly code, and then _r_e_c_l_a_i_m rewrites the tree, and reclaims the resources, such as registers, that might have become free as a result of the generated code. This part of the compiler is among the most time criti- cal. There is a spectrum of implementation techniques available for doing this matching. The most naive algorithm simply looks at the templates one by one. This can be con- siderably improved upon by restricting the search for an acceptable template. It would be possible to do better than this if the templates were given to a separate program that ate them and generated a template matching subroutine. This would make maintenance of the compiler much more compli- cated, however, so this has not been done. September 27, 1987 - 30 - The matching algorithm is actually carried out by res- tricting the range in the table that must be searched for each opcode. This introduces a number of complications, however, and needs a bit of sympathetic help by the person constructing the compiler in order to obtain best results. The exact tuning of this algorithm continues; it is best to consult the code and comments in _m_a_t_c_h for the latest ver- sion. In order to match a template to a tree, it is necessary to match not only the cookie and the operator of the root, but also the types and shapes of the left and right descen- dants (if any) of the tree. A convention is established here that is carried out throughout the second pass of the compiler. If a node represents a unary operator, the single descendant is always the ``left'' descendant. If a node represents a unary operator or a leaf node (no descendants) the ``right'' descendant is taken by convention to be the node itself. This enables templates to easily match leaves and conversion operators, for example, without any addi- tional mechanism in the matching program. The type matching is straightforward; it is possible to specify any combination of basic types, general pointers, and pointers to one or more of the basic types. The shape matching is somewhat more complicated, but still pretty sim- ple. Templates have a collection of possible operand shapes on which the opcode might match. In the simplest case, an _a_d_d operation might be able to add to either a register variable or a scratch register, and might be able (with appropriate help from the assembler) to add an integer con- stant (ICON), a static memory cell (NAME), or a stack loca- tion (OREG). It is usually attractive to specify a number of such shapes, and distinguish between them when the assembler out- put is produced. It is possible to describe the union of many elementary shapes such as ICON, NAME, OREG, AREG or BREG (both scratch and register forms), etc. To handle at least the simple forms of indirection, one can also match some more complicated forms of trees: STARNM and STARREG can match more complicated trees headed by an indirection opera- tor, and SFLD can match certain trees headed by a FLD opera- tor. These patterns call machine dependent routines that match the patterns of interest on a given machine. The shape SWADD may be used to recognize NAME or OREG nodes that lie on word boundaries: this may be of some importance on word addressed machines. Finally, there are some special shapes: these may not be used in conjunction with the other shapes, but may be defined and extended in machine dependent ways. The special shapes SZERO, SONE, and SMONE are prede- fined and match constants 0, 1, and -1, respectively; others are easy to add and match by using the machine dependent routine _s_p_e_c_i_a_l. September 27, 1987 - 31 - When a template has been found that matches the root of the tree, the cookie, and the shapes and types of the des- cendants, there is still one bar to a total match: the tem- plate may call for some resources (for example, a scratch register). The routine _a_l_l_o is called, and it attempts to allocate the resources. If it cannot, the match fails; no resources are allocated. If successful, the allocated resources are given numbers 1, 2, etc. for later reference when the assembly code is generated. The routines _e_x_p_a_n_d and _r_e_c_l_a_i_m are then called. The _m_a_t_c_h routine then returns a special value, MDONE. If no match was found, the value MNOPE is returned; this is a signal to the caller to try more cookie values, or attempt a rewriting rule. _M_a_t_c_h is also used to select rewriting rules, although the way of doing this is pretty straightforward. A special cookie, FORREW, is used to ask _m_a_t_c_h to search for a rewriting rule. The rewriting rules are keyed to various opcodes; most are carried out in _o_r_d_e_r. Since the question of when to rewrite is one of the key issues in code generation, it will be taken up again later. _R_e_g_i_s_t_e_r _A_l_l_o_c_a_t_i_o_n The register allocation routines, and the allocation strategy, play a central role in the correctness of the code generation algorithm. If there are bugs in the Sethi-Ullman computation that cause the number of needed registers to be underestimated, the compiler may run out of scratch regis- ters; it is essential that the allocator keep track of those registers that are free and busy, in order to detect such conditions. Allocation of registers takes place as the result of a template match; the routine _a_l_l_o is called with a word describing the number of A registers, B registers, and tem- porary locations needed. The allocation of temporary loca- tions on the stack is relatively straightforward, and will not be further covered; the bookkeeping is a bit tricky, but conceptually trivial, and requests for temporary space on the stack will never fail. Register allocation is less straightforward. The two major complications are _p_a_i_r_i_n_g and _s_h_a_r_i_n_g. In many machines, some operations (such as multiplication and divi- sion), and/or some types (such as longs or double precision) require even/odd pairs of registers. Operations of the first type are exceptionally difficult to deal with in the compiler; in fact, their theoretical properties are rather bad as well. [ Aho Johnson Ullman Multiregister ] The second issue is dealt with rather more successfully; a machine dependent function called _s_z_t_y(_t) is called that returns 1 or 2, depending on the number of A registers required to hold an object of type _t. If _s_z_t_y returns 2, an even/odd pair of A registers is allocated for each request. September 27, 1987 - 32 - As part of its duties, the routine _u_s_a_b_l_e finds usable register pairs for various operations. This task is not as easy as it sounds; it does not suffice to merely use _s_z_t_y on the expression tree, since there are situations in which a register pair temporary is needed even though the result of the expression requires only one register. This can occur with assignment operator expressions which have int type but a double right hand side, or with relational expressions where one operand is float and the other double. The other issue, sharing, is more subtle, but important for good code quality. When registers are allocated, it is possible to reuse registers that hold address information, and use them to contain the values computed or accessed. For example, on the IBM 360, if register 2 has a pointer to an integer in it, we may load the integer into register 2 itself by saying: L 2,0(2) If register 2 had a byte pointer, however, the sequence for loading a character involves clearing the target register first, and then inserting the desired character: SR 3,3 IC 3,0(2) In the first case, if register 3 were used as the target, it would lead to a larger number of registers used for the expression than were required; the compiler would generate inefficient code. On the other hand, if register 2 were used as the target in the second case, the code would simply be wrong. In the first case, register 2 can be _s_h_a_r_e_d while in the second, it cannot. In the specification of the register needs in the tem- plates, it is possible to indicate whether required scratch registers may be shared with possible registers on the left or the right of the input tree. In order that a register be shared, it must be scratch, and it must be used only once, on the appropriate side of the tree being compiled. The _a_l_l_o routine thus has a bit more to do than meets the eye; it calls _f_r_e_e_r_e_g to obtain a free register for each A and B register request. _F_r_e_e_r_e_g makes multiple calls on the routine _u_s_a_b_l_e to decide if a given register can be used to satisfy a given need. _U_s_a_b_l_e calls _s_h_a_r_e_i_t if the regis- ter is busy, but might be shared. Finally, _s_h_a_r_e_i_t calls _u_s_h_a_r_e to decide if the desired register is actually in the appropriate subtree, and can be shared. Just to add additional complexity, on some machines (such as the IBM 370) it is possible to have ``double index- ing'' forms of addressing; these are represented by OREG's September 27, 1987 - 33 - with the base and index registers encoded into the register field. While the register allocation and deallocation _p_e_r _s_e is not made more difficult by this phenomenon, the code itself is somewhat more complex. Having allocated the registers and expanded the assem- bly language, it is time to reclaim the resources; the rou- tine _r_e_c_l_a_i_m does this. Many operations produce more than one result. For example, many arithmetic operations may produce a value in a register, and also set the condition codes. Assignment operations may leave results both in a register and in memory. _R_e_c_l_a_i_m is passed three parameters; the tree and cookie that were matched, and the rewriting field of the template. The rewriting field allows the specification of possible results; the tree is rewritten to reflect the results of the operation. If the tree was com- puted for side effects only (FOREFF), the tree is freed, and all resources in it reclaimed. If the tree was computed for condition codes, the resources are also freed, and the tree replaced by a special node type, FORCC. Otherwise, the value may be found in the left argument of the root, the right argument of the root, or one of the temporary resources allocated. In these cases, first the resources of the tree, and the newly allocated resources, are freed; then the resources needed by the result are made busy again. The final result must always match the shape of the input cookie; otherwise, the compiler error ``cannot reclaim'' is generated. There are some machine dependent ways of prefer- ring results in registers or memory when there are multiple results matching multiple goals in the cookie. _R_e_c_l_a_i_m also implements, in a curious way, C's ``usual arithmetic conversions''. When a value is generated into a temporary register, _r_e_c_l_a_i_m decides what the type and size of the result will be. Unless automatic conversion is specifically suppressed in the code template with the T macro, _r_e_c_l_a_i_m converts char and short results to int, unsigned char and unsigned short results to unsigned int, and float into double (for double only floating point arith- metic). This conversion is a simple type pun; no instruc- tions for converting the value are actually emitted. This implies that registers must always contain a value that is at least as wide as a register, which greatly restricts the range of possible templates. _T_h_e _M_a_c_h_i_n_e _D_e_p_e_n_d_e_n_t _I_n_t_e_r_f_a_c_e The files _o_r_d_e_r._c, _l_o_c_a_l_2._c, and _t_a_b_l_e._c, as well as the header file _m_a_c_2_d_e_f_s, represent the machine dependent portion of the second pass. The machine dependent portion can be roughly divided into two: the easy portion and the hard portion. The easy portion tells the compiler the names of the registers, and arranges that the compiler generate the proper assembler formats, opcode names, location September 27, 1987 - 34 - counters, etc. The hard portion involves the Sethi-Ullman computation, the rewriting rules, and, to some extent, the templates. It is hard because there are no real algorithms that apply; most of this portion is based on heuristics. This section discusses the easy portion; the next several sections will discuss the hard portion. If the compiler is adapted from a compiler for a machine of similar architecture, the easy part is indeed easy. In _m_a_c_2_d_e_f_s, the register numbers are defined, as well as various parameters for the stack frame, and various macros that describe the machine architecture. If double indexing is to be permitted, for example, the symbol R2REGS is defined. Also, a number of macros that are involved in function call processing, especially for unusual function call mechanisms, are defined here. In _l_o_c_a_l_2._c, a large number of simple functions are defined. These do things such as write out opcodes, regis- ter names, and address forms for the assembler. Part of the function call code is defined here; that is nontrivial to design, but typically rather straightforward to implement. Among the easy routines in _o_r_d_e_r._c are routines for generat- ing a created label, defining a label, and generating the arguments of a function call. These routines tend to have a local effect, and depend on a fairly straightforward way on the target assembler and the design decisions already made about the compiler. Thus they will not be further treated here. _T_h_e _R_e_w_r_i_t_i_n_g _R_u_l_e_s When a tree fails to match any template, it becomes a candidate for rewriting. Before the tree is rewritten, the machine dependent routine _n_e_x_t_c_o_o_k is called with the tree and the cookie; it suggests another cookie that might be a better candidate for the matching of the tree. If all else fails, the templates are searched with the cookie FORREW, to look for a rewriting rule. The rewriting rules are of two kinds; for most of the common operators, there are machine dependent rewriting rules that may be applied; these are handled by machine dependent functions that are called and given the tree to be computed. These routines may recur- sively call _o_r_d_e_r or _c_o_d_g_e_n to cause certain subgoals to be achieved; if they actually call for some alteration of the tree, they return 1, and the code generation algorithm recanonicalizes and tries again. If these routines choose not to deal with the tree, the default rewriting rules are applied. The assignment operators, when rewritten, call the rou- tine _s_e_t_a_s_g. This is assumed to rewrite the tree at least to the point where there are no side effects in the left September 27, 1987 - 35 - hand side. If there is still no template match, a default rewriting is done that causes an expression such as _a += _b to be rewritten as _a = _a + _b This is a useful default for certain mixtures of strange types (for example, when _a is a bit field and _b an charac- ter) that otherwise might need separate table entries. Simple assignment, structure assignment, and all forms of calls are handled completely by the machine dependent routines. For historical reasons, the routines generating the calls return 1 on failure, 0 on success, unlike the other routines. The machine dependent routine _s_e_t_b_i_n handles binary operators; it too must do most of the job. In particular, when it returns 0, it must do so with the left hand side in a temporary register. The default rewriting rule in this case is to convert the binary operator into the associated assignment operator; since the left hand side is assumed to be a temporary register, this preserves the semantics and often allows a considerable saving in the template table. The increment and decrement operators may be dealt with with the machine dependent routine _s_e_t_i_n_c_r. If this routine chooses not to deal with the tree, the rewriting rule replaces _x ++ by ( (_x += _1) - _1) which preserves the semantics. Once again, this is not too attractive for the most common cases, but can generate close to optimal code when the type of x is unusual. Finally, the indirection (UNARY MUL) operator is also handled in a special way. The machine dependent routine _o_f_f_s_t_a_r is extremely important for the efficient generation of code. _O_f_f_s_t_a_r is called with a tree that is the direct descendant of a UNARY MUL node; its job is to transform this tree so that the combination of UNARY MUL with the transformed tree becomes addressable. On most machines, _o_f_f_s_t_a_r can simply compute the tree into an A or B register, depending on the architecture, and then _c_a_n_o_n will make the resulting tree into an OREG. On many machines, _o_f_f_s_t_a_r can profitably choose to do less work than computing its entire September 27, 1987 - 36 - argument into a register. For example, if the target machine supports OREG's with a constant offset from a regis- ter, and _o_f_f_s_t_a_r is called with a tree of the form _e_x_p_r + _c_o_n_s_t where _c_o_n_s_t is a constant, then _o_f_f_s_t_a_r need only compute _e_x_p_r into the appropriate form of register. On machines that support double indexing, _o_f_f_s_t_a_r may have even more choice as to how to proceed. The proper tuning of _o_f_f_s_t_a_r, which is not typically too difficult, should be one of the first tries at optimization attempted by the compiler writer. _T_h_e _S_e_t_h_i-_U_l_l_m_a_n _C_o_m_p_u_t_a_t_i_o_n The heart of the heuristics is the computation of the Sethi-Ullman numbers. This computation is closely linked with the rewriting rules and the templates. As mentioned before, the Sethi-Ullman numbers are expected to estimate the number of scratch registers needed to compute the sub- trees without using any stores. However, the original theory does not apply to real machines. For one thing, the theory assumes that all registers are interchangeable. Real machines have general purpose, floating point, and index registers, register pairs, etc. The theory also does not account for side effects; this rules out various forms of pathology that arise from assignment and assignment opera- tors. Condition codes are also undreamed of. Finally, the influence of types, conversions, and the various addressa- bility restrictions and extensions of real machines are also ignored. Nevertheless, for a ``useless'' theory, the basic insight of Sethi and Ullman is amazingly useful in a real compiler. The notion that one should attempt to estimate the resource needs of trees before starting the code genera- tion provides a natural means of splitting the code genera- tion problem, and provides a bit of redundancy and self checking in the compiler. Moreover, if writing the Sethi- Ullman routines is hard, describing, writing, and debugging the alternative (routines that attempt to free up registers by stores into temporaries ``on the fly'') is even worse. Nevertheless, it should be clearly understood that these routines exist in a realm where there is no ``right'' way to write them; it is an art, the realm of heuristics, and, con- sequently, a major source of bugs in the compiler. Often, the early, crude versions of these routines give little trouble; only after the compiler is actually working and the code quality is being improved do serious problem have to be faced. Having a simple, regular machine architecture is worth quite a lot at this time. The major problems arise from asymmetries in the September 27, 1987 - 37 - registers: register pairs, having different kinds of regis- ters, and the related problem of needing more than one register (frequently a pair) to store certain data types (such as longs or doubles). There appears to be no general way of treating this problem; solutions have to be fudged for each machine where the problem arises. On the Honeywell 66, for example, there are only two general purpose regis- ters, so a need for a pair is the same as the need for two registers. On the IBM 370, the register pair (0,1) is used to do multiplications and divisions; registers 0 and 1 are not generally considered part of the scratch registers, and so do not require allocation explicitly. On the Interdata 8/32, after much consideration, the decision was made not to try to deal with the register pair issue; operations such as multiplication and division that required pairs were simply assumed to take all of the scratch registers. Several weeks of effort had failed to produce an algorithm that seemed to have much chance of running successfully without inordinate debugging effort. The difficulty of this issue should not be minimized; it represents one of the main intellectual efforts in porting the compiler. Nevertheless, this problem has been fudged with a degree of success on nearly a dozen machines, so the compiler writer should not abandon hope. The Sethi-Ullman computations interact with the rest of the compiler in a number of rather subtle ways. As already discussed, the _s_t_o_r_e routine uses the Sethi-Ullman numbers to decide which subtrees are too difficult to compute in registers, and must be stored. There are also subtle interactions between the rewriting routines and the Sethi- Ullman numbers. Suppose we have a tree such as _A - _B where _A and _B are expressions; suppose further that _B takes two registers, and _A one. It is possible to compute the full expression in two registers by first computing _B, and then, using the scratch register used by _B, but not contain- ing the answer, compute _A. The subtraction can then be done, computing the expression. (Note that this assumes a number of things, not the least of which are register-to- register subtraction operators and symmetric registers.) If the machine dependent routine _s_e_t_b_i_n, however, is not prepared to recognize this case and compute the more diffi- cult side of the expression first, the Sethi-Ullman number must be set to three. Thus, the Sethi-Ullman number for a tree should represent the code that the machine dependent routines are actually willing to generate. The interaction can go the other way. If we take an expression such as * ( _p + _i ) September 27, 1987 - 38 - where _p is a pointer and _i an integer, this can probably be done in one register on most machines. Thus, its Sethi- Ullman number would probably be set to one. If double indexing is possible in the machine, a possible way of com- puting the expression is to load both _p and _i into regis- ters, and then use double indexing. This would use two scratch registers; in such a case, it is possible that the scratch registers might be unobtainable, or might make some other part of the computation run out of registers. The usual solution is to cause _o_f_f_s_t_a_r to ignore opportunities for double indexing that would tie up more scratch registers than the Sethi-Ullman number had reserved. In summary, the Sethi-Ullman computation represents much of the craftsmanship and artistry in any application of the portable compiler. It is also a frequent source of bugs. Algorithms are available that will produce nearly optimal code for specialized machines, but unfortunately most existing machines are far removed from these ideals. The best way of proceeding in practice is to start with a compiler for a similar machine to the target, and proceed very carefully. _R_e_g_i_s_t_e_r _A_l_l_o_c_a_t_i_o_n After the Sethi-Ullman numbers are computed, _o_r_d_e_r calls a routine, _r_a_l_l_o, that does register allocation, if appropriate. This routine does relatively little, in gen- eral; this is especially true if the target machine is fairly regular. There are a few cases where it is assumed that the result of a computation takes place in a particular register; switch and function return are the two major places. The expression tree has a field, _r_a_l_l, that may be filled with a register number; this is taken to be a pre- ferred register, and the first temporary register allocated by a template match will be this preferred one, if it is free. If not, no particular action is taken; this is just a heuristic. If no register preference is present, the field contains NOPREF. In some cases, the result must be placed in a given register, no matter what. The register number is placed in _r_a_l_l, and the mask MUSTDO is logically or'ed in with it. In this case, if the subtree is requested in a register, and comes back in a register other than the demanded one, it is moved by calling the routine _r_m_o_v_e. If the target register for this move is busy, it is a compiler error. Note that this mechanism is the only one that will ever cause a register-to-register move between scratch registers (unless such a move is buried in the depths of some tem- plate). This simplifies debugging. In some cases, there is a rather strange interaction between the register allocation and the Sethi-Ullman number; if there is an operator or situation requiring a particular register, the allocator and September 27, 1987 - 39 - the Sethi-Ullman computation must conspire to ensure that the target register is not being used by some intermediate result of some far-removed computation. This is most easily done by making the special operation take all of the free registers, preventing any other partially-computed results from cluttering up the works. _T_e_m_p_l_a_t_e _S_h_o_r_t_c_u_t_s Some operations are just too hard or too clumsy to be implemented in code templates on a particular architecture. One way to handle such operations is to replace them with function calls. The intermediate file reading code in _r_e_a_d_e_r._c contains a call to an implementation dependent macro MYREADER; this can be defined to call various routines which walk the code tree and perform transformations. On the VAX, for example, unsigned division and remainder opera- tions are far too complex to encode in a template. The rou- tine _h_a_r_d_o_p_s is called from a tree walk in _m_y_r_e_a_d_e_r to detect these operations and replace them with calls to the C runtime functions _u_d_i_v and _u_r_e_m. (There are complementary functions _a_u_d_i_v and _a_u_r_e_m which are provided as support for unsigned assignment operator expressions; they are different from _u_d_i_v and _u_r_e_m because the left hand side of an assign- ment operator expression must be evaluated only once.) Note that arithmetic support routines are always expensive; the compiler makes an effort to notice common operations such as unsigned division by a constant power of two and generates optimal code for these inline. Another escape involves the routine _z_z_z_c_o_d_e. This function is called from _e_x_p_a_n_d to process template macros which start with the character _Z. On the VAX, many complex code generation problems are swept under the rug into _z_z_z_c_o_d_e. Scalar type conversions are a particularly annoy- ing issue; they are primarily handled using the macro ZA. Rather than creating a template for each possible conversion and result, which would be tedious and complex given C's many scalar types, this macro allows the compiler to take shortcuts. Tough conversions such as unsigned into double are easily handled using special code under ZA. One conven- tion which makes scalar conversions somewhat more difficult than they might otherwise be is the strict requirement that values in registers must have a type that is as wide or wider than a single register. This convention is used pri- marily to implement the ``usual arithmetic conversions'' of C, but it can get in the way when converting between (say) a char value and an unsigned short. A routine named _c_o_l_l_a_p_s_i_- _b_l_e is used to determine whether one operation or two is needed to produce a register-width result. Another convenient macro is ZP. This macro is used to generate an appropriate conditional test after a comparison. September 27, 1987 - 40 - This makes it possible to avoid a profusion of template entries which essentially duplicate each other, one entry for each type of test multiplied by the number of different comparison conditions. A related macro, ZN, is used to nor- malize the result of a relational test by producing an integer 0 or 1. The macro ZS does the unlovely job of generating code for structure assignments. It tests the size of the struc- ture to see what VAX instruction can be used to move it, and is capable of emitting a block move instruction for large structures. On other architectures this macro could be used to generate a function call to a block copy routine. The macro ZG was recently introduced to handle the thorny issue of assignment operator expressions which have an integral left hand side and a floating point right hand side. These expressions are passed to the code generator without the usual type balancing so that good code can be generated for them. Older versions of the portable compiler computed these expressions with integer arithmetic; with the ZG operator, the current compiler can convert the left hand side to the appropriate floating type, compute the expres- sion with floating point arithmetic, convert the result back to integral type and store it in the left hand side. These operations are performed by recursive calls to _z_z_z_c_o_d_e and other routines related to _e_x_p_a_n_d. An assortment of other macros finish the job of inter- preting code templates. Among the more interesting ones: ZC produces the number of words pushed on the argument stack, which is useful for function calls; ZD and ZE produce con- stant increment and decrement operations; ZL and ZR produce the assembler letter code (l, w or b) corresponding to the size and type of the left and right operand respectively. _S_h_a_r_e_d _C_o_d_e The _l_i_n_t utility shares sources with the portable com- piler. _L_i_n_t uses all of the machine independent pass 1 sources, and adds its own set of ``machine dependent'' rou- tines, contained mostly in _l_i_n_t._c. _L_i_n_t uses a private intermediate file format and a private pass 2 whose source is _l_p_a_s_s_2._c. Several modifications were made to the C scanner in _s_c_a_n._c, conditionally compiled with the symbol LINT, in order to support _l_i_n_t's convention of passing ``pragma'' information inside special comments. A few other minor modifications were also made, _e._g. to skip over _a_s_m statements. The _f_7_7 and _p_c compilers use a code generator which shares sources with pass 2 of the portable compiler. This code generator is very similar to pass 2 but uses a dif- ferent intermediate file format. Three source files are September 27, 1987 - 41 - needed in addition to the pass 2 sources. _f_o_r_t._c is a machine independent source file which contains a pass 2 main routine that replaces the equivalent routine in _r_e_a_d_e_r._c, together with several routines for reading the binary inter- mediate file. _f_o_r_t._c includes the machine dependent file _f_o_r_t._h, which defines two trivial label generation routines. A header file /_u_s_r/_i_n_c_l_u_d_e/_p_c_c._h defines opcode and type symbols which are needed to provide a standard intermediate file format; this file is also included by the Fortran and Pascal compilers. The creation of this header file made it necessary to make some changes in the way the portable C compiler is built. These changes were made with the aim of minimizing the number of lines changed in the original sources. Macro symbols in _p_c_c._h are flagged with a unique prefix to avoid symbol name collisions in the Fortran and Pascal compilers, which have their own internal opcode and type symbols. A _s_e_d(1) script is used to strip these pre- fixes, producing an include file named _p_c_c_l_o_c_a_l._h which is specific to the portable C compiler and contains opcode sym- bols which are compatible with the original opcode symbols. A similar _s_e_d script is used to produce a file of Yacc tokens for the C grammar. A number of changes to existing source files were made to accommodate the Fortran-style pass 2. These changes are conditionally compiled using the symbol FORT. Many changes were needed to implement single-precision arithmetic; other changes concern such things as the avoidance of floating point move instructions, which on the VAX can cause floating point faults when a datum is not a normalized floating point value. In earlier implementations of the Fortran-style pass 2 there were a number of stub files which served only to define the symbol FORT in a particular source file; these files have been removed for 4.3BSD in favor of a new compi- lation strategy which yields up to three different objects from a single source file, depending on what compilation control symbols are defined for that file. The Fortran-style pass 2 uses a Polish Postfix inter- mediate file. The file is in binary format, and is logi- cally divided into a stream of 32-bit records. Each record consists of an (_o_p_c_o_d_e, _v_a_l_u_e, _t_y_p_e) triple, possibly fol- lowed inline by more descriptive information. The _o_p_c_o_d_e and _t_y_p_e are selected from the list in _p_c_c._h; the type encodes a basic type, around which may be wrapped type modifiers such as ``pointer to'' or ``array of'' to produce more complex types. The function of the _v_a_l_u_e parameter depends on the opcode; it may be used for a flag, a register number or the value of a constant, or it may be unused. The optional inline data is often a null-terminated string, but it may also be a binary offset from a register or from a symbolic constant; sometimes both a string and an offset appear. September 27, 1987 - 42 - Here are a few samples of intermediate file records and their interpretation: Optional 8OpcodeType Value9 Data8 Interpretation 8 98____________________________________________________________________________________________________________________________ 9ICON int flag=0binary=5 7 the integer constant 5 NAME char flag=1 7 binary=1, string="_foo_" 77 a character*1 element in a Fortran common block _f_o_o at offset 1 OREG char reg=11 7 offset=1, string="v.2-v.1" 77 the second element of a For- tran character*1 array, ex- pressed as an offset from a static base register PLUS float 7 a single precision add FTEXT size=2string=".text 0" 7 an inline assembler directive of length 2 (32-bit records) _C_o_m_p_i_l_e_r _B_u_g_s The portable compiler has an excellent record of gen- erating correct code. The requirement for reasonable cooperation between the register allocation, Sethi-Ullman computation, rewriting rules, and templates builds quite a bit of redundancy into the compiling process. The effect of this is that, in a surprisingly short time, the compiler will start generating correct code for those programs that it can compile. The hard part of the job then becomes find- ing and eliminating those situations where the compiler refuses to compile a program because it knows it cannot do it right. For example, a template may simply be missing; this may either give a compiler error of the form ``no match for op ...'' , or cause the compiler to go into an infinite loop applying various rewriting rules. The compiler has a variable, _n_r_e_c_u_r, that is set to 0 at the beginning of an expressions, and incremented at key spots in the compilation process; if this parameter gets too large, the compiler decides that it is in a loop, and aborts. Loops are also characteristic of botches in the machine-dependent rewriting rules. Bad Sethi-Ullman computations usually cause the scratch registers to run out; this often means that the Sethi-Ullman number was underestimated, so _s_t_o_r_e did not store something it should have; alternatively, it can mean that the rewriting rules were not smart enough to find the sequence that _s_u_c_o_m_p assumed would be used. The best approach when a compiler error is detected involves several stages. First, try to get a small example program that steps on the bug. Second, turn on various debugging flags in the code generator, and follow the tree through the process of being matched and rewritten. Some flags of interest are -e, which prints the expression tree, -r, which gives information about the allocation of 9 September 27, 1987 - 43 - registers, -a, which gives information about the performance of _r_a_l_l_o, and -o, which gives information about the behavior of _o_r_d_e_r. This technique should allow most bugs to be found relatively quickly. Unfortunately, finding the bug is usually not enough; it must also be fixed! The difficulty arises because a fix to the particular bug of interest tends to break other code that already works. Regression tests, tests that compare the performance of a new compiler against the performance of an older one, are very valuable in preventing major catas- trophes. _C_o_m_p_i_l_e_r _E_x_t_e_n_s_i_o_n_s The portable C compiler makes a few extensions to the language described by Ritchie. _S_i_n_g_l_e _p_r_e_c_i_s_i_o_n _a_r_i_t_h_m_e_t_i_c. ``All floating arithmetic in C is carried out in double-precision; whenever a _f_l_o_a_t appears in a an expression it is lengthened to _d_o_u_b_l_e by zero-padding its fraction.'' Dennis Ritchie. [ kernighan ritchie prentice 1978 ] Programmers who would like to use C to write numerical applications often shy away from it because C programs cannot perform single precision arith- metic. On machines such as the VAX which can cleanly sup- port arithmetic on two (or more) sizes of floating point values, programs which can take advantage of single preci- sion arithmetic will run faster. A very popular proposal for the ANSI C standard states that implementations may per- form single precision computations with single precision arithmetic; some actual C implementations already do this, and now the Berkeley compiler joins them. The changes are implemented in the compiler with a set of conditional compilation directives based on the symbol SPRECC; thus two compilers are generated, one with only dou- ble precision arithmetic and one with both double and single precision arithmetic. The _c_c program uses a flag -_f to select the single/double version of the compiler (/_l_i_b/_s_c_c_o_m) instead of the default double only version (/_l_i_b/_c_c_o_m). It is expected that at some time in the future the double only compiler will be retired and the single/double compiler will become the default. There are a few implementation details of the single/double compiler which will be of interest to users and compiler porters. To maintain compatibility with func- tions compiled by the double only compiler, single precision actual arguments are still coerced to double precision, and formal arguments which are declared single precision are still ``really'' double precision. This may change if func- tion prototypes of the sort proposed for the ANSI C standard are eventually adopted. Floating point constants are now September 27, 1987 - 44 - classified into single precision and double precision types. The precision of a constant is determined from context; if a floating constant appears in an arithmetic expression with a single precision value, the constant is treated as having single precision type and the arithmetic expression is com- puted using single precision arithmetic. Remarkably little code in the compiler needed to be changed to implement the single/double compiler. In many cases the changes overlapped with special cases which are used for the Fortran-style pass 2 (/_l_i_b/_f_1). Most of the single precision changes were implemented by Sam Leffler. _P_r_e_p_r_o_c_e_s_s_o_r _e_x_t_e_n_s_i_o_n_s. The portable C compiler is normally distributed with a macro preprocessor written by J. F. Reiser. This preprocessor implements the features described in Ritchie's reference manual; it removes com- ments, expands macro definitions and removes or inserts code based on conditional compilation directives. Two interest- ing extensions are provided by this version of the prepro- cessor: o+ When comments are removed, no white space is neces- sarily substituted; this has the effect of _r_e- _t_o_k_e_n_i_z_i_n_g code, since the PCC will reanalyze the input. Macros can thus create new tokens by clever use of comments. For example, the macro definition ``#define foo(a,b) a/**/b'' creates a macro _f_o_o which concatenates its two arguments, forming a new token. o+ Macro bodies are analyzed for macro arguments without regard to the boundaries of string or character con- stants. The definition ``#define bar(a) "a\n"'' creates a macro which returns the literal form of its argument embedded in a string with a newline appended. These extensions are not portable to a number of other C preprocessors. They may be replaced in the future by corresponding ANSI C features, when the ANSI C standard has been formalized. _S_u_m_m_a_r_y _a_n_d _C_o_n_c_l_u_s_i_o_n The portable compiler has been a useful tool for pro- viding C capability on a large number of diverse machines, and for testing a number of theoretical constructs in a practical setting. It has many blemishes, both in style and functionality. It has been applied to many more machines than first anticipated, of a much wider range than origi- nally dreamed of. Its use has also spread much faster than expected, leaving parts of the compiler still somewhat raw in shape. On the theoretical side, there is some hope that the September 27, 1987 - 45 - skeleton of the _s_u_c_o_m_p routine could be generated for many machines directly from the templates; this would give a con- siderable boost to the portability and correctness of the compiler, but might affect tunability and code quality. There is also room for more optimization, both within _o_p_t_i_m and in the form of a portable ``peephole'' optimizer. On the practical, development side, the compiler could probably be sped up and made smaller without doing too much violence to its basic structure. Parts of the compiler deserve to be rewritten; the initialization code, register allocation, and parser are prime candidates. It might be that doing some or all of the parsing with a recursive des- cent parser might save enough space and time to be worthwhile; it would certainly ease the problem of moving the compiler to an environment where _Y_a_c_c is not already present. _A_c_k_n_o_w_l_e_d_g_e_m_e_n_t_s I would like to thank the many people who have sym- pathetically, and even enthusiastically, helped me grapple with what has been a frustrating program to write, test, and install. D. M. Ritchie and E. N. Pinson provided needed early encouragement and philosophical guidance; M. E. Lesk, R. Muha, T. G. Peterson, G. Riddle, L. Rosler, R. W. Mitze, B. R. Rowland, S. I. Feldman, and T. B. London have all con- tributed ideas, gripes, and all, at one time or another, climbed ``into the pits'' with me to help debug. Without their help this effort would have not been possible; with it, it was often kind of fun. S. C. Johnson Many people have contributed fixes and improvements to the current Berkeley version of the compiler. A number of really valuable fixes were contributed by Ralph Campbell, Sam Leffler, Kirk McKusick, Arthur Olsen, Donn Seeley, Don Speck and Chris Torek, but most of the bugs were spotted by the legions of virtuous C programmers who were kind enough to let us know that the compiler was broken and when the heck were we going to get it fixed? Thank you all. Donn Seeley [ $LIST$ ] September 27, 1987