A Tour Through the Portable C Compiler S. C. Johnson _I_n_t_r_o_d_u_c_t_i_o_n A C compiler has been implemented that has proved to be quite portable, serving as the basis for C compilers on roughly a dozen machines, including the Honeywell 6000, IBM 370, and Interdata 8/32. The compiler is highly compatible with the C language standard. Kernighan Ritchie Prentice 1978 Among the goals of this compiler are portability, high reliability, and the use of state-of-the-art techniques and tools wherever practical. Although the efficiency of the compiling process is not a primary goal, the compiler is efficient enough, and produces good enough code, to serve as a production compiler. The language implemented is highly compatible with the current PDP-11 version of C. Moreover, roughly 75% of the compiler, including nearly all the syntactic and semantic routines, is machine independent. The compiler also serves as the major portion of the program _l_i_n_t, described else- where. Johnson Lint Program Checker A number of earlier attempts to make portable compilers are worth noting. While on CO-OP assignment to Bell Labs in 1973, Alan Snyder wrote a portable C compiler which was the basis of his Master's Thesis at M.I.T. Snyder Portable C Compiler This compiler was very slow and complicated, and contained a number of rather serious implementation diffi- culties; nevertheless, a number of Snyder's ideas appear in this work. Most earlier portable compilers, including Snyder's, have proceeded by defining an intermediate language, perhaps based on three-address code or code for a stack machine, and writing a machine independent program to translate from the source code to this intermediate code. The intermediate code is then read by a second pass, and interpreted or com- piled. This approach is elegant, and has a number of advan- tages, especially if the target machine is far removed from the host. It suffers from some disadvantages as well. Some August 3, 1987 - 2 - constructions, like initialization and subroutine prologs, are difficult or expensive to express in a machine indepen- dent way that still allows them to be easily adapted to the target assemblers. Most of these approaches require a sym- bol table to be constructed in the second (machine depen- dent) pass, and/or require powerful target assemblers. Also, many conversion operators may be generated that have no effect on a given machine, but may be needed on others (for example, pointer to pointer conversions usually do nothing in C, but must be generated because there are some machines where they are significant). For these reasons, the first pass of the portable com- piler is not entirely machine independent. It contains some machine dependent features, such as initialization, subrou- tine prolog and epilog, certain storage allocation func- tions, code for the _s_w_i_t_c_h statement, and code to throw out unneeded conversion operators. As a crude measure of the degree of portability actu- ally achieved, the Interdata 8/32 C compiler has roughly 600 machine dependent lines of source out of 4600 in Pass 1, and 1000 out of 3400 in Pass 2. In total, 1600 out of 8000, or 20%, of the total source is machine dependent (12% in Pass 1, 30% in Pass 2). These percentages can be expected to rise slightly as the compiler is tuned. The percentage of machine-dependent code for the IBM is 22%, for the Honeywell 25%. If the assembler format and structure were the same for all these machines, perhaps another 5-10% of the code would become machine independent. These figures are sufficiently misleading as to be almost meaningless. A large fraction of the machine depen- dent code can be converted in a straightforward, almost mechanical way. On the other hand, a certain amount of the code requres hard intellectual effort to convert, since the algorithms embodied in this part of the code are typically complicated and machine dependent. To summarize, however, if you need a C compiler written for a machine with a reasonable architecture, the compiler is already three quarters finished! _O_v_e_r_v_i_e_w This paper discusses the structure and organization of the portable compiler. The intent is to give the big pic- ture, rather than discussing the details of a particular machine implementation. After a brief overview and a dis- cussion of the source file structure, the paper describes the major data structures, and then delves more closely into the two passes. Some of the theoretical work on which the compiler is based, and its application to the compiler, is discussed elsewhere. johnson portable theory practice One August 3, 1987 - 3 - of the major design issues in any C compiler, the design of the calling sequence and stack frame, is the subject of a separate memorandum. johnson lesk ritchie calling sequence The compiler consists of two passes, _p_a_s_s_1 and _p_a_s_s_2, that together turn C source code into assembler code for the target machine. The two passes are preceded by a preproces- sor, that handles the #_d_e_f_i_n_e and #_i_n_c_l_u_d_e statements, and related features (e.g., #_i_f_d_e_f, etc.). It is a nearly machine independent program, and will not be further dis- cussed here. The output of the preprocessor is a text file that is read as the standard input of the first pass. This produces as standard output another text file that becomes the stan- dard input of the second pass. The second pass produces, as standard output, the desired assembler language source code. The preprocessor and the two passes all write error messages on the standard error file. Thus the compiler itself makes few demands on the I/O library support, aiding in the bootstrapping process. Although the compiler is divided into two passes, this represents historical accident more than deep necessity. In fact, the compiler can optionally be loaded so that both passes operate in the same program. This ``one pass'' operation eliminates the overhead of reading and writing the intermediate file, so the compiler operates about 30% faster in this mode. It also occupies about 30% more space than the larger of the two component passes. Because the compiler is fundamentally structured as two passes, even when loaded as one, this document primarily describes the two pass version. The first pass does the lexical analysis, parsing, and symbol table maintenance. It also constructs parse trees for expressions, and keeps track of the types of the nodes in these trees. Additional code is devoted to initializa- tion. Machine dependent portions of the first pass serve to generate subroutine prologs and epilogs, code for switches, and code for branches, label definitions, alignment opera- tions, changes of location counter, etc. The intermediate file is a text file organized into lines. Lines beginning with a right parenthesis are copied by the second pass directly to its output file, with the parenthesis stripped off. Thus, when the first pass pro- duces assembly code, such as subroutine prologs, etc., each line is prefaced with a right parenthesis; the second pass passes these lines to through to the assembler. The major job done by the second pass is generation of code for expressions. The expression parse trees produced August 3, 1987 - 4 - in the first pass are written onto the intermediate file in Polish Prefix form: first, there is a line beginning with a period, followed by the source file line number and name on which the expression appeared (for debugging purposes). The successive lines represent the nodes of the p