9 Bell Laboratories Murray Hill, New Jersey 07974 9 Computing Science Technical Report No. 69 Some Applications of Inverted Indexes on the UNIX System M. E. Lesk June 21, 1978 Some Applications of Inverted Indexes on the UNIX System M. E. Lesk Bell Laboratories Murray Hill, New Jersey 07974 _A_B_S_T_R_A_C_T I. Some Applications of Inverted Indexes - Over- view This memorandum describes a set of programs which make inverted indexes to UNIX* text files, and their application to retrieving and formatting citations for documents prepared using _t_r_o_f_f. These indexing and searching programs make keyword indexes to volumes of material too large for linear searching. Searches for combinations of single words can be performed quickly. The programs are divided into two phases. The first makes an index from the original data; the second searches the index and retrieves items. Both of these phases are further divided into two parts to separate the data-dependent and algorithm depen- dent code. The major current application of these pro- grams is the _t_r_o_f_f preprocessor _r_e_f_e_r. A list of 4300 references is maintained on line, containing primarily papers written and cited by local authors. Whenever one of these references is required in a paper, a few words from the title or author list will retrieve it, and the user need not bother to re-enter the exact citation. Alter- natively, authors can use their own lists of papers. This memorandum is of interest to those who are interested in facilities for searching large but relatively unchanging text files on the UNIX system, and those who are interested in handling bibliographic citations with UNIX _t_r_o_f_f. II. Updating Publication Lists - 2 - This section is a brief note describing the auxiliary programs for managing the updating pro- cessing. It is written to aid clerical users in maintaining lists of references. Primarily, the programs described permit a large amount of indi- vidual control over the content of publication lists while retaining the usefulness of the files to other users. III. Manual Pages This section contains the pages from the UNIX programmer's manual for the _l_o_o_k_a_l_l, _p_u_b_i_n_d_e_x, and _r_e_f_e_r commands. It is useful for reference. ______________________________ * UNIX is a Trademark of Bell Laboratories. June 21, 1978 Some Applications of Inverted Indexes on the UNIX System M. E. Lesk Bell Laboratories Murray Hill, New Jersey 07974 _1. _I_n_t_r_o_d_u_c_t_i_o_n. The UNIX* system has many utilities (e.g. _g_r_e_p, _a_w_k, _l_e_x, _e_g_r_e_p, _f_g_r_e_p, ...) to search through files of text, but most of them are based on a linear scan through the entire file, using some deterministic automaton. This memorandum discusses a program which uses inverted indexes [ %A D. Knuth %T The Art of Computer Programming: Vol. 3, Sorting and Searching %I Addison-Wesley %C Reading, Mass. %D 1977 %O See section 6.5. ] and can thus be used on much larger data bases. As with any indexing system, of course, there are some disadvantages; once an index is made, the files that have been indexed can not be changed without remaking the index. Thus applications are restricted to those making many searches of relatively stable data. Furthermore, these pro- grams depend on hashing, and can only search for exact matches of whole keywords. It is not possible to look for arithmetic or logical expressions (e.g. ``date greater than 1970'') or for regular expression searching such as that in _l_e_x. [ lex lesk cstr ] Currently there are two uses of this software, the _r_e_f_e_r preprocessor to format references, and the _l_o_o_k_a_l_l command to search through all text files on the UNIX system. The remaining sections of this memorandum discuss the searching programs and their uses. Section 2 explains the operation of the searching algorithm and describes the data collected for use with the _l_o_o_k_a_l_l command. The more impor- tant application, _r_e_f_e_r has a user's description in section 3. Section 4 goes into more detail on reference files for the benefit of those who wish to add references to data bases or write new _t_r_o_f_f macros for use with _r_e_f_e_r. The options to make _r_e_f_e_r collect identical citations, or other- wise relocate and adjust references, are described in __________________________ * UNIX is a Trademark of Bell Laboratories. - 2 - section 5. The UNIX manual sections for _r_e_f_e_r, _l_o_o_k_a_l_l, and associated commands are attached as appendices. _2. _S_e_a_r_c_h_i_n_g. The indexing and searching process is divided into two phases, each made of two parts. These are shown below. A. Construct the index. (1) Find keys -- turn the input files into a sequence of tags and keys, where each tag identifies a dis- tinct item in the input and the keys for each such item are the strings under which it is to be indexed. (2) Hash and sort -- prepare a set of inverted indexes from which, given a set of keys, the appropriate item tags can be found quickly. B. Retrieve an item in response to a query. (3) Search -- Given some keys, look through the files prepared by the hashing and sorting facility and derive the appropriate tags. (4) Deliver -- Given the tags, find the original items. This completes the searching process. The first phase, making the index, is presumably done rela- tively infrequently. It should, of course, be done whenever the data being indexed change. In contrast, the second phase, retrieving items, is presumably done often, and must be rapid. An effort is made to separate code which depends on the data being handled from code which depends on the searching procedure. The search algorithm is involved only in steps (2) and (3), while knowledge of the actual data files is needed only by steps (1) and (4). Thus it is easy to adapt to different data files or different search algorithms. To start with, it is necessary to have some way of selecting or generating keys from input files. For dealing with files that are basically English, we have a key-making program which automatically selects words and passes them to the hashing and sorting program (step 2). The format used has one line for each input item, arranged as follows: name:start,length (tab) key1 key2 key3 ... where _n_a_m_e is the file name, _s_t_a_r_t is the starting byte number, and _l_e_n_g_t_h is the number of bytes in the entry. - 3 - These lines are the only input used to make the index. The first field (the file name, byte position, and byte count) is the tag of the item and can be used to retrieve it quickly. Normally, an item is either a whole file or a sec- tion of a file delimited by blank lines. After the tab, the second field contains the keys. The keys, if selected by the automatic program, are any alphanumeric strings which are not among the 100 most frequent words in English and which are not entirely numeric (except for four-digit numbers beginning 19, which are accepted as dates). Keys are truncated to six characters and converted to lower case. Some selection is needed if the original items are very large. We normally just take the first _n keys, with _n less than 100 or so; this replaces any attempt at intelligent selection. One file in our system is a complete English dictionary; it would presumably be retrieved for all queries. To generate an inverted index to the list of record tags and keys, the keys are hashed and sorted to produce an index. What is wanted, ideally, is a series of lists show- ing the tags associated with each key. To condense this, what is actually produced is a list showing the tags associ- ated with each hash code, and thus with some set of keys. To speed up access and further save space, a set of three or possibly four files is produced. These files are: center; c c lI l. File Contents entry Pointers to posting file for each hash code posting Lists of tag pointers for each hash code tag Tags for each item key Keys for each item (optional) The posting file comprises the real data: it contains a sequence of lists of items posted under each hash code. To speed up searching, the entry file is an array of pointers into the posting file, one per potential hash code. Furth- ermore, the items in the lists in the posting file are not referred to by their complete tag, but just by an address in the tag file, which gives the complete tags. The key file is optional and contains a copy of the keys used in the indexing. The searching process starts with a query, containing several keys. The goal is to obtain all items which were indexed under these keys. The query keys are hashed, and the pointers in the entry file used to access the lists in the posting file. These lists are addresses in the tag file of documents posted under the hash codes derived from the query. The common items from all lists are determined; this must include the items indexed by every key, but may also contain some items which are false drops, since items refer- enced by the correct hash codes need not actually have con- tained the correct keys. Normally, if there are several keys in the query, there are not likely to be many false - 4 - drops in the final combined list even though each hash code is somewhat ambiguous. The actual tags are then obtained from the tag file, and to guard against the possibility that an item has false-dropped on some hash code in the query, the original items are normally obtained from the delivery program (4) and the query keys checked against them by string comparison. Usually, therefore, the check for bad drops is made against the original file. However, if the key derivation procedure is complex, it may be preferable to check against the keys fed to program (2). In this case the optional key file which contains the keys associated with each item is generated, and the item tag is supplemented by a string ;start,length which indicates the starting byte number in the key file and the length of the string of keys for each item. This file is not usually necessary with the present key-selection pro- gram, since the keys always appear in the original document. There is also an option (-C_n) for coordination level searching. This retrieves items which match all but _n of the query keys. The items are retrieved in the order of the number of keys that they match. Of course, _n must be less than the number of query keys (nothing is retrieved unless it matches at least one key). As an example, consider one set of 4377 references, comprising 660,000 bytes. This included 51,000 keys, of which 5,900 were distinct keys. The hash table is kept full to save space (at the expense of time); 995 of 997 possible hash codes were used. The total set of index files (no key file) included 171,000 bytes, about 26% of the original file size. It took 8 minutes of processor time to hash, sort, and write the index. To search for a single query with the resulting index took 1.9 seconds of processor time, while to find the same paper with a sequential linear search using _g_r_e_p (reading all of the tags and keys) took 12.3 seconds of processor time. We have also used this software to index all of the English stored on our UNIX system. This is the index searched by the _l_o_o_k_a_l_l command. On a typical day there were 29,000 files in our user file system, containing about 152,000,000 bytes. Of these 5,300 files, containing 32,000,000 bytes (about 21%) were English text. The total number of `words' (determined mechanically) was 5,100,000. Of these 227,000 were selected as keys; 19,000 were dis- tinct, hashing to 4,900 (of 5,000 possible) different hash codes. The resulting inverted file indexes used 845,000 bytes, or about 2.6% of the size of the original files. The particularly small indexes are caused by the fact that keys - 5 - are taken from only the first 50 non-common words of some very long input files. Even this large _l_o_o_k_a_l_l index can be searched quickly. For example, to find this document by looking for the keys ``lesk inverted indexes'' required 1.7 seconds of processor time and system time. By comparison, just to search the 800,000 byte dictionary (smaller than even the inverted indexes, let alone the 32,000,000 bytes of text files) with _g_r_e_p takes 29 seconds of processor time. The _l_o_o_k_a_l_l pro- gram is thus useful when looking for a document which you believe is stored on-line, but do not know where. For exam- ple, many memos from the Computing Science Research Center are in its UNIX file system, but it is often difficult to guess where a particular memo might be (it might have several authors, each with many directories, and have been worked on by a secretary with yet more directories). Instructions for the use of the _l_o_o_k_a_l_l command are given in the manual section, shown in the appendix to this memoran- dum. The only indexes maintained routinely are those of pub- lication lists and all English files. To make other indexes, the programs for making keys, sorting them, search- ing the indexes, and delivering answers must be used. Since they are usually invoked as parts of higher-level commands, they are not in the default command directory, but are available to any user in the directory /_u_s_r/_l_i_b/_r_e_f_e_r. Three programs are of interest: _m_k_e_y, which isolates keys from input files; _i_n_v, which makes an index from a set of keys; and _h_u_n_t, which searches the index and delivers the items. Note that the two parts of the retrieval phase are combined into one program, to avoid the excessive system work and delay which would result from running these as separate processes. These three commands have a large number of options to adapt to different kinds of input. The user not interested in the detailed description that now follows may skip to section 3, which describes the _r_e_f_e_r program, a packaged-up version of these tools specifically oriented towards format- ting references. _M_a_k_e _K_e_y_s. The program _m_k_e_y is the key-making program corresponding to step (1) in phase A. Normally, it reads its input from the file names given as arguments, and if there are no arguments it reads from the standard input. It assumes that blank lines in the input delimit separate items, for each of which a different line of keys should be generated. The lines of keys are written on the standard output. Keys are any alphanumeric string in the input not among the most frequent words in English and not entirely numeric (except that all-numeric strings are acceptable if they are between 1900 and 1999). In the output, keys are - 6 - translated to lower case, and truncated to six characters in length; any associated punctuation is removed. The follow- ing flag arguments are recognized by _m_k_e_y: center; lB lw(4i). -c _n_a_m_e _T{ _N_a_m_e _o_f _f_i_l_e _o_f _c_o_m_m_o_n _w_o_r_d_s; _d_e_f_a_u_l_t _i_s /_u_s_r/_l_i_b/_e_i_g_n. _T} -_f _n_a_m_e _T{ _R_e_a_d _a _l_i_s_t _o_f _f_i_l_e_s _f_r_o_m _n_a_m_e _a_n_d _t_a_k_e _e_a_c_h _a_s _a_n _i_n_p_u_t _a_r_g_u_m_e_n_t. _T} -_i _c_h_a_r_s _T{ _I_g_n_o_r_e _a_l_l _l_i_n_e_s _w_h_i_c_h _b_e_g_i_n _w_i_t_h `%' _f_o_l_- _l_o_w_e_d _b_y _a_n_y _c_h_a_r_a_c_t_e_r _i_n _c_h_a_r_s. _T} -_k_n _T{ _U_s_e _a_t _m_o_s_t _n _k_e_y_s _p_e_r _i_n_p_u_t _i_t_e_m. _T} -_l_n _T{ _I_g_n_o_r_e _i_t_e_m_s _s_h_o_r_t_e_r _t_h_a_n _n _l_e_t_t_e_r_s _l_o_n_g. _T} -_n_m _T{ _I_g_n_o_r_e _a_s _a _k_e_y _a_n_y _w_o_r_d _i_n _t_h_e _f_i_r_s_t _m _w_o_r_d_s _o_f _t_h_e _l_i_s_t _o_f _c_o_m_m_o_n _E_n_g_l_i_s_h _w_o_r_d_s. _T_h_e _d_e_f_a_u_l_t _i_s _1_0_0. _T} -_s _T{ _R_e_m_o_v_e _t_h_e _l_a_b_e_l_s (_f_i_l_e:_s_t_a_r_t,_l_e_n_g_t_h) _f_r_o_m _t_h_e _o_u_t_p_u_t; _j_u_s_t _g_i_v_e _t_h_e _k_e_y_s. _U_s_e_d _w_h_e_n _s_e_a_r_c_h_i_n_g _r_a_t_h_e_r _t_h_a_n _i_n_d_e_x_i_n_g. _T} -_w _T{ _E_a_c_h _w_h_o_l_e _f_i_l_e _i_s _a _s_e_p_a_r_a_t_e _i_t_e_m; _b_l_a_n_k _l_i_n_e_s _i_n _f_i_l_e_s _a_r_e _i_r_r_e_l_e_v_a_n_t. _T} The normal arguments for indexing references are the defaults, which are -_c /_u_s_r/_l_i_b/_e_i_g_n, -_n_1_0_0, and -_l_3. For searching, the -_s option is also needed. When the big _l_o_o_k_a_l_l index of all English files is run, the options are -_w, -_k_5_0, and -_f (_f_i_l_e_l_i_s_t). When running on textual input, the _m_k_e_y program processes about 1000 English words per pro- cessor second. Unless the -_k option is used (and the input files are long enough for it to take effect) the output of _m_k_e_y is comparable in size to its input. _H_a_s_h _a_n_d _i_n_v_e_r_t. The _i_n_v program computes the hash codes and writes the inverted files. It reads the output of _m_k_e_y and writes the set of files described earlier in this section. It expects one argument, which is used as the base name for the three (or four) files to be written. Assuming an argument of _I_n_d_e_x (the default) the entry file is named _I_n_d_e_x._i_a, the posting file _I_n_d_e_x._i_b, the tag file _I_n_d_e_x._i_c, and the key file (if present) _I_n_d_e_x._i_d. The _i_n_v program recognizes the following options: center; lB lw(4i). -a T{ Append the new keys to a previ- ous set of inverted files, making new files if there is no old set using the same base name. T} -d T{ Write the optional key file. This is needed when you can not check for false drops by looking for the keys in the original inputs, i.e. when the key derivation procedure is compli- cated and the output keys are not words from the input files. T} -h_n _T{ _T_h_e _h_a_s_h _t_a_b_l_e _s_i_z_e _i_s _n (_d_e_f_a_u_l_t _9_9_7); _n _s_h_o_u_l_d _b_e _p_r_i_m_e. _M_a_k_i_n_g _n bigger saves search time and spends disk space. T} -i[u] _n_a_m_e _T{ _T_a_k_e _i_n_p_u_t _f_r_o_m _f_i_l_e _n_a_m_e, _i_n_s_t_e_a_d _o_f _t_h_e _s_t_a_n_d_a_r_d _i_n_p_u_t; _i_f _u _i_s _p_r_e_s_e_n_t _n_a_m_e _i_s _u_n_l_i_n_k_e_d _w_h_e_n _t_h_e _s_o_r_t _i_s _s_t_a_r_t_e_d. _U_s_i_n_g _t_h_i_s _o_p_t_i_o_n _p_e_r_m_i_t_s _t_h_e _s_o_r_t _s_c_r_a_t_c_h _s_p_a_c_e _t_o _o_v_e_r_l_a_p _t_h_e _d_i_s_k _s_p_a_c_e _u_s_e_d _f_o_r _i_n_p_u_t _k_e_y_s. _T} -_n _T{ _M_a_k_e _a _c_o_m_p_l_e_t_e_l_y _n_e_w _s_e_t _o_f _i_n_v_e_r_t_e_d _f_i_l_e_s, _i_g_n_o_r_i_n_g _p_r_e_v_i_o_u_s _f_i_l_e_s. _T} -_p _T{ - 7 - _P_i_p_e _i_n_t_o _t_h_e _s_o_r_t _p_r_o_g_r_a_m, _r_a_t_h_e_r _t_h_a_n _w_r_i_t_i_n_g _a _t_e_m_p_o_r_a_r_y _i_n_p_u_t _f_i_l_e. _T_h_i_s _s_a_v_e_s _d_i_s_k _s_p_a_c_e _a_n_d _s_p_e_n_d_s _p_r_o_c_e_s_s_o_r _t_i_m_e. _T} -_v _T{ _V_e_r_b_o_s_e _m_o_d_e; _p_r_i_n_t _a _s_u_m_m_a_r_y _o_f _t_h_e _n_u_m_b_e_r _o_f _k_e_y_s _w_h_i_c_h _f_i_n_i_s_h_e_d _i_n_d_e_x_i_n_g. _T} About half the time used in _i_n_v is in the contained sort. Assuming the sort is roughly linear, however, a guess at the total timing for _i_n_v is 250 keys per second. The space used is usually of more importance: the entry file uses four bytes per possible hash (note the -_h option), and the tag file around 15-20 bytes per item indexed. Roughly, the posting file contains one item for each key instance and one item for each possible hash code; the items are two bytes long if the tag file is less than 65336 bytes long, and the items are four bytes wide if the tag file is greater than 65536 bytes long. To minimize storage, the hash tables should be over-full; for most of the files indexed in this way, there is no other real choice, since the _e_n_t_r_y file must fit in memory. _S_e_a_r_c_h_i_n_g _a_n_d _R_e_t_r_i_e_v_i_n_g. The _h_u_n_t program retrieves items from an index. It combines, as mentioned above, the two parts of phase (B): search and delivery. The reason why it is efficient to combine delivery and search is partly to avoid starting unnecessary processes, and partly because the delivery operation must be a part of the search operation in any case. Because of the hashing, the search part takes place in two stages: first items are retrieved which have the right hash codes associated with them, and then the actual items are inspected to determine false drops, i.e. to determine if anything with the right hash codes doesn't really have the right keys. Since the original item is retrieved to check on false drops, it is efficient to present it immediately, rather than only giving the tag as output and later retrieving the item again. If there were a separate key file, this argument would not apply, but separate key files are not common. Input to _h_u_n_t is taken from the standard input, one query per line. Each query should be in _m_k_e_y -_s output for- mat; all lower case, no punctuation. The _h_u_n_t program takes one argument which specifies the base name of the index files to be searched. Only one set of index files can be searched at a time, although many text files may be indexed as a group, of course. If one of the text files has been changed since the index, that file is searched with _f_g_r_e_p; this may occasionally slow down the searching, and care should be taken to avoid having many out of date files. The following option arguments are recognized by _h_u_n_t: center; lB lw(4i). -a T{ Give all output; ignore checking for false drops. T} -C_n _T{ _C_o_o_r_d_i_n_a_t_i_o_n _l_e_v_e_l _n; _r_e_t_r_i_e_v_e _i_t_e_m_s _w_i_t_h _n_o_t _m_o_r_e _t_h_a_n _n _t_e_r_m_s _o_f _t_h_e _i_n_p_u_t _m_i_s_s_i_n_g; - 8 - _d_e_f_a_u_l_t _C_0, _i_m_p_l_y_i_n_g _t_h_a_t _e_a_c_h _s_e_a_r_c_h _t_e_r_m _m_u_s_t _b_e _i_n _t_h_e _o_u_t_p_u_t _i_t_e_m_s. _T} -_F[_y_n_d] T{ ``-Fy'' gives the text of all the items found; ``-Fn'' suppresses them. ``-F_d'' where _d is an integer gives the text of the first _d items. The default is -_F_y. T} -g T{ Do not use _f_g_r_e_p to search files changed since the index was made; print an error comment instead. T} -i _s_t_r_i_n_g _T{ _T_a_k_e _s_t_r_i_n_g _a_s _i_n_p_u_t, _i_n_s_t_e_a_d _o_f _r_e_a_d_i_n_g _t_h_e _s_t_a_n_d_a_r_d _i_n_p_u_t. _T} -_l _n _T{ _T_h_e _m_a_x_i_m_u_m _l_e_n_g_t_h _o_f _i_n_t_e_r_n_a_l _l_i_s_t_s _o_f _c_a_n_d_i_d_a_t_e _i_t_e_m_s _i_s _n; _d_e_f_a_u_l_t _1_0_0_0. _T} -_o _s_t_r_i_n_g _T{ _P_u_t _t_e_x_t _o_u_t_p_u_t (``-_F_y'') _i_n _s_t_r_i_n_g; _o_f _u_s_e _o_n_l_y _w_h_e_n _i_n_v_o_k_e_d _f_r_o_m _a_n_o_t_h_e_r _p_r_o_g_r_a_m. _T} -_p _T{ _P_r_i_n_t _h_a_s_h _c_o_d_e _f_r_e_q_u_e_n_c_i_e_s; _m_o_s_t_l_y _f_o_r _u_s_e _i_n _o_p_t_i_m_i_z_i_n_g _h_a_s_h _t_a_b_l_e _s_i_z_e_s. _T} -_T[_y_n_d] T{ ``-Ty'' gives the tags of the items found; ``-Tn'' suppresses them. ``-T_d'' where _d is an integer gives the first _d tags. The default is -_T_n. T} -t _s_t_r_i_n_g _T{ _P_u_t _t_a_g _o_u_t_p_u_t (``-_T_y'') _i_n _s_t_r_i_n_g; _o_f _u_s_e _o_n_l_y _w_h_e_n _i_n_v_o_k_e_d _f_r_o_m _a_n_o_t_h_e_r _p_r_o_g_r_a_m. _T} The timing of _h_u_n_t is complex. Normally the hash table is overfull, so that there will be many false drops on any single term; but a multi-term query will have few false drops on all terms. Thus if a query is underspecified (one search term) many potential items will be examined and dis- carded as false drops, wasting time. If the query is over- specified (a dozen search terms) many keys will be examined only to verify that the single item under consideration has that key posted. The variation of search time with number of keys is shown in the table below. Queries of varying length were constructed to retrieve a particular document from the file of references. In the sequence to the left, search terms were chosen so as to select the desired paper as quickly as possible. In the sequence on the right, terms were chosen inefficiently, so that the query did not uniquely select the desired document until four keys had been used. The same document was the target in each case, and the final set of eight keys are also identical; the differences at five, six and seven keys are produced by measurement error, not by the slightly different key lists. center; c s s s5 | c s s s cp8 cp8 cp8 cp8 | cp8 cp8 cp8 cp8 cp8 cp8 cp8 cp8 | cp8 cp8 cp8 cp8 n n n n | n n n n . Efficient Keys Inefficient Keys No. keys Total drops Retrieved Search time No. keys Total drops RetrievedSearch time (incl. false) Documents (seconds) (incl. false) Documents (seconds) 1 15 3 1.27 1 68 55 5.96 2 1 1 0.11 2 29 29 2.72 3 1 1 0.14 3 8 8 0.95 4 1 1 0.17 4 1 1 0.18 5 1 1 0.19 5 1 1 0.21 6 1 1 0.23 6 1 1 0.22 7 1 1 0.27 7 1 1 0.26 - 9 - 8 1 1 0.29 8 1 1 0.29 As would be expected, the optimal search is achieved when the query just specifies the answer; however, overspecifica- tion is quite cheap. Roughly, the time required by _h_u_n_t can be approximated as 30 milliseconds per search key plus 75 milliseconds per dropped document (whether it is a false drop or a real answer). In general, overspecification can be recommended; it protects the user against additions to the data base which turn previously uniquely-answered queries into ambiguous queries. The careful reader will have noted an enormous discrepancy between these times and the earlier quoted time of around 1.9 seconds for a search. The times here are purely for the search and retrieval: they are measured by running many searches through a single invocation of the _h_u_n_t program alone. Usually, the UNIX command processor (the shell) must start both the _m_k_e_y and _h_u_n_t processes for each query, and arrange for the output of _m_k_e_y to be fed to the _h_u_n_t program. This adds a fixed overhead of about 1.7 seconds of processor time to any single search. Further- more, remember that all these times are processor times: on a typical morning on our PDP 11/70 system, with about one dozen people logged on, to obtain 1 second of processor time for the search program took between 2 and 12 seconds of real time, with a median of 3.9 seconds and a mean of 4.8 seconds. Thus, although the work involved in a single search may be only 200 milliseconds, after you add the 1.7 seconds of startup processor time and then assume a 4:1 elapsed/processor time ratio, it will be 8 seconds before any response is printed. _3. _S_e_l_e_c_t_i_n_g _a_n_d _F_o_r_m_a_t_t_i_n_g _R_e_f_e_r_e_n_c_e_s _f_o_r _T_R_O_F_F The major application of the retrieval software is _r_e_f_e_r, which is a _t_r_o_f_f preprocessor like _e_q_n. [ kernighan cherry acm 1975 ] It scans its input looking for items of the form .[ imprecise citation .] where an imprecise citation is merely a string of words found in the relevant bibliographic citation. This is translated into a properly formatted reference. If the imprecise citation does not correctly identify a single paper (either selecting no papers or too many) a message is given. The data base of citations searched may be tailored to each system, and individual users may specify their own citation files. On our system, the default data base is accumulated from the publication lists of the members of our organization, plus about half a dozen personal - 10 - bibliographies that were collected. The present total is about 4300 citations, but this increases steadily. Even now, the data base covers a large fraction of local cita- tions. For example, the reference for the _e_q_n paper above was specified as ... preprocessor like .I eqn. .[ kernighan cherry acm 1975 .] It scans its input looking for items ... This paper was itself printed using _r_e_f_e_r. The above input text was processed by _r_e_f_e_r as well as _t_b_l and _t_r_o_f_f by the command _r_e_f_e_r _m_e_m_o-_f_i_l_e | _t_b_l | _t_r_o_f_f -_m_s and the reference was automatically translated into a correct citation to the ACM paper on mathematical typeset- ting. The procedure to use to place a reference in a paper using _r_e_f_e_r is as follows. First, use the _l_o_o_k_b_i_b command to check that the paper is in the data base and to find out what keys are necessary to retrieve it. This is done by typing _l_o_o_k_b_i_b and then typing some potential queries until a suitable query is found. For example, had one started to find the _e_q_n paper shown above by presenting the query $ lookbib kernighan cherry (EOT) _l_o_o_k_b_i_b would have found several items; experimentation would quickly have shown that the query given above is ade- quate. Overspecifying the query is of course harmless; it is even desirable, since it decreases the risk that a docu- ment added to the publication data base in the future will be retrieved in addition to the intended document. The extra time taken by even a grossly overspecified query is quite small. A particularly careful reader may have noticed that ``acm'' does not appear in the printed citation; we have supplemented some of the data base items with extra keywords, such as common abbreviations for journals or other sources, to aid in searching. If the reference is in the data base, the query that retrieved it can be inserted in the text, between .[ and .] - 11 - brackets. If it is not in the data base, it can be typed into a private file of references, using the format dis- cussed in the next section, and then the -_p option used to search this private file. Such a command might read (if the private references are called _m_y_f_i_l_e) _r_e_f_e_r -_p _m_y_f_i_l_e _d_o_c_u_m_e_n_t | _t_b_l | _e_q_n | _t_r_o_f_f -_m_s . . . where _t_b_l and/or _e_q_n could be omitted if not needed. The use of the -_m_s macros [ lesk typing documents unix gcos ] or some other macro package, however, is essential. _R_e_f_e_r only generates the data for the references; exact formatting is done by some macro package, and if none is supplied the references will not be printed. By default, the references are numbered sequentially, and the -_m_s macros format references as footnotes at the bottom of the page. This memorandum is an example of that style. Other possibilities are discussed in section 5 below. _4. _R_e_f_e_r_e_n_c_e _F_i_l_e_s. A reference file is a set of bibliographic references usable with _r_e_f_e_r. It can be indexed using the software described in section 2 for fast searching. What _r_e_f_e_r does is to read the input document stream, looking for imprecise citation references. It then searches through reference files to find the full citations, and inserts them into the document. The format of the full citation is arranged to make it convenient for a macro package, such as the -_m_s mac- ros, to format the reference for printing. Since the format of the final reference is determined by the desired style of output, which is determined by the macros used, _r_e_f_e_r avoids forcing any kind of reference appearance. All it does is define a set of string registers which contain the basic information about the reference; and provide a macro call which is expanded by the macro package to format the refer- ence. It is the responsibility of the final macro package to see that the reference is actually printed; if no macros are used, and the output of _r_e_f_e_r fed untranslated to _t_r_o_f_f, nothing at all will be printed. The strings defined by _r_e_f_e_r are taken directly from the files of references, which are in the following format. The references should be separated by blank lines. Each reference is a sequence of lines beginning with % and fol- lowed by a key-letter. The remainder of that line, and suc- cessive lines until the next line beginning with %, contain the information specified by the key-letter. In general, _r_e_f_e_r does not interpret the information, but merely presents it to the macro package for final formatting. A user with a separate macro package, for example, can add new key-letters or use the existing ones for other purposes - 12 - without bothering _r_e_f_e_r. The meaning of the key-letters given below, in particu- lar, is that assigned by the -_m_s macros. Not all informa- tion, obviously, is used with each citation. For example, if a document is both an internal memorandum and a journal article, the macros ignore the memorandum version and cite only the journal article. Some kinds of information are not used at all in printing the reference; if a user does not like finding references by specifying title or author key- words, and prefers to add specific keywords to the citation, a field is available which is searched but not printed (K). The key letters currently recognized by _r_e_f_e_r and -_m_s, with the kind of information implied, are: center; c c6 c c c l c l. Key Information specified Key Information specified A Author's name N Issue number B Title of book containing item O Other information C City of publication P Page(s) of article D Date R Technical report reference E Editor of book containing item T Title G Government (NTIS) ordering number V Volume number I Issuer (publish- er) J Journal name K Keys (for searching) X or L Label Y or M Memorandum label Z Information not used by _r_e_f_e_r For example, a sample reference could be typed as: %T Bounds on the Complexity of the Maximal Common Subsequence Problem %Z ctr127 %A A. V. Aho %A D. S. Hirschberg %A J. D. Ullman %J J. ACM %V 23 %N 1 %P 1-12 %M abcd-78 %D Jan. 1976 Order is irrelevant, except that authors are shown in the order given. The output of _r_e_f_e_r is a stream of string definitions, one for each of the fields of each reference, as shown below. - 13 - .]- .ds [A authors' names ... .ds [T title ... .ds [J journal ... ... .][ type-number The _r_e_f_e_r program, in general, does not concern itself with the significance of the strings. The different fields are treated identically by _r_e_f_e_r, except that the X, Y and Z fields are ignored (see the -_i option of _m_k_e_y) in indexing and searching. All _r_e_f_e_r does is select the appropriate citation, based on the keys. The macro package must arrange the strings so as to produce an appropriately formatted citation. In this process, it uses the convention that the `T' field is the title, the `J' field the journal, and so forth. The _r_e_f_e_r program does arrange the citation to simplify the macro package's job, however. The special macro .]- precedes the string definitions and the special macro .][ follows. These are changed from the input .[ and .] so that running the same file through _r_e_f_e_r again is harmless. The .]- macro can be used by the macro package to initialize. The .][ macro, which should be used to print the reference, is given an argument _t_y_p_e-_n_u_m_b_e_r to indicate the kind of reference, as follows: center; c c n l. Value Kind of reference 1 Journal article 2 Book 3 Article within book 4 Technical report 5 Bell Labs technical memorandum 0 Other The type is determined by the presence or absence of partic- ular fields in the citation (a journal article must have a `J' field, a book must have an `I' field, and so forth). To a small extent, this violates the above rule that _r_e_f_e_r does not concern itself with the contents of the citation; how- ever, the classification of the citation in _t_r_o_f_f macros would require a relatively expensive and obscure program. Any macro writer may, of course, preserve consistency by ignoring the argument to the .][ macro. The reference is flagged in the text with the sequence \*([.number\*(.] where _n_u_m_b_e_r is the footnote number. The strings [. and .] should be used by the macro package to format the reference flag in the text. These strings can be replaced for a par- ticular footnote, as described in section 5. The footnote number (or other signal) is available to the reference macro .][ as the string register [_F. To simplify dealing with a - 14 - text reference that occurs at the end of a sentence, _r_e_f_e_r treats a reference which follows a period in a special way. The period is removed, and the reference is preceded by a call for the string <. and followed by a call for the string >. For example, if a reference follows ``end.'' it will appear as end\*(<.\*([.number\*(.]\*(>. where _n_u_m_b_e_r is the footnote number. The macro package should turn either the string >. or <. into a period and delete the other one. This permits the output to have either the form ``end[31].'' or ``end.8319'' as the macro package wishes. Note that in one case the period precedes the number and in the other it follows the number. In some cases users wish to suspend the searching, and merely use the reference macro formatting. That is, the user doesn't want to provide a search key between .[ and .] brackets, but merely the reference lines for the appropriate document. Alternatively, the user can wish to add a few fields to those in the reference as in the standard file, or override some fields. Altering or replacing fields, or sup- plying whole references, is easily done by inserting lines beginning with %; any such line is taken as direct input to the reference processor rather than keys to be searched. Thus .[ key1 key2 key3 ... %Q New format item %R Override report name .] makes the indicates changes to the result of searching for the keys. All of the search keys must be given before the first % line. If no search keys are provided, an entire citation can be provided in-line in the text. For example, if the _e_q_n paper citation were to be inserted in this way, rather than by searching for it in the data base, the input would read - 15 - ... preprocessor like .I eqn. .[ %A B. W. Kernighan %A L. L. Cherry %T A System for Typesetting Mathematics %J Comm. ACM %V 18 %N 3 %P 151-157 %D March 1975 .] It scans its input looking for items ... This would produce a citation of the same appearance as that resulting from the file search. As shown, fields are normally turned into _t_r_o_f_f strings. Sometimes users would rather have them defined as macros, so that other _t_r_o_f_f commands can be placed into the data. When this is necessary, simply double the control character % in the data. Thus the input .[ %V 23 %%M Bell Laboratories, Murray Hill, N.J. 07974 .] is processed by _r_e_f_e_r into .ds [V 23 .de [M Bell Laboratories, Murray Hill, N.J. 07974 .. The information after %%_M is defined as a macro to be invoked by .[_M while the information after %_V is turned into a string to be invoked by _\*([_V. At present -_m_s expects all information as strings. _5. _C_o_l_l_e_c_t_i_n_g _R_e_f_e_r_e_n_c_e_s _a_n_d _o_t_h_e_r _R_e_f_e_r _O_p_t_i_o_n_s Normally, the combination of _r_e_f_e_r and -_m_s formats out- put as _t_r_o_f_f footnotes which are consecutively numbered and placed at the bottom of the page. However, options exist to place the references at the end; to arrange references alphabetically by senior author; and to indicate references by strings in the text of the form [Name1975a] rather than - 16 - by number. Whenever references are not placed at the bottom of a page identical references are coalesced. For example, the -_e option to _r_e_f_e_r specifies that references are to be collected; in this case they are output whenever the sequence .[ $LIST$ .] is encountered. Thus, to place references at the end of a paper, the user would run _r_e_f_e_r with the -_e option and place the above $LIST$ commands after the last line of the text. _R_e_f_e_r will then move all the references to that point. To aid in formatting the collected references, _r_e_f_e_r writes the references preceded by the line .]< and followed by the line .]> to invoke special macros before and after the references. Another possible option to _r_e_f_e_r is the -_s option to specify sorting of references. The default, of course, is to list references in the order presented. The -_s option implies the -_e option, and thus requires a .[ $LIST$ .] entry to call out the reference list. The -_s option may be followed by a string of letters, numbers, and `+' signs indicating how the references are to be sorted. The sort is done using the fields whose key-letters are in the string as sorting keys; the numbers indicate how many of the fields are to be considered, with `+' taken as a large number. Thus the default is -_s_A_D meaning ``Sort on senior author, then date.'' To sort on all authors and then title, specify -_s_A+_T. And to sort on two authors and then the journal, write -_s_A_2_J. Other options to _r_e_f_e_r change the signal or label inserted in the text for each reference. Normally these are just sequential numbers, and their exact placement (within brackets, as superscripts, etc.) is determined by the macro package. The -_l option replaces reference numbers by strings composed of the senior author's last name, the date, and a disambiguating letter. If a number follows the _l as in -_l_3 only that many letters of the last name are used in - 17 - the label string. To abbreviate the date as well the form -l_m,_n shortens the last name to the first _m letters and the date to the last _n digits. For example, the option -_l_3,_2 would refer to the _e_q_n paper (reference 3) by the signal _K_e_r_7_5_a, since it is the first cited reference by Kernighan in 1975. A user wishing to specify particular labels for a private bibliography may use the -_k option. Specifying -k_x causes the field _x to be used as a label. The default is L. If this field ends in -, that character is replaced by a sequence letter; otherwise the field is used exactly as given. If none of the _r_e_f_e_r-produced signals are desired, the -_b option entirely suppresses automatic text signals. If the user wishes to override the -_m_s treatment of the reference signal (which is normally to enclose the number in brackets in _n_r_o_f_f and make it a superscript in _t_r_o_f_f) this can be done easily. If the lines .[ or .] contain anything following these characters, the remainders of these lines are used to surround the reference signal, instead of the default. Thus, for example, to say ``See reference (2).'' and avoid ``See reference.829'' the input might appear See reference .[ ( imprecise citation ... .]). Note that blanks are significant in this construction. If a permanent change is desired in the style of reference sig- nals, however, it is probably easier to redefine the strings [. and .] (which are used to bracket each signal) than to change each citation. Although normally _r_e_f_e_r limits itself to retrieving the data for the reference, and leaves to a macro package the job of arranging that data as required by the local format, there are two special options for rearrangements that can not be done by macro packages. The -_c option puts fields into all upper case (CAPS-SMALL CAPS in _t_r_o_f_f output). The key-letters indicated what information is to be translated to upper case follow the _c, so that -_c_A_J means that authors' names and journals are to be in caps. The -_a option writes the names of authors last name first, that is _A. _D. _H_a_l_l, _J_r. is written as _H_a_l_l, _A. _D. _J_r. The citation form of the _J_o_u_r_n_a_l _o_f _t_h_e _A_C_M, for example, would require both -_c_A and -_a options. This produces authors' names in the style _K_E_R_- _N_I_G_H_A_N, _B. _W. _A_N_D _C_H_E_R_R_Y, _L. _L. for the previous example. The -_a option may be followed by a number to indicate how many author names should be reversed; -_a_1 (without any -_c option) would produce _K_e_r_n_i_g_h_a_n, _B. _W. _a_n_d _L. _L. _C_h_e_r_r_y, for - 18 - example. Finally, there is also the previously-mentioned -_p option to let the user specify a private file of references to be searched before the public files. Note that _r_e_f_e_r does not insist on a previously made index for these files. If a file is named which contains reference data but is not indexed, it will be searched (more slowly) by _r_e_f_e_r using _f_g_r_e_p. In this way it is easy for users to keep small files of new references, which can later be added to the public data bases.