9                     Bell Laboratories
               Murray Hill, New Jersey 07974


9         Computing Science Technical Report No. 69

     Some Applications of Inverted Indexes on the UNIX
                           System

                         M. E. Lesk


June 21, 1978


     Some Applications of Inverted Indexes on the UNIX
                           System


                         M. E. Lesk

                     Bell Laboratories
               Murray Hill, New Jersey 07974


                          _A_B_S_T_R_A_C_T


     I. Some Applications of Inverted Indexes  -  Over-
     view

          This memorandum describes a set  of  programs
     which  make  inverted indexes to UNIX* text files,
     and their application to retrieving and formatting
     citations for documents prepared using _t_r_o_f_f.

          These indexing and  searching  programs  make
     keyword  indexes  to volumes of material too large
     for linear searching.  Searches  for  combinations
     of  single  words  can  be performed quickly.  The
     programs are divided into two phases.   The  first
     makes  an index from the original data; the second
     searches the index and retrieves items.   Both  of
     these phases are further divided into two parts to
     separate the data-dependent and  algorithm  depen-
     dent code.

          The major current application of  these  pro-
     grams  is the _t_r_o_f_f preprocessor _r_e_f_e_r.  A list of
     4300 references is maintained on line,  containing
     primarily   papers  written  and  cited  by  local
     authors.  Whenever  one  of  these  references  is
     required in a paper, a few words from the title or
     author list will retrieve it, and  the  user  need
     not bother to re-enter the exact citation.  Alter-
     natively, authors  can  use  their  own  lists  of
     papers.

          This memorandum is of interest to  those  who
     are  interested  in facilities for searching large
     but relatively unchanging text files on  the  UNIX
     system,  and  those who are interested in handling
     bibliographic citations with UNIX _t_r_o_f_f.

     II. Updating Publication Lists


                           - 2 -


          This section is a brief note  describing  the
     auxiliary  programs for managing the updating pro-
     cessing.  It is written to aid clerical  users  in
     maintaining  lists  of references.  Primarily, the
     programs described permit a large amount of  indi-
     vidual  control  over  the  content of publication
     lists while retaining the usefulness of the  files
     to other users.

     III. Manual Pages

          This section contains the pages from the UNIX
     programmer's manual for the _l_o_o_k_a_l_l, _p_u_b_i_n_d_e_x, and
     _r_e_f_e_r commands.  It is useful for reference.

     ______________________________
     * UNIX is a Trademark of Bell Laboratories.


June 21, 1978


     Some Applications of Inverted Indexes on the UNIX
                           System


                         M. E. Lesk

                     Bell Laboratories
               Murray Hill, New Jersey 07974


_1.  _I_n_t_r_o_d_u_c_t_i_o_n.

     The UNIX* system has many utilities  (e.g.  _g_r_e_p,  _a_w_k,
_l_e_x, _e_g_r_e_p, _f_g_r_e_p, ...) to search through files of text, but
most of them are based on a linear scan through  the  entire
file,  using  some deterministic automaton.  This memorandum
discusses a program which uses  inverted  indexes  [  %A  D.
Knuth  %T  The  Art of Computer Programming: Vol. 3, Sorting
and Searching %I Addison-Wesley %C Reading, Mass.   %D  1977
%O  See  section 6.5.  ] and can thus be used on much larger
data bases.

     As with any indexing system, of course, there are  some
disadvantages;  once  an  index is made, the files that have
been indexed can not be changed without remaking the  index.
Thus  applications  are  restricted  to  those  making  many
searches of relatively stable data.  Furthermore, these pro-
grams  depend  on  hashing,  and  can  only search for exact
matches of whole keywords.  It is not possible to  look  for
arithmetic  or logical expressions (e.g. ``date greater than
1970'') or for regular expression searching such as that  in
_l_e_x.  [ lex lesk cstr ]

     Currently there are two  uses  of  this  software,  the
_r_e_f_e_r  preprocessor  to  format  references, and the _l_o_o_k_a_l_l
command to search through all text files on the UNIX system.

     The remaining sections of this memorandum  discuss  the
searching  programs  and their uses.  Section 2 explains the
operation of the searching algorithm and describes the  data
collected for use with the _l_o_o_k_a_l_l command.  The more impor-
tant application, _r_e_f_e_r has a user's description in  section
3.   Section  4 goes into more detail on reference files for
the benefit of those who wish  to  add  references  to  data
bases  or  write  new  _t_r_o_f_f macros for use with _r_e_f_e_r.  The
options to make _r_e_f_e_r collect identical citations, or other-
wise  relocate  and  adjust  references,  are  described  in
__________________________
* UNIX is a Trademark of Bell Laboratories.


                           - 2 -


section 5.  The UNIX manual sections for _r_e_f_e_r, _l_o_o_k_a_l_l, and
associated commands are attached as appendices.

_2.  _S_e_a_r_c_h_i_n_g.

     The indexing and searching process is divided into  two
phases, each made of two parts.  These are shown below.

A.   Construct the index.

     (1)  Find keys -- turn the input files into a  sequence
          of tags and keys, where each tag identifies a dis-
          tinct item in the input and the keys for each such
          item  are  the  strings  under  which  it is to be
          indexed.

     (2)  Hash and sort -- prepare a set of inverted indexes
          from  which,  given a set of keys, the appropriate
          item tags can be found quickly.

B.   Retrieve an item in response to a query.

     (3)  Search -- Given some keys, look through the  files
          prepared  by  the hashing and sorting facility and
          derive the appropriate tags.

     (4)  Deliver --  Given  the  tags,  find  the  original
          items.  This completes the searching process.

The first phase, making the index, is presumably done  rela-
tively infrequently.  It should, of course, be done whenever
the data being indexed  change.   In  contrast,  the  second
phase,  retrieving items, is presumably done often, and must
be rapid.

     An effort is made to separate code which depends on the
data  being handled from code which depends on the searching
procedure.  The search algorithm is involved only  in  steps
(2)  and  (3),  while  knowledge of the actual data files is
needed only by steps (1) and (4).  Thus it is easy to  adapt
to different data files or different search algorithms.

     To start with, it is necessary  to  have  some  way  of
selecting  or generating keys from input files.  For dealing
with files that are basically English, we have a  key-making
program which automatically selects words and passes them to
the hashing and sorting program (step 2).  The  format  used
has one line for each input item, arranged as follows:

        name:start,length (tab) key1 key2 key3 ...

where _n_a_m_e is the file name,  _s_t_a_r_t  is  the  starting  byte
number, and _l_e_n_g_t_h is the number of bytes in the entry.


                           - 3 -


     These lines are the only input used to make the  index.
The  first  field  (the  file  name, byte position, and byte
count) is the tag of the item and can be used to retrieve it
quickly.  Normally, an item is either a whole file or a sec-
tion of a file delimited by blank lines.  After the tab, the
second  field  contains  the keys.  The keys, if selected by
the automatic program, are any  alphanumeric  strings  which
are  not  among  the  100 most frequent words in English and
which  are  not  entirely  numeric  (except  for  four-digit
numbers  beginning  19,  which are accepted as dates).  Keys
are truncated to six characters and converted to lower case.
Some  selection  is  needed  if  the original items are very
large.  We normally just take the first _n keys, with _n  less
than  100  or  so;  this replaces any attempt at intelligent
selection.  One file in our system  is  a  complete  English
dictionary;   it  would  presumably  be  retrieved  for  all
queries.

     To generate an inverted index to  the  list  of  record
tags  and keys, the keys are hashed and sorted to produce an
index.  What is wanted, ideally, is a series of lists  show-
ing  the  tags  associated with each key.  To condense this,
what is actually produced is a list showing the tags associ-
ated  with  each  hash code, and thus with some set of keys.
To speed up access and further save space, a set of three or
possibly four files is produced.  These files are:

center; c c  lI  l.   File    Contents  entry   Pointers  to
posting file         for each hash code posting Lists of tag
pointers for         each hash code  tag     Tags  for  each
item key     Keys for each item         (optional)

The posting file comprises the  real  data:  it  contains  a
sequence  of lists of items posted under each hash code.  To
speed up searching, the entry file is an array  of  pointers
into  the posting file, one per potential hash code.  Furth-
ermore, the items in the lists in the posting file  are  not
referred to by their complete tag, but just by an address in
the tag file, which gives the complete tags.  The  key  file
is  optional  and  contains  a  copy of the keys used in the
indexing.

     The searching process starts with a  query,  containing
several  keys.   The  goal is to obtain all items which were
indexed under these keys.  The query keys  are  hashed,  and
the  pointers  in the entry file used to access the lists in
the posting file.  These lists are addresses in the tag file
of  documents  posted  under the hash codes derived from the
query.  The common items from all lists are determined; this
must  include  the  items indexed by every key, but may also
contain some items which are false drops, since items refer-
enced  by the correct hash codes need not actually have con-
tained the correct keys.  Normally,  if  there  are  several
keys  in  the  query,  there are not likely to be many false


                           - 4 -


drops in the final combined list even though each hash  code
is  somewhat  ambiguous.   The actual tags are then obtained
from the tag file, and to guard against the possibility that
an  item  has  false-dropped on some hash code in the query,
the original items are normally obtained from  the  delivery
program  (4)  and  the  query  keys  checked against them by
string comparison.

     Usually, therefore, the check for  bad  drops  is  made
against  the  original file.  However, if the key derivation
procedure is complex, it may be preferable to check  against
the  keys fed to program (2).  In this case the optional key
file which contains the keys associated with  each  item  is
generated, and the item tag is supplemented by a string

        ;start,length

which indicates the starting byte number in the key file and
the  length  of the string of keys for each item.  This file
is not usually necessary with the present key-selection pro-
gram, since the keys always appear in the original document.

     There is also an option (-C_n)  for  coordination  level
searching.   This  retrieves  items which match all but _n of
the query keys.  The items are retrieved in the order of the
number  of  keys that they match.  Of course, _n must be less
than the number of query keys (nothing is  retrieved  unless
it matches at least one key).

     As an example, consider one  set  of  4377  references,
comprising  660,000  bytes.   This  included 51,000 keys, of
which 5,900 were distinct keys.  The hash table is kept full
to  save space (at the expense of time); 995 of 997 possible
hash codes were used.  The total set of index files (no  key
file) included 171,000 bytes, about 26% of the original file
size.  It took 8 minutes of processor time  to  hash,  sort,
and  write the index.  To search for a single query with the
resulting index took 1.9 seconds of processor time, while to
find  the  same  paper with a sequential linear search using
_g_r_e_p (reading all of the tags and keys) took 12.3 seconds of
processor time.

     We have also used this software to  index  all  of  the
English  stored  on  our  UNIX  system.   This  is the index
searched by the _l_o_o_k_a_l_l command.  On  a  typical  day  there
were  29,000 files in our user file system, containing about
152,000,000  bytes.   Of  these  5,300   files,   containing
32,000,000  bytes  (about 21%) were English text.  The total
number of `words' (determined mechanically)  was  5,100,000.
Of  these  227,000  were  selected as keys; 19,000 were dis-
tinct, hashing to 4,900 (of 5,000 possible)  different  hash
codes.   The  resulting  inverted  file indexes used 845,000
bytes, or about 2.6% of the size of the original files.  The
particularly  small indexes are caused by the fact that keys


                           - 5 -


are taken from only the first 50 non-common  words  of  some
very long input files.

     Even this large _l_o_o_k_a_l_l index can be searched  quickly.
For  example,  to find this document by looking for the keys
``lesk inverted indexes'' required 1.7 seconds of  processor
time  and  system  time.   By comparison, just to search the
800,000 byte dictionary  (smaller  than  even  the  inverted
indexes,  let alone the 32,000,000 bytes of text files) with
_g_r_e_p takes 29 seconds of processor time.  The  _l_o_o_k_a_l_l  pro-
gram  is  thus  useful when looking for a document which you
believe is stored on-line, but do not know where.  For exam-
ple,  many  memos from the Computing Science Research Center
are in its UNIX file system, but it is  often  difficult  to
guess  where  a  particular  memo  might  be  (it might have
several authors, each with many directories, and  have  been
worked  on  by  a  secretary  with  yet  more  directories).
Instructions for the use of the _l_o_o_k_a_l_l command are given in
the  manual  section, shown in the appendix to this memoran-
dum.

     The only indexes maintained routinely are those of pub-
lication  lists  and  all  English  files.   To  make  other
indexes, the programs for making keys, sorting them, search-
ing the indexes, and delivering answers must be used.  Since
they are usually invoked as parts of higher-level  commands,
they  are  not  in  the  default  command directory, but are
available to  any  user  in  the  directory  /_u_s_r/_l_i_b/_r_e_f_e_r.
Three  programs  are  of interest: _m_k_e_y, which isolates keys
from input files; _i_n_v, which makes an index from  a  set  of
keys;  and  _h_u_n_t,  which searches the index and delivers the
items.  Note that the two parts of the retrieval  phase  are
combined  into  one  program,  to avoid the excessive system
work and delay which would  result  from  running  these  as
separate processes.

     These three commands have a large number of options  to
adapt  to different kinds of input.  The user not interested
in the detailed description that now  follows  may  skip  to
section  3, which describes the _r_e_f_e_r program, a packaged-up
version of these tools specifically oriented towards format-
ting references.

     _M_a_k_e _K_e_y_s.  The program _m_k_e_y is the key-making  program
corresponding  to  step  (1) in phase A.  Normally, it reads
its input from the file names given  as  arguments,  and  if
there are no arguments it reads from the standard input.  It
assumes that blank  lines  in  the  input  delimit  separate
items,  for each of which a different line of keys should be
generated.  The lines of keys are written  on  the  standard
output.   Keys  are any alphanumeric string in the input not
among the most frequent words in English  and  not  entirely
numeric  (except  that all-numeric strings are acceptable if
they are between 1900 and 1999).  In the  output,  keys  are


                           - 6 -


translated to lower case, and truncated to six characters in
length; any associated punctuation is removed.  The  follow-
ing flag arguments are recognized by _m_k_e_y:

center; lB lw(4i).  -c _n_a_m_e   _T{  _N_a_m_e  _o_f  _f_i_l_e  _o_f  _c_o_m_m_o_n
_w_o_r_d_s;  _d_e_f_a_u_l_t  _i_s  /_u_s_r/_l_i_b/_e_i_g_n.   _T} -_f _n_a_m_e   _T{ _R_e_a_d _a
_l_i_s_t _o_f _f_i_l_e_s _f_r_o_m _n_a_m_e _a_n_d _t_a_k_e _e_a_c_h _a_s _a_n _i_n_p_u_t  _a_r_g_u_m_e_n_t.
_T}  -_i  _c_h_a_r_s  _T{ _I_g_n_o_r_e _a_l_l _l_i_n_e_s _w_h_i_c_h _b_e_g_i_n _w_i_t_h `%' _f_o_l_-
_l_o_w_e_d _b_y _a_n_y _c_h_a_r_a_c_t_e_r _i_n _c_h_a_r_s.  _T} -_k_n  _T{ _U_s_e _a_t  _m_o_s_t  _n
_k_e_y_s _p_e_r _i_n_p_u_t _i_t_e_m.  _T} -_l_n  _T{ _I_g_n_o_r_e _i_t_e_m_s _s_h_o_r_t_e_r _t_h_a_n _n
_l_e_t_t_e_r_s _l_o_n_g.  _T} -_n_m  _T{ _I_g_n_o_r_e _a_s _a _k_e_y _a_n_y  _w_o_r_d  _i_n  _t_h_e
_f_i_r_s_t  _m  _w_o_r_d_s  _o_f  _t_h_e  _l_i_s_t _o_f _c_o_m_m_o_n _E_n_g_l_i_s_h _w_o_r_d_s.  _T_h_e
_d_e_f_a_u_l_t   _i_s   _1_0_0.    _T}   -_s   _T{   _R_e_m_o_v_e   _t_h_e    _l_a_b_e_l_s
(_f_i_l_e:_s_t_a_r_t,_l_e_n_g_t_h)  _f_r_o_m  _t_h_e  _o_u_t_p_u_t;  _j_u_s_t _g_i_v_e _t_h_e _k_e_y_s.
_U_s_e_d _w_h_e_n _s_e_a_r_c_h_i_n_g _r_a_t_h_e_r _t_h_a_n _i_n_d_e_x_i_n_g.  _T}  -_w   _T{  _E_a_c_h
_w_h_o_l_e  _f_i_l_e  _i_s  _a  _s_e_p_a_r_a_t_e  _i_t_e_m; _b_l_a_n_k _l_i_n_e_s _i_n _f_i_l_e_s _a_r_e
_i_r_r_e_l_e_v_a_n_t.  _T}


     The normal arguments for indexing  references  are  the
defaults,  which  are -_c /_u_s_r/_l_i_b/_e_i_g_n, -_n_1_0_0, and -_l_3.  For
searching, the -_s option  is  also  needed.   When  the  big
_l_o_o_k_a_l_l  index  of all English files is run, the options are
-_w, -_k_5_0, and -_f (_f_i_l_e_l_i_s_t).  When running on textual input,
the _m_k_e_y program processes about 1000 English words per pro-
cessor second.  Unless the -_k option is used (and the  input
files  are  long enough for it to take effect) the output of
_m_k_e_y is comparable in size to its input.

     _H_a_s_h _a_n_d _i_n_v_e_r_t.  The _i_n_v  program  computes  the  hash
codes and writes the inverted files.  It reads the output of
_m_k_e_y and writes the set of files described earlier  in  this
section.  It expects one argument, which is used as the base
name for the three (or four) files to be written.   Assuming
an  argument  of _I_n_d_e_x (the default) the entry file is named
_I_n_d_e_x._i_a, the posting file _I_n_d_e_x._i_b, the tag file  _I_n_d_e_x._i_c,
and  the  key  file  (if present) _I_n_d_e_x._i_d.  The _i_n_v program
recognizes the following options:

center; lB lw(4i).  -a   T{ Append the new keys to a  previ-
ous  set  of inverted files, making new files if there is no
old set using the same base  name.   T}  -d   T{  Write  the
optional  key  file.   This is needed when you can not check
for false drops by looking for  the  keys  in  the  original
inputs,  i.e.  when  the key derivation procedure is compli-
cated and the output keys  are  not  words  from  the  input
files.  T} -h_n  _T{ _T_h_e _h_a_s_h _t_a_b_l_e _s_i_z_e _i_s _n (_d_e_f_a_u_l_t _9_9_7); _n
_s_h_o_u_l_d _b_e _p_r_i_m_e.  _M_a_k_i_n_g _n  bigger  saves  search  time  and
spends  disk  space.   T}  -i[u] _n_a_m_e     _T{ _T_a_k_e _i_n_p_u_t _f_r_o_m
_f_i_l_e _n_a_m_e, _i_n_s_t_e_a_d _o_f _t_h_e _s_t_a_n_d_a_r_d _i_n_p_u_t; _i_f  _u  _i_s  _p_r_e_s_e_n_t
_n_a_m_e  _i_s  _u_n_l_i_n_k_e_d  _w_h_e_n  _t_h_e  _s_o_r_t  _i_s _s_t_a_r_t_e_d.  _U_s_i_n_g _t_h_i_s
_o_p_t_i_o_n _p_e_r_m_i_t_s _t_h_e _s_o_r_t _s_c_r_a_t_c_h _s_p_a_c_e _t_o  _o_v_e_r_l_a_p  _t_h_e  _d_i_s_k
_s_p_a_c_e _u_s_e_d _f_o_r _i_n_p_u_t _k_e_y_s.  _T} -_n   _T{ _M_a_k_e _a _c_o_m_p_l_e_t_e_l_y _n_e_w
_s_e_t _o_f _i_n_v_e_r_t_e_d _f_i_l_e_s, _i_g_n_o_r_i_n_g _p_r_e_v_i_o_u_s _f_i_l_e_s.  _T}  -_p   _T{


                           - 7 -


_P_i_p_e  _i_n_t_o _t_h_e _s_o_r_t _p_r_o_g_r_a_m, _r_a_t_h_e_r _t_h_a_n _w_r_i_t_i_n_g _a _t_e_m_p_o_r_a_r_y
_i_n_p_u_t _f_i_l_e.  _T_h_i_s _s_a_v_e_s  _d_i_s_k  _s_p_a_c_e  _a_n_d  _s_p_e_n_d_s  _p_r_o_c_e_s_s_o_r
_t_i_m_e.   _T}  -_v   _T{  _V_e_r_b_o_s_e  _m_o_d_e;  _p_r_i_n_t  _a _s_u_m_m_a_r_y _o_f _t_h_e
_n_u_m_b_e_r _o_f _k_e_y_s _w_h_i_c_h _f_i_n_i_s_h_e_d _i_n_d_e_x_i_n_g.  _T}


     About half the time used in _i_n_v  is  in  the  contained
sort.  Assuming the sort is roughly linear, however, a guess
at the total timing for _i_n_v is 250  keys  per  second.   The
space  used  is  usually  of more importance: the entry file
uses four bytes per possible hash (note the -_h option),  and
the  tag file around 15-20 bytes per item indexed.  Roughly,
the posting file contains one item for each key instance and
one  item  for  each  possible  hash code; the items are two
bytes long if the tag file is less than  65336  bytes  long,
and the items are four bytes wide if the tag file is greater
than 65536 bytes long.  To minimize storage, the hash tables
should  be  over-full; for most of the files indexed in this
way, there is no other real choice,  since  the  _e_n_t_r_y  file
must fit in memory.

     _S_e_a_r_c_h_i_n_g _a_n_d _R_e_t_r_i_e_v_i_n_g.  The _h_u_n_t  program  retrieves
items  from  an index.  It combines, as mentioned above, the
two parts of phase (B): search and delivery.  The reason why
it  is efficient to combine delivery and search is partly to
avoid starting unnecessary processes, and partly because the
delivery operation must be a part of the search operation in
any case.  Because of the hashing,  the  search  part  takes
place  in  two  stages: first items are retrieved which have
the right hash codes associated  with  them,  and  then  the
actual  items  are  inspected to determine false drops, i.e.
to determine if anything with the right hash  codes  doesn't
really  have  the  right  keys.   Since the original item is
retrieved to check  on  false  drops,  it  is  efficient  to
present  it  immediately, rather than only giving the tag as
output and later retrieving the item again.  If there were a
separate  key  file,  this  argument  would  not  apply, but
separate key files are not common.

     Input to _h_u_n_t is taken from  the  standard  input,  one
query per line.  Each query should be in _m_k_e_y -_s output for-
mat; all lower case, no punctuation.  The _h_u_n_t program takes
one  argument  which  specifies  the  base name of the index
files to be searched.  Only one set of index  files  can  be
searched  at a time, although many text files may be indexed
as a group, of course.  If one of the text  files  has  been
changed  since  the index, that file is searched with _f_g_r_e_p;
this may occasionally slow  down  the  searching,  and  care
should be taken to avoid having many out of date files.  The
following option arguments are recognized by _h_u_n_t:

center; lB lw(4i).  -a   T{ Give all output; ignore checking
for  false drops.  T} -C_n  _T{ _C_o_o_r_d_i_n_a_t_i_o_n _l_e_v_e_l _n; _r_e_t_r_i_e_v_e
_i_t_e_m_s _w_i_t_h _n_o_t _m_o_r_e _t_h_a_n  _n  _t_e_r_m_s  _o_f  _t_h_e  _i_n_p_u_t  _m_i_s_s_i_n_g;


                           - 8 -


_d_e_f_a_u_l_t  _C_0,  _i_m_p_l_y_i_n_g  _t_h_a_t _e_a_c_h _s_e_a_r_c_h _t_e_r_m _m_u_s_t _b_e _i_n _t_h_e
_o_u_t_p_u_t _i_t_e_m_s.  _T} -_F[_y_n_d]   T{ ``-Fy'' gives the text of all
the  items  found; ``-Fn'' suppresses them.  ``-F_d'' where _d
is an integer gives the text of  the  first  _d  items.   The
default is -_F_y.  T} -g   T{ Do not use _f_g_r_e_p to search files
changed since the index was made;  print  an  error  comment
instead.   T}  -i _s_t_r_i_n_g _T{ _T_a_k_e _s_t_r_i_n_g _a_s _i_n_p_u_t, _i_n_s_t_e_a_d _o_f
_r_e_a_d_i_n_g _t_h_e _s_t_a_n_d_a_r_d _i_n_p_u_t.  _T} -_l _n _T{ _T_h_e  _m_a_x_i_m_u_m  _l_e_n_g_t_h
_o_f _i_n_t_e_r_n_a_l _l_i_s_t_s _o_f _c_a_n_d_i_d_a_t_e _i_t_e_m_s _i_s _n; _d_e_f_a_u_l_t _1_0_0_0.  _T}
-_o _s_t_r_i_n_g _T{ _P_u_t _t_e_x_t _o_u_t_p_u_t (``-_F_y'')  _i_n  _s_t_r_i_n_g;  _o_f  _u_s_e
_o_n_l_y  _w_h_e_n  _i_n_v_o_k_e_d  _f_r_o_m _a_n_o_t_h_e_r _p_r_o_g_r_a_m.  _T} -_p   _T{ _P_r_i_n_t
_h_a_s_h _c_o_d_e _f_r_e_q_u_e_n_c_i_e_s; _m_o_s_t_l_y _f_o_r  _u_s_e  _i_n  _o_p_t_i_m_i_z_i_n_g  _h_a_s_h
_t_a_b_l_e  _s_i_z_e_s.  _T} -_T[_y_n_d]   T{ ``-Ty'' gives the tags of the
items found; ``-Tn'' suppresses them.  ``-T_d'' where _d is an
integer  gives the first _d tags.  The default is -_T_n.  T} -t
_s_t_r_i_n_g _T{ _P_u_t _t_a_g _o_u_t_p_u_t (``-_T_y'') _i_n _s_t_r_i_n_g;  _o_f  _u_s_e  _o_n_l_y
_w_h_e_n _i_n_v_o_k_e_d _f_r_o_m _a_n_o_t_h_e_r _p_r_o_g_r_a_m.  _T}


     The timing of _h_u_n_t is complex.  Normally the hash table
is  overfull,  so that there will be many false drops on any
single term; but a multi-term  query  will  have  few  false
drops  on all terms.  Thus if a query is underspecified (one
search term) many potential items will be examined and  dis-
carded  as false drops, wasting time.  If the query is over-
specified (a dozen search terms) many keys will be  examined
only  to verify that the single item under consideration has
that key posted.  The variation of search time  with  number
of  keys  is  shown  in the table below.  Queries of varying
length were constructed to retrieve  a  particular  document
from  the  file of references.  In the sequence to the left,
search terms were chosen so as to select the  desired  paper
as quickly as possible.  In the sequence on the right, terms
were  chosen  inefficiently,  so  that  the  query  did  not
uniquely  select  the  desired  document until four keys had
been used.  The same document was the target in  each  case,
and  the  final  set  of  eight keys are also identical; the
differences at five, six and  seven  keys  are  produced  by
measurement error, not by the slightly different key lists.

center; c   s   s   s5  | c   s   s   s cp8 cp8  cp8  cp8  |
cp8  cp8 cp8 cp8 cp8 cp8 cp8 cp8 | cp8 cp8 cp8 cp8 n   n   n
n   | n   n   n   n  .  Efficient Keys Inefficient Keys  No.
keys  Total       drops    Retrieved Search      time    No.
keys  Total   drops    RetrievedSearch   time         (incl.
false)  Documents (seconds)      (incl.
false)  Documents (seconds)
1    15   3    1.27 1    68   55   5.96
2    1    1    0.11 2    29   29   2.72
3    1    1    0.14 3    8    8    0.95
4    1    1    0.17 4    1    1    0.18
5    1    1    0.19 5    1    1    0.21
6    1    1    0.23 6    1    1    0.22
7    1    1    0.27 7    1    1    0.26


                           - 9 -


8    1    1    0.29 8    1    1    0.29

As would be expected, the optimal search  is  achieved  when
the query just specifies the answer; however, overspecifica-
tion is quite cheap.  Roughly, the time required by _h_u_n_t can
be  approximated  as  30 milliseconds per search key plus 75
milliseconds per dropped document (whether  it  is  a  false
drop  or  a real answer).  In general, overspecification can
be recommended; it protects the user  against  additions  to
the   data  base  which  turn  previously  uniquely-answered
queries into ambiguous queries.

     The  careful  reader  will  have  noted   an   enormous
discrepancy  between these times and the earlier quoted time
of around 1.9 seconds for a  search.   The  times  here  are
purely  for  the  search and retrieval: they are measured by
running many searches through a  single  invocation  of  the
_h_u_n_t  program  alone.   Usually,  the UNIX command processor
(the shell) must start both the _m_k_e_y and _h_u_n_t processes  for
each  query, and arrange for the output of _m_k_e_y to be fed to
the _h_u_n_t program.  This adds a fixed overhead of  about  1.7
seconds  of  processor  time to any single search.  Further-
more, remember that all these times are processor times:  on
a  typical  morning  on our PDP 11/70 system, with about one
dozen people logged on, to obtain 1 second of processor time
for the search program took between 2 and 12 seconds of real
time, with a median  of  3.9  seconds  and  a  mean  of  4.8
seconds.   Thus,  although  the  work  involved  in a single
search may be only 200 milliseconds, after you add  the  1.7
seconds  of  startup  processor  time  and then assume a 4:1
elapsed/processor time ratio, it will be  8  seconds  before
any response is printed.

_3.  _S_e_l_e_c_t_i_n_g _a_n_d _F_o_r_m_a_t_t_i_n_g _R_e_f_e_r_e_n_c_e_s _f_o_r _T_R_O_F_F

     The major application  of  the  retrieval  software  is
_r_e_f_e_r,  which is a _t_r_o_f_f preprocessor like _e_q_n.  [ kernighan
cherry acm 1975 ] It scans its input looking  for  items  of
the form

        .[
        imprecise citation
        .]

where an imprecise citation is  merely  a  string  of  words
found  in  the  relevant  bibliographic  citation.   This is
translated into a  properly  formatted  reference.   If  the
imprecise  citation  does  not  correctly  identify a single
paper (either selecting no papers or too many) a message  is
given.   The data base of citations searched may be tailored
to each system, and individual users may specify  their  own
citation  files.   On  our  system, the default data base is
accumulated from the publication lists of the members of our
organization,    plus    about   half   a   dozen   personal


                           - 10 -


bibliographies that were collected.  The  present  total  is
about  4300  citations,  but  this increases steadily.  Even
now, the data base covers a large fraction  of  local  cita-
tions.

     For example, the reference for the _e_q_n paper above  was
specified as

        ...
        preprocessor like
        .I eqn.
        .[
        kernighan cherry acm 1975
        .]
        It scans its input looking for items
        ...

This paper was itself printed using _r_e_f_e_r.  The above  input
text  was processed by _r_e_f_e_r as well as _t_b_l and _t_r_o_f_f by the
command

        _r_e_f_e_r _m_e_m_o-_f_i_l_e | _t_b_l | _t_r_o_f_f -_m_s

and  the  reference  was  automatically  translated  into  a
correct  citation  to the ACM paper on mathematical typeset-
ting.

     The procedure to use to place a reference  in  a  paper
using  _r_e_f_e_r  is as follows.  First, use the _l_o_o_k_b_i_b command
to check that the paper is in the data base and to find  out
what  keys  are  necessary  to retrieve it.  This is done by
typing _l_o_o_k_b_i_b and then typing some potential queries  until
a  suitable query is found.  For example, had one started to
find the _e_q_n paper shown above by presenting the query

                $ lookbib
                kernighan cherry
                (EOT)

_l_o_o_k_b_i_b would  have  found  several  items;  experimentation
would  quickly have shown that the query given above is ade-
quate.  Overspecifying the query is of course  harmless;  it
is  even desirable, since it decreases the risk that a docu-
ment added to the publication data base in the  future  will
be  retrieved  in  addition  to  the intended document.  The
extra time taken by even a grossly  overspecified  query  is
quite small.  A particularly careful reader may have noticed
that ``acm'' does not appear in  the  printed  citation;  we
have  supplemented  some  of  the data base items with extra
keywords, such as common abbreviations for journals or other
sources, to aid in searching.

     If the reference is in the data base,  the  query  that
retrieved  it can be inserted in the text, between .[ and .]


                           - 11 -


brackets.  If it is not in the data base, it  can  be  typed
into  a  private  file  of references, using the format dis-
cussed in the next section, and then the -_p option  used  to
search this private file.  Such a command might read (if the
private references are called _m_y_f_i_l_e)

        _r_e_f_e_r -_p _m_y_f_i_l_e _d_o_c_u_m_e_n_t | _t_b_l | _e_q_n | _t_r_o_f_f -_m_s . . .

where _t_b_l and/or _e_q_n could be omitted if  not  needed.   The
use of the -_m_s macros [ lesk typing documents unix gcos ] or
some other macro package, however, is essential.  _R_e_f_e_r only
generates  the  data for the references; exact formatting is
done by some macro package, and  if  none  is  supplied  the
references will not be printed.

     By default, the references are  numbered  sequentially,
and  the  -_m_s  macros  format references as footnotes at the
bottom of the page.  This memorandum is an example  of  that
style.   Other  possibilities  are  discussed  in  section 5
below.

_4.  _R_e_f_e_r_e_n_c_e _F_i_l_e_s.

     A reference file is a set of  bibliographic  references
usable  with  _r_e_f_e_r.   It  can be indexed using the software
described in section 2 for fast searching.  What _r_e_f_e_r  does
is  to read the input document stream, looking for imprecise
citation references.  It  then  searches  through  reference
files  to find the full citations, and inserts them into the
document.  The format of the full citation  is  arranged  to
make it convenient for a macro package, such as the -_m_s mac-
ros, to format the reference for printing.  Since the format
of the final reference is determined by the desired style of
output, which is determined by the macros used, _r_e_f_e_r avoids
forcing  any  kind  of reference appearance.  All it does is
define a set of string registers  which  contain  the  basic
information  about  the  reference; and provide a macro call
which is expanded by the macro package to format the  refer-
ence.   It  is the responsibility of the final macro package
to see that the reference is actually printed; if no  macros
are used, and the output of _r_e_f_e_r fed untranslated to _t_r_o_f_f,
nothing at all will be printed.

     The strings defined by _r_e_f_e_r are  taken  directly  from
the  files of references, which are in the following format.
The references should be separated  by  blank  lines.   Each
reference  is  a sequence of lines beginning with % and fol-
lowed by a key-letter.  The remainder of that line, and suc-
cessive  lines until the next line beginning with %, contain
the information specified by the  key-letter.   In  general,
_r_e_f_e_r   does  not  interpret  the  information,  but  merely
presents it to the macro package for  final  formatting.   A
user with a separate macro package, for example, can add new
key-letters or use the  existing  ones  for  other  purposes


                           - 12 -


without bothering _r_e_f_e_r.

     The meaning of the key-letters given below, in particu-
lar,  is  that assigned by the -_m_s macros.  Not all informa-
tion, obviously, is used with each citation.   For  example,
if  a  document is both an internal memorandum and a journal
article, the macros ignore the memorandum version  and  cite
only the journal article.  Some kinds of information are not
used at all in printing the reference; if a  user  does  not
like  finding  references by specifying title or author key-
words, and prefers to add specific keywords to the citation,
a field is available which is searched but not printed (K).

     The key letters currently recognized by _r_e_f_e_r and  -_m_s,
with the kind of information implied, are:

center;  c  c6  c  c   c   l   c   l.    Key     Information
specified   Key     Information  specified  A       Author's
name   N       Issue number B       Title of book containing
item   O       Other     information     C       City     of
publication     P       Page(s)          of          article
D       Date    R       Technical      report      reference
E       Editor  of   book   containing   item  T       Title
G       Government              (NTIS)              ordering
number       V       Volume number I       Issuer  (publish-
er)      J       Journal      name     K       Keys     (for
searching)    X       or          L       Label   Y       or
M       Memorandum label        Z       Information not used
by _r_e_f_e_r

For example, a sample reference could be typed as:

        %T Bounds on the Complexity of the Maximal
        Common Subsequence Problem
        %Z ctr127
        %A A. V. Aho
        %A D. S. Hirschberg
        %A J. D. Ullman
        %J J. ACM
        %V 23
        %N 1
        %P 1-12
        %M abcd-78
        %D Jan. 1976

Order is irrelevant, except that authors are  shown  in  the
order  given.   The  output  of  _r_e_f_e_r is a stream of string
definitions, one for each of the fields of  each  reference,
as shown below.


                           - 13 -


        .]-
        .ds [A authors' names ...
        .ds [T title ...
        .ds [J journal ...
        ...
        .][ type-number

The _r_e_f_e_r program, in general, does not concern itself  with
the  significance  of the strings.  The different fields are
treated identically by _r_e_f_e_r, except that the  X,  Y  and  Z
fields  are  ignored (see the -_i option of _m_k_e_y) in indexing
and searching.  All _r_e_f_e_r does  is  select  the  appropriate
citation, based on the keys.  The macro package must arrange
the strings so as  to  produce  an  appropriately  formatted
citation.   In this process, it uses the convention that the
`T' field is the title, the `J' field the  journal,  and  so
forth.

     The _r_e_f_e_r program does arrange the citation to simplify
the  macro  package's  job,  however.  The special macro .]-
precedes the string definitions and the  special  macro  .][
follows.  These are changed from the input .[ and .] so that
running the same file through _r_e_f_e_r again is harmless.   The
.]-  macro  can  be used by the macro package to initialize.
The .][ macro, which should be used to print the  reference,
is  given  an  argument  _t_y_p_e-_n_u_m_b_e_r to indicate the kind of
reference, as follows:

center; c c n l.  Value   Kind of reference  1       Journal
article    2       Book    3       Article    within    book
4       Technical   report   5       Bell   Labs   technical
memorandum 0       Other

The type is determined by the presence or absence of partic-
ular  fields  in the citation (a journal article must have a
`J' field, a book must have an `I' field, and so forth).  To
a small extent, this violates the above rule that _r_e_f_e_r does
not concern itself with the contents of the  citation;  how-
ever,  the  classification  of  the citation in _t_r_o_f_f macros
would require a relatively expensive  and  obscure  program.
Any  macro  writer  may,  of course, preserve consistency by
ignoring the argument to the .][ macro.

     The reference is flagged in the text with the sequence

        \*([.number\*(.]

where _n_u_m_b_e_r is the footnote number.  The strings [.  and .]
should  be used by the macro package to format the reference
flag in the text.  These strings can be replaced for a  par-
ticular  footnote,  as described in section 5.  The footnote
number (or other signal) is available to the reference macro
.][  as  the string register [_F.  To simplify dealing with a


                           - 14 -


text reference that occurs at the end of a  sentence,  _r_e_f_e_r
treats  a reference which follows a period in a special way.
The period is removed, and the reference is  preceded  by  a
call  for  the  string  <.   and  followed by a call for the
string >.  For example, if a reference follows  ``end.''  it
will appear as

        end\*(<.\*([.number\*(.]\*(>.

where _n_u_m_b_e_r is the  footnote  number.   The  macro  package
should  turn  either the string >.  or <.  into a period and
delete the other one.   This  permits  the  output  to  have
either  the  form  ``end[31].''  or  ``end.8319'' as the macro
package wishes.  Note that in one case the  period  precedes
the number and in the other it follows the number.

     In some cases users wish to suspend the searching,  and
merely  use  the  reference  macro formatting.  That is, the
user doesn't want to provide a search key between .[ and  .]
brackets, but merely the reference lines for the appropriate
document.  Alternatively, the user can wish  to  add  a  few
fields to those in the reference as in the standard file, or
override some fields.  Altering or replacing fields, or sup-
plying  whole  references, is easily done by inserting lines
beginning with %; any such line is taken as direct input  to
the  reference  processor  rather  than keys to be searched.
Thus

        .[
        key1 key2 key3 ...
        %Q New format item
        %R Override report name
        .]

makes the indicates changes to the result of  searching  for
the  keys.   All of the search keys must be given before the
first % line.

     If no search keys are provided, an entire citation  can
be  provided  in-line  in the text.  For example, if the _e_q_n
paper citation were to be inserted in this way, rather  than
by searching for it in the data base, the input would read


                           - 15 -


        ...
        preprocessor like
        .I eqn.
        .[
        %A B. W. Kernighan
        %A L. L. Cherry
        %T A System for Typesetting Mathematics
        %J Comm. ACM
        %V 18
        %N 3
        %P 151-157
        %D March 1975
        .]
        It scans its input looking for items
        ...

This would produce a citation of the same appearance as that
resulting from the file search.

     As  shown,  fields  are  normally  turned  into   _t_r_o_f_f
strings.   Sometimes users would rather have them defined as
macros, so that other _t_r_o_f_f commands can be placed into  the
data.   When  this  is  necessary, simply double the control
character % in the data.  Thus the input

        .[
        %V 23
        %%M
        Bell Laboratories,
        Murray Hill, N.J. 07974
        .]

is processed by _r_e_f_e_r into

        .ds [V 23
        .de [M
        Bell Laboratories,
        Murray Hill, N.J. 07974
        ..

The information after %%_M  is  defined  as  a  macro  to  be
invoked by .[_M while the information after %_V is turned into
a string to be invoked by _\*([_V.  At present -_m_s expects all
information as strings.

_5.  _C_o_l_l_e_c_t_i_n_g _R_e_f_e_r_e_n_c_e_s _a_n_d _o_t_h_e_r _R_e_f_e_r _O_p_t_i_o_n_s

     Normally, the combination of _r_e_f_e_r and -_m_s formats out-
put  as _t_r_o_f_f footnotes which are consecutively numbered and
placed at the bottom of the page.  However, options exist to
place  the  references  at  the  end;  to arrange references
alphabetically by senior author; and to indicate  references
by  strings  in the text of the form [Name1975a] rather than


                           - 16 -


by number.  Whenever references are not placed at the bottom
of a page identical references are coalesced.

     For example, the -_e  option  to  _r_e_f_e_r  specifies  that
references are to be collected; in this case they are output
whenever the sequence

        .[
        $LIST$
        .]

is encountered.  Thus, to place references at the end  of  a
paper, the user would run _r_e_f_e_r with the -_e option and place
the above $LIST$ commands after the last line of  the  text.
_R_e_f_e_r  will  then move all the references to that point.  To
aid in formatting the collected references, _r_e_f_e_r writes the
references preceded by the line

        .]<

and followed by the line

        .]>

to invoke special macros before and after the references.

     Another possible option to _r_e_f_e_r is the  -_s  option  to
specify  sorting  of references.  The default, of course, is
to list references in the order presented.   The  -_s  option
implies the -_e option, and thus requires a

        .[
        $LIST$
        .]

entry to call out the reference list.  The -_s option may  be
followed  by  a  string  of  letters, numbers, and `+' signs
indicating how the references are to be sorted.  The sort is
done using the fields whose key-letters are in the string as
sorting keys; the numbers indicate how many  of  the  fields
are  to  be  considered,  with  `+' taken as a large number.
Thus the default is -_s_A_D meaning ``Sort  on  senior  author,
then date.''  To sort on all authors and then title, specify
-_s_A+_T.  And to sort on two authors  and  then  the  journal,
write -_s_A_2_J.

     Other options to  _r_e_f_e_r  change  the  signal  or  label
inserted in the text for each reference.  Normally these are
just sequential numbers, and their exact  placement  (within
brackets,  as superscripts, etc.) is determined by the macro
package.   The  -_l  option  replaces  reference  numbers  by
strings composed of the senior author's last name, the date,
and a disambiguating letter.  If a number follows the  _l  as
in  -_l_3  only that many letters of the last name are used in


                           - 17 -


the label string.  To abbreviate the date as well  the  form
-l_m,_n  shortens the last name to the first _m letters and the
date to the last _n digits.  For example,  the  option  -_l_3,_2
would  refer  to  the  _e_q_n paper (reference 3) by the signal
_K_e_r_7_5_a, since it is the first cited reference  by  Kernighan
in 1975.

     A user wishing  to  specify  particular  labels  for  a
private  bibliography may use the -_k option.  Specifying -k_x
causes the field _x to be used as a label.  The default is L.
If  this  field  ends  in -, that character is replaced by a
sequence letter; otherwise the  field  is  used  exactly  as
given.

     If none of the _r_e_f_e_r-produced signals are desired,  the
-_b option entirely suppresses automatic text signals.

     If the user wishes to override the -_m_s treatment of the
reference signal (which is normally to enclose the number in
brackets in _n_r_o_f_f and make it a superscript in  _t_r_o_f_f)  this
can  be done easily.  If the lines .[ or .] contain anything
following these characters, the remainders  of  these  lines
are  used  to  surround the reference signal, instead of the
default.  Thus, for example, to say ``See  reference  (2).''
and avoid ``See reference.829'' the input might appear

        See reference
        .[ (
        imprecise citation ...
        .]).

Note that blanks are significant in this construction.  If a
permanent  change  is desired in the style of reference sig-
nals, however, it is probably easier to redefine the strings
[.   and  .] (which are used to bracket each signal) than to
change each citation.

     Although normally _r_e_f_e_r limits itself to retrieving the
data  for  the  reference, and leaves to a macro package the
job of arranging that data as required by the local  format,
there  are  two  special options for rearrangements that can
not be done by macro packages.  The -_c  option  puts  fields
into  all upper case (CAPS-SMALL CAPS in _t_r_o_f_f output).  The
key-letters indicated what information is to  be  translated
to upper case follow the _c, so that -_c_A_J means that authors'
names and journals are to be in caps.  The -_a option  writes
the  names  of  authors last name first, that is _A. _D. _H_a_l_l,
_J_r.  is written as _H_a_l_l, _A. _D. _J_r.  The citation form of the
_J_o_u_r_n_a_l  _o_f _t_h_e _A_C_M, for example, would require both -_c_A and
-_a options.  This produces authors' names in the style  _K_E_R_-
_N_I_G_H_A_N,  _B.  _W.  _A_N_D _C_H_E_R_R_Y, _L. _L. for the previous example.
The -_a option may be followed by a number  to  indicate  how
many  author  names  should be reversed; -_a_1 (without any -_c
option) would produce _K_e_r_n_i_g_h_a_n, _B. _W. _a_n_d _L. _L. _C_h_e_r_r_y, for


                           - 18 -


example.

     Finally, there  is  also  the  previously-mentioned  -_p
option  to let the user specify a private file of references
to be searched before the public  files.   Note  that  _r_e_f_e_r
does  not insist on a previously made index for these files.
If a file is named which contains reference data but is  not
indexed,  it  will  be searched (more slowly) by _r_e_f_e_r using
_f_g_r_e_p.  In this way it is easy for users to keep small files
of  new  references,  which can later be added to the public
data bases.