Writing Tools - The STYLE and DICTION Programs


                        L. L. Cherry

                     Bell Laboratories
               Murray Hill, New Jersey 07974


                        W. Vesterman

                     Livingston College
                     Rutgers University


                          _A_B_S_T_R_A_C_T

          Text processing systems are now in heavy  use
     in  many companies to format documents.  With many
     documents stored on line, it has  become  possible
     to use computers to study writing style itself and
     to help writers produce better  written  and  more
     readable  prose.  The system of programs described
     here is an initial  step  toward  such  help.   It
     includes programs and a data base designed to pro-
     duce a stylistic profile of writing  at  the  word
     and  sentence  level.   The  system measures read-
     ability, sentence and word length, sentence  type,
     word usage, and sentence openers.  It also locates
     common examples of wordy phrasing and bad diction.
     The  system  is useful for evaluating a document's
     style, locating sentences that may be difficult to
     read  or excessively wordy, and determining a par-
     ticular writer's style over several documents.


_1.  _I_n_t_r_o_d_u_c_t_i_o_n

     Computers  have  become  important  in   the   document
preparation  process,  with  programs  to check for spelling
errors and to format  documents.   As  the  amount  of  text
stored on line increases, it becomes feasible and attractive
to study writing style and to attempt to help the writer  in
producing  readable  documents.  The system of writing tools
described here is a first step toward such help.  The system
includes  programs  and a data base to analyze writing style
at the word and sentence level.  We use the  term  ``style''
in this paper to describe the results of a writer's particu-
lar choices  among  individual  words  and  sentence  forms.
Although   many   judgements   of   style   are  subjective,


                     September 2, 1987


                           - 2 -


particularly those of word choice, there are some  objective
measures  that experts agree lead to good style.  Three pro-
grams have been written to measure some of  the  objectively
definable  characteristics  of writing style and to identify
some commonly misused or unnecessary  phrases.   Although  a
document  that  conforms  to  the  stylistic  rules  is  not
guaranteed to be coherent and readable,  one  that  violates
all  of  the  rules  is likely to be difficult or tedious to
read.  The program STYLE  calculates  readability,  sentence
length  variability,  sentence type, word usage and sentence
openers at a rate  of  about  400  words  per  second  on  a
PDP11/70 running the UNIX*  Operating  System.   It  assumes
that the sentences are well-formed, i. e. that each sentence
has a verb and that the subject and verb  agree  in  number.
DICTION  identifies  phrases  that  are  either bad usage or
unnecessarily wordy.  EXPLAIN acts as a  thesaurus  for  the
phrases found by DICTION.  Sections 2, 3, and 4 describe the
programs; Section 5 gives the results on a cross-section  of
technical  documents; Section 6 discusses accuracy and prob-
lems; Section 7 gives implementation details.

_2.  _S_T_Y_L_E

     The program STYLE reads a document and prints a summary
of  readability  indices,  sentence  length  and  type, word
usage, and sentence openers.  It may also be used to  locate
all  sentences  in a document longer than a given length, of
readability index higher than a given number, those contain-
ing  a  passive  verb, or those beginning with an expletive.
STYLE is based  on  the  system  for  finding  English  word
classes  or  parts  of speech, PARTS [1].  PARTS is a set of
programs that uses a small dictionary (about 350 words)  and
suffix  rules  to  partially  assign word classes to English
text.  It then uses experimentally  derived  rules  of  word
order  to  assign word classes to all words in the text with
an accuracy of about 95%.  Because PARTS uses only  a  small
dictionary  and  general  rules,  it works on text about any
subject, from physics to psychology.   Style  measures  have
been  built  into the output phase of the programs that make
up PARTS.  Some of the measures are simple counters  of  the
word classes found by PARTS; many are more complicated.  For
example, the verb count is the total number of verb phrases.
This includes phrases like:

        has been going
        was only going
        to go

each of which each counts as one verb.  Figure 1  shows  the
output of STYLE run on a paper by Kernighan and Mashey about
the UNIX programming environment [2].  As the example shows,
__________________________
* UNIX is a Trademark of Bell Laboratories.


                     September 2, 1987


                           - 3 -


box;  l1l.   programming  environment  readability   grades:
        (Kincaid)  12.3   (auto)  12.8   (Coleman-Liau) 11.8
(Flesch) 13.5 (46.3) sentence info:         no. sent 335 no.
wds  7419          av  sent  leng  22.1  av  word  leng 4.91
        no. questions 0 no. imperatives 0          no.  non-
func wds 4362  58.8%   av leng 6.38         short sent (<17)
35% (118) long sent (>32)  16% (55)         longest sent  82
wds  at  sent  174; shortest sent 1 wds at sent 117 sentence
types:          simple   34%  (114)   complex    32%   (108)
        compound   12%  (41) compound-complex  21% (72) word
usage:         verb types as % of total  verbs          tobe
45%  (373) aux  16% (133) inf  14% (114)         passives as
% of non-inf verbs  20% (144)         types as  %  of  total
        prep  10.8%  (804)  conj  3.5%  (262) adv 4.8% (354)
        noun 26.7% (1983) adj 18.7% (1388) pron  5.3%  (393)
        nominalizations    2  %  (155)  sentence beginnings:
        subject opener: noun (63) pron (43) pos (0) adj (58)
art  (62)  tot   67%          prep   12%  (39) adv   9% (31)
        verb   0% (1)  sub_conj    6%  (20)  conj    1%  (5)
        expletives   4% (13)


                          Figure 1

STYLE output is in five parts.  After a brief discussion  of
sentences, we will describe the parts in order.

_2._1.  _W_h_a_t _i_s _a _s_e_n_t_e_n_c_e?

     Readers of documents have little trouble deciding where
the sentences end.  People don't even have to stop and think
about uses of the  character  ``.''  in  constructions  like
1.25,  A. J. Jones, Ph.D., i. e., or etc. .  When a computer
reads a document, finding the end of  sentences  is  not  as
easy.  First we must throw away the printer's marks and for-
matting commands that litter  the  text  in  computer  form.
Then STYLE defines a sentence as a string of words ending in
one of:

         . ! ? /.

The end marker ``/.'' may be used to indicate an  imperative
sentence.   Imperative  sentences that are not so marked are
not  identified  as  imperative.   STYLE  properly   handles
numbers  with embedded decimal points and commas, strings of
letters and numbers with embedded decimal  points  used  for
naming  computer  file  names,  and the common abbreviations
listed in Appendix 1.  Numbers that end sentences, like  the
preceding  sentence, cause a sentence break if the next word
begins with a capital letter.  Initials only  cause  a  sen-
tence  break  if  the next word begins with a capital and is


                     September 2, 1987


                           - 4 -


found in the dictionary of function words used by PARTS.  So
the string

        J. D. JONES

does not cause a break, but the string

         ... system H.  The ...

does.  With these rules most sentences  are  broken  at  the
proper place, although occasionally either two sentences are
called one or a fragment is called a sentence.  More on this
later.

_2._2.  _R_e_a_d_a_b_i_l_i_t_y _G_r_a_d_e_s

     The first section of STYLE output consists of four rea-
dability  indices.   As  Klare points out in [3] readability
indices may be used to estimate the reading skills needed by
the  reader  to  understand  a  document.   The  readability
indices reported by STYLE are based on measures of  sentence
and  word  lengths.   Although  the  indices may not measure
whether the document is coherent and well organized, experi-
ence  has  shown  that high indices seem to be indicators of
stylistic difficulty.  Documents with  short  sentences  and
short  words  have low scores; those with long sentences and
many polysyllabic words have high scores.   The  4  formulae
reported  are  Kincaid  Formula  [4],  Automated Readability
Index [5], Coleman-Liau Formula [6] and a normalized version
of  Flesch  Reading  Ease  Score  [7].   The formulae differ
because they  were experimentally  derived  using  different
texts  and subject groups.  We will discuss each of the for-
mulae briefly; for a more  detailed  discussion  the  reader
should see [3].

     The Kincaid Formula, given by:

    _R_e_a_d_i_n_g___G_r_a_d_e_=11.8_*_s_y_l___p_e_r___w_d_+_.39_*_w_d_s___p_e_r___s_e_n_t_-15.59

was based on Navy training manuals that ranged in difficulty
from 5.5 to 16.3 in reading grade level.  The score reported
by this formula tends to  be  in  the  mid-range  of  the  4
scores.   Because  it  is  based  on  adult training manuals
rather than school book text, this formula is  probably  the
best one to apply to technical documents.

     The Automated Readability Index (ARI),  based  on  text
from grades 0 to 7, was derived to be easy to automate.  The
formula is:

    _R_e_a_d_i_n_g___G_r_a_d_e_=4.71_*_l_e_t___p_e_r___w_d_+_.5_*_w_d_s___p_e_r___s_e_n_t_-21.43

ARI tends to produce scores that are higher than Kincaid and
Coleman-Liau but are usually slightly lower than Flesch.


                     September 2, 1987


                           - 5 -


     The Coleman-Liau Formula, based on text ranging in dif-
ficulty from .4 to 16.3, is:

   _R_e_a_d_i_n_g___G_r_a_d_e_=5.89_*_l_e_t___p_e_r___w_d_-_.3_*_s_e_n_t___p_e_r__100___w_d_s_-15.8

Of the four formulae this one usually gives the lowest grade
when applied to technical documents.

     The last formula, the Flesch  Reading  Ease  Score,  is
based  on  grade  school  text covering grades 3 to 12.  The
formula, given by:

  _R_e_a_d_i_n_g___S_c_o_r_e_=206.835_-84.6_*_s_y_l___p_e_r___w_d_-1.015_*_w_d_s___p_e_r___s_e_n_t

is usually reported in the range 0 (very difficult)  to  100
(very  easy).   The  score reported by STYLE is scaled to be
comparable to the other formulas, except  that  the  maximum
grade level reported is set to 17.  The Flesch score is usu-
ally the highest of the 4 scores on technical documents.

     Coke [8] found that the Kincaid Formula is probably the
best  predictor for technical documents; both ARI and Flesch
tend to overestimate the difficulty;  Coleman-Liau  tend  to
underestimate.   On  text  in the range of grades 7 to 9 the
four formulas tend to be about the same.  On easy  text  the
Coleman-Liau  formula is probably preferred since it is rea-
sonably accurate at the lower grades  and  it  is  safer  to
present  text  that  is  a little too easy than a little too
hard.

     If a document has particularly difficult technical con-
tent,  especially if it includes a lot of mathematics, it is
probably best to make the text very easy  to  read,  i.e.  a
lower  readability  index  by  shortening  the sentences and
words.  This will allow the reader  to  concentrate  on  the
technical  content  and  not  the  long sentences.  The user
should remember that  these  indices  are  estimators;  they
should  not be taken as absolute numbers.  STYLE called with
``-r number'' will print all  sentences  with  an  Automated
Readability Index equal to or greater than ``number''.

_2._3.  _S_e_n_t_e_n_c_e _l_e_n_g_t_h _a_n_d _s_t_r_u_c_t_u_r_e

     The next two sections of STYLE output  deal  with  sen-
tence  length  and  structure.   Almost all books on writing
style or  effective  writing  emphasize  the  importance  of
variety  in  sentence length and structure for good writing.
Ewing's first rule in discussing style in the  book  _W_r_i_t_i_n_g
_f_o_r _R_e_s_u_l_t_s [9] is:

        ``Vary the sentence structure and length of your sentences.''

Leggett,  Mead  and  Charvat  break  this  rule  into  3  in
_P_r_e_n_t_i_c_e-_H_a_l_l _H_a_n_d_b_o_o_k _f_o_r _W_r_i_t_e_r_s [10] as follows:


                     September 2, 1987


                           - 6 -


        ``34a. Avoid the overuse of short simple sentences.''
        ``34b. Avoid the overuse of long compound sentences.''
        ``34c. Use various sentence structures to avoid monotony and increase effectiveness.''

Although experts agree that these rules are  important,  not
all  writers  follow  them.  Sample technical documents have
been found with almost no sentence length or type  variabil-
ity.   One  document had 90% of its sentences about the same
length as the average; another was made up  almost  entirely
of simple sentences (80%).

     The  output  sections  labeled  ``sentence  info''  and
``sentence  types'' give both length and structure measures.
STYLE reports on the number and average length of both  sen-
tences  and  words,  and  number of questions and imperative
sentences (those ending in ``/.'').  The  measures  of  non-
function  words  are an attempt to look at the content words
in the document.  In English non-function words  are  nouns,
adjectives, adverbs, and non-auxiliary verbs; function words
are  prepositions,  conjunctions,  articles,  and  auxiliary
verbs.   Since  most  function words are short, they tend to
lower the average word length.  The average length  of  non-
function  words  may  be a more useful measure for comparing
word choice of different writers than the total average word
length.  The percentages of short and long sentences measure
sentence length variability.  Short sentences are  those  at
least  5  words  less  than  the average; long sentences are
those at least 10 words longer than the  average.   Last  in
the  sentence information section is the length and location
of the longest and shortest sentences.   If  the  flag  ``-l
number'' is used, STYLE will print all sentences longer than
``number''.

     Because of the difficulties in dealing  with  the  many
uses  of  commas  and conjunctions in English, sentence type
definitions vary slightly from those of standard  textbooks,
but still measure the same constructional activity.

1.   A simple sentence has one verb and no dependent clause.

2.   A complex sentence has one independent clause  and  one
     dependent  clause,  each  with  one verb.  Complex sen-
     tences are found by identifying sentences that  contain
     either  a subordinate conjunction or a clause beginning
     with words like ``that''  or  ``who''.   The  preceding
     sentence has such a clause.

3.   A compound sentence has  more  than  one  verb  and  no
     dependent  clause.   Sentences joined by ``;'' are also
     counted as compound.

4.   A compound-complex sentence has either  several  depen-
     dent  clauses  or  one  dependent clause and a compound


                     September 2, 1987


                           - 7 -


     verb in either the dependent or independent clause.

     Even using these broader definitions, simple  sentences
dominate  many  of  the  technical  documents that have been
tested, but the example in Figure 1 shows  variety  in  both
sentence structure and sentence length.

_2._4.  _W_o_r_d _U_s_a_g_e

     The word usage measures are an attempt to identify some
other  constructional  features of writing style.  There are
many different ways in English to say the same  thing.   The
constructions  differ  from  one  another in the form of the
words used.  The following  sentences  all  convey  approxi-
mately the same meaning but differ in word usage:

        The cxio program is used to perform all communication between the systems.
        The cxio program performs all communications between the systems.
        The cxio program is used to communicate between the systems.
        The cxio program communicates between the systems.
        All communication between the systems is performed by the cxio program.

The  distribution of the parts of speech and verb  construc-
tions  helps  identify  overuse of particular constructions.
Although the measures used by STYLE are crude, they do point
out  problem areas.  For each category, STYLE reports a per-
centage and a raw count.  In addition to looking at the per-
centage,  the  user  may  find  it useful to compare the raw
count with the number of sentences.  If,  for  example,  the
number  of infinitives is almost equal to the number of sen-
tences, then many of the sentences in the document are  con-
structed  like the first and third in the preceding example.
The user may want to transform some of these sentences  into
another  form.   Some  of the implications of the word usage
measures are discussed below.

_V_e_r_b_s are measured in  several  different  ways  to  try  to
     determine  what  types  of  verb constructions are most
     frequent in the document.  Technical writing  tends  to
     contain many passive verb constructions and other usage
     of the verb ``to be''.  The category of  verbs  labeled
     ``tobe''  measures  both  passives and sentences of the
     form:

             _s_u_b_j_e_c_t _t_o_b_e _p_r_e_d_i_c_a_t_e

     In counting verbs, whole verb phrases  are  counted  as
     one  verb.  Verb phrases containing auxiliary verbs are
     counted in the  category  ``aux''.   The  verb  phrases
     counted  here  are  those  whose  tense  is  not simple
     present or simple past.  It might eventually be  useful
     to  do  more  detailed  measures of verb tense or mood.
     Infinitives are listed  as  ``inf''.   The  percentages
     reported  for  these  three categories are based on the


                     September 2, 1987


                           - 8 -


     total number of verb phrases found.   These  categories
     are  not  mutually  exclusive;  they  cannot  be added,
     since, for example, ``to  be  going''  counts  as  both
     ``tobe'' and ``inf''.  Use of these three types of verb
     constructions varies significantly among authors.


     STYLE reports passive verbs as a percentage of the fin-
     ite  verbs  in  the  document.   Most  style books warn
     against the overuse of passive verbs.  Coleman [11] has
     shown  that  sentences  with active verbs are easier to
     learn than those  with  passive  verbs.   Although  the
     inverted  object-subject  order  of  the  passive voice
     seems to emphasize the  object,  Coleman's  experiments
     showed  that there is little difference in retention by
     word position. He also showed that the direct object of
     an active verb is retained better than the subject of a
     passive verb.  These experiments support the advice  of
     the  style  books suggesting that writers should try to
     use active verbs wherever possible.   The  flag  ``-p''
     causes  STYLE to print all sentences containing passive
     verbs.


_P_r_o_n_o_u_n_s add cohesiveness and connectivity to a document  by
     providing  back-reference.  They are often a short-hand
     notation for something previously mentioned, and there-
     fore  connect  the sentence containing the pronoun with
     the word to which the pronoun refers.   Although  there
     are  other  mechanisms  for such connections, documents
     with no pronouns tend to be wordy and  to  have  little
     connectivity.

_A_d_v_e_r_b_s can provide transition between sentences  and  order
     in  time  and  space.   In  performing these functions,
     adverbs,  like  pronouns,  provide   connectivity   and
     cohesiveness.

_C_o_n_j_u_n_c_t_i_o_n_s provide parallelism in a document by connecting
     two or more equal units.  These units may be whole sen-
     tences, verb phrases, nouns,  adjectives,  or  preposi-
     tional phrases.  The compound and compound-complex sen-
     tences reported under sentence type are parallel struc-
     tures.  Other uses of parallel structures are indicated
     by the degree that the number of conjunctions  reported
     under  word  usage  exceeds the compound sentence meas-
     ures.

_N_o_u_n_s _a_n_d _A_d_j_e_c_t_i_v_e_s. A ratio of nouns  to  adjectives  near
     unity  may  indicate  the  over-use of modifiers.  Some
     technical writers qualify every noun with one  or  more
     adjectives.  Qualifiers in phrases like ``simple linear
     single-link network model'' often lend  more  obscurity
     than precision to a text.


                     September 2, 1987


                           - 9 -


_N_o_m_i_n_a_l_i_z_a_t_i_o_n_s are verbs  that  are  changed  to  nouns  by
     adding   one   of   the  suffixes  ``ment'',  ``ance'',
     ``ence'', or  ``ion''.   Examples  are  accomplishment,
     admittance, adherence, and abbreviation.  When a writer
     transforms a nominalized sentence to a  non-nominalized
     sentence,  she/he  increases  the  effectiveness of the
     sentence in several ways.  The noun becomes  an  active
     verb  and frequently one complicated clause becomes two
     shorter clauses.  For example,

             Their inclusion of this provision is admission of the importance of the system.
             When they included this provision, they admitted the importance of the system.

     Coleman  found  that  the  transformed  sentences  were
     easier  to learn, even when the transformation produced
     sentences  that  were  slightly  longer,  provided  the
     transformation  broke one clause into two.  Writers who
     find their document contains many  nominalizations  may
     want  to  transform some of the sentences to use active
     verbs.

_2._5.  _S_e_n_t_e_n_c_e _o_p_e_n_e_r_s

     Another agreed upon principle of style  is  variety  in
sentence openers.  Because STYLE determines the type of sen-
tence opener by looking at the part of speech of  the  first
word  in the sentence, the sentences counted under the head-
ing ``subject opener'' may not all  really  begin  with  the
subject.   However,  a large percentage of sentences in this
category  still  indicates  lack  of  variety  in   sentence
openers.   Other  sentence  opener  measures  help  the user
determine if there are  transitions  between  sentences  and
where the subordination occurs.  Adverbs and conjunctions at
the beginning of sentences  are  mechanisms  for  transition
between  sentences.  A pronoun at the beginning shows a link
to something  previously  mentioned  and  indicates  connec-
tivity.

     The location of subordination can be determined by com-
paring  the number of sentences that begin with a subordina-
tor with the number of sentences with complex  clauses.   If
few  sentences  start with subordinate conjunctions then the
subordination is embedded or at the end of the complex  sen-
tences.   For  variety the writer may want to transform some
sentences to have leading subordination.

     The last category of openers, expletives,  is  commonly
overworked  in  technical writing.  Expletives are the words
``it'' and ``there'', usually with the verb  ``to  be'',  in
constructions where the subject follows the verb.  For exam-
ple,

        There are three streets used by the traffic.
        There are too many users on this system.


                     September 2, 1987


                           - 10 -


This construction tends to emphasize the object rather  than
the  subject  of  the  sentence.  The flag ``-e'' will cause
STYLE to print all sentences that begin with an expletive.

_3.  _D_I_C_T_I_O_N

     The program DICTION prints all sentences in a  document
containing  phrases  that  are  either frequently misused or
indicate wordiness.  The  program,  an  extension  of  Aho's
FGREP [12] string matching program, takes as input a file of
phrases or patterns to be matched and a file of text  to  be
searched.   A  data  base of about 450 phrases has been com-
piled  as  a  default  pattern  file  for  DICTION.   Before
attempting  to  locate  phrases, the program maps upper case
letters to lower case and substitutes  blanks  for  punctua-
tion.  Sentence boundaries were deemed less critical in DIC-
TION than in STYLE, so abbreviations and other uses  of  the
character ``.'' are not treated specially.  DICTION brackets
all pattern matches in a sentence with the characters  ``[''
``]''  .   Although  many of the phrases in the default data
base are correct in some contexts, in others  they  indicate
wordiness.   Some  examples  of  the  phrases  and suggested
alternatives are:


        cc
        ll.
        Phrase  Alternative
        a large number of       many
        arrive at a decision    decide
        collect together        collect
        for this reason so
        pertaining to   about
        through the use of      by or with
        utilize use
        with the exception of   except


Appendix 2 contains a complete list  of  the  default  file.
Some of the entries are short forms of problem phrases.  For
example, the phrase ``the fact'' is found in all of the fol-
lowing  and  is sufficient to point out the wordiness to the
user:


                     September 2, 1987


                           - 11 -


        cc
        ll.
        Phrase  Alternative
        accounted for by the fact that  caused by
        an example of this is the fact that     thus
        based on the fact that  because
        despite the fact that   although
        due to the fact that    because
        in light of the fact that       because
        in view of the fact that        since
        notwithstanding the fact that   although


Entries in Appendix 2 preceded by  ``~''  are  not  matched.
See Section 7 for details on the use of ``~''.

     The user may supply her/his own pattern file  with  the
flag  ``-f patfile''.  In this case the default file will be
loaded first, followed by the  user  file.   This  mechanism
allows  users  to suppress patterns contained in the default
file or to include their own pet peeves that are not in  the
default file.  The flag ``-n'' will exclude the default file
altogether.  In constructing a pattern file,  blanks  should
be  used before and after each phrase to avoid matching sub-
strings in words.  For example, to find all  occurrences  of
the word ``the'', the pattern `` the '' should be used.  The
blanks cause only the word ``the'' to be matched and not the
string  ``the''  in  words like there, other, and therefore.
One side effect of surrounding the words with blanks is that
when  two  phrases occur without intervening words, only the
first will be matched.

_4.  _E_X_P_L_A_I_N

     The last program, EXPLAIN, is an interactive  thesaurus
for  phrases  found  by  DICTION.  The user types one of the
phrases bracketed by DICTION and EXPLAIN responds with  sug-
gested  substitutions  for  the phrase that will improve the
diction of the document.

_5.  _R_e_s_u_l_t_s

_5._1.  _S_T_Y_L_E

     To get baseline  statistics  and  check  the  program's
accuracy,  we  ran  STYLE  on 20 technical documents.  There
were a total of 3287 sentences in the sample.  The  shortest
document  was  67 sentences long; the longest 339 sentences.
The documents  covered  a  wide  range  of  subject  matter,
including   theoretical   computing,   physics,  psychology,
engineering, and affirmative  action.   Table  1  gives  the
range,  median,  and standard deviation of the various style


                     September 2, 1987


                           - 12 -


                          Table 1
         Text Statistics on 20 Technical Documents

                           cccccc
                          llnnnn.
        variable        minimum maximum mean    standard deviation
                             _
    Readability     Kincaid 9.5     16.9    13.3    2.2
            automated       9.0     17.4    13.3    2.5
            Cole-Liau       10.0    16.0    12.7    1.8
                Flesch  8.9     17.0    14.4    2.2
                             _
sentence info.  av sent length  15.5    30.3    21.6    4.0
            av word length  4.61    5.63    5.08    .29
        av nonfunction length   5.72    7.30    6.52    .45
            short sent      23%     46%     33%     5.9
            long sent       7%      20%     14%     2.9
                             _
    sentence types  simple  31%     71%     49%     11.4
                complex 19%     50%     33%     8.3
            compound        2%      14%     7%      3.3
        compound-complex        2%      19%     10%     4.8
                             _
    verb types      tobe    26%     64%     44.7%   10.3
            auxiliary       10%     40%     21%     8.7
            infinitives     8%      24%     15.1%   4.8
            passives        12%     50%     29%     9.3
                             _
word usage      prepositions    10.1%   15.0%   12.3%   1.6
             conjunction     1.8%    4.8%    3.4%    .9
                adverbs 1.2%    5.0%    3.4%    1.0
                nouns   23.6%   31.6%   27.8%   1.7
            adjectives      15.4%   27.1%   21.1%   3.4
            pronouns        1.2%    8.4%    2.5%    1.1
             nominalizations 2%      5%      3.3%    .8
                             _
sentence openers        prepositions    6%      19%     12%     3.4
                adverbs 0%      20%     9%      4.6
                subject 56%     85%     70%     8.0
                verbs   0%      4%      1%      1.0
        subordinating conj      1%      12%     5%      2.7
            conjunctions    0%      4%      0%      1.5
            expletives      0%      6%      2%      1.7


measures.  As you will note most of the measurements have  a
fairly wide range of values across the sample documents.

     As a comparison, Table 2 gives the median  results  for
two  different  technical authors, a sample of instructional
material, and a sample of the Federalist  Papers.   The  two
authors show similar styles, although author 2 uses somewhat
shorter sentences and longer words than author 1.  Author  1


                     September 2, 1987


                           - 13 -


uses  all  types of sentences, while author 2 prefers simple
and complex  sentences,  using  few  compound  or  compound-
complex sentences.  The other major difference in the styles
of these authors is the location of subordination.  Author 1
seems  to  prefer  embedded or trailing subordination, while
author 2 begins many sentences with the subordinate  clause.
The documents tested for both authors 1 and 2 were technical
documents, written for a technical audience.   The  instruc-
tional  documents,  which are written for craftspeople, vary
surprisingly little from the  two  technical  samples.   The
sentences  and  words  are a little longer, and they contain
many passive and auxiliary verbs, few adverbs, and almost no
pronouns.   The instructional documents contain many impera-
tive  sentences,  so  there  are  many  sentence  with  verb
openers.  The sample of Federalist Papers contrasts with the
other samples in almost every way.

_5._2.  _D_I_C_T_I_O_N

     In the few weeks that DICTION  has  been  available  to
users  about 35,000 sentences have been run with about 5,000
string matches.  The authors using the program seem to  make
the  suggested  changes  about 50-75% of the time.  To date,
almost 200 of the 450 strings in the default file have  been
matched.   Although  most  of  these  phrases  are valid and
correct in some contexts, the 50-75% change  rate  seems  to
show  that the phrases are used much more often than concise
diction warrants.

_6.  _A_c_c_u_r_a_c_y

_6._1.  _S_e_n_t_e_n_c_e _I_d_e_n_t_i_f_i_c_a_t_i_o_n

     The correctness of the STYLE output on the 20  document
sample  was checked in detail.  STYLE misidentified 129 sen-
tence fragments as sentences and incorrectly joined  two  or
more  sentences  75  times in the 3287 sentence sample.  The
problems were usually because of nonstandard formatting com-
mands, unknown abbreviations, or lists of non-sentences.  An
impossibly long sentence found as the  longest  sentence  in
the  document  usually  is the result of a long list of non-
sentences.

_6._2.  _S_e_n_t_e_n_c_e _T_y_p_e_s

     Style correctly identified sentence type  on  86.5%  of
the  sentences  in the sample.  The type distribution of the
sentences was 52.5% simple, 29.9% complex, 8.5% compound and
9%  compound-complex.   The  program  reported 49.5% simple,
31.9%  complex,  8%  compound  and  10.4%  compound-complex.
Looking  at  the  errors  on  the  individual documents, the
number of simple sentences was under-reported  by  about  4%
and  the  complex and compound-complex were over-reported by
3% and 2%, respectively.  The  following  matrix  shows  the


                     September 2, 1987


                           - 14 -


                          Table 2
             Text Statistics on Single Authors

                           cccccc
                          llnnnn.
        variable        author 1        author 2        inst.   FED
                             _
    readability     Kincaid 11.0    10.3    10.8    16.3
            automated       11.0    10.3    11.9    17.8
            Coleman-Liau    9.3     10.1    10.2    12.3
                Flesch  10.3    10.7    10.1    15.0
                             _
sentence info   av sent length  22.64   19.61   22.78   31.85
            av word length  4.47    4.66    4.65    4.95
        av nonfunction length   5.64    5.92    6.04    6.87
            short sent      35%     43%     35%     40%
            long sent       18%     15%     16%     21%
                             _
    sentence types  simple  36%     43%     40%     31%
                complex 34%     41%     37%     34%
            compound        13%     7%      4%      10%
        compound-complex        16%     8%      14%     25%
                             _
    verb type       tobe    42%     43%     45%     37%
            auxiliary       17%     19%     32%     32%
            infinitives     17%     15%     12%     21%
            passives        20%     19%     36%     20%
                             _
word usage      prepositions    10.0%   10.8%   12.3%   15.9%
            conjunctions    3.2%    2.4%    3.9%    3.4%
                adverbs 5.05%   4.6%    3.5%    3.7%
               nouns   27.7%   26.5%   29.1%   24.9%
           adjectives      17.0%   19.0%   15.4%   12.4%
            pronouns        5.3%    4.3%    2.1%    6.5%
             nominalizations 1%      2%      2%      3%
                             _
sentence openers        prepositions    11%     14%     6%      5%
                 adverbs 9%      9%      6%      4%
                subject 65%     59%     54%     66%
                 verb    3%      2%      14%     2%
         subordinating conj      8%      14%     11%     3%
             conjunction     1%      0%      0%      3%
             expletives      3%      3%      0%      3%


programs output vs. the actual sentence type.


                     September 2, 1987


                           - 15 -


                           csssss
                           cccccc
                          clnnnn.
                      Program Results
                simple  complex compound        comp-complex
         Actual  simple  1566    132     49      17
     Sentence        complex 47      892     6       65
     Type    compound        40      6       207     23
            comp-complex    0       52      5       249


     The system's inability  to  find  imperative  sentences
seems to have little effect on most of the style statistics.
A document with half of its sentences  imperative  was  run,
with  and  without  the  imperative end marker.  The results
were identical except for the expected errors of not finding
verbs  as sentence openers, not counting the imperative sen-
tences, and a slight difference (1%) in the number of  nouns
and adjectives reported.

_6._3.  _W_o_r_d _U_s_a_g_e

     The accuracy of identifying word types reflects that of
PARTS,  which  is  about 95% correct.  The largest source of
confusion is between nouns and adjectives.  The verb  counts
were  checked  on  about 20 sentences from each document and
found to be about 98% correct.

_7.  _T_e_c_h_n_i_c_a_l _D_e_t_a_i_l_s

_7._1.  _F_i_n_d_i_n_g _S_e_n_t_e_n_c_e_s

     The formatting commands embedded in the  text  increase
the  difficulty  of  finding  sentences.   Not all text in a
document is in sentence form; there  are  headings,  tables,
equations  and  lists, for example.  Headings like ``Finding
Sentences'' above should be discarded, not attached  to  the
next  sentence.   However,  since  many of the documents are
formatted to be  phototypeset,  and  contain  font  changes,
which  usually  operate  on  the most important words in the
document, discarding all formatting commands is not correct.
To  improve  the  programs'  ability  to find sentence boun-
daries, the deformatting  program,  DEROFF  [13],  has  been
given  some knowledge of the formatting packages used on the
UNIX operating system.  DEROFF will now do the following:

1.   Suppress  all  formatting  macros  that  are  used  for
     titles, headings, author's name, etc.

2.   Suppress the arguments to the macros for titles,  head-
     ings, author's name, etc.


                     September 2, 1987


                           - 16 -


3.   Suppress displays, tables, footnotes and text  that  is
     centered or in no-fill mode.

4.   Substitute a place holder for equations and  check  for
     hidden  end  markers.   The  place  holder is necessary
     because many  typists  and  authors  use  the  equation
     setter  to  change  fonts on important words.  For this
     reason, header files containing the definition  of  the
     EQN delimiters must also be included as input to STYLE.
     End markers are often hidden when an  equation  ends  a
     sentence  and the period is typed inside the EQN delim-
     iters.

5.   Add a "." after lists.  If the flag -ml is  also  used,
     all  lists  are  suppressed.   This  is a separate flag
     because of the variety of  ways  the  list  macros  are
     used.   Often,  lists  are  sentences  that  should  be
     included in the analysis.  The user must determine  how
     lists are used in the document to be analyzed.

     Both STYLE and DICTION call DEROFF before they look  at
the  text.  The user should supply the -ml flag if the docu-
ment contains many lists of  non-sentences  that  should  be
skipped.

_7._2.  _D_e_t_a_i_l_s _o_f _D_I_C_T_I_O_N

     The program DICTION is based  on  the  string  matching
program  FGREP.   FGREP takes as input a file of patterns to
be matched and a file to be searched and outputs  each  line
that  contains  any  of  the  patterns with no indication of
which pattern was matched.  The following changes have  been
added to FGREP:

1.   The basic unit that DICTION operates on is  a  sentence
     rather than a line.  Each sentence that contains one of
     the patterns is output.

2.   Upper case letters are mapped to lower case.

3.   Punctuation is replaced by blanks.

4    All pattern matches in the sentence are found and  sur-
     rounded with ``['' ``]'' .

5.   A method for suppressing a string match has been added.
     Any pattern that begins with ``~'' will not be matched.
     Because the matching algorithm finds the  longest  sub-
     string, the suppression of a match allows words in some
     correct contexts not to be matched while  allowing  the
     word  in another context to be found.  For example, the
     word ``which'' is often  incorrectly  used  instead  of
     ``that'' in restrictive clauses.  However, ``which'' is
     usually correct  when  preceded  by  a  preposition  or


                     September 2, 1987


                           - 17 -


     ``,''.   The  default pattern file suppresses the match
     of the common prepositions or a double  blank  followed
     by  ``which''  and  therefore  matches only the suspect
     uses.  The  double  blank  accounts  for  the  replaced
     comma.

_8.  _C_o_n_c_l_u_s_i_o_n_s

     A system of writing tools  that  measure  some  of  the
objective   characteristics   of   writing  style  has  been
developed.  The tools are sufficiently general that they may
be  applied to documents on any subject with equal accuracy.
Although the measurements are only of the surface  structure
of  the  text, they do point out problem areas.  In addition
to helping writers produce better documents, these  programs
may  be  useful for studying the writing process and finding
other formulae for measuring readability.


                     September 2, 1987


                           - 18 -


_R_e_f_e_r_e_n_c_e_s


1.   L. L. Cherry, ``PARTS - A  System  for  Assigning  Word
     Classes  to English Text,'' submitted _C_o_m_m_u_n_i_c_a_t_i_o_n_s _o_f
     _t_h_e _A_C_M.

2.   B. W. Kernighan and J. R. Mashey, ``The  UNIX  Program-
     ming  Environment,'' _S_o_f_t_w_a_r_e - _P_r_a_c_t_i_c_e & _E_x_p_e_r_i_e_n_c_e ,
     9, 1-15 (1979).

3.   G.  R.  Klare,   ``Assessing   Readability,''   _R_e_a_d_i_n_g
     _R_e_s_e_a_r_c_h _Q_u_a_r_t_e_r_l_y, 1974-1975, _1_0 , 62-102.

4.   E. A. Smith and P. Kincaid, ``Derivation and validation
     of the automated readability index for use with techni-
     cal materials,'' _H_u_m_a_n _F_a_c_t_o_r_s, 1970, 12, 457-464.

5.   J. P. Kincaid, R. P. Fishburne, R. L. Rogers, and B. S.
     Chissom,   ``Derivation  of  new  readability  formulas
     (Automated Readability Index,  Fog  count,  and  Flesch
     Reading  Ease  Formula)  for Navy enlisted personnel,''
     Navy Training  Command  Research  Branch  Report  8-75,
     Feb., 1975.

6.   M. Coleman and T. L.  Liau,  ``A  Computer  Readability
     Formula  Designed  for  Machine  Scoring,''  _J_o_u_r_n_a_l _o_f
     _A_p_p_l_i_e_d _P_s_y_c_h_o_l_o_g_y, 1975, 60, 283-284.

7.   R. Flesch, ``A New Readability Yardstick,'' _J_o_u_r_n_a_l  _o_f
     _A_p_p_l_i_e_d _P_s_y_c_h_o_l_o_g_y, 1948, 32, 221-233.

8.   E. U. Coke, private communication.

9.   D. W. Ewing, _W_r_i_t_i_n_g _f_o_r _R_e_s_u_l_t_s, John  Wiley  &  Sons,
     Inc., New York, N. Y. (1974).

10.  G. Leggett, C. D. Mead and  W.  Charvat,  _P_r_e_n_t_i_c_e-_H_a_l_l
     _H_a_n_d_b_o_o_k  _f_o_r  _W_r_i_t_e_r_s,  Seventh Edition, Prentice-Hall
     Inc., Englewood Cliffs, N. J. (1978).

11.  E. B. Coleman, ``Learning  of  Prose  Written  in  Four
     Grammatical   Transformations,''   _J_o_u_r_n_a_l  _o_f  _A_p_p_l_i_e_d
     _P_s_y_c_h_o_l_o_g_y, 1965, vol. 49, no. 5, pp. 332-341.

12   A. V. Aho and M. J. Corasick, ``Efficient String Match-
     ing:  an  aid to Bibliographic Search,'' _C_o_m_m_u_n_i_c_a_t_i_o_n_s
     _o_f _t_h_e _A_C_M, 18, (6), 333-340, June 1975.

13.  Bell Laboratories,  ``_U_N_I_X  _T_I_M_E-_S_H_A_R_I_N_G  _S_Y_S_T_E_M:  _U_N_I_X
     _P_R_O_G_R_A_M_M_E_R'_S _M_A_N_U_A_L,'' Seventh Edition, Vol. 1 (January
     1979).


                     September 2, 1987


                           - 19 -


                           Appendix 1

                       STYLE Abbreviations


     a. d.
     A. M.
     a. m.
     b. c.
     Ch.
     ch.
     ckts.
     dB.
     Dept.
     dept.
     Depts.
     depts.
     Dr.
     Drs.
     e. g.
     Eq.
     eq.
     et al.
     etc.
     Fig.
     fig.
     Figs.
     figs.
     ft.
     i. e.
     in.
     Inc.
     Jr.
     jr.
     mi.
     Mr.
     Mrs.
     Ms.
     No.
     no.
     Nos.
     nos.
     P. M.
     p. m.
     Ph. D.
     Ph. d.
     Ref.
     ref.
     Refs.
     refs.
     St.
     vs.
     yr.


                     September 2, 1987


                           - 20 -


                           Appendix 2

                    Default DICTION Patterns

8      a great deal of
8      a large number of
8      a lot of
8      a majority of
8      a need for
8      a number of
8      a particular preference for
8      a preference for
8      a small number of
8      a tendency to
8      abovementioned
8      absolutely complete
8      absolutely essential
8      accomplished
8      accordingly
8      activate
8      actual
8      added increments
8      adequate enough
8      advent
8      afford an opportunity
8      aggregate
8      all of
8      all throughout
8      along the line
8      an indication of
8      analyzation
8      and etc
8      and or
8      another additional
8      any and all
8      arrive at a
8      as a matter of fact
8      as a method of
8      as good or better than
8      as of now
8      as per
8      as regards
8      as related to
8      as to
8      assistance
8      assistance to
8      assistance to
8      assuming that
8      at a later date
8      at about
8      at above
8      at all times
8      at an early date
8      at below
8      at the present
8      at the time when
8      at this point in time
8      at this time
8      at which time
8      at your earliest convenience
8      authorization
8      awful
8      basic fundamentals
8      basically
8      be cognizant of
8      being as
8      being that
8      brief in duration
8      bring to a conclusion
8      but that
8      but what
8      by means of
8      by the use of
8      carry out experiments
8      center about
8      center around
7777777777777777777777777777777777778                     center portion
8                     check into
8                     check on
8                     check up on
8                     circle around
8                     close proximity
8                     collaborate together
8                     collect together
8                     combine together
8                     come to an end
8                     commence
8                     common accord
8                     compensation
8                     completely eliminated
8                     comprise
8                     concerning
8                     conduct an investigation of
8                     conjecture
8                     connect up
8                     consensus of opinion
8                     consequent result
8                     consolidate together
8                     construct
8                     contemplate
8                     continue on
8                     continue to remain
8                     could of
8                     count up
8                     couple together
8                     debate about
8                     decide on
8                     deleterious effect
8                     demean
8                     demonstrate
8                     depreciate in value
8                     deserving of
8                     desirable benefits
8                     desirous of
8                     different than
8                     discontinue
8                     disutility
8                     divide up
8                     doubt but
8                     due to
8                     duly noted
8                     during the time that
8                     each and every
8                     early beginnings
8                     effectuate
8                     emotional feelings
8                     empty out
8                     enclosed herein
8                     enclosed herewith
8                     end result
8                     end up
8                     endeavor
8                     enter in
8                     enter into
8                     enthused
8                     entirely complete
8                     equally good as
8                     essentially
8                     eventuate
8                     every now and then
8                     exactly identical
8                     experiencing difficulty
8                     fabricate
8                     face up to
8                     facilitate
8                     facts and figures
8                     fast in action
8                     fearful of
7777777777777777777777777777777777778                                    fearful that
8                                    few in number
8                                    file away
8                                    final completion
8                                    final ending
8                                    final outcome
8                                    final result
8                                    finalize
8                                    find it interesting to know
8                                    first and foremost
8                                    first beginnings
8                                    first initiated
8                                    firstly
8                                    follow after
8                                    following after
8                                    for the purpose of
8                                    for the reason that
8                                    for the simple reason that
8                                    for this reason
8                                    for your information
8                                    from the point of view of
8                                    full and complete
8                                    generally agreed
8                                    good and
8                                    got to
8                                    gratuitous
8                                    greatly minimize
8                                    head up
8                                    help but
8                                    helps in the production of
8                                    hopeful
8                                    if and when
8                                    if at all possible
8                                    impact
8                                    implement
8                                    important essentials
8                                    importantly
8                                    in a large measure
8                                    in a position to
8                                    in accordance
8                                    in advance of
8                                    in agreement with
8                                    in all cases
8                                    in back of
8                                    in behalf of
8                                    in behind
8                                    in between
8                                    in case
8                                    in close proximity
8                                    in conflict with
8                                    in conjunction with
8                                    in connection with
8                                    in fact
8                                    in large measure
8                                    in many cases
8                                    in most cases
8                                    in my opinion I think
8                                    in order to
8                                    in rare cases
8                                    in reference to
8                                    in regard to
8                                    in regards to
8                                    in relation with
8                                    in short supply
8                                    in size
8                                    in terms of
8                                    in the amount of
8                                    in the case of
8                                    in the course of
8                                    in the event
8                                    in the field of
7777777777777777777777777777777777788                                                   in the form of
8                                                   in the instance of
8                                                   in the interim
8                                                   in the last analysis
8                                                   in the matter of
8                                                   in the near future
8                                                   in the neighborhood of
8                                                   in the not too distant future
8                                                   in the proximity of
8                                                   in the range of
8                                                   in the same way as described
8                                                   in the shape of
8                                                   in the vicinity of
8                                                   in this case
8                                                   in view of the
8                                                   in violation of
8                                                   inasmuch as
8                                                   indicate
8                                                   indicative of
8                                                   initialize
8                                                   initiate
8                                                   injurious to
8                                                   inquire
8                                                   inside of
8                                                   institute a
8                                                   intents and purposes
8                                                   intermingle
8                                                   irregardless
8                                                   is defined as
8                                                   is used to control
8                                                   is when
8                                                   is where
8                                                   it is incumbent
8                                                   it stands to reason
8                                                   it was noted that if
8                                                   joint cooperation
8                                                   joint partnership
8                                                   just exactly
8                                                   kind of
8                                                   know about
8                                                   last but not least
8                                                   later on
8                                                   leaving out of consideration
8                                                   liable
8                                                   link up
8                                                   literally
8                                                   little doubt that
8                                                   lose out on
8                                                   lots of
8                                                   main essentials
8                                                   make a
8                                                   make adjustments to
8                                                   make an
8                                                   make application to
8                                                   make contact with
8                                                   make mention of
8                                                   make out a list of
8                                                   make the acquaintance of
8                                                   make the adjustment
8                                                   manner
8                                                   maximum possible
8                                                   meaningful
8                                                   meet up with
8                                                   melt down
8                                                   melt up
8                                                   methodology
8                                                   might of
8                                                   minimize as far as possible
8                                                   minor importance
8                                                   miss out on
8                                                   modification


9

8                     September 2, 1987


9


                           - 21 -


8      more preferable
8      most unique
8      must of
8      mutual cooperation
8      necessary requisite
8      necessitate
8      need for
8      nice
8      not be un
8      not in a position to
8      not of a high order of accuracy
8      not un
8      notwithstanding
8      of considerable magnitude
8      of that
8      of the opinion that
8      off of
8      on a few occasions
8      on account of
8      on behalf of
8      on the grounds that
8      on the occasion
8      on the part of
8      one of the
8      open up
8      operates to correct
8      outside of
8      over with
8      overall
8      past history
8      perceptive of
8      perform a measurement
8      perform the measurement
8      permits the reduction of
8      personalize
8      pertaining to
8      physical size
8      plan ahead
8      plan for the future
8      plan in advance
8      plan on
8      present a conclusion
8      present a report
8      presently
8      prior to
8      prioritize
8      proceed to
8      procure
8      productive of
8      prolong the duration
8      protrude out from
8      provided that
8      pursuant to
8      put to use in
8      range all the way from
8      reason is because
8      reason why
8      recur again
8      reduce down
8      refer back
8      reference to this
8      reflective of
8      regarding
8      regretful
8      reinitiate
8      relative to
8      repeat again
8      representative of
8      resultant effect
8      resume again
8      retreat back
8      return again
8      return back
8      revert back
8      seal off
777777777777777777777777777777777777788                     seems apparent
8                     send a communication
8                     short space of time
8                     should of
8                     single unit
8                     situation
8                     so as to
8                     sort of
8                     spell out
8                     still continue
8                     still remain
8                     subsequent
8                     substantially in agreement
8                     succeed in
8                     suggestive of
8                     superior than
8                     surrounding circumstances
8                     take appropriate
8                     take cognizance of
8                     take into consideration
8                     termed as
8                     terminate
8                     termination
8                     the author
8                     the authors
8                     the case that
8                     the fact
8                     the foregoing
8                     the foreseeable future
8                     the fullest possible extent
8                     the majority of
8                     the nature
8                     the necessity of
8                     the only difference being that
8                     the order of
8                     the point that
8                     the truth is
8                     there are not many
8                     through the medium of
8                     through the use of
8                     throughout the entire
8                     time interval
8                     to summarize the above
8                     total effect of all this
8                     totality
8                     transpire
8                     true facts
8                     try and
8                     ultimate end
8                     under a separate cover
8                     under date of
8                     under separate cover
8                     under the necessity to
8                     underlying purpose
8                     undertake a study
8                     uniformly consistent
8                     unique
8                     until such time as
8                     up to this time
8                     upshot
8                     utilize
8                     very
8                     very complete
8                     very unique
8                     vital
8                     which
8                     with a view to
8                     with reference to
8                     with regard to
8                     with the exception of
8                     with the object of
8                     with the result that
8                     with this in mind, it is clear that
8                     within the realm of possibility
8                     without further delay
777777777777777777777777777777777777788                                    worth while
8                                    would of
8                                   ing behavior
8                                   wise
8                                   ~  which
8                                   ~ about which
8                                   ~ after which
8                                   ~ at which
8                                   ~ between which
8                                   ~ by which
8                                   ~ for which
8                                   ~ from which
8                                   ~ in which
8                                   ~ into which
8                                   ~ of which
8                                   ~ on which
8                                   ~ on which
8                                   ~ over which
8                                   ~ through which
8                                   ~ to which
8                                   ~ under which
8                                   ~ upon which
8                                   ~ with which
8                                   ~ without which
8                                   ~clockwise
8                                   ~likewise
8                                   ~otherwise


9

8                     September 2, 1987


9