Writing Tools - The STYLE and DICTION Programs L. L. Cherry Bell Laboratories Murray Hill, New Jersey 07974 W. Vesterman Livingston College Rutgers University _A_B_S_T_R_A_C_T Text processing systems are now in heavy use in many companies to format documents. With many documents stored on line, it has become possible to use computers to study writing style itself and to help writers produce better written and more readable prose. The system of programs described here is an initial step toward such help. It includes programs and a data base designed to pro- duce a stylistic profile of writing at the word and sentence level. The system measures read- ability, sentence and word length, sentence type, word usage, and sentence openers. It also locates common examples of wordy phrasing and bad diction. The system is useful for evaluating a document's style, locating sentences that may be difficult to read or excessively wordy, and determining a par- ticular writer's style over several documents. _1. _I_n_t_r_o_d_u_c_t_i_o_n Computers have become important in the document preparation process, with programs to check for spelling errors and to format documents. As the amount of text stored on line increases, it becomes feasible and attractive to study writing style and to attempt to help the writer in producing readable documents. The system of writing tools described here is a first step toward such help. The system includes programs and a data base to analyze writing style at the word and sentence level. We use the term ``style'' in this paper to describe the results of a writer's particu- lar choices among individual words and sentence forms. Although many judgements of style are subjective, September 2, 1987 - 2 - particularly those of word choice, there are some objective measures that experts agree lead to good style. Three pro- grams have been written to measure some of the objectively definable characteristics of writing style and to identify some commonly misused or unnecessary phrases. Although a document that conforms to the stylistic rules is not guaranteed to be coherent and readable, one that violates all of the rules is likely to be difficult or tedious to read. The program STYLE calculates readability, sentence length variability, sentence type, word usage and sentence openers at a rate of about 400 words per second on a PDP11/70 running the UNIX* Operating System. It assumes that the sentences are well-formed, i. e. that each sentence has a verb and that the subject and verb agree in number. DICTION identifies phrases that are either bad usage or unnecessarily wordy. EXPLAIN acts as a thesaurus for the phrases found by DICTION. Sections 2, 3, and 4 describe the programs; Section 5 gives the results on a cross-section of technical documents; Section 6 discusses accuracy and prob- lems; Section 7 gives implementation details. _2. _S_T_Y_L_E The program STYLE reads a document and prints a summary of readability indices, sentence length and type, word usage, and sentence openers. It may also be used to locate all sentences in a document longer than a given length, of readability index higher than a given number, those contain- ing a passive verb, or those beginning with an expletive. STYLE is based on the system for finding English word classes or parts of speech, PARTS [1]. PARTS is a set of programs that uses a small dictionary (about 350 words) and suffix rules to partially assign word classes to English text. It then uses experimentally derived rules of word order to assign word classes to all words in the text with an accuracy of about 95%. Because PARTS uses only a small dictionary and general rules, it works on text about any subject, from physics to psychology. Style measures have been built into the output phase of the programs that make up PARTS. Some of the measures are simple counters of the word classes found by PARTS; many are more complicated. For example, the verb count is the total number of verb phrases. This includes phrases like: has been going was only going to go each of which each counts as one verb. Figure 1 shows the output of STYLE run on a paper by Kernighan and Mashey about the UNIX programming environment [2]. As the example shows, __________________________ * UNIX is a Trademark of Bell Laboratories. September 2, 1987 - 3 - box; l1l. programming environment readability grades: (Kincaid) 12.3 (auto) 12.8 (Coleman-Liau) 11.8 (Flesch) 13.5 (46.3) sentence info: no. sent 335 no. wds 7419 av sent leng 22.1 av word leng 4.91 no. questions 0 no. imperatives 0 no. non- func wds 4362 58.8% av leng 6.38 short sent (<17) 35% (118) long sent (>32) 16% (55) longest sent 82 wds at sent 174; shortest sent 1 wds at sent 117 sentence types: simple 34% (114) complex 32% (108) compound 12% (41) compound-complex 21% (72) word usage: verb types as % of total verbs tobe 45% (373) aux 16% (133) inf 14% (114) passives as % of non-inf verbs 20% (144) types as % of total prep 10.8% (804) conj 3.5% (262) adv 4.8% (354) noun 26.7% (1983) adj 18.7% (1388) pron 5.3% (393) nominalizations 2 % (155) sentence beginnings: subject opener: noun (63) pron (43) pos (0) adj (58) art (62) tot 67% prep 12% (39) adv 9% (31) verb 0% (1) sub_conj 6% (20) conj 1% (5) expletives 4% (13) Figure 1 STYLE output is in five parts. After a brief discussion of sentences, we will describe the parts in order. _2._1. _W_h_a_t _i_s _a _s_e_n_t_e_n_c_e? Readers of documents have little trouble deciding where the sentences end. People don't even have to stop and think about uses of the character ``.'' in constructions like 1.25, A. J. Jones, Ph.D., i. e., or etc. . When a computer reads a document, finding the end of sentences is not as easy. First we must throw away the printer's marks and for- matting commands that litter the text in computer form. Then STYLE defines a sentence as a string of words ending in one of: . ! ? /. The end marker ``/.'' may be used to indicate an imperative sentence. Imperative sentences that are not so marked are not identified as imperative. STYLE properly handles numbers with embedded decimal points and commas, strings of letters and numbers with embedded decimal points used for naming computer file names, and the common abbreviations listed in Appendix 1. Numbers that end sentences, like the preceding sentence, cause a sentence break if the next word begins with a capital letter. Initials only cause a sen- tence break if the next word begins with a capital and is September 2, 1987 - 4 - found in the dictionary of function words used by PARTS. So the string J. D. JONES does not cause a break, but the string ... system H. The ... does. With these rules most sentences are broken at the proper place, although occasionally either two sentences are called one or a fragment is called a sentence. More on this later. _2._2. _R_e_a_d_a_b_i_l_i_t_y _G_r_a_d_e_s The first section of STYLE output consists of four rea- dability indices. As Klare points out in [3] readability indices may be used to estimate the reading skills needed by the reader to understand a document. The readability indices reported by STYLE are based on measures of sentence and word lengths. Although the indices may not measure whether the document is coherent and well organized, experi- ence has shown that high indices seem to be indicators of stylistic difficulty. Documents with short sentences and short words have low scores; those with long sentences and many polysyllabic words have high scores. The 4 formulae reported are Kincaid Formula [4], Automated Readability Index [5], Coleman-Liau Formula [6] and a normalized version of Flesch Reading Ease Score [7]. The formulae differ because they were experimentally derived using different texts and subject groups. We will discuss each of the for- mulae briefly; for a more detailed discussion the reader should see [3]. The Kincaid Formula, given by: _R_e_a_d_i_n_g___G_r_a_d_e_=11.8_*_s_y_l___p_e_r___w_d_+_.39_*_w_d_s___p_e_r___s_e_n_t_-15.59 was based on Navy training manuals that ranged in difficulty from 5.5 to 16.3 in reading grade level. The score reported by this formula tends to be in the mid-range of the 4 scores. Because it is based on adult training manuals rather than school book text, this formula is probably the best one to apply to technical documents. The Automated Readability Index (ARI), based on text from grades 0 to 7, was derived to be easy to automate. The formula is: _R_e_a_d_i_n_g___G_r_a_d_e_=4.71_*_l_e_t___p_e_r___w_d_+_.5_*_w_d_s___p_e_r___s_e_n_t_-21.43 ARI tends to produce scores that are higher than Kincaid and Coleman-Liau but are usually slightly lower than Flesch. September 2, 1987 - 5 - The Coleman-Liau Formula, based on text ranging in dif- ficulty from .4 to 16.3, is: _R_e_a_d_i_n_g___G_r_a_d_e_=5.89_*_l_e_t___p_e_r___w_d_-_.3_*_s_e_n_t___p_e_r__100___w_d_s_-15.8 Of the four formulae this one usually gives the lowest grade when applied to technical documents. The last formula, the Flesch Reading Ease Score, is based on grade school text covering grades 3 to 12. The formula, given by: _R_e_a_d_i_n_g___S_c_o_r_e_=206.835_-84.6_*_s_y_l___p_e_r___w_d_-1.015_*_w_d_s___p_e_r___s_e_n_t is usually reported in the range 0 (very difficult) to 100 (very easy). The score reported by STYLE is scaled to be comparable to the other formulas, except that the maximum grade level reported is set to 17. The Flesch score is usu- ally the highest of the 4 scores on technical documents. Coke [8] found that the Kincaid Formula is probably the best predictor for technical documents; both ARI and Flesch tend to overestimate the difficulty; Coleman-Liau tend to underestimate. On text in the range of grades 7 to 9 the four formulas tend to be about the same. On easy text the Coleman-Liau formula is probably preferred since it is rea- sonably accurate at the lower grades and it is safer to present text that is a little too easy than a little too hard. If a document has particularly difficult technical con- tent, especially if it includes a lot of mathematics, it is probably best to make the text very easy to read, i.e. a lower readability index by shortening the sentences and words. This will allow the reader to concentrate on the technical content and not the long sentences. The user should remember that these indices are estimators; they should not be taken as absolute numbers. STYLE called with ``-r number'' will print all sentences with an Automated Readability Index equal to or greater than ``number''. _2._3. _S_e_n_t_e_n_c_e _l_e_n_g_t_h _a_n_d _s_t_r_u_c_t_u_r_e The next two sections of STYLE output deal with sen- tence length and structure. Almost all books on writing style or effective writing emphasize the importance of variety in sentence length and structure for good writing. Ewing's first rule in discussing style in the book _W_r_i_t_i_n_g _f_o_r _R_e_s_u_l_t_s [9] is: ``Vary the sentence structure and length of your sentences.'' Leggett, Mead and Charvat break this rule into 3 in _P_r_e_n_t_i_c_e-_H_a_l_l _H_a_n_d_b_o_o_k _f_o_r _W_r_i_t_e_r_s [10] as follows: September 2, 1987 - 6 - ``34a. Avoid the overuse of short simple sentences.'' ``34b. Avoid the overuse of long compound sentences.'' ``34c. Use various sentence structures to avoid monotony and increase effectiveness.'' Although experts agree that these rules are important, not all writers follow them. Sample technical documents have been found with almost no sentence length or type variabil- ity. One document had 90% of its sentences about the same length as the average; another was made up almost entirely of simple sentences (80%). The output sections labeled ``sentence info'' and ``sentence types'' give both length and structure measures. STYLE reports on the number and average length of both sen- tences and words, and number of questions and imperative sentences (those ending in ``/.''). The measures of non- function words are an attempt to look at the content words in the document. In English non-function words are nouns, adjectives, adverbs, and non-auxiliary verbs; function words are prepositions, conjunctions, articles, and auxiliary verbs. Since most function words are short, they tend to lower the average word length. The average length of non- function words may be a more useful measure for comparing word choice of different writers than the total average word length. The percentages of short and long sentences measure sentence length variability. Short sentences are those at least 5 words less than the average; long sentences are those at least 10 words longer than the average. Last in the sentence information section is the length and location of the longest and shortest sentences. If the flag ``-l number'' is used, STYLE will print all sentences longer than ``number''. Because of the difficulties in dealing with the many uses of commas and conjunctions in English, sentence type definitions vary slightly from those of standard textbooks, but still measure the same constructional activity. 1. A simple sentence has one verb and no dependent clause. 2. A complex sentence has one independent clause and one dependent clause, each with one verb. Complex sen- tences are found by identifying sentences that contain either a subordinate conjunction or a clause beginning with words like ``that'' or ``who''. The preceding sentence has such a clause. 3. A compound sentence has more than one verb and no dependent clause. Sentences joined by ``;'' are also counted as compound. 4. A compound-complex sentence has either several depen- dent clauses or one dependent clause and a compound September 2, 1987 - 7 - verb in either the dependent or independent clause. Even using these broader definitions, simple sentences dominate many of the technical documents that have been tested, but the example in Figure 1 shows variety in both sentence structure and sentence length. _2._4. _W_o_r_d _U_s_a_g_e The word usage measures are an attempt to identify some other constructional features of writing style. There are many different ways in English to say the same thing. The constructions differ from one another in the form of the words used. The following sentences all convey approxi- mately the same meaning but differ in word usage: The cxio program is used to perform all communication between the systems. The cxio program performs all communications between the systems. The cxio program is used to communicate between the systems. The cxio program communicates between the systems. All communication between the systems is performed by the cxio program. The distribution of the parts of speech and verb construc- tions helps identify overuse of particular constructions. Although the measures used by STYLE are crude, they do point out problem areas. For each category, STYLE reports a per- centage and a raw count. In addition to looking at the per- centage, the user may find it useful to compare the raw count with the number of sentences. If, for example, the number of infinitives is almost equal to the number of sen- tences, then many of the sentences in the document are con- structed like the first and third in the preceding example. The user may want to transform some of these sentences into another form. Some of the implications of the word usage measures are discussed below. _V_e_r_b_s are measured in several different ways to try to determine what types of verb constructions are most frequent in the document. Technical writing tends to contain many passive verb constructions and other usage of the verb ``to be''. The category of verbs labeled ``tobe'' measures both passives and sentences of the form: _s_u_b_j_e_c_t _t_o_b_e _p_r_e_d_i_c_a_t_e In counting verbs, whole verb phrases are counted as one verb. Verb phrases containing auxiliary verbs are counted in the category ``aux''. The verb phrases counted here are those whose tense is not simple present or simple past. It might eventually be useful to do more detailed measures of verb tense or mood. Infinitives are listed as ``inf''. The percentages reported for these three categories are based on the September 2, 1987 - 8 - total number of verb phrases found. These categories are not mutually exclusive; they cannot be added, since, for example, ``to be going'' counts as both ``tobe'' and ``inf''. Use of these three types of verb constructions varies significantly among authors. STYLE reports passive verbs as a percentage of the fin- ite verbs in the document. Most style books warn against the overuse of passive verbs. Coleman [11] has shown that sentences with active verbs are easier to learn than those with passive verbs. Although the inverted object-subject order of the passive voice seems to emphasize the object, Coleman's experiments showed that there is little difference in retention by word position. He also showed that the direct object of an active verb is retained better than the subject of a passive verb. These experiments support the advice of the style books suggesting that writers should try to use active verbs wherever possible. The flag ``-p'' causes STYLE to print all sentences containing passive verbs. _P_r_o_n_o_u_n_s add cohesiveness and connectivity to a document by providing back-reference. They are often a short-hand notation for something previously mentioned, and there- fore connect the sentence containing the pronoun with the word to which the pronoun refers. Although there are other mechanisms for such connections, documents with no pronouns tend to be wordy and to have little connectivity. _A_d_v_e_r_b_s can provide transition between sentences and order in time and space. In performing these functions, adverbs, like pronouns, provide connectivity and cohesiveness. _C_o_n_j_u_n_c_t_i_o_n_s provide parallelism in a document by connecting two or more equal units. These units may be whole sen- tences, verb phrases, nouns, adjectives, or preposi- tional phrases. The compound and compound-complex sen- tences reported under sentence type are parallel struc- tures. Other uses of parallel structures are indicated by the degree that the number of conjunctions reported under word usage exceeds the compound sentence meas- ures. _N_o_u_n_s _a_n_d _A_d_j_e_c_t_i_v_e_s. A ratio of nouns to adjectives near unity may indicate the over-use of modifiers. Some technical writers qualify every noun with one or more adjectives. Qualifiers in phrases like ``simple linear single-link network model'' often lend more obscurity than precision to a text. September 2, 1987 - 9 - _N_o_m_i_n_a_l_i_z_a_t_i_o_n_s are verbs that are changed to nouns by adding one of the suffixes ``ment'', ``ance'', ``ence'', or ``ion''. Examples are accomplishment, admittance, adherence, and abbreviation. When a writer transforms a nominalized sentence to a non-nominalized sentence, she/he increases the effectiveness of the sentence in several ways. The noun becomes an active verb and frequently one complicated clause becomes two shorter clauses. For example, Their inclusion of this provision is admission of the importance of the system. When they included this provision, they admitted the importance of the system. Coleman found that the transformed sentences were easier to learn, even when the transformation produced sentences that were slightly longer, provided the transformation broke one clause into two. Writers who find their document contains many nominalizations may want to transform some of the sentences to use active verbs. _2._5. _S_e_n_t_e_n_c_e _o_p_e_n_e_r_s Another agreed upon principle of style is variety in sentence openers. Because STYLE determines the type of sen- tence opener by looking at the part of speech of the first word in the sentence, the sentences counted under the head- ing ``subject opener'' may not all really begin with the subject. However, a large percentage of sentences in this category still indicates lack of variety in sentence openers. Other sentence opener measures help the user determine if there are transitions between sentences and where the subordination occurs. Adverbs and conjunctions at the beginning of sentences are mechanisms for transition between sentences. A pronoun at the beginning shows a link to something previously mentioned and indicates connec- tivity. The location of subordination can be determined by com- paring the number of sentences that begin with a subordina- tor with the number of sentences with complex clauses. If few sentences start with subordinate conjunctions then the subordination is embedded or at the end of the complex sen- tences. For variety the writer may want to transform some sentences to have leading subordination. The last category of openers, expletives, is commonly overworked in technical writing. Expletives are the words ``it'' and ``there'', usually with the verb ``to be'', in constructions where the subject follows the verb. For exam- ple, There are three streets used by the traffic. There are too many users on this system. September 2, 1987 - 10 - This construction tends to emphasize the object rather than the subject of the sentence. The flag ``-e'' will cause STYLE to print all sentences that begin with an expletive. _3. _D_I_C_T_I_O_N The program DICTION prints all sentences in a document containing phrases that are either frequently misused or indicate wordiness. The program, an extension of Aho's FGREP [12] string matching program, takes as input a file of phrases or patterns to be matched and a file of text to be searched. A data base of about 450 phrases has been com- piled as a default pattern file for DICTION. Before attempting to locate phrases, the program maps upper case letters to lower case and substitutes blanks for punctua- tion. Sentence boundaries were deemed less critical in DIC- TION than in STYLE, so abbreviations and other uses of the character ``.'' are not treated specially. DICTION brackets all pattern matches in a sentence with the characters ``['' ``]'' . Although many of the phrases in the default data base are correct in some contexts, in others they indicate wordiness. Some examples of the phrases and suggested alternatives are: cc ll. Phrase Alternative a large number of many arrive at a decision decide collect together collect for this reason so pertaining to about through the use of by or with utilize use with the exception of except Appendix 2 contains a complete list of the default file. Some of the entries are short forms of problem phrases. For example, the phrase ``the fact'' is found in all of the fol- lowing and is sufficient to point out the wordiness to the user: September 2, 1987 - 11 - cc ll. Phrase Alternative accounted for by the fact that caused by an example of this is the fact that thus based on the fact that because despite the fact that although due to the fact that because in light of the fact that because in view of the fact that since notwithstanding the fact that although Entries in Appendix 2 preceded by ``~'' are not matched. See Section 7 for details on the use of ``~''. The user may supply her/his own pattern file with the flag ``-f patfile''. In this case the default file will be loaded first, followed by the user file. This mechanism allows users to suppress patterns contained in the default file or to include their own pet peeves that are not in the default file. The flag ``-n'' will exclude the default file altogether. In constructing a pattern file, blanks should be used before and after each phrase to avoid matching sub- strings in words. For example, to find all occurrences of the word ``the'', the pattern `` the '' should be used. The blanks cause only the word ``the'' to be matched and not the string ``the'' in words like there, other, and therefore. One side effect of surrounding the words with blanks is that when two phrases occur without intervening words, only the first will be matched. _4. _E_X_P_L_A_I_N The last program, EXPLAIN, is an interactive thesaurus for phrases found by DICTION. The user types one of the phrases bracketed by DICTION and EXPLAIN responds with sug- gested substitutions for the phrase that will improve the diction of the document. _5. _R_e_s_u_l_t_s _5._1. _S_T_Y_L_E To get baseline statistics and check the program's accuracy, we ran STYLE on 20 technical documents. There were a total of 3287 sentences in the sample. The shortest document was 67 sentences long; the longest 339 sentences. The documents covered a wide range of subject matter, including theoretical computing, physics, psychology, engineering, and affirmative action. Table 1 gives the range, median, and standard deviation of the various style September 2, 1987 - 12 - Table 1 Text Statistics on 20 Technical Documents cccccc llnnnn. variable minimum maximum mean standard deviation _ Readability Kincaid 9.5 16.9 13.3 2.2 automated 9.0 17.4 13.3 2.5 Cole-Liau 10.0 16.0 12.7 1.8 Flesch 8.9 17.0 14.4 2.2 _ sentence info. av sent length 15.5 30.3 21.6 4.0 av word length 4.61 5.63 5.08 .29 av nonfunction length 5.72 7.30 6.52 .45 short sent 23% 46% 33% 5.9 long sent 7% 20% 14% 2.9 _ sentence types simple 31% 71% 49% 11.4 complex 19% 50% 33% 8.3 compound 2% 14% 7% 3.3 compound-complex 2% 19% 10% 4.8 _ verb types tobe 26% 64% 44.7% 10.3 auxiliary 10% 40% 21% 8.7 infinitives 8% 24% 15.1% 4.8 passives 12% 50% 29% 9.3 _ word usage prepositions 10.1% 15.0% 12.3% 1.6 conjunction 1.8% 4.8% 3.4% .9 adverbs 1.2% 5.0% 3.4% 1.0 nouns 23.6% 31.6% 27.8% 1.7 adjectives 15.4% 27.1% 21.1% 3.4 pronouns 1.2% 8.4% 2.5% 1.1 nominalizations 2% 5% 3.3% .8 _ sentence openers prepositions 6% 19% 12% 3.4 adverbs 0% 20% 9% 4.6 subject 56% 85% 70% 8.0 verbs 0% 4% 1% 1.0 subordinating conj 1% 12% 5% 2.7 conjunctions 0% 4% 0% 1.5 expletives 0% 6% 2% 1.7 measures. As you will note most of the measurements have a fairly wide range of values across the sample documents. As a comparison, Table 2 gives the median results for two different technical authors, a sample of instructional material, and a sample of the Federalist Papers. The two authors show similar styles, although author 2 uses somewhat shorter sentences and longer words than author 1. Author 1 September 2, 1987 - 13 - uses all types of sentences, while author 2 prefers simple and complex sentences, using few compound or compound- complex sentences. The other major difference in the styles of these authors is the location of subordination. Author 1 seems to prefer embedded or trailing subordination, while author 2 begins many sentences with the subordinate clause. The documents tested for both authors 1 and 2 were technical documents, written for a technical audience. The instruc- tional documents, which are written for craftspeople, vary surprisingly little from the two technical samples. The sentences and words are a little longer, and they contain many passive and auxiliary verbs, few adverbs, and almost no pronouns. The instructional documents contain many impera- tive sentences, so there are many sentence with verb openers. The sample of Federalist Papers contrasts with the other samples in almost every way. _5._2. _D_I_C_T_I_O_N In the few weeks that DICTION has been available to users about 35,000 sentences have been run with about 5,000 string matches. The authors using the program seem to make the suggested changes about 50-75% of the time. To date, almost 200 of the 450 strings in the default file have been matched. Although most of these phrases are valid and correct in some contexts, the 50-75% change rate seems to show that the phrases are used much more often than concise diction warrants. _6. _A_c_c_u_r_a_c_y _6._1. _S_e_n_t_e_n_c_e _I_d_e_n_t_i_f_i_c_a_t_i_o_n The correctness of the STYLE output on the 20 document sample was checked in detail. STYLE misidentified 129 sen- tence fragments as sentences and incorrectly joined two or more sentences 75 times in the 3287 sentence sample. The problems were usually because of nonstandard formatting com- mands, unknown abbreviations, or lists of non-sentences. An impossibly long sentence found as the longest sentence in the document usually is the result of a long list of non- sentences. _6._2. _S_e_n_t_e_n_c_e _T_y_p_e_s Style correctly identified sentence type on 86.5% of the sentences in the sample. The type distribution of the sentences was 52.5% simple, 29.9% complex, 8.5% compound and 9% compound-complex. The program reported 49.5% simple, 31.9% complex, 8% compound and 10.4% compound-complex. Looking at the errors on the individual documents, the number of simple sentences was under-reported by about 4% and the complex and compound-complex were over-reported by 3% and 2%, respectively. The following matrix shows the September 2, 1987 - 14 - Table 2 Text Statistics on Single Authors cccccc llnnnn. variable author 1 author 2 inst. FED _ readability Kincaid 11.0 10.3 10.8 16.3 automated 11.0 10.3 11.9 17.8 Coleman-Liau 9.3 10.1 10.2 12.3 Flesch 10.3 10.7 10.1 15.0 _ sentence info av sent length 22.64 19.61 22.78 31.85 av word length 4.47 4.66 4.65 4.95 av nonfunction length 5.64 5.92 6.04 6.87 short sent 35% 43% 35% 40% long sent 18% 15% 16% 21% _ sentence types simple 36% 43% 40% 31% complex 34% 41% 37% 34% compound 13% 7% 4% 10% compound-complex 16% 8% 14% 25% _ verb type tobe 42% 43% 45% 37% auxiliary 17% 19% 32% 32% infinitives 17% 15% 12% 21% passives 20% 19% 36% 20% _ word usage prepositions 10.0% 10.8% 12.3% 15.9% conjunctions 3.2% 2.4% 3.9% 3.4% adverbs 5.05% 4.6% 3.5% 3.7% nouns 27.7% 26.5% 29.1% 24.9% adjectives 17.0% 19.0% 15.4% 12.4% pronouns 5.3% 4.3% 2.1% 6.5% nominalizations 1% 2% 2% 3% _ sentence openers prepositions 11% 14% 6% 5% adverbs 9% 9% 6% 4% subject 65% 59% 54% 66% verb 3% 2% 14% 2% subordinating conj 8% 14% 11% 3% conjunction 1% 0% 0% 3% expletives 3% 3% 0% 3% programs output vs. the actual sentence type. September 2, 1987 - 15 - csssss cccccc clnnnn. Program Results simple complex compound comp-complex Actual simple 1566 132 49 17 Sentence complex 47 892 6 65 Type compound 40 6 207 23 comp-complex 0 52 5 249 The system's inability to find imperative sentences seems to have little effect on most of the style statistics. A document with half of its sentences imperative was run, with and without the imperative end marker. The results were identical except for the expected errors of not finding verbs as sentence openers, not counting the imperative sen- tences, and a slight difference (1%) in the number of nouns and adjectives reported. _6._3. _W_o_r_d _U_s_a_g_e The accuracy of identifying word types reflects that of PARTS, which is about 95% correct. The largest source of confusion is between nouns and adjectives. The verb counts were checked on about 20 sentences from each document and found to be about 98% correct. _7. _T_e_c_h_n_i_c_a_l _D_e_t_a_i_l_s _7._1. _F_i_n_d_i_n_g _S_e_n_t_e_n_c_e_s The formatting commands embedded in the text increase the difficulty of finding sentences. Not all text in a document is in sentence form; there are headings, tables, equations and lists, for example. Headings like ``Finding Sentences'' above should be discarded, not attached to the next sentence. However, since many of the documents are formatted to be phototypeset, and contain font changes, which usually operate on the most important words in the document, discarding all formatting commands is not correct. To improve the programs' ability to find sentence boun- daries, the deformatting program, DEROFF [13], has been given some knowledge of the formatting packages used on the UNIX operating system. DEROFF will now do the following: 1. Suppress all formatting macros that are used for titles, headings, author's name, etc. 2. Suppress the arguments to the macros for titles, head- ings, author's name, etc. September 2, 1987 - 16 - 3. Suppress displays, tables, footnotes and text that is centered or in no-fill mode. 4. Substitute a place holder for equations and check for hidden end markers. The place holder is necessary because many typists and authors use the equation setter to change fonts on important words. For this reason, header files containing the definition of the EQN delimiters must also be included as input to STYLE. End markers are often hidden when an equation ends a sentence and the period is typed inside the EQN delim- iters. 5. Add a "." after lists. If the flag -ml is also used, all lists are suppressed. This is a separate flag because of the variety of ways the list macros are used. Often, lists are sentences that should be included in the analysis. The user must determine how lists are used in the document to be analyzed. Both STYLE and DICTION call DEROFF before they look at the text. The user should supply the -ml flag if the docu- ment contains many lists of non-sentences that should be skipped. _7._2. _D_e_t_a_i_l_s _o_f _D_I_C_T_I_O_N The program DICTION is based on the string matching program FGREP. FGREP takes as input a file of patterns to be matched and a file to be searched and outputs each line that contains any of the patterns with no indication of which pattern was matched. The following changes have been added to FGREP: 1. The basic unit that DICTION operates on is a sentence rather than a line. Each sentence that contains one of the patterns is output. 2. Upper case letters are mapped to lower case. 3. Punctuation is replaced by blanks. 4 All pattern matches in the sentence are found and sur- rounded with ``['' ``]'' . 5. A method for suppressing a string match has been added. Any pattern that begins with ``~'' will not be matched. Because the matching algorithm finds the longest sub- string, the suppression of a match allows words in some correct contexts not to be matched while allowing the word in another context to be found. For example, the word ``which'' is often incorrectly used instead of ``that'' in restrictive clauses. However, ``which'' is usually correct when preceded by a preposition or September 2, 1987 - 17 - ``,''. The default pattern file suppresses the match of the common prepositions or a double blank followed by ``which'' and therefore matches only the suspect uses. The double blank accounts for the replaced comma. _8. _C_o_n_c_l_u_s_i_o_n_s A system of writing tools that measure some of the objective characteristics of writing style has been developed. The tools are sufficiently general that they may be applied to documents on any subject with equal accuracy. Although the measurements are only of the surface structure of the text, they do point out problem areas. In addition to helping writers produce better documents, these programs may be useful for studying the writing process and finding other formulae for measuring readability. September 2, 1987 - 18 - _R_e_f_e_r_e_n_c_e_s 1. L. L. Cherry, ``PARTS - A System for Assigning Word Classes to English Text,'' submitted _C_o_m_m_u_n_i_c_a_t_i_o_n_s _o_f _t_h_e _A_C_M. 2. B. W. Kernighan and J. R. Mashey, ``The UNIX Program- ming Environment,'' _S_o_f_t_w_a_r_e - _P_r_a_c_t_i_c_e & _E_x_p_e_r_i_e_n_c_e , 9, 1-15 (1979). 3. G. R. Klare, ``Assessing Readability,'' _R_e_a_d_i_n_g _R_e_s_e_a_r_c_h _Q_u_a_r_t_e_r_l_y, 1974-1975, _1_0 , 62-102. 4. E. A. Smith and P. Kincaid, ``Derivation and validation of the automated readability index for use with techni- cal materials,'' _H_u_m_a_n _F_a_c_t_o_r_s, 1970, 12, 457-464. 5. J. P. Kincaid, R. P. Fishburne, R. L. Rogers, and B. S. Chissom, ``Derivation of new readability formulas (Automated Readability Index, Fog count, and Flesch Reading Ease Formula) for Navy enlisted personnel,'' Navy Training Command Research Branch Report 8-75, Feb., 1975. 6. M. Coleman and T. L. Liau, ``A Computer Readability Formula Designed for Machine Scoring,'' _J_o_u_r_n_a_l _o_f _A_p_p_l_i_e_d _P_s_y_c_h_o_l_o_g_y, 1975, 60, 283-284. 7. R. Flesch, ``A New Readability Yardstick,'' _J_o_u_r_n_a_l _o_f _A_p_p_l_i_e_d _P_s_y_c_h_o_l_o_g_y, 1948, 32, 221-233. 8. E. U. Coke, private communication. 9. D. W. Ewing, _W_r_i_t_i_n_g _f_o_r _R_e_s_u_l_t_s, John Wiley & Sons, Inc., New York, N. Y. (1974). 10. G. Leggett, C. D. Mead and W. Charvat, _P_r_e_n_t_i_c_e-_H_a_l_l _H_a_n_d_b_o_o_k _f_o_r _W_r_i_t_e_r_s, Seventh Edition, Prentice-Hall Inc., Englewood Cliffs, N. J. (1978). 11. E. B. Coleman, ``Learning of Prose Written in Four Grammatical Transformations,'' _J_o_u_r_n_a_l _o_f _A_p_p_l_i_e_d _P_s_y_c_h_o_l_o_g_y, 1965, vol. 49, no. 5, pp. 332-341. 12 A. V. Aho and M. J. Corasick, ``Efficient String Match- ing: an aid to Bibliographic Search,'' _C_o_m_m_u_n_i_c_a_t_i_o_n_s _o_f _t_h_e _A_C_M, 18, (6), 333-340, June 1975. 13. Bell Laboratories, ``_U_N_I_X _T_I_M_E-_S_H_A_R_I_N_G _S_Y_S_T_E_M: _U_N_I_X _P_R_O_G_R_A_M_M_E_R'_S _M_A_N_U_A_L,'' Seventh Edition, Vol. 1 (January 1979). September 2, 1987 - 19 - Appendix 1 STYLE Abbreviations a. d. A. M. a. m. b. c. Ch. ch. ckts. dB. Dept. dept. Depts. depts. Dr. Drs. e. g. Eq. eq. et al. etc. Fig. fig. Figs. figs. ft. i. e. in. Inc. Jr. jr. mi. Mr. Mrs. Ms. No. no. Nos. nos. P. M. p. m. Ph. D. Ph. d. Ref. ref. Refs. refs. St. vs. yr. September 2, 1987 - 20 - Appendix 2 Default DICTION Patterns 8 a great deal of 8 a large number of 8 a lot of 8 a majority of 8 a need for 8 a number of 8 a particular preference for 8 a preference for 8 a small number of 8 a tendency to 8 abovementioned 8 absolutely complete 8 absolutely essential 8 accomplished 8 accordingly 8 activate 8 actual 8 added increments 8 adequate enough 8 advent 8 afford an opportunity 8 aggregate 8 all of 8 all throughout 8 along the line 8 an indication of 8 analyzation 8 and etc 8 and or 8 another additional 8 any and all 8 arrive at a 8 as a matter of fact 8 as a method of 8 as good or better than 8 as of now 8 as per 8 as regards 8 as related to 8 as to 8 assistance 8 assistance to 8 assistance to 8 assuming that 8 at a later date 8 at about 8 at above 8 at all times 8 at an early date 8 at below 8 at the present 8 at the time when 8 at this point in time 8 at this time 8 at which time 8 at your earliest convenience 8 authorization 8 awful 8 basic fundamentals 8 basically 8 be cognizant of 8 being as 8 being that 8 brief in duration 8 bring to a conclusion 8 but that 8 but what 8 by means of 8 by the use of 8 carry out experiments 8 center about 8 center around 7777777777777777777777777777777777778 center portion 8 check into 8 check on 8 check up on 8 circle around 8 close proximity 8 collaborate together 8 collect together 8 combine together 8 come to an end 8 commence 8 common accord 8 compensation 8 completely eliminated 8 comprise 8 concerning 8 conduct an investigation of 8 conjecture 8 connect up 8 consensus of opinion 8 consequent result 8 consolidate together 8 construct 8 contemplate 8 continue on 8 continue to remain 8 could of 8 count up 8 couple together 8 debate about 8 decide on 8 deleterious effect 8 demean 8 demonstrate 8 depreciate in value 8 deserving of 8 desirable benefits 8 desirous of 8 different than 8 discontinue 8 disutility 8 divide up 8 doubt but 8 due to 8 duly noted 8 during the time that 8 each and every 8 early beginnings 8 effectuate 8 emotional feelings 8 empty out 8 enclosed herein 8 enclosed herewith 8 end result 8 end up 8 endeavor 8 enter in 8 enter into 8 enthused 8 entirely complete 8 equally good as 8 essentially 8 eventuate 8 every now and then 8 exactly identical 8 experiencing difficulty 8 fabricate 8 face up to 8 facilitate 8 facts and figures 8 fast in action 8 fearful of 7777777777777777777777777777777777778 fearful that 8 few in number 8 file away 8 final completion 8 final ending 8 final outcome 8 final result 8 finalize 8 find it interesting to know 8 first and foremost 8 first beginnings 8 first initiated 8 firstly 8 follow after 8 following after 8 for the purpose of 8 for the reason that 8 for the simple reason that 8 for this reason 8 for your information 8 from the point of view of 8 full and complete 8 generally agreed 8 good and 8 got to 8 gratuitous 8 greatly minimize 8 head up 8 help but 8 helps in the production of 8 hopeful 8 if and when 8 if at all possible 8 impact 8 implement 8 important essentials 8 importantly 8 in a large measure 8 in a position to 8 in accordance 8 in advance of 8 in agreement with 8 in all cases 8 in back of 8 in behalf of 8 in behind 8 in between 8 in case 8 in close proximity 8 in conflict with 8 in conjunction with 8 in connection with 8 in fact 8 in large measure 8 in many cases 8 in most cases 8 in my opinion I think 8 in order to 8 in rare cases 8 in reference to 8 in regard to 8 in regards to 8 in relation with 8 in short supply 8 in size 8 in terms of 8 in the amount of 8 in the case of 8 in the course of 8 in the event 8 in the field of 7777777777777777777777777777777777788 in the form of 8 in the instance of 8 in the interim 8 in the last analysis 8 in the matter of 8 in the near future 8 in the neighborhood of 8 in the not too distant future 8 in the proximity of 8 in the range of 8 in the same way as described 8 in the shape of 8 in the vicinity of 8 in this case 8 in view of the 8 in violation of 8 inasmuch as 8 indicate 8 indicative of 8 initialize 8 initiate 8 injurious to 8 inquire 8 inside of 8 institute a 8 intents and purposes 8 intermingle 8 irregardless 8 is defined as 8 is used to control 8 is when 8 is where 8 it is incumbent 8 it stands to reason 8 it was noted that if 8 joint cooperation 8 joint partnership 8 just exactly 8 kind of 8 know about 8 last but not least 8 later on 8 leaving out of consideration 8 liable 8 link up 8 literally 8 little doubt that 8 lose out on 8 lots of 8 main essentials 8 make a 8 make adjustments to 8 make an 8 make application to 8 make contact with 8 make mention of 8 make out a list of 8 make the acquaintance of 8 make the adjustment 8 manner 8 maximum possible 8 meaningful 8 meet up with 8 melt down 8 melt up 8 methodology 8 might of 8 minimize as far as possible 8 minor importance 8 miss out on 8 modification 9 8 September 2, 1987 9 - 21 - 8 more preferable 8 most unique 8 must of 8 mutual cooperation 8 necessary requisite 8 necessitate 8 need for 8 nice 8 not be un 8 not in a position to 8 not of a high order of accuracy 8 not un 8 notwithstanding 8 of considerable magnitude 8 of that 8 of the opinion that 8 off of 8 on a few occasions 8 on account of 8 on behalf of 8 on the grounds that 8 on the occasion 8 on the part of 8 one of the 8 open up 8 operates to correct 8 outside of 8 over with 8 overall 8 past history 8 perceptive of 8 perform a measurement 8 perform the measurement 8 permits the reduction of 8 personalize 8 pertaining to 8 physical size 8 plan ahead 8 plan for the future 8 plan in advance 8 plan on 8 present a conclusion 8 present a report 8 presently 8 prior to 8 prioritize 8 proceed to 8 procure 8 productive of 8 prolong the duration 8 protrude out from 8 provided that 8 pursuant to 8 put to use in 8 range all the way from 8 reason is because 8 reason why 8 recur again 8 reduce down 8 refer back 8 reference to this 8 reflective of 8 regarding 8 regretful 8 reinitiate 8 relative to 8 repeat again 8 representative of 8 resultant effect 8 resume again 8 retreat back 8 return again 8 return back 8 revert back 8 seal off 777777777777777777777777777777777777788 seems apparent 8 send a communication 8 short space of time 8 should of 8 single unit 8 situation 8 so as to 8 sort of 8 spell out 8 still continue 8 still remain 8 subsequent 8 substantially in agreement 8 succeed in 8 suggestive of 8 superior than 8 surrounding circumstances 8 take appropriate 8 take cognizance of 8 take into consideration 8 termed as 8 terminate 8 termination 8 the author 8 the authors 8 the case that 8 the fact 8 the foregoing 8 the foreseeable future 8 the fullest possible extent 8 the majority of 8 the nature 8 the necessity of 8 the only difference being that 8 the order of 8 the point that 8 the truth is 8 there are not many 8 through the medium of 8 through the use of 8 throughout the entire 8 time interval 8 to summarize the above 8 total effect of all this 8 totality 8 transpire 8 true facts 8 try and 8 ultimate end 8 under a separate cover 8 under date of 8 under separate cover 8 under the necessity to 8 underlying purpose 8 undertake a study 8 uniformly consistent 8 unique 8 until such time as 8 up to this time 8 upshot 8 utilize 8 very 8 very complete 8 very unique 8 vital 8 which 8 with a view to 8 with reference to 8 with regard to 8 with the exception of 8 with the object of 8 with the result that 8 with this in mind, it is clear that 8 within the realm of possibility 8 without further delay 777777777777777777777777777777777777788 worth while 8 would of 8 ing behavior 8 wise 8 ~ which 8 ~ about which 8 ~ after which 8 ~ at which 8 ~ between which 8 ~ by which 8 ~ for which 8 ~ from which 8 ~ in which 8 ~ into which 8 ~ of which 8 ~ on which 8 ~ on which 8 ~ over which 8 ~ through which 8 ~ to which 8 ~ under which 8 ~ upon which 8 ~ with which 8 ~ without which 8 ~clockwise 8 ~likewise 8 ~otherwise 9 8 September 2, 1987 9