<?xml version="1.0"?><!DOCTYPE article SYSTEM "/project/take/software/searchbench_offline_processing/paperxml_generator/aclextractor/src/python/../resource/dtd/paperxml.dtd"><article><header><firstpageheader><page local="1"/><title>Statistical and Linguistic Strategies in the Computer Grading of Essays</title><author surname="Page" givenname="Ellis B."></author></firstpageheader><frontmatter><p><b>Statistical and Linguistic Strategies in the Computer Grading of Essays</b></p></frontmatter><abstract><b>Ellis B. Page University of Connecticut Storrs, Conn., U.S.A.</b> <b>Essay tests are used in the schools and colleges of all nations, and in major testing programs of national and even international size.   Potentially, such essay tests are an important applied field for computational linguistics, and-should eventually provide focus for much work.   Yet in the past, little direct attention has been paid to such grading, although there are ways to begin investigation which would not necessarily require much linguistic knowledge beyond that now available.</b> <b>Beginning in December of 1964, Project Essay Grade (PEG), at the University of Connecticut, has investi­gated the computer analysis and évaluation of student writing.   In February, 1965, the project was given pilot funding by the College Entrance Examination Board of New York City, and in June, 1966, the United States Office of Education gave it much larger "support. Through this period of preliminary investigation,(Cer­tain problems have become much better understood (Daigon, 1966; Page, 1966, 1967).   This paper discusses these problems, relates certain major findings to date, and outlines apparently promising avenues for future work by linguists, computer scientists, psychologists, and educators.</b> </abstract></header><body><section title="Background"><p><b>It is useful to conceptualize the field of essay grading in two dimensions, as represented in Figure 1.</b></p><p><b><u>Figure</u> 1_</b></p><p><b><i>Possible Dimensions of Essay Grading</i></b></p><p><b>I II Content Style</b> <b>A.</b><b> Rating Simulation</b> <b>B.</b><b> Master Analysis</b></p><p><b>Any serious effort to grade essays must obviously face problems of "content" as in Column I, and of "style" as in Column II.   Yet it is obvious that these columns are not mutually exclusive.    Similarly, the rows are not mutually exclusive either, but their general meaning must be mastered to understand the work to date and the problems of the field.   The first row refers to the</b> <b>simulation of the human judgment, without great concern about the way this judgment was produced.</b><page local="2" global="2"/><b>   The second row refers to the accurate, deep, "true" analysis of the essay.</b></p><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>1</b><b>(A)</b></p></td><td class="cell"><p><b>II (A)</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>1(B)</b></p></td><td class="cell"><p><b>II (B)</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><p><b>We have coined two terms to describe this differ­ence.    Since the top row is concerned with ap<u>prox</u>imation, we speak of the computer-variables employed as <u>proxes</u>. Since the bottom row is concerned with the true "intrin­sic" variables of interest, we speak of such variables as <u>trins</u>.   A <u>trin</u>, then, is a variable of intrinsic interest to the human judge, for example, "aptness of word choice".    Usually a trin is not directly measurable by present computer strategies.    And a <u>prox</u> is any variable measured by the computer, as an approximation (or correlate) of some trin, for example, proportion of uncommon words used by a student (where common words are discovered by a list look-up procedure in computer memory).</b></p><p><b>In the early part of our investigations, we concen­trated on the right column and top row of Figure 1, look­ing for actuarial strategies, seeking out those proxes which would be of most immediate use in the simulation of the final human product, the ratings of stylistic factors.</b></p><p><b>For the first attempts, we evolved a general research design, which we have more or less followed to date:</b></p><p><b>(1) Samples of essays were judged by a number of independent experts.    For our first trial 272 essays, written by students in Grades 8 to 12 in an American high school, and judged by at least four independent teachers.   These judgments of overall quality formed the trins.</b></p><p><b>(2) Hypotheses were generated about the variables which might be associated with these judgments. If measurable by computer, and feasible to program within the logistics of the study, these computer variables be­came the proxes of the study.</b></p><p><b>(3) Computer routines were written to measure these proxes in the essays.    These were written in FORTRAN IV, for the IBM 7040 computer, and are highly modular and mnemonic programs, fairly well documented and available to computational linguists interested in using them or adapting them.</b></p><p><b>(4) Essays were prepared for computer input. In the present stage of data processing, this means that they were typed by clerical workers on an ordinary key­punch.   They were punched into cards, and these cards served as input for the next stage.</b></p><page local="3" global="3"/><p><b>(5)    The essays were passed through the computer, under the control of the program which collected data about the proxes.   The output was as appears in Figure 2.</b></p><doubt alpha="85.7" length="7" tooSmall="False" monospace="0.0">Figure£</doubt><p><b><i>PEG-1A Output</i></b></p><p><b>Figure 2 shows a piece of output from PEG-IA. Line A shows the way a sentence from the student essay is re­written in 12-character double-precision computer "words" and stored in memory.   Line B shows the summary of data for that sentence just analyzed.    The first number is the essay identification.   The other numbers of Line B are some counts from that sentence.    Line C shows a summary of these counts, across sentences, for this whole essay.   And Line D are these measures transformed in a number of simple ways, and ready for input into the final analysis.</b></p><p><b>(6)    These scores were then analyzed for their mul­tivariate relationship to the human ratings, were weighted appropriately, and were used to maximize the prediction of the expert human ratings.    This was all done by use of standard multiple-regression programs.</b></p><p><b>The first analyses produced results as shown in Table 1.    Here it is possible to read the list of proxes (Col. A), 'and their correlation, after transformation, with the human judgments of overall quality (Col. B). Col. C shows their contribution to the total multiple regression, and Col. D indicates the test-retest relia­bility of the proxes themselves, as discovered from two essays written by the same students, with about a month between writings.</b></p><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b><i>o</i></b></p></td><td class="cell"><p><b>1</b></p></td><td class="cell"><p><b>THINK THAT</b></p></td><td class="cell"><p><b>IF PEOPLE</b></p></td><td class="cell"><p><b>WOULD</b></p></td><td class="cell"><p><b>LIVE</b></p></td><td class="cell"><p><b>no</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>THEIR</b></p></td><td class="cell"><p><b>GIFTS • BANK eUOK</b></p></td><td class="cell"><p><b>-tout</b><b>-THETf-</b></p><p><b>, EVERYONE</b></p></td><td class="cell"><p><b>HERE HOULO</b></p></td><td class="cell"><p><b>—BORN-</b></p><p><b>*E</b></p></td><td class="cell"><p><b>—m </b><b>6E</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>o</b></p></td><td class="cell"><p><b>102 1 19</b></p></td><td class="cell"><p><b>5    1    106   468 26</b></p></td><td class="cell"><p><b>676   0   0   1   1   0   0 0</b></p></td><td class="cell"><p><b>0   0   2 0</b></p></td><td class="cell"><p><b>0   4   0 0</b></p></td><td class="cell"><p><b><i>* j</i></b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>o</b></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p><i>!</i></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>BE</b></p></td><td class="cell"><p><b>TAhhiT           Ht IP TO LIVE</b></p></td><td class="cell"><p><b>THINK ntXT IN </b><b>if</b></p></td><td class="cell"><p><b>EVERYONE</b></p></td><td class="cell"><p><b>NltE -LIVED</b></p></td><td class="cell"><p>■ .</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>o</b></p></td><td class="cell"><p><b>RULE</b></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>o</b></p></td><td class="cell"><p></p></td><td class="cell"><p><i>il</i><i> </i><b>18   </b><i>Sbl il</i></p></td><td class="cell"><p><i>SSI   </i>U   <b>1   U   1   0   </b>u u</p></td><td class="cell"><p><b>U   U   U u</b></p></td><td class="cell"><p><i>~0</i><b>—ZT—</b><i>0~</i></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><i>® 1</i></p></td><td class="cell"><p><b>GUESS THAT</b></p></td><td class="cell"><p><b>it </b><b>1$</b></p></td><td class="cell"><p><b>JUST</b></p></td><td class="cell"><p><b>WISHFUL .</b></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>o</b></p></td><td class="cell"><p><b>"ATlEAtl GOD</b></p></td><td class="cell"><p><b>~~n</b><b> </b><b>DUE </b><b>snwt </b><b>FOR THAT</b></p></td><td class="cell"><p><b>-cost-</b><i>ne</i><i>-</i></p></td><td class="cell"><p><b>—ANYTHING—</b></p></td><td class="cell"><p><b>-TO-</b><i>T</i></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>o</b></p></td><td class="cell"><p><b>(6)     102 1 21</b></p></td><td class="cell"><p><b>5   1     87   462 21</b></p></td><td class="cell"><p><b>441   0   1   2   1   0   0 0</b></p></td><td class="cell"><p><b>0   0   0 0</b></p></td><td class="cell"><p><b>0   2   0 0</b></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>©    102 1 21</b></p></td><td class="cell"><p><b>S 12 1*97 7365 371</b></p></td><td class="cell"><p><b>7685   0   4 26 21   0   3 0</b></p></td><td class="cell"><p><b>0   0   6 0</b></p></td><td class="cell"><p><b>0 32   2 2</b></p></td><td class="cell"><p><b>l\</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>o</b></p></td><td class="cell"><p><b>1022 0.</b></p></td><td class="cell"><p>n.   s.  si. jit.</p><p><b>9.     1.     1. 3.</b></p></td><td class="cell"><p><b>u.    il.    </b>(U.    3i&gt;     <b>o.     </b>v*     <b>u. u. 2.   66. 100. 100*     0.     3.     0. 404.</b></p></td><td class="cell"><p><b>381. 0. 189. 73.</b></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>o</b></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>IN</b></p></td><td class="cell"><p><b>TUU                 HtAH II</b><b>fe </b><b>ARE</b></p></td><td class="cell"><p><b>SUHL-ONE SAT FREE *</b></p></td><td class="cell"><p><b>■ HAI</b></p></td><td class="cell"><p><b>THE</b></p></td><td class="cell"><p><i>—m</i></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>o</b></p></td><td class="cell"><p><b>rtRSUh</b></p></td><td class="cell"><p><b>IS NUI</b></p></td><td class="cell"><p><b>VfcKT UR&amp;L-RVANI</b></p></td><td class="cell"><p><b>*</b></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>_o.</b></p></td><td class="cell"><p><i>IV i </i><b>1 t</b></p></td><td class="cell"><p><b>1   U     </b><i>Vi   </i><b>4JU </b><i>il</i></p></td><td class="cell"><p><b>464   0   0   1    1   U   U U</b></p></td><td class="cell"><p><b>~D 0</b><b>—</b><b>Z 0~ </b><i>^</i><b>--^</b></p></td><td class="cell"><p>~o<b>—</b><b>i</b><b>—</b>o<b>—</b><b>o-"</b></p></td><td class="cell"><p><footnote anchor="1"/> ^</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><page local="4" global="4"/><table caption="Table I*"></table><p><b><i>Variables Used in Project Essay Grade l-A for a Criterion of Overall Quality</i></b></p><p>*Number or students judged was 272. Multiple R against human criterion (four judges) was .71 for both Essay C and Essay D (D data shown here). F-ralios for Multiple R were highly significant.</p><p><b>The overall accuracy of this beginning strategy was startling.    The proxes achieved a multiple-correlation of .71 for the first set of essays analyzed and, by chance, achieved the identical coefficient for the second set. Furthermore, the beta weightings from one set of essays did well in predicting the human judgments for the second set of essays written by the same youngsters.    All in all, the computer did a respectable,  "human-expert" job. in grading essays, as is visible in Table 2.</b></p><table class="main" frame="box" rules="all" border="0" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p><b>A.</b></p></td><td class="cell"><p><b>B.</b></p></td><td class="cell"><p><b>C.</b></p></td><td class="cell"><p><b>D.</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p><b><i>Proxes                      Corr. with</i></b></p></td><td class="cell"><p><b><i>Beta wis.</i></b></p></td><td class="cell"><p><b><i>Test-Ret. Rel.</i></b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p><b><i>Criterion</i></b></p></td><td class="cell"><p></p></td><td class="cell"><p><b><i>(Two essays)</i></b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>1.</b></p></td><td class="cell"><p><b>Title present</b></p></td><td class="cell"><p><b>.04</b></p></td><td class="cell"><p><b>.09</b></p></td><td class="cell"><p><b>.05</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>2.</b></p></td><td class="cell"><p><b>Av. sentence length</b></p></td><td class="cell"><p><b>.04</b></p></td><td class="cell"><p><b>-.13</b></p></td><td class="cell"><p><b>.63</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>3.</b></p></td><td class="cell"><p><b>Number of paragraphs</b></p></td><td class="cell"><p><b>.06</b></p></td><td class="cell"><p><b>-.11</b></p></td><td class="cell"><p><b>.42</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>4.</b></p></td><td class="cell"><p><b>Subject-verb openings</b></p></td><td class="cell"><p><b>-.16</b></p></td><td class="cell"><p><b>-.01</b></p></td><td class="cell"><p><b>.20</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>5.</b></p></td><td class="cell"><p><b>Length of essay in words</b></p></td><td class="cell"><p><b>.32</b></p></td><td class="cell"><p><b>.32</b></p></td><td class="cell"><p><b>.55</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>6,</b></p></td><td class="cell"><p><b>Number of parentheses</b></p></td><td class="cell"><p><b>.04</b></p></td><td class="cell"><p><b>-.01</b></p></td><td class="cell"><p><b>.21</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>7.</b></p></td><td class="cell"><p><b>Number of apostrophes</b></p></td><td class="cell"><p><b>-.23</b></p></td><td class="cell"><p><b>-.06</b></p></td><td class="cell"><p><b>.42</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>8.</b></p></td><td class="cell"><p><b>Number of commas</b></p></td><td class="cell"><p><b>.34</b></p></td><td class="cell"><p><b>.09</b></p></td><td class="cell"><p><b>.61</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>9.</b></p></td><td class="cell"><p><b>Number of periods</b></p></td><td class="cell"><p><b>-.05</b></p></td><td class="cell"><p><b>-.05</b></p></td><td class="cell"><p><b>.57</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>10.</b></p></td><td class="cell"><p><b>Number of underlined words</b></p></td><td class="cell"><p><b>.01</b></p></td><td class="cell"><p><b>.00</b></p></td><td class="cell"><p><b>.22</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>11.</b></p></td><td class="cell"><p><b>Number of dashes</b></p></td><td class="cell"><p><b>.22</b></p></td><td class="cell"><p><b>.10</b></p></td><td class="cell"><p><b>.44</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>12.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>colons</b></p></td><td class="cell"><p><b>.02</b></p></td><td class="cell"><p><b>-.03</b></p></td><td class="cell"><p><b>.29</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>13.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>semicolons</b></p></td><td class="cell"><p><b>.08</b></p></td><td class="cell"><p><b>.06</b></p></td><td class="cell"><p><b>.32</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>14.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>quotation marks</b></p></td><td class="cell"><p><b>.11</b></p></td><td class="cell"><p><b>.04</b></p></td><td class="cell"><p><b>.27</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>15.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>exclamation marks</b></p></td><td class="cell"><p><b>-.05</b></p></td><td class="cell"><p><b>.09</b></p></td><td class="cell"><p><b>.20</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>16.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>question marks</b></p></td><td class="cell"><p><b>-.14</b></p></td><td class="cell"><p><b>.01</b></p></td><td class="cell"><p><b>.29</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>17.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>prepositions</b></p></td><td class="cell"><p><b>.25</b></p></td><td class="cell"><p><b>.10</b></p></td><td class="cell"><p><b>.27</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>18.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>connective words</b></p></td><td class="cell"><p><b>.18</b></p></td><td class="cell"><p><b>-.02</b></p></td><td class="cell"><p><b>.24</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>19.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>spelling errors</b></p></td><td class="cell"><p><b>-•21 .</b></p></td><td class="cell"><p><b>-.13</b></p></td><td class="cell"><p><b>.23</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>20.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>relative pronouns</b></p></td><td class="cell"><p><b>.11</b></p></td><td class="cell"><p><b>.11</b></p></td><td class="cell"><p><b>.17</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>21.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>subordinating conjs.</b></p></td><td class="cell"><p><b>-.12</b></p></td><td class="cell"><p><b>.06</b></p></td><td class="cell"><p><b>.18</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>22.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>common words on Dale</b></p></td><td class="cell"><p><b>-.48</b></p></td><td class="cell"><p><b>-.07</b></p></td><td class="cell"><p><b>.65</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>23.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>sents. end punc. pres.</b></p></td><td class="cell"><p><b>-.01</b></p></td><td class="cell"><p><b>-.08</b></p></td><td class="cell"><p><b>.14</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>24.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>declar. sents. type A</b></p></td><td class="cell"><p><b>.12</b></p></td><td class="cell"><p><b>.14</b></p></td><td class="cell"><p><b>.34</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>25.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>declar. sents. type B</b></p></td><td class="cell"><p><b>.02</b></p></td><td class="cell"><p><b>.02</b></p></td><td class="cell"><p><b>.09</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>26.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>hyphens</b></p></td><td class="cell"><p><b>.18</b></p></td><td class="cell"><p><b>.07</b></p></td><td class="cell"><p><b>.20</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>27.</b></p></td><td class="cell"><p><b>No.</b></p></td><td class="cell"><p><b>slashes</b></p></td><td class="cell"><p><b>-.07</b></p></td><td class="cell"><p><b>-.02</b></p></td><td class="cell"><p><b><i>-.02</i></b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>28.</b></p></td><td class="cell"><p><b>Aver, word length in Itrs.</b></p></td><td class="cell"><p><b>.51</b></p></td><td class="cell"><p><b>.12</b></p></td><td class="cell"><p><b>.62</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>29.</b></p></td><td class="cell"><p><b>Stan. dev. of word length</b></p></td><td class="cell"><p><b>.53</b></p></td><td class="cell"><p><b>.30</b></p></td><td class="cell"><p><b>.61</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>30.</b></p></td><td class="cell"><p><b>Stan. dev. of sent, length</b></p></td><td class="cell"><p><b>-.07</b></p></td><td class="cell"><p><b>.03</b></p></td><td class="cell"><p><b>.48</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><page local="5" global="5"/><doubt alpha="71.4" length="7" tooSmall="False" monospace="0.0">Tabla 2</doubt><p><b><i>Which One is the Computer ?</i></b></p><p>Below is the intercorrelation matric generated by the cross-validation of peg i</p><p><b>Here we see the results of a cross-validation. These are correlations between judgments of 138 essays done by five "judges," four of them human and one of them the computer.   The computer judgments were the grades given by the regression weightings based on 138 <u>other</u> essays by <u>other</u> students.    This cross-validation, then, is very conservative.   Yet, from a practical point of view, the five judges are indistinguishable from one another.</b></p><p><b>However useful such an overall rating might be, we of course still wished greater <u>detail</u> in our analysis. We therefore broadened the analysis to five traits be­lieved important in essays, adapted partly from those of Paul Diederich.   They may be summarized as:    <u>ideas</u>, <u>organization</u>, <u>style</u>, <u>mechanics</u>, and <u>creativity</u>. We had a particular interest in creativity, since some critics from the beginning have believed that the com­puter must founder on this kind of measure.    "You might grade mechanics all right," someone will say, "but what about originality?   What about the fellow who is really different?   The machine can't handle him!"</b></p><p><b>Therefore, in 1966 we called together a group of 32 highly qualified English teachers from the schools of Connecticut to see how <u>they</u> would handle creativity and these other traits.    Each of 256 essays was rated on a five-point scale on each of these five important traits, by eight such expert judges, each acting inde­pendently of any other judge.    The teacher ratings were then analyzed, and it was found that the essay and the trait contributed significant variances, as did the trait-by-essay interaction,  (perhaps the clearest demon­stration of the ipsative profile).   To investigate each of the se five trait ratings, the same 30 proxes were again employed, with the results to be seen in Table 3.</b></p><table class="main" frame="box" rules="all" border="0" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>A</p></td><td class="cell"><p>B</p></td><td class="cell"><p><b><i>Judges </i></b><b><i>C</i></b></p></td><td class="cell"><p>D</p></td><td class="cell"><p>E</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>A</p></td><td class="cell"><p></p></td><td class="cell"><p>51</p></td><td class="cell"><p>51</p></td><td class="cell"><p>44</p></td><td class="cell"><p>57</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>B</p></td><td class="cell"><p>51</p></td><td class="cell"><p></p></td><td class="cell"><p>53</p></td><td class="cell"><p>56</p></td><td class="cell"><p>61</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>C</p></td><td class="cell"><p>51</p></td><td class="cell"><p>53</p></td><td class="cell"><p></p></td><td class="cell"><p>48</p></td><td class="cell"><p>49</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>D</p></td><td class="cell"><p>44</p></td><td class="cell"><p>56</p></td><td class="cell"><p>48</p></td><td class="cell"><p></p></td><td class="cell"><p>59</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>E</p></td><td class="cell"><p>57</p></td><td class="cell"><p>61</p></td><td class="cell"><p>49</p></td><td class="cell"><p>59</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><page local="6" global="6"/><doubt alpha="71.4" length="7" tooSmall="False" monospace="0.0">Table 3</doubt><p><b>Computer Simulation of Human Judgments For Five Essay Traits (30 predictors, 256 cases)</b></p><doubt alpha="66.7" length="6" tooSmall="False" monospace="0.0">Note :</doubt><p><b>Coi. B represents the reliability of the human judg­ments of each trait, based upon the sum of eight inde­pendent ratings, August 1966.</b></p><p><b>Col. C represents the multiple-regression coeffi­cients found in predicting the pooled human ratings with 30 independent proxes found in the essays by the computer program of PEG-IA.</b></p><p><b>Col. D presents these same coefficients, shrunken to eliminate capitalization on chance from the number of predictor variables  (cf. McNemar, 1962, p. 184J</b></p><p><b>Col. E presents these coefficients, both shrunken and corrected for the unreliability of the human groups (cf. McNemar, 1962, p. 153.)</b></p><p><b>In our rapidly growing knowledge, Table 3 may temporarily say the most to us about the computer anal­ysis of important essay traits.    Column A of course gives the titles of the five traits  (more complete descriptions of the rating instructions may be supplied on request).    Column B shows the rather low reliability of the group of eight human judges, computed by anal­ysis of variance.</b></p><p><b>Here in Column B "creativity" is less reliably judged by these experts than are the other traits, even when eight judgments are pooled.    And mechanics may be the most reliably graded of these five traits. Surely, then, <u>humans</u> seemed to have a harder time with creativity than with mechanics.</b></p><table class="main" frame="box" rules="all" border="0" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>A.</b></p><p><b>Essay Traits</b></p></td><td class="cell"><p><b>B.</b></p><p><b>Hum.-Gp. Reliab.</b></p></td><td class="cell"><p><b>C. Mult. R</b></p></td><td class="cell"><p><b>D.</b></p><p><b>Shrunk. Mult. R</b></p></td><td class="cell"><p><b>E.</b></p><p><b>Corr. (Atten.)</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>I.</b></p></td><td class="cell"><p><b>Ideas or Content</b></p></td><td class="cell"><p><b>.75</b></p></td><td class="cell"><p><b>.72</b></p></td><td class="cell"><p><b>.68</b></p></td><td class="cell"><p><b>.78</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>II.</b></p></td><td class="cell"><p><b>Organization</b></p></td><td class="cell"><p><b>.75</b></p></td><td class="cell"><p><b>.62</b></p></td><td class="cell"><p><b>.55</b></p></td><td class="cell"><p><b>.64</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>III.</b></p></td><td class="cell"><p><b>Style</b></p></td><td class="cell"><p><b>.79</b></p></td><td class="cell"><p><b>.73</b></p></td><td class="cell"><p><b>.69</b></p></td><td class="cell"><p><b>.77</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>IV.</b></p></td><td class="cell"><p><b>Mechanics</b></p></td><td class="cell"><p><b>i85</b></p></td><td class="cell"><p><b>.69</b></p></td><td class="cell"><p><b>.64</b></p></td><td class="cell"><p><b>.69</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>V.</b></p></td><td class="cell"><p><b>Creativity</b></p></td><td class="cell"><p><b>.72</b></p></td><td class="cell"><p><b>.71</b></p></td><td class="cell"><p><b>.66</b></p></td><td class="cell"><p><b>.78</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><page local="7" global="7"/><p><b>What of the computer?   Column C shows the;raw multiple correlations of the proxes with these rather unreliable group judgments.    These were the coeffi­cients produced by the standard regression program run by Dieter Paulus and myself.    Column D simply shows the same coefficients after the necessary shrinking to avoid the capitalization on chance which is inherent with multiple predictors.   Finally, in order for a fair comparison to be made among the traits, the criterion's unreliability should be taken into account, as in Column E.    Here such difficult variables as creativity and organization no longer seem to suffer; the computer's difficulty is apparently in the criter­ion itself, and is therefore attributable to human limitations, rather than to machine or program limita­tions.    Column E, then, exhibits what might be the expectable cross-validation from a similar set of essays, if predicting a perfectly reliable set of human judgments.</b></p><p><b>Current and Projected Problems</b></p><p><b>Of course, all this is a temporary reading taken in the middle of the research stream.    Our investigators have also gone on with other strategies.    Donald Mar­cotte (1967)  has developed a phrase analyzer, and has discovered that clichés, as usually listed, were largely irrelevant to the judgment of such essays. Dieter Paulus  (1967a)  has studied the curvilinearity of proxes, and concluded that much elaborate statistical optimiza­tion may be a waste of time, and that the most major improvements should probably be made in other ways. He also has studied feedback to the student writer, using an on-line time-sharing console  (Paulus, 1967b), as has also Michael Zieky.    Another researcher, Jack H. Hiller (1967), has investigated quasi-psychological dimensions  (including <u>opinionation</u> and <u>vagueness</u>)as predictors of the human judgments.    Using techniques familiar from automatic content analysis  (cf. Stone et <u>al</u>, 1966) , he constructed lists of words and phrases to He-fine the variables of psychological interest, and found these negatively correlated, as he predicted, with writ­ing quality.    And, in May, 1967, a sizeable improvement was made in the statistical accuracy/ increasing the multiple-regression coefficient from about .71 to about .77, and improving the variance accounted for by around 20%.    In other words, the newest programs apparently do <u>better</u> than the individual, expert English teacher.</b></p><p><b>The early strategies, then, have provided fertile ground for statistical investigation of essay grading, especially in the actuarial simulation of rating of style.    But what of the deeper dimensions of stylistic analysis, and what of subject-matter content, as in essay questions in history, philosophy, or science?</b></p><page local="8" global="8"/><p><b>Possible contributory linguistic strategies have been under more intensive study in recent months, with the advice and help of Susumu Kuno (1964), Stanley Petrick (Keyser and Petrick, 1967), John Olney (Olney and Londe, 1966; also see Harris, 1952) and others. (Of course these workers are not resppnsible for errors or misconceptions in the present paper.) Anticipated future strategies are currently summarized in Table 4. This table is based partly on work already accomplished in Project Essay Grade, partly on suggested minor adaptations of systems already working for others, and partly on projected programs which are not yet appar­ently operative in any system, but which do not seem impossibly difficult at the efficiency desired.</b></p><doubt alpha="71.4" length="7" tooSmall="False" monospace="0.0">Table 4</doubt><p><b>Project Essay Grade Hypothetical Complete Essay Grader</b></p></section><section number="1." title="INPUT and PUNCH.   Handwritten or typewritten or"><p><b>other raw response of the writer is converted for computer input.</b></p></section><section number="2." title="SNTORG.    Creates arrays of words and sentences as"><p><b>found in prose.   This is just as performed in PEG-I.</b></p></section><section number="3." title="DICT.    Assignment of available syntactic roles to"><p><b>each word.   This is currently done by many pro­grams, but needs an expanded dictionary, and ambiguity resolver.'   At the same time, the semantic information will be stored in the work­space for reference of other parts of program. Availability of the tape-written Random House Dictionary (Unabridged) has been promised.</b></p><doubt alpha="62.3" length="53" tooSmall="False" monospace="0.0">4. PARS.    A modified Kuno (1964) program seems most</doubt><p><b>promising, and is currently being programmed for both the 7094 and the 360 by workers at IBM. Alterations will" bë necessary tô accept wêll-formed substrings.</b></p></section><section number="5." title="REFER. This is intended to identify and encode the"><p><b>most likely referents of pronouns and other anaphoric expressions.    (Cf. Olney and Londe, 1966).    This process must employ both syntactic features and semantic information from DICT.</b></p><p><b>(Continued)</b> <b>Table 4 (Continued)</b><page local="9" global="9"/></p></section><section number="6." title="KERNEL and STRUC.    From the rewritten string output"><p><b>of  (5), KERNEL would establish a .set of elemen­tary propositions, and STRUC would encode the relationships among these elements.    This step would retain all the information of an essay in simplest possible units, yet would retain addi­tional information about emphasis, subordination, causal relation, etc., among these units.</b></p></section><section number="7." title="EQUIV.    The elementary units would be augmented by"><p><b>the semantic information in DICT.    To each word would be assigned a cluster of permissible synonyms, with weightings of semantic distance. This permits an analysis of redundance and emphasis in the essay, and permits a comparison of the content of the student essay with that of the key or master essay.</b></p></section><section number="8." title="STYLE.    Descriptions of the surface structure char-"><p><b>acteristics of the essays'   parts of speech, organization of themes, types and varieties of sentence structure, grammatical depths, tightness of reference, etc;    information about grammatical errors and strengths.</b></p></section><section number="9." title="CONTNT.    Comparison of the agreement of student and"><p><b>master essay, through measure of kernel hits and struc hits, these weighted by semantic distance of language chosen.</b></p><p><b>10.    SCOR.    Multivariate prediction of appropriate pro­file for the immediate purpose.</b></p><p><b>The limitations of space will permit only a few comments on this table, which may be seen as representing a hypothetical, ideal essay grader.    For large grading systems, over established substantive content, it would be possible, <u>for the key or master essay</u>, to edit by hand the output from certain routines  (especially REFER and STRUC).    Of course, four of the most important routines listed in Table 4 are far from perfected in any existing programs.    Ideally, they would assume better solutions to certain major, stubborn problems in computational linguis­tics.</b></p><page local="10" global="10"/><p><b>Indeed, the steps in this hypothetical essay grader are close to the heart of the most persistent and troublesome problems in linguistics.    Is it necessary that sentences be syntactically analyzed before mapping into deep structure?   What is the proper role of semantics in such deep structure?   How can the outside knowledge of the reader be incorporated into the machine analysis?    (For some discussion of this pro­blem, see Quillian, 1966).    In general, how may we in­corporate some of the intuitive richness which the literate hurian brings to his reading?</b></p><p><b>It is not expected that workers in essay grading will suddenly resolve all such questions.    They may be recognized as those which so trouble linguists as to contribute to the recent official pessimism, in the United States, about the future of mechanical transla­tion.    After 15 years of effort, mechanical translation is still regarded as disappointing in quality, and vir­tually no sustained output of any. machine program would be ordinarily mistaken for the work of a professional human translator.</b></p><p><b>On the other hand, the earliest attempts at essay grading by computer have,in a very limited way, leaped ahead of machine translation.   And if the expert human ratings of high school essays may be regarded as an acceptable goal, then the machine program appears to have reached such a goal already.    For that matter, improved performance, even superior to that of the in­dividual human expert, appears to be immediately practi­cable as well.</b></p><p><b>The explanation of this advantage, of course, is that the- problem of essay grading as attacked in the current work is much easier than the problem of machine translation.    In translation, every nuance of the input string should be accounted for in the output string. In essay grading, only" a certain portion of the input text needs to be accounted for, and the output .does not depend on the existence of any large tlanguage-generating system.    High quality machine translation apparently de­mands a fair portion of the total language-manipulating capability of the human, but essay grading may use only a fraction of it, and may process language in ways quite different from that of </b><b><i>ther </i></b><b>human being.    For example, our present programs have to date largely ignored <u>order</u> and <u>sequence</u> in the essays, although to the human the order of words is, of course, of crucial and unceasing import­ance.</b></p><p><b>Since essay grading can work with such fractional information, then, why pursue the deeper analysis of Table 4?    Clearly, the purpose is not entirely the same as it would be for the usual linguist.   At any discrete</b> <b>time in research, what is sought is not <u>necessarily</u> the perfect humanoid behavior, but rather those portions of that behavior which, given any current state of the art, will contribute optimally to efficient and practicable improvements in output.</b><page local="11" global="11"/><b>    Indeed, regardless of the eventual perfection of deep linguistic behavior, for any specific application to essay grading, at any one moment, large portions of such available behavior may be irrele­vant, just as it seems that ordinary human language processing does not usually call for our full linguistic ëffort.</b></p><p><b>Yet we regard it as eventually important to be <u>able</u> to perform these various kinds of advanced machine analysis when required.   Therefore, the eventual uses of the ideal essay analyzer may require analytic capa­bility as deep as may be imagined.   Writing out suitable comments for the student, for example, will in some cases,  tax any system which may be foreseen.</b></p><p><b>Even approximate solutions to these problems, how­ever, though unsatisfactory for certain scientific pur­poses, could make important contributions to the educa­tional description and evaluation of essays.    For such evaluation is itself probabilistic, limited by imperfect asymptotes of writer consistency and rater agreement. And such evaluation therefore does not require, to be practicable and satisfactory, the same deterministic perfection which has continued to elude and frustrate researchers -in mechanical translation.    There is a fund­amental difference in goals, which must be realized. As has been demonstrated here, the output from much cruder statistical programs has already reached a quality not too remote from usefulness.   The more advanced strate­gies currently seem, at least to the present workers, bright with promise. '. ,</b><page local="12" global="12"/></p><doubt alpha="100.0" length="1" tooSmall="False" monospace="0.0">H</doubt><p><b>Daigon, Arthur.   Computer Grading of English Composition. <u>The English Journal</u>, January, 1966, 46-52.</b></p><p><b>Harris, Z. S.    Discourse analysis.    <u>Language</u>, 1952, </b><b>8_</b><b> (4) , 474-493.</b></p><doubt alpha="60.8" length="51" tooSmall="False" monospace="0.0">Hiller, Jack H., Page, E. B., and Marcotte, D. R. A</doubt><p><b>Computer Search for Traits of Opinionation, Vague­ness, and Specificity-Distinction in Student Essays. Paper read at the Annual Meeting of the American Psychological Association, Washington, D.C., September 2, 1967.</b></p><doubt alpha="62.6" length="99" tooSmall="False" monospace="0.0">Keyser, S. J., and Petrick, S. R.    Syntactic Analysis, 1966.    (In press in a forthcoming book.)</doubt><p><b>Kuno, Susumu.    Some characteristics of the Multiple-Path Syntactic Analyzer.    <u>Language Data Processing</u>, Cambridge:.   Harvard Computation Laboratory, 1964. C6, 1-8.</b></p><p><b>Marcotte, Donald.    The.Computer Analysis of Cliche Be­havior in Student Writing.    Paper read at the Annual Meeting of the American Educational Research Association, New York, February 18, 1967.</b></p><p><b>McNemar, Quinn;    <u>Psychological Statistics</u>, 3rd ed. New York:    Wiley, 1962.</b></p><p><b>Olney, John and Londe, D.    A research plan for investi­gating English discourse structure with particular attention to anaphoric relationships.    <u>Tech Memo </u>mm-(L)-3256.    Santa Monica, California: System Development Corporation.    November 22, 1966.    17 p.</b></p><p><b>Page, Ellis B.    The Imminence of Grading Essays by</b> <b>Computer.</b><b>    <u>Phi Delta Kappan</u>, January, 1966, 238-243.</b></p><p><b>Page, Ellis B.    Grading Essays by Computer: Progress</b> <b>Report.</b><b>    <u>Proceedings of the 1966 Invitational Con</u>­<u>ference on Testing Problems</u>.    Prxnceton, N.J.: Educational Testing Service, 1967.    Pp. 87-100.</b></p><p><b>Paulus, Dieter.    Problems of Nonlinearity in Grading Essays.    Paper read at the Annual Meeting of the American Educational Research Association, New York, February 16, 1967a.</b></p><p><b>Paulus, Dieter.    Feedback in Project Essay Grade. Paper</b> <b>read at the Annual Meeting of the American Psycholog­ical Association, Washington, D.</b><b>C., September 2, 1967b.</b></p><p><b>Quillian, M. Ross.    Semantic Memory.   Cambridge, Mass.: Bolt Beranek and Newman, 1966.</b></p><page local="13" global="13"/><p><b>References (Continued)</b></p><p><b>Stone, Philip J., Dunphey, Dexter </b><b>C,</b><b> Smith, Marshall S., and Ogilvie, Daniel M.    <u>The General Inquirer: A Computer Approach to Content Analysis</u>. Cambridge: M.I.T. Press, 1966.    Pp. 651.</b></p><p><b>Woods, William A.    Semantics for ,a Question-Answering System.    Paper read at the Annual Meeting of the Association for Machine Translation and Computa­tional Linguistics.    Atlantic City, N.J.    April 21, 1967.</b></p></section></body></article>