<?xml version="1.0"?><!DOCTYPE article SYSTEM "/project/take/software/searchbench_offline_processing/paperxml_generator/aclextractor/src/python/../resource/dtd/paperxml.dtd"><article><header><firstpageheader><page local="1"/><title>Syntactic Analysis by Alternating Computation and Inspection</title><author surname="Leunbach" givenname="Gustav"></author></firstpageheader><frontmatter></frontmatter><abstract></abstract></header><body><section title=""><p><b>Syntactic Analysis by Alternating Computation and Inspection.</b></p><p><b>By Gustav Leunbach The Danish Institute for Educational Research.</b></p><p><b>Automated procedures for analyzing a given text of some language into a sequence of morphemes representing word classes and flexions, is important for machine trans­lation, for automatic abstracting and indexing, and pos­sibly for other technical purposes. It is a commonplace that an automatic procedure cannot resolve every possible utterance of the language, and it is one among several problems for programmers of such procedures whether to leave some sequences unresolved or to present them for inspection to a person acting on the basis of some linguistic knowledge (theoretical and/or practical).</b></p><p><b>It is also a question of theoretical interest to what extent a given automatic procedure will function and how much it will be improved by the addition of one or an­other set of rules.   The purpose of this paper is not to study existing programs from this viewpoint, but rather to build up a program from nothing, investigating step by step the economy of various additions to it.</b></p><p><b>In addition to thé word Inspection for which a definition has been attempted above, the title of the paper contains another key-word, Computation, which is taken to mean any manipulation of symbols by a fixed set of rules. In this context the symbols are linguistic entities such as phonemes, letters, morphemes, words or sentence clauses, as well as formal symbols, e.g. numerals, used by defin­ition to represent these entities and relationships be-</b> <b>tween them.</b><page local="2"/><b></b></p><p><b>Computation may be performed either by means of an el­ectronic computer, a datamat, or by clerical assistance. The advantages of the use of datamats can be in the main summed up into three areas:</b></p><p><b>1. Accuracy.   Errors due to malfunction of the datamat will in almost all cases be easily distinguished from correct^resuits.   There will, however, often be a need for coding to a datamatic medium, and the errors associ­ated with clerical computation may be expected to appear in this part of the work.</b></p><p><b>2. Speed.   My personal experience - competing with too many others for access to a datamat with too frequent technical breakdowns - has taught me not to overempha­size' -*his advantage.</b></p><p><b>3. Controllability of instruction.    If a clerk performs computational work on a text of a language he knows, it is not possible to prevent his common sense from inter­fering with the program he is performing, often with benefit, but at times in ways that are harmful to the purpose of the investigation.    If the language is com­pletely unknown to him, and he knows that it is - in­correct knowledge is even more harmful than correct knowledge - he will tend to develop headaches and lose both accuracy and speed.   The datamat performs exactly the instructions contained in the program - programming errors may be much harder to detect than errors due to mechanical malfunctioning, but this matter pertains to the field of program writing technique in general.</b></p><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">2</doubt><page local="3"/><p><b>The text which I have used in my study is a written one - I lack the necessary facilities for handling oral language - a Western novel in Danish coded on paper tape for usé in typographical machinery - of course in a tape format almost incompatible with those used by datamats, but with some inventiveness it has been transferred character by character into type of normal format for the datamat in question. - In the composition of the novel no  lliterary merit was intended, but since stylist­ic analysis is not a part of my program this -is not a serious defect.</b></p><p><b>The symbols of the code are readily divided into letters and non-letter symbol, the latter being normal symbols of written language (space, case shifts, comma, point etc.) and typographical symbols such as italizing and de-italizing codes.</b></p><p><b>In the first computational phase the text is sorted into words and separators between words;   a word is defined as an unbroken sequence of letter symbols, with the ex­ception that a lower case symbol is allowed if the word begins with a capital.</b></p><p><b>The words are sorted alphabetically and each word is given a number above 200; all non-letter symbols have numbers below 100 (with separate values of each in upper and lower case); a space directly between two words is suppressed, else the text is stored as a sequence of numerals in the two mentioned intervals.    The text is broken up into units of a maximum size determined by the storage rules of the datamat, but as far as possible terminated by full stop, question mark or the like.</b></p><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">3</doubt><page local="4"/><p><b>xnus, wnen a xaxer computatiionax pnase impxies xyping out the context of a certain expression, the context will not be a fixed number of words before and after, but in most cases a linguistically relevant section of the text.</b></p><p><b>The occurrences of each word are counted, not with the purpose of investigating any of the current theories of word frequencies, but for two specific reasons: 1. The words which occur only once are listed for inspection with the purpose of finding as many misprints as possible for correction in a later computational program (the tapes available for this study are the input before proof­reading) .</b></p><p><b>2.   A list of most frequent words is printed.    This will contain some words directly related to subject matter -the name of the novel's leading character is placed at ab. no.  15 in frequency order - but mostly it will be words whose meaning is largely defined by their place in the syntactic structure, and many of the instructions in a program for computational analysis will deal with the treatment of these words.   Even when words of con­crete denotation enter this list, their frequency (in this text) may make it good economy to add information about word class etc. for them.</b></p><p><b>A further computational program in the first phase com­pares all words in the alphabetic list,with a set of flexions and indicates which words possibly may be de­rived from other words in the list.   This information is added to the word list.</b></p><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">4</doubt><page local="5"/><p><b>At this stage the first set of general computational rules for sentence construction is introduced, partially by inspection of the list of most frequent words. This list is likely to contain instances of the following word classes:</b></p><p><b>Personal pronouns in subject form (I, he, we etc.) Personal pronouns in non-subject form (me, him, us etc. )</b> <b>Pronouns with nominal function (who etc.</b><b>) Pronouns with nominal or adjectival function. Prepositions whose usual function is to adverbialize the following nominal clause.</b></p><p><b>Auxiliary verbs which fulfill the function of the finite verb, but are usually followed by other verb forms. Conjunctions.</b></p><p><b>Adverbs which are characterized' by their position in relation to the finite verb. Particles which form a complete sentence (yes). (Note: Concrete examples are supplanted by their English counterparts whenever possible, which it often is, due to the structural similarity between Danish and English.   Three important differences may be noted: The definite article is a separate word if an adjective is present, else it is appended to the noun as a flection. The present tense of a verb is always different from the infinitio and has no personal flexion.   The past participle is differ­ent from the past tense.)</b> <b>Homonyms may occur in the list with one meaning in one of the above classes and another meaning in a class of words of concrete denotation.</b><b>   Example:"sâ" (at about the 20th place by frequency order) may be an adverb</b> <b>translated as "so" or "then" or the verb form "saw".</b><page local="6"/><b> The -rules by which such words are treated must contain a warning; it may be reasonable to include word class in­formation for the second meaning.</b></p><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">5</doubt><p><b>The numbers between 100 and 200 are available for the coding of the grammatical information mentioned here.</b></p><p><b>The text may contain sentences which are completely anal­yzed by the first set of rules:"He likes </b><b>me."</b><b> "He has given it to her." Such sentences establish certain words as finite verb forms (reserving the possibility that they may be homonyms) and others as participles. Other sentences will with great likelihood establish certain words as nouns or adjectives.</b></p><p><b>Semicolons are sentence separators with at least the same degree of certainty as full stop, only they do not cancel out a subsequent upper case shift as a signal of a proper name.   Commas may be sentence separators, but in many instances they are alternatives to the conjunct­ions "and" and "or"; computational rules must thus to some extent treat these conjunctions and comma as equals ("and/or" may also be sentence separators).</b></p><p><b>Now examples of incompletely analyzed sentences will be typed out for inspection with computed tentative assign­ments to word classes, including computed hypotheses, of. homonymy. A count is made of the proportion of the text which has been analyzed.   The inspector may judge it necessary to have typed out the sentences on which the assignment of certain words depend - guided hy the fre­quency count: with very rare words it may be useless,</b> <b>with very frequent words it may be necessary to restrict output.</b><page local="7"/><b></b></p><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">6</doubt><p><b>The inspected sentences may reveal other less frequent words which ought to be assigned to the above-mentioned structured word classes.   Or they may point to the necessity of assigning word classes to some words of con­crete denotation (particularly homonyms), or to accept or reject computed analyses of words into root morpheme and flexion morpheme.    Or supposed sentence separators may be revealed to be abbreviation points and re-coded accordingly (this may lead to general computational rules such as re-coding all instances of </b><b>"Mr.").</b></p><p><b>After this, the whole text is again computed, and ex­amples of incompletely analyzed sentences "on a higher level" are presented for inspection, etc.  (If a "hard core" remains, this may contribute to the list of in­stances of unresolvable ambiguity for future treatises of structural linguistics.) The important point is that every inspection phase is strictly limited; else com­putation would be of no help.</b></p><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">7</doubt></section></body></article>