American Journal of computations! Linguistics Microfiche 36

3 TH ANNUA L MEETING

Timothy C. Diller, Editor

Sperry-Univac St. Paul, Minnesota 55101 Copyright @ 1975 by the Association for Computational Linguistics PREFACE The fifth and final ACL sesskon was split into two subsessions: one continued the treatment of discourse structure and general knowledge begun in session 4; the other provided a look at several automated text analysis systems. Georgette Silva kindly chaired both subaetsions. Only five .of the a ix talks given are represented in this Proceedings. The paper detailing Salton's talk on automatic igdexing was far too extensive to be included on this fiche and hence will be published separately. The paper by Klappholz and Lbckman discusses the problems involved in the feesolution of cross-sentential reference and sketches an algorithm for their solution. (Notq the closely related paper by Deutsch in Session 4.) Rosenschein addresses the problem of restricting the generation of inferential propositions given a set of beliefs and proposes a structural constraint upon inferencing. Becklea et al. present a man-machine approach to the description of idiolect variations in an environment extraordinarily complex linguistically and sociologically. Brill and Oshika describe a set of programs which permit both batch and interactLve processing of orthographic and phonological strings to provfde information on f r , contextual variation, and associational relations. Anderson, Bross, and Sager present a theory, of: linguistic compression in written texts and de-

scribe the results of an implementation of that theory.

Timothy 6. Diller, Program Committee Chairman TABLE OF CONTENTS MOOELING DISCOURSE AND WORLD KNOWLEDGE 11Contextual Reference Resolution David Klappholz and Abe Lockman . . . . . . . . . . . . . . . . . . . . . . . . . . How does a System Know When to Stop Inferencing? Stan Rosenschein . . . . . . . . . . . . . . . . . . . . . 26 Developing a Computer System to Handle Inherently Variable Linguistic Data D. Beckles, L. Cwrington, and G. Warner in collaboration with C. Borely, HI Knight, P. Aquing, and J . Marquee - . . . . . . . . . . . . . . . . . . A Natural Language Processing Package David Brill and Beatrice T. ~shika . . . . . . . . . . . . . . . . . . . . . On the Role of Wofds and Phrases in Automatic Text Analysis Gerard Salton (Abstract only) . . . . . . . . . Grammatical Compression in Notes and Records: Analysis and Cornputdtion Barbara B. Anderson, Irwin D. J . Bross, and Naomi Sager . . . . . . . . . . . . . . . . . . . . . . .

American Journal of Computational Linguistics Microfiche 36 : 4 DAV.ID KLAPPHOLZ AND ABE LOCKMAN Department of Electrical Engineering

and Computer Science Columbla Vniversi ty New Yorh, New York 10027 ABSTRACT

With the exception of pranomial reference, little, has been written (in the field of computational linguistics) about the phenomenon of reference in natural language. This paper investigates the power and use of reference in natural language. and the problems involved in its resolution. An algorithm is sketched for accomplishing reference resolution using a notion of cross-sentential focus, a mechanism for hypothesizing all possible contextual references, and a judgment mechanism for dis ~rirninati ng among the hypotheses. The reference resolution problem The present work began as an attempt to develop a set of algorithms and/or heurietics to enable a primitive-based, inference driven model of a natural language user (Schank 1972 Rieger 1974) to properly resolve pronomial references acmee eentence bmndaries. The authors quickly realized, however, that the problem of pronomial reference resolution is only a .small aspect of a problem which might be termed nominal reference resolution, itself but a 8-11 aspect of the problem 'of the coherence of d text, (or conversation) i, em the manner in which it llmeansu more than the logicd conjunction of the meaningp of its individual constituent aentences,

Examples of tqe first problem, i. e. pronomial reference resohtion are given in sentence sequences 1-4 below. 1. Yesterday some boys from our village chased a pack of wild dogs;

the largest one fell into a ditch. 2. The wild dogs which forage just outside our village suffer from a

strange bone-wealeining disealte. Yesterday some boys from our

village chased a pack of wild dogs* the largeat one broke a leg and

fell into a ditch.

Yesterday John chased Bill half a block; he was soon out of breath.

My friend Bill has an extremely severe case of asthma. Yesterday

John chased Bill half a block; he was soon out of breath.

The ~roblem in utterance (text, conversation etc. ) excerpts of the above type is #hat of determining the referents of We various occurrences 3. 4. of the pronouns I ' one, I' and "heN For the moment we simply note that usually preferred referents of the two occurreqces of llonell are I1boyl1 and Ild~g~~, (examples 1 and 2 respectively) and those of the two occurrehces of "hew are I1Johnl1 and Bill (exampies 3 and 4 respectively. ) The more general pr~blem of nominal reference resolution is exhibited in €he following annotated excerpt from a recent newpaper article (N. Y. Times 7/15/75, byline Arnold Lubasch); subscripted bracketing of the excerpt is intended only to enable later reference to specific parts of the text.

1 [Some of the tnajor provisions of 2 [the state's Fair Campaign \ were declared unconstitutional here yesterday oy 3 [a special Federal court] 3 that assailed 4 [the restiictions on election campaigning] as "repugnant to the right of freedom of speech. "

h he three-judge court, ] 5 whi* was convened to consider a 5 constitutional challenge by three State As sembly candidates last year threw out 6 [ 7 [the ' code' s] 7 prohibition against attacking any political candidate's xa ce, sex, - religion or ethnic background] 6

*[It]* also overtuned 9 [ 10 [ 11 [ the codets]ll banlo on any misrepresentation of a candidate1 a party affiliation, position on political issues and personal qualifications, including the use af llcharacter defamationr1 and scurrilous attacks. I l l9

Accordbg to 12 [the caurt1s]12 38-page decision, written by [Judge Henry F. Werber] 13 with &he concurrence of 14 [Judges Leonard 13 P. hdoore, and Mark A. Con~tanting]~~. 15[ 16[the provisions 116 ba~hg misrepresentation Irca st a substantial chill on the expres sion of protected speech that are unconstitutionally overbroad and vague. l1 If newpaper reporters had a bit more sympathy for those of us concerned with natu-1 language poocessing, the above excerpt might have read as follows: The state has a Fair Campaign Code. Some of the major provisions of the state's Fair Campagin Code are provisions which restrict something. Some of the things restricted by some of the major provisions of the state's Fair Campaign Code which restrict something are activities having to do with election campaigning.

Same of the activities having to do with election campaigning which are restricted by some of the major ,provisions of the state's Fair Campaign Code whioh restrict something are attacking a political candidate's race, sex, religions or ethnic background and misrepresenting a) candidate1 s party affiliation, position on political is sue s . . .

Last year three state assembly canddiates filed a constihtional challenge to some of the major provisions of the state's Fair Campaign Code which restrict something. Y esterday a special Federal court declared unconstitutional thou of the major provisione of the state's Fair Campaign Code which restrict . something . . , The p& is that in order for a machine or a human to validly claim to have Ilunder~tood~~ the original excerpt& helshelit must be able at the very least to dekonstrate that he/she/it has established the following relationships between various items occurring in the excerpt. (Iqtegers aprbsent subscripted bracketed regments of the original excerpt. ) (f) The identity of 2, 7, and 11 (ii) The identity of 3, 5, 8, and 12 (iii) The fact that 4, 6, 9, and 15 are elements, subsets or pafb of 1 (iv) The fact that 13 and 14 are members of 3 and on and on and on. (I. em a closer analysis of the original excerpt reveals many more relationships which must be established before llundeirstandingu may be clamed. )

It people sctually wrotelspoke in the style of the somewhat facetious paraphrase of the original excerpt, the nominal reference problem would be reduced to one of matching lexcial patterns and recognising a few syntactic cues; to state the obvious, the necessity for more* succinct linguistic communication has forced the development of elliptical devices which shia the burden of nominal reference resolution from syntactic analysis to an analysis of the Hsemanticsll of sentences in context. More specifically, nominal references cannot in general be resolved without the use of general semantic infarmation as well as specific world knowledge.

While the fact that syntactic a-lysis alone is insufficient for understmding is anything but novel, the question of the magnitude of the nominal reference problem and of its solution1 s crucial dependence upon local context seems to have been little commented upon. (Clark (1975) discussee the problem from a viewpoint different from that of this paper, ) The reader who remains unconvinced by the examples above that local context (and specific world knowledge relating to local context) must play a crucial role in reference resolution is asked to consider the two sentence sequences 5a, 6, and 5b, 6. 5. a. The founding fathers had a difficult time agreeing on how the basic laws governing our country should be framed.

b. Those foolish people at the country club have spent an incredible

amount of time arguing about club rules. 6. The second article of the constitution, for example, was argued

about for months before agreement was reached.

In sentence sequence 5a, 6, Itthe second articlett clearly refers to the second article of the constitution of the United States, while in sentence sequence 5b, 6, the reference is to the second article of the constitution of the country club. In each case the only factor involved in resolving the reference is the semantic content of its 10-1 aontextin this case the meaning of the sentende preceding the one in which tke reference occurs.

Since the lexical item Itthe constitutiontt appears in the example just considered, a word- concerning such proper-noun-like objects is in order. In any language €here are lexical Items and phrases such as those appearing in 7 below, which, in the absence of compelling alternative, have standard default ref erentsf for example the standard default referents of the items in 7 are the corresponding items in 8 7. a. The constitution b. The founding fathers c. Wall Street d. The establishment e. The presiaent f. Madison Avenue 8. a. The constitution of the U. S. b.

The flhnding fathers of the U. S.

The U. S. business. community (or that part of it residing in

New Yotk City. )

Those people who have the power to influence the course of events

in the nation etc. etc.

The president of the U. S. f. The advertising industry. C. d. e. In order for textual occurrm-ce of such proper-noun-like objects to be properly handled, their standard default referents must be listed in the - lexicon. This is not to say that occurrences of proper-noun-like objects cannot be references to objects occurring previously in the text; rather it is the case that their default options must also be considered as possible referents.

As final examples of the reference resolution problem let us consider sentence sequences 9 and 10 below. 9. The president was shot while riding in a motorcade down one of the

major boulevards of Dallas yesterday; it caused a panic on Wall

Street. 10. John was invited to tea at the Quimbyls last Saturday; he would have

loveq to go, but he knew held be busy then. In example 9, while t&e first sentence of the eequence contains a number of noun objects (president, motorcade, boulevards, Dallas) which are potential referents for the occurrence of r l i t l l in the second sentence, none of the these is in fact, the proper referent; rather, the proper referent of I f i t l t is the event (or fact) that "The president was shot while I t . . . . In example 10 we have an instance of an adverbial reference ("thenf1) which must be recognized as referring to flyesterday" rather than bo some non adverbial object occurring in the first sentence of that example, Sketch of a Solution

Frsm the point of view of computer implementation, the problem of nominal reference resolution is one of creating tokens for noun objects mentioned in a text, and discovering and encoding the relations, alluded to in the text, which hold between them and various other tokens in memory.

This problem, though certainly not it s magnitude or ramifications, was noticed by Rieger (1974) in his poineering implementation of a primitive-based model of a natural language user. Riegerls system, however, suffers from the incredible inefficiency r esuIting from its need to search all of memory in order to attempt any reference resolution; in addition it will often miss a quite obvious referent entirely, and, in fact, resolves non-pronomial references only accidentally if at all. Before presenting a sketch of a proposed solution to the nominal reference resolution problem, it would be well to detail more precisely the overall language processing enviornment within which it is meint to operate and of which it is a most necessary part. First, we adsume that a relatively small set, S, of semantic primitives and $ logical-calculus -like language, L, for expressing ltmeaningsll are available. The set S and language L must satisfy the following two conditions. (i) The predicate, function, and constant eymbols of L are members of S. (ii) There is a one-to-o* mapping from meanings of (natural language)

sentences to formulas of L.

While a set of prkiitives and a meaning representation language even demonstably close to satisfying the above conditions have yet to be produced, we will, in examples to follow, make use of meaning represen tations; the only claim we will make for them is that the functibns served by their constituent constructs must be served by the elements of any adequate system.

In addition to a meaning representation scheme we will assume en encoding of world knowledge of the sort which a lltypicallt adult might possess, again with the same obvious caveat.

While the question of translation from natural language sentence8 to'meaning representations will not be touched upon here, we will sasume sentence -by- sentence translation of the sort exhibited in various examples to follow. The solution PO the reference resolution problem rests in recognizing the fact that reference is an elliptical device, and +that the human under stander of. natural language cannot recapture that which was elided once he is too far from it in the text; in fact, he cannot resolve a reference to a pint in the text more than a few-sentences back without going back and pondering it (if he can do so at all). We should note that this is true even ih the case in which the referent doesn't actually appear in the text, but appears only in an inference from some statement made in the text. In this latter case - a case which we will discuss only at the very end of this paper the reference is not resolvable (and would not therefore have Been made by the c~eator of the text in the first place). unless the statement from which the inference is made appears shortly before in the text . Though we cannot say precisely how far back is meant by. "shortly before, " it is certainly no more than a few sentences. Fbr a given sentence, S, appearing in a text we will refer to the gequence of sentences preceding S by no more than the intended distance as the focus of S.

In terms of computer implementation, we will, in the processing of a text (which we conc-eive of as proceeding sentence-by-sentence), maintain the following focus sets. (i) %bje ' J C The noun-object focus - the set of tokens of a noun meaning re-

presentation~ of the focus of S (where S is the bentence currently.

being processed) (ii) The event focus -a set aontaining, for every sentence W in the focus of S, the object EVENT(F), where F is the meaninq representation of W, and EVENT is a function which maps the meaning of a formula, F, into a noun-like object whose meaning

is "the event (or fact) that F" (iii) The time focus - a set containing taken8 for all time references (e. g, yesterday, five olclock, etc. ) occurring in the meaninq representation of the focus of S. The reader may question our inclusion of every object appearing in the meaning xepresentation of the focus of S in one of the above focus sets, i. e. in the set of potential referents. In fact, however, it seems to be the case that any object (of one of the above-mentioned types) occurring in the meaning representation of the focus of S may be the referent of an object occurring later in the meaning representation of S. Sonsidex, for example, the sentence sequences formed by taking each of he sentences af 12 below, - in turn - as an immediate continuation of a text zontaining sentence 1 1 below. 11. Stan argued with his sister Fran in an attempt to convince her that

she should bring Mary, whom he would like to get to knpw, on their

planned trip to the San Diego Zoo tomor~ow.

a. - - He was really insistent. b. She was hard to convince.

C. - It was useless. d. He thinks - she's the prettiest one of all Frads friends.

e. The prospect really excites him.

f. He - arguecl that - ft wouldn't tie Mary up for more than half a day. g. It's €he best one in the country, you know. 12. h. - She thruught - it was a twrible idea. i. She happened to be busy then, but expressed an interest in coming along ahother the. Ea& of the mderlined items in sentences 12a-12i references some object in senten- 11. (For the sake of clarity we present in 13 below the referents as we understand them, ) 13, a, Stan b. Fran c. The attempt (to convince , . . d. Mary em EVENT \Stan will get to know Mary) f. The trip g. The San Diego Zoo h. Both --- @he a& it are ambiuous; if - she is taken to be to (Fran will bring Mary .); - "Fran, l1 then it refers EXENT ,. if she is taken

to be "Maryll), then - it refers to EVENT (Mary will come.. . )

The point is, of course, that any item in (the meaning representation of) a sentence, S, may be referenced by some item in (the meaning repr eeentation of) a latter sentence.

On the other side of the coin the question of identifying potential - references is just as important as that of identifying the seb of all possible referents for an object which is known to reference something. If we were' concerned only with pronomial referenee reaolution, the problem would have a simple solution; every pronoun is a reference. For nominal items other than pronouns the problem is far less simple; if a noun occurs in a text just how do we know if there is a previously occurring nominal item to which it refers? As much as we would like there to be algorithmically testable criteria, i. e. recognizable syntactic and/or semantic cues, for making the decision, there seem to be none. Thus, the mechanism we propose considers every token appearing in the translation of a sentence as a possible reference. At present, we hypothesize the existence of a small set, R, of relations which are suffvient to account for all instances of nominal reference. Included in this set are, at the very least, the relations identity, member of, subset of, and part of. Noje that although this list of relations im quite small, it suffices to handle all the examples of reference presented thus far (i. e. those occurring in sentence sequences 1-6 and 9-12 as well as those occurring in the excerpted newpaper article above). All of the above observations taken together lead to the following sketch of an algorithm for reference resolution. I. #AS each new sentence, S, is transrated into its meaning representation,

the various focus sets (noun-object, event, time) are updated. 11. A set,

'that

of S, event, is the III. A lljudgment mechanism, discussed below, invoked to determine

the liklihoods of the correctness of the various members of H. H, is formed containing all tuples of t&e form (N1, N2, P) such N1 is a nominal item occurring in(the meaning representation

Nz is an object occurring in the focus set (noun-object, or time) appropriate to N1 , and is a member of R; H set of all current refewnce hypotheses arising-from S. " is It is clear that following step II any further processing of reference hypotheses requires that all members of H be considered relahive to one avther, since the correctness or incorrectness of one may depend crucially upon that of others. In the general case not all hypotheses will turn out to be correct, and in fact some may contradict others - for instance in the case of two hypothesis-triples with identical first and second elements and different third elements. Once it has been created, the set H is submitted to a "judgment mechanismft whose task it is to choose some of the hypotheses as valid and others as invalid. The judgement mechanism must clearly have access to the world knowledge stored in memory, and must be capable of performing inferencing of a sort which produces decisions as to the relative Eklihoods of the various hypotheses.

Before giving example8 of just how such a judgment mechanism might work, we should make it clear that our sense of I1inferencing1l is very different from Riegerls (1974). In Riegerls sense inferencing is undirected, while ours is directed toward the goal of validati~g hypotheses. There is, in addition, another sense in which the sort of inferencing to be done by the judgment mechanism is directed. The fact that the rgasons for validating or throwing out a particular reference hypothesis (on the part of human natural language users) involve the information coweyed in local context as well as world knowledge relating to items contained in that information (and world knowledge relating to items contained in world knowledge relating to items contained in that information, ctc. ) constitutes a good guess as to the particular pieces of world kncrwledge and the rules of inference which must be involved in judging that hypothesis. Ekamples of reference resolation 14 and 15 below contain components of possible meaning repfek sentations of the two sentencel of sentence sequence 1 at the beginning of this paper. C 1 : CHASED (xl, q)

C2: TIME (cis YESTERDAY)

C3: SUBSET ( x [BOYS])

C4: SUBSET (x~, [DOGS])

C5: GREATER (SIZE(xl),l)

C6: GREATER (SIZE (IS,), 1)

C7: FALL INTO (yI, y2)

C8: TIME (C7, PAST)

C9: MEMBER (y,, [DITCH])

€30: MEMBER(yl,y3)

C11: LARGEST (yl, y3) 'The meaning reprecentations proposed for the two sentences are

C1A%hGhGAC5AC& and G7hCshCo AC~OAC~~ respectively. Note that we are

not claiming that the predicates CHASED, and FALL INTO and the constants

YESTERDAY, BOY, DOG, PAST and DITCH are at the leve3. of semantic

primitives; rather, the above analyses are at just the level which we need

to illustate the operation of the reference resolution mechanism. Further -

more, the symbols YESTERDAY, BOY, DOG, PAST and DITCH ahould

be taken as pointers to the definitions of the appropriate items encoded

in memory in whatever fashion. The brackethg in the notation [A], where

A is a pointer to a definition, is meant to be a function which takes A 15, into an object whose meaning ie the class of items satisfying the meaning pointed to by A. Once the translation of the first sentence of sequence 1 into its meaning representation has been completed - on the assumption that that sentence is at the beginning of the text being processed - the various focus sem will contain the followkg: no.- object focus: [xl, xz) ; event fbcps.: [(cfic21\~/\chcd\~~ )3 ; time focus [YESTERDAY ] . After the second sentence is translated the set, H, of reference triple hypotheses presented to the judgment mechanism will then be the following : is a member of is a sqbget of i s~pa r of t Note that no member of the event focus occurrs in H because the translation of the second sentence contains no term of the form EVENT(y); for simplicae omit the question of time referencing.

All of the relations between y2 and xl or rt can be ruled out pn the basis of SUBSET (x,, [DOC]) SUBSET (xl. [BOY]), MEMBER Or,, [DITCH]) and of the world knowledge to the effect that boysldogs cannot be identical to, members of, eubsets of or parts of ditches (of course in some weird fairy tale setting one of these might be possible and shouldn't be thrown out; but in such a case local context would inform us of the "weirdrr situation and the appropriate one wouldn't be thrown out. )

The hypothesis that or y3 is a part of either xl or xz can be ruled out on the basis of SUBSET (q, [BOY]) and SUBSET (x,, [DOG]), which tell us that q and x2 are sets of objects, and the world knowledge that sets don't have "partsi1 in the sense of the "part of1! relation. Identify between yl and either XI or x2 can be ruled out on the basis of MEMBER (yl , y3) which tells us that yl is an ihdividual and SUBSET (x1, [BOY]), SUBSET (x2, [DOG]), GREATER (SIZE (x,), l), and GREATER (SIZE (xz), 11, which tell us that xl and x2 are sets containing more than one object. (Remember that we're not doing axiomatic set theory in which there are no lgindividuals in our sense and in which the sort of ll-iindividualll which is dealt with can be a subset

of some set. )

Fmally, the "member of" relation between y3 and either xl or x2

can be ruled out pn the basie of MEMBER (yl, y3) which requires thpt y3 be a set, SUBSET (xl, [BOY]), SUBSET (x, [DOG]), GREATER

(SIZE (xl), I), and GREATER (SIZE (x2), 1), which tell us that x1 and x2

are sets containing more thad one element each, and the fact that sets

are not members - of sets. (Again, we're not dealing with set theory; if in fact, we were talking about axiomatic set theory in English, then

local context would contain that information, and aiff erent inferences

would come into play. )

This leaves us with the following hypotheses : '' is identical to {is a subset o f j [:j y, is a member of { :g But some of these hypotheses are consistent with one another: in fact the hypotheses

ie identical to -C is a subset of 3 X. i = l,2 imply the hypotheses yl respectively because mechanism assumes yl is a member of that in presenting the usually preferred referents for references in sentence sequences 1-4 the claim was made that in sentence sequence 1, the usually preferred referent for "onefr is lfboys." The reason for this claim is the authort s observation. that, when such a pronomial refercnce occurs as the surface subject of a sentence, in the absence of semantic content which discrminates among the various possible referents, most people seem to take the eurface subject of the last sentence in the focus as the inbnded referent. The reason for this human judgment is probably that the readerlhearer takes the surface subject to be the "topicll of a sentence. If this observation is correct, the judgment mechanism should, in the current example, simply choose "one of the boysr1 (yl is a member is a member of x. i = l,2 of MEMBER (yl, y,). At any rate, the judgment

at this point that either yl is a member of xl or x2. The readear is asked to recall at this point of xl) as the proper referent. If this observation is incorrect, the judgment mechaaism should judge that there is ambiguity in the reference 'lone Sentence sequence 2 at the beginning of this paper would be handled in precisely the same manner as sentence sequence 1 up to the point at which 11y3 is a member of xl1I and "y, is a member of x," were the remaining hypotheses. The knowledge that Ifthe dogsLt refer red to suffer from a strange bone -weakening diaease would bhen cause the judgment mechaniam to strengthen the likelihood that tlonell refers to "dogs, thus causing Ityl is a member of x," to be the preferred judgment. Sentence sequence16 below contains an example of EVENT reference. 16. The presidnet was shot yesterday. It caused a panic on Wall Street. Omitting all other details of the translation into meaning representation we simply note that the primitive -level predicate into which cause" is tranqlated requires an object of the form EVENT (F) as its subject (i. e. if we say something like "John caused a stir" what we mean is that John did something and the event (or fact) that he did that caused a stir.) Thus, when the 2nd sentence is handled, the only possible referents for will be the objects contained in the EVENT focus, namely just EVENT (the president was shot yesterrlay). The judgment mechanism thus must skply decide if the event (or fact) that the president was shot yesterday was likely to have caused a panic on Wall Street, a judgment which, with adequate world knowledge, should certainly be confirmed.

Sentence sequence 17 is a very similar case. 17. The president was shot yesterday. Bill told me all about it. It

caused a panic on Wall Street. In order to resolve the reference 'lit" in the last sentence of 17, the judgment mechanism would have to decide on the relative likelihoods of i and ii below (i) The event (or fact) that the president was shot yesterday caused a panic on Wall Street. (ii) The event (ok fact) that Bill told me about the president being shot yesterday cauaed a panic on Wall Street. Again, with the availability of reasonable world knowledge about such things as presidents, their being shot and panics, the judgment mechanism should be able to choose the proper referent for "it1I

While a fully detailed specification of the judgment mechanism must await further investigation, the above examples should illustrate, at least in part, the manner in which we conceive of its operation. Conclusions

The phenomenon with which we have been dealing is one example of what we would like to call the llcreativefl aspect of language use; more specifically, reference of the sort we have described - and attempted to handle - is an elliptical device necessary for effective communication; moreover, it is a device which exhibits the ability of language to "change the ground rulestf in a very flexible and fluid manner in response to context.

At this point we must admit that there is an even more creative type of reference than the sort we have dealt with. 18 below is an example of this type of reference. 18. Last week I caught a cold while vieiting my mother in Chicago; as ueual , the chicken eoup had too much pepper in it. The interesting reference in the above example is ILchickeh soup. There is no item in the first sentence to which it is directly related; on the other hand, few people have any trouble resolving it by interpolating between the two sentences of example 18 the idea expressed in sentence 19 below: 1,q. When I get sick my mother makes me chicken soup. If sentence 19 were available, our reference resolution mechanism would easily come up with an identity relation between the two occurrences of It chicken eoup Obviously, for our proposed mechanism to resolve this reference, some sort of inferencing must first work on the 1st sentence of 18 to produce the meaning of 19 as an inference. Thus it is clear that reference resolution and general inferencing must be inter leaved.

The mechanism proposed abave does not handle the entire problem. It does, however, seem to be a minimal model of reference resoIdtion (minimal in the sense that at least this much must be going on). In addition, it provides for that control over the use of general inferencing which is required to avoid a combinatbrial explosion (BOOM).

References Clark, H. H. (1975), I1Bridging1I, Conference on Theoretical Is sues in Natural Language Processing, 10 -13 June 1975, Cambridge, Mass. Rieger, C. J. (1974), Conceptual Memory: A Theory and Computer Program for Processing the Meaning Content of Natural Language Utterences, Ph.D. Thesis. Stanford University. 1974. Schaak, R. (1972), I1Conceptual Dependency: A Theory of Natural Language Under standing, Cognitive psycho lo^ 3(4), 1972. American Journal of Computational Linguistics Miaofiche 36 : 26 How DOES A SYSTEM KNOW WHEN TO STOP INFERENCING?* The Moore School of Electrical Engineering University of Pennsylvania, Philadelphia 19174 Abstract The problem of constmining the set of hfemtces added to a set of beliefs is considered. One method, based on finding a minimal unifying structure, is frresented and discussed. The method is meant to pnxride internal criteria for inference cut-off. I. Introduction

Natural language processing systems that are sensitive to the semantic and logical content of processed sentences and to the p~glratics of their use generally draw inferences. A set of fonmilas representing the meaning of a sentence and the 'state of belieft of the system is augnented by other related formulas (the inferences) which are retrieved and/or constructed during the pmcessing. The problem to be investigated here is: How can thi$ process be contmlled? Can reasonable criteria be found for restraining the addition of inferences?

Top-down inferences fol.luwing from the meaning of lexical items (often expressed by decomposition into primitives) are clearly bounded, if no interactions are allowed amng the generated sub-formulas. This process

(which we call EXPANSION) w i l l not be discussed here. Rather, we shall be concerned with SYNlESIS, i.e., the addition of new formulas based on the * This work was partially suppored by NSF- Grant SCC 72-05465A01.

** AuthorT s currerrt address : Courant Institute of Mathemtical Sciences,

New York University, 2 51 Mercer Smet, New York , New York 10 012. zxsence - of already generated lower-level formulas, 'v~hich we shall call kliefs. In particular, we are concerned with infererces addgd because a set cf beliefs is recognized as fitting a plre-defined pattern. The question we ask is: Given an initial set of beliefs ovm a set of ;ri&ives, - what 'crite~ion can be us& to M t the pmcess of pattern matcljng r-d associated inference addition? The major structural- feature that we use % w ide such a criterion is a partial order over the set of patterns. Before pursuing this suggestion any further, let us examme sane d -3s ait~~native ap-pmaches to infence and iiiCsrence c~r-off. To logicians, deductive inference involves rules by which fanmilas can 'pvablet statements. Xhe terminaticm criterion for a particular invocation of the m&ankmmi&t be the appemce of an 'intestingt farmola or the loss of interest of the infemcer, but in general the stdement of the rules of inference says nohing about when to cease deriving fbmulas. This para- from logic has been carried over into ktificbl which mkes c-ain that appropriate criteria are applied to hdlt the inferenchg. This is =st apparent in systems written in PLANNER-like languages a ch has use~~pmgrammble me&anisms for conbmlling the pl~gf process . In the work uf Schank and Riege, (Sch, 75) (Ri , 7 4 inference has nore of the flavor of b e association; inferences are conceive$ of as expan&g sph~ in'inference s space. ' 'Ib termination strategies are qloyed: (1) the (iisoovery of a chain of inferences leading fr\m one of the iiidl behkfs to another thmw a shared formila, or 'contact pint in infmce space, and (2) the association of numerical fstnngthst to fomuks so that a line of inference can be discontinued if the strength falls below a certain thIesbld. Smtegy (2) is scmwihat unsatisfyii in viaw of t!e prwLtW. arbi-brsariness and attendant difficulties in evaluating the mle of parttcular numerical constants in the total. behavim of a cc~lplex systep. These osnstants, presumbly, have little to do with the m-iical stru~tuce of t?e foninl inferace scha~~, and as such we would call them 'extaml criteria. ' A pfm.' stmtegy like (1) above, on the other hard, is nure tintexmalt and is to be

A gwl of 'the present mrlj is to fcmnulate a reasonable internal criterion for infmce cut-off tjhich can be stated fcmnally as part of the inference rule. To do this, we &EL impose a stmchm on the set of patterns to be used in inferencing, and the rule for adding inferences w i l l be fc~rmkted in terns of .this strutme. The operatibns to be de-ibed below ate exp- mrre fully in (R,75), where a desmiption of a ccquter iqlem~ntath is also presented. 1 A Partial Order for the Patterm Set

!he inference rule we are aiming fop is to depend on the - ret of input beliefs and the - set 0s patterns. The notion we are trying to fcmnalize is ''What does this set of beliefs suggest with respect to this set af patterns?" The particular class of inferences we are concerned with are those gotten by matching beliefs in the input set against a pattern and augmenting the beliefs with additional ppositions as dictated by the pattern. W e want to find the least instarlces of patterns which cover (include) the set of input beiiefs. W e w i l l take as inferences all pmpositions (an arbitrary nunber) wW& a. entailed by that instance of the pattern.

Put another way, the inference operaticsl is to jump to ~onclusions. However, it is cmly to - jump to those conclusion required to make the resulting set an instance of ,the least possible pattern in the pattern set.

The key concept here is 'least' in that thls is what cmtmls how many inferences are added. What would be a suitable dering relation far patterns and ~aopositional beliefs? One which naturally suggests itself and which is c~rrartly unclex- investigation relies Qn the relations of instantiation and

substitution instance and (2) S 5, S' if S a St. Carbbing these two, we say that €q,. . . ,p,) - < {q l,... ,%I, where the p i 's &pi 's are ~~opositional forms, if there is a substitution, s, for the variables of {pl,.. . ,p 3 such that {s(pl),-. ,s(p 1) -1 < Iq 1 ,enm 9%)- n n We adopt the notational convention of prefixing variables with '?' . and let Q = I (HAPPY JOHN) , (GIVE MR. JONES JOHN TOY),

(PIIRE;YT MR. JONES JOHN)). Then P 5 Q under the substitution ?x+JOHN, ?y+MR. JWS. The tless-thm~uaJ.' relation is also defined lor pairs of pattam: ht PAM = {(P ?x ?y), $Q ?y ?a)} and l& PAT-2 = {iR ?u ?v ?w), (Q ?w ?v), Pdu ?w) 1. Clearly, PH-1 f P-2 under the sub9citution ?x+?u, ?3*?~, ?B-?V.

This defi#itfon of - < is quite stdghtfcrrd and czn be made to aceorodate expressions wdth embeddings and m a t e Variables. (These are included in the implaentation. 1

that the relation < - can be -thought of an informition-axrtent caparison; if S - < St then St contains at least askmch 'infomatiant as S (and pcwsibly mre) either by virtue of variables t~ving been replaced by particular constants or by additional farmuLas having been added to the set.

Given 5 far rehting pairs of belief sets, pairs of patterns, aa. bedief-set/pttem pairs, we can now fundate me belief-set-extending III. The Infemnce Operation: SYNTHESIZE

Given a set of P of patterns and an input set Bel of beliefs, SYN!EESTZE returns a set I of instantiated patterns fma P such that the following -Wee caditions a3l hold: (1) ~Caemge of input. beliefs) For each instantiated (2) (Pairwise inccnparability) If p,q r I then (3E (MinWity) mere are no other instances r of patterns in P which are not ih I and yet which are .cI to some element of I and for which Bil - < r. The el-ts of I SYHTHESSrEbl) possible rnumd. b = represent extensions of Bel; n I represents clear extensions of Bel, namely the superset of Be1 stained in all rmmmll. . extansions. Let P { pl = {(A ?XI, (B ?g), (C ?XI),

= p2 I (B 7x1, (C ?XI, (D ?XI), p3 = {(A ?XI, (B ?XI, (C ?XI, (G ?XI} 1 Represented graphically: If input bet Bel = {(A JOHN), (C JOHN)3 then SYNEESIZE(Bel) = I I (A JOHN), (B JOHN), (C JOHN)) 3. There is only one possible ndnimdL extenshn; (B JOHN) is inferred. If input set Bel = €(B JOHN), (C JOHN)) then sYN'mSIZE:(Bell = { f (A JOHN), (B JOHN), (C JOHN) 1 C(B JQRN), (C JOHN), (D JOfN)) 1. There are two possible rmrundl . . extensions, but the set of clear extensions contains no inferences beyond the input set, Bel. (Had pl and pq shared another clause, however, an inference would have been added.) If the input set Be1 = €(G JOHN), (B JOHN11 then SYNTHESIZE(l3el) I { (G JOHN), (A JOHN), (B JOHN), (C JOHN)} I, Pattern pa is the least pattern which wheh instantiated covers the inputs, and there are two inferxed pru,positions : (A JOHN) and (C JOHN).

me descripti~n given here has been necessmily brief and incomplete A mre farnodl trea-t of SYNTHESIZE in tams of lattice-themetic operations is given in (R,75) and is miz zed in (JR,75). One additional technical point should be made: It often happens that for a given input set there are - no single patt- instances which cover all the inputs, though patterns exist mse instances cover subsets of the inputs. In such a case we use an extended SYNTHFLSIZE operationt~hich is defined in the same spirit as SYWHESIZE. (See (R,75).)

Even witbut the firU f d treatment, several things should now be clear. First, the actual nunibex of inferences dra.. (propositions added) for a particular input set may be small or large (depending on the inputs and the pattern set,) but it is bounded in a phcipled way because of the definitim of SYNTHESIZE.

Second, the usual distinction between 'antecedent' and 'consequent' clauses in the pattern is not htained; a clause in the pattern may serve as an antecedent on one occasion and a consequent on ano-Eher. Third, if 'defined1 lexical item were to be associated with tht: patterns, noting which variables are to be bound as arguments upon instantiation, then the SYN'IIESIZE function can be used to canpute sumnarizhg expressions. a*ls SYNTHESIZE remsents a possiELe formalism for lexical insertion. IV. An f5wmole of the beration SYNTHESIZE Far the sake of illustration, let the primitives be: (BENIGN ?x) - ( ?X ?y) -- ?X thr?ebtesls ?y (GIVE ?x ?ob ?y) -- ?x gives ?ob to ?y (BELONG ?ob ?XI -- ?ob belongs to ?x ~IlITEND ?x ?Q) -- ?x intends to do ?Q (LEIURN ?X ?ob ?y) -- ?X ?ob to ?y (?*ljlS-n~sT ?x ?y) - ?x -pays i.nt-:sz ?y (These primitives and the patterns below may appear somwhat arkificu, but we have chosen a sinrple illustration due to the difficulties in following examples with axre than a few clauses.) kt the pattern set consist of the following four pattms: PAT-2 : pm-3: Pm-4: E (BENIGN ?x) , (BMNG ?ob ?y) , (GIVE ?y ?ob ?XI, (INTEND -- ?x (REIURN ?x ?ob ?yl)l ?x I takes-loan-% ?y : (BENIGN ?x) , (EELOM; ?ob iy) , (GIVE ?y ?ob ?x) , (INEND - ?x (REIURI'V ?x ?ob ?y)), (PAYS-INEREST ?x ?y)) ?x roba ?y: {(NOT (KEXLGN ?XI), (BElXlNG ?ob ?y), - ( ?x ?y), (GIVE ?y ?ob ?XI, (NOT (INEND ?x (RFNRN Tx ?ob ?y))) 3 ?x plays-practid joke on ?y : { (BENIGN ?XI, (BWla ?ub ?y) , (mm ?x ?y) , (GIVE ?y ?ob ?y), (DVIEND ?x (RFNRN ?x ?ob ?y))) A rough graphic 'pepsentation of the set of patterns is sham in Figure 1. PAT-2 (takes-loan-from) PAT-4 (plays-pctical-j ake )

PAYS- BENIGN BELONG GIVE THREATEN INTEND NOT- NOT-INqEND INTEREST BENIGN Figure 1

Ncrw consider the folluwing situations: Situation 1. Input beliefs Be1 {(BELXING HARRY), (GIVE HARRY bKLEC HIE)) smTHESIZE(kl) = {{(BFLXING WUlfS;T HARRY), (BENIGN MOE), (GNE HARKY W MOE),

a' (INTEND MOE (REMW MOE WUET HlRRY))3

{(NOT BENIGN MOE)), (BELXING WLET HARRY), (THREATEN MOE HARRY),

(GIVE HARRY MALLET MOE),

(NOT INTEND MOE ~FEruRN m WLIm IwRRY)13 The mmmd . . matched patterns are - rob and born, adding the (conjectural) inf-tim that either Hary was threatened, 0x1 Moe intends to return the wallet. Situation 2. Input beliefs : Be1 = {(GIEUE WVK 1000-&XLARS JOHNDCE) ,(PAYS-INTEEST JOHNWE BANK)} sYNTHESIzE(~1) = { { (mGN JOHNWE), (BELONG 1000-DOWARS BANK) , (GIW BANK 1000-DOLLARS JOHNWE), (INTEND JO-E (RETURN JOHMXlE 1000-DOLIARS BANK) 1, (PAYS-INTEREST JOHNDOE As matching the - BANK) I I a msult of loan pattern, we have added three clauses.

Situation 3. Input beliefs

Be1 ={(INTEND JOHNDOE (REXKN JOHNDOE 1000-DOLLARS BANK)),

(PAYS-INTEREST JOHNDOE BANK) 1

Here SYNTHESIZE(I3el) fiturns exactly the same set as was returned in Situation 2. Note, however, that the roles of

(1) (GIVE RAM( 1000-DOLLARS JOHNDOE)

and (2) (INTEND JOHMXlE (RFIURN JOHNDOE 1000-DOLLARS BANK)) have been reversed. In Situation 2, (1) was an input and ( 2 ) was infemed , whereas in Situation 3, (2) was input wd (1) inferred. The curresponding clauses of the loan pattke2*n were serving as antecedents on one occasion and consequents on the other. This follows naturKLly fran the way SYNIPIESIZE was defined.

In this regard the reader rnay notice that sane input belief sets might yield 'warrantedt or 'spu~?ious' inferences--jumping to too many cmclusicms. Hawever, the incmmntal addition of new patterns corrects this anom19 in a natural way: Patterns which formerly were 'least covers' may cease to be so in the extended pattern set. V. Using Definitions to Set -Up the Pgttem Space W e have been particularly interested in using definitions of words to set up pattern spaces in which SYNTHESIZE could wark as an inferencer and a lexical insertion technique. Special attention was payed to the 'speech actv verbs, and a bief sample list is presented below. (The symbol '?Prt denotes a predicate variable. Also, primitive predicates are capitalized, while defined predicates are underlined. ) Again, the definitions are greatly oversimplified for illustrative purposes. (define - tell (?x ?y ?p ?t) (and (-RE ?tO ?t)

(NOT (KNOW ?y ?p ?to)) (SAY ?x ?y ?p ?t) (KNOW ?y ?p ?t)

(CAUSE (SAY ?x ?y ?p ?t)(KNW ?y ?p ?t)))) (define request (?x ?y ?p ?t)

(tels ?x ?y (W ?x ?p ?t) ?t)) (define m s e (?x ?y ?Pr ?t)

(and (EELS-OBLIGA'F€D - ?x (?Pr ?XI ?t) (tell ?x ?y UNTENP ?x (?Pr ?x) ?t) ?t))) (define camand (?x ?y ?Pr ?t)

(request ?x ?y (?Fb ?y) ?t 1) (define implare (?x ?y ?Pr ?t ) (and WWl?S-FAVOR-FROM ?x ?y) The expansion of these items to - patterns over the primitives yields a set in which, far example, KNOW < tell 5 request iccnmand. The input set Be1 = {CBEFQRE tl t2), (SAY JAMES MASTER (INTEND JAMES (OPEN JAMES DOOR) t2) t2), (FEELS-OBLIGATED JAMES (OPE3 JAMES DOOR) t2 1) tmuld be synthesized to (pmmise JAMES MASER (O?EN + LX)OR) t2), with &eed inferences (KNOW FASTER (DElfi) JLLCES (OPEN JA I CCOi:? t2) t21, etc., as dictatd by the pattern instance of W s e . A mt?d bas been pmwsed far 'fred bfez-znciri - by attern - inatchig in which inference cut-f can be structurally ccmstrained: A pattsx is matched if it is one of the minioil =ems whose instantiati~n corn &he iqput in.~~mati.cm--~ven ifthis necessitates addkg an mbitmry anounf of additional infmmtion. Similarly, on the question of bw m y infmces to &aw: 'Enom -a inferemes are drawn to enable a cohmt pattm to be matched.

The method we have proposed is general in that it nrakes no assunptions

about the particular predicates to be used in the patterns and beliefs. (Of

course, it does make as^^ about ht counts as a pattern m a belief. 1

The infmcing auld be done by a general purpose cmponent wfiich accepts a

set of patterns as a parameta. Th*, a p g rw designing a system for inference by pattern mtch need not" devise external criteria, and certainly not miteria to be associated Wi'th 1 every pattern. Ram the criteria are hqlicit in the system as a Wle; any wtterns which can be described in a vw general pattern description language w i l l genemte its awn set of internal miteria fur inference cut-off.

We are continuing to investigate f d sm s for smcturing pattern sets in the hope of gaining further insights into this class ~f inferences. References (C,75> Clark, H.H. 1113ridging,'' in Proceedings of the Workshop on Theoretical Issues in Natural Language Processing. Widge , Mass&usetts., June 1975. (JR,75) Joshi, ~.1(. and Rosenschein, S. "A FomKLism for Relating Meal and Pragmatic Information," in Proceedings of the Workshop on Theoretical Issues in Naturwl Language messing. Canilxidge, Massachusetts 3 June 19,75. (Ri,74) Keg?, C. Conceptual Memry . Ph.Dw Thesis Stanfcxd University. GR,75) (Sch, 75) (W,75) Stanford, CaliTn;l..nia. 1974. Rosenschein , S . Struc-hg a Pattern Space, with Applications to Lexical - - Information -- --- and Event Intermetation. Doctoral Dissertation. University of Pennsylvania. Wladelphia, Pennsylvania. 1975. Schank, R. , Goldman, N. , Rieger , C. , and Riesba&, C. "Inference

and Parapme by Computer " Journal,of-the A.C.M. Volume 22,

No. 3. July, 1975. Wilks , Y . "A Preferentie, Pattern-seeking Semantics fcrr Natural

Language Inference ." Artificial InTelligence. Volume 6. 1975. American Journal of Computational Linguistics Microfiche 36 : 40 D. BECKLES, L. CARRINGTON, AND G . WARNER IN COLLABORATION WITH C. BORELY, H. KNIGHT, P. AQUING, AND J. MARQUE^ Department of Mathematics and School of Education Universi t y of the Wes t Indi es St. Augustihe, Tr2nidad ABSTRACT

Linguistic communication in Trinidad and Tobago i~ characterised by intra- and inter-ideolectal variation in a spectrum ranging frbm CreoleEnglish to Internationally Acceptab le English. The tape-recorded speech of a sample of children is being analysed to determine the structure of their language, its correlation with socio-linguistic facters and their progress in the use of English. X.2 coqvter system is designed to deal with manually codified data in the form of parse trees with associated grammatical and semantic information. The communication complex does not have readily identifiab le norms. The analytical method and compwr sys tern effect recognition of stable sub-systems (regardless of the external criteria which determine these sub-system), comparison of these sub-systems with English as well as state the evolution of the children's language. Acknowledgement

The research of which tkis paper is a working document is partially funded by Ford Foundation Grant 690-06641). The authors acknowledge the kind assistance of the IBM worid Trade Corporation, Port of Spain, Trinidad. The design and some results ~f the research to which the computer system relates are described by Carrington, Borely and Knight (1969, 1972, applicable to curriculum development and teacher education, the structure of the speech of school-children aged 5-11+ in Trinidad and Tobago and to compare this speech with English. The official language and medium of instruction is Englfsh. However, the medium of daily communication ranges from a type of Creole-English to a modifed variety of Internationally Acceptable English (IAE). The term 'postcreole dialect continuum" has been used by several researchers, notably Le Page (1957), De Camp (1971) and Bickerton (1973) to refer to apparently analagous situations in Jamaica and Guyana. In addition to Creole, English and variants of both, a large part of the population is exposed to a local variety of Hindi (Bho jpuri) . Smaller numbers are exposed to Lesser Antillean French Creole and fewer still to Spanish.

Communication within the society is characterised by inter-ideolectal variation related to several socio-linguistic factors - ethno-linguistic background, social class, educational level, occupation, sex and age. Codeswitching and intra-ideolectal variation related to the context, content and purpose of communication complicate the examinat ion of the communication s)btem. Since the variant levels of the complex appear to overlap they are difficult f o separate into distinct sub-systems . ne - kinpistic Data

The available corpus comprises 100 hours of the recorded conversation

of almost 1,000 children between 5 and 11+ selected randomly from 30 schools. me data fall into two pre-determined categories: (a) free (with pees group); controlled (with investigator) . Given the nature of the communication compaex stated above, variation and contrast are central to the data. In addition to the usual socio-linguistic correlates of variation, these data have the possibility of containing linguistic elements which are not paralleled anywhere else in the community. These elements may occur as a result of the instability intrinsic to the performance of a vulnerable age cohort. W e are not dealing with fully learned discrete languages or dialects but with partially learned systems of speech communication being used by children who, by virtue of being in school, are under pressure to abandon part of their communication repertoire in favour of another variety of speech. hplic~ltions of the Data Type for the

Analytical Procedure

hglish is the only code of the communication complex for which adequate grammatical descriptions are available . It is demonstrably untenable to assume that the informants are attempting to speak English at all times. They are communicating in a set of language varieties which are assumed to be rule-governed. A statement of frequency and type of deviation from Bnglish cannot therefore be an adequate analysis. The first task of the analysis must be to determine the structures, both major and minor, used by informants of' various socio-linguistic descriptions.

A preliminary examination of the data shows that at the level of phrase-structure of utterances; the structures w i l l appear to be predominantly identical with English. It is the components of the elements, their meanings and functions that w i l l show the differences from English. Consequently, the analysis mst note the levels at which derivational trees

cease to be compatible with English.

In view of the variability inherent in the data, the analysis must discover the socio-linguistic correlates of the occurence of elements, as hue11 as state co-occurence restrictions of a given element. Since it is possible that some elements may be distributed ih a way that does not perinit correlation with the stated socio-linguistic factors, the analysis must permit grouping of informants based on shared linguistic features for sasequent re-examination. This provision admits the possibility that sets of features may be typical of a language acquistion stage of the informants mgardless of their socio-linguistic descriptions. me Analytical Procedure 1, Each utterance is phonetically transcribed and ascribed to an informant by an identification procedure. Doubtful identf ty is specially coded. Each utterance is rewritten in English orthography. For each utterance a parse tree is constacted using the following protocol where each category described below forms the content of a node of the parse tree. The numbers are for reference and indicate the hierarchical relationship of the nodes. .a Utterance type 8.1 Utterance comp 1 exi ty S SEL FRAG FREL SIMP CP CX CPCX DEC INT IMP sentence elliptical S fragment el lip tical FKAG simple compound complex compound- complex declarative interrogative imperative 0.3 Semantic type STMT statement Qu COMM RHET question command rhetorical intent (1.4 Linear order and type of cl*auses occurring e.g. MCl + ADVC TEMP 2 $9.5 Linear order and type of phrases occurring (where not part of a clause) e.g. PREP P 1 + VBL P2 9.6 Dependency of clauses - dependent

embedded

co-ordinate

included e.g. 2/1 = clause 2 is embedded in clase 1 ACI'V active PAS passive EQ equational STAT stative

LOC Iocbnive 1-9 surface structure of the clause/phrase occurring first.

e.g. MC -7 ~ SUM + PRED* - + IOBJ + WBJ + PREP P *PRED = predicator not predicate 1-1 detailed analysis 3 of first occurring element of 1.9. e.g. SUBJ PRMD + HDW 1.1.1 first element of subject. e.g. PRMD~ [HE ] PADJ,

RD, MASC, SG, NOK; IAE: [HfS] etc, 2. fU surface structure of the clause/phrase occurring

second,, , etc to 7,9, AFM NEG affirmative negative As exemplified at 1.1.1, the last node of each sub-part states the actual literal being described. The acceptability of the item as IAE is noted,OK or NOK,together with a reasonable IAE alternative. Apart from the obligatory information requifed by the procedure, the analyst may make additional comments which may be either in keywords or English e.g. CMNT: probably . idiosyncratic or CMNT: double NEG. e.g. 8.9 [SCRUNT]----) serounge for a living

e.g. 9.0 TAG-[YOU HEAR] Fig. 1 shows a sample analysis. Figure 1 [bH SISTER AND THEM DOES BREAK A SET OF PLATE, YES] Glossary of keywords ADV - adverb [ial) , ANIM - animate, ASOC - associative AUX - auxiliary, BCL - base form final cluster, C- clause CLF - final cluster results from suffixation, CMEJT - comment CTN - completion, DET - determiner, DOBJ - direct object, GR - grammar HDW - headword, INAN - inanimate, IND - indefinite, IOBJ - indirect object LEX - lexical, MASC - masculine, MC - main clause, N - noun, NCO - countable noun, NEQV - no equivalent, NEUT - netral, P - phrase PADJ - possessive adjective, PATT - pattern, PL - plural, PLZR - pluralizer PRED - predicator, PREP - preposition, PRMD - pre-head modifier, PROG - progressive, RD - third person, SG - singular, SUBJ - subject, TEMP - temporal, TM - time, TRAN - transitive, VBL - verbal, VT - verb used transitively * - alternative parse or meaning, @ - absence of. . . , [ ] enclose literals,

- end of inSonnation set, , - minor separator. Developing 7 the Computer System The strucfure of the parse tree is, in general, quite complex and a simple ad hoc approach to validity checking was quickly seen to be inadequate. As a result a formal description of the tree was developed and used to construct a (partially) syntax-driven validity checking rgutine. The output of this routine consists of a listing of the input, with error comments where necessary, together with the internal representation of the valid trees which is written onto a file - the parse-tree file - for the subsequent analyses.

several other files are used in addition to the parse tree file. There is the informant file which contains profiles of the informants, (e.g. age, sex, linguistic background, etc), a set of form class files and a set of classification files. The form class files are groupings of the various keywords which may occur in the data. Thus, for example, one form class file contains all keywords which may occur on the left-hand side of a rewrite. A classification file contains a group number for each informant; for example, one classification file contains 0 for each informant not aged 5 otherwise: In any operation on the data the utterances of informants in group 0 of the relevant classification w i l l be ignored. Each node of a tree in the parse tree file consists of a name - in the case of a rewrite this is the left-hand side of the rewrite, otherwise it is the level number - and a set of descriptors, e .g. the grammar associated with the name. Thus, in the example of Figure 1, the lines 1.1, 1.1.1, 1.1.2, 1.1.2.1 become the sub-tree of Figure 2 where the descriptors are put in parentheses. Figure 2 SUBJ HDW (ANIM, NOK; IAE: NEQV) I NCO ASOC

For any tree, each analysis starts at the root and many of the tasks to be described below may be regarded, in part, as a pattern matching exercise. The difficulties, and interest, arise because each node of the parse tree carries a substantial amount of information, and except for literals, only a partial matching of the nodes is usually required. In PRMD (PADJ, ST, SG, OK) I addition, some tasks requira the matching of disjoint sub-trees within a given parse tree, occasionally subject to side conditions which may involve nodes not lying on the paths between the root and any of the sub-trees of interest. Apart from the pattern matching,there is the problem of classification of the occurrences of the various patterns. This is a simple tabulation complicated, in some cases, by the fact that the total number of categories is unknown. The basic task of the system may be cast in the form: count with respect to a given classification file, and subject to stated side conditions, the occurrences of a given pattern.

Since there are only 1,000 informants and they fall into a reasonably small nunher of classes it is economical to pre-classify on the basis of the informant profiles rather than build the classification process into the rest of the analysis. The system is instructed to produce a classification file by a statement of the form: CLASS = ( classification file name ) , (4 expression list >) where

(classification file name) is the name by which the file w i l l be known, and each expression in<eqression list 7 is a Boolean expression. For example :

CLASS = HINDI, (AGE = 5 e LANG = HINDi, AGE = 5 4 LmG + HIND11 w i l l produce the classification file given earlier as an example.

The side conditions refer to items ih the parse trees which must occur if the tree is to be i~~c ludd in a given analysis. For example, if only

affirmative active uHerances are to be analysed the side condition Q. 7

AFM AC171 is used. me patrern to be used is stated in a manner similar to

that used in r specifying the input data. Thus, the pattern description PRED .. . + AUX.. .; GR: @ CTN, NEUT TM, @ PROG, PATT Pdicates that the sub-tree PRED (GR: @ CTN, NEUT TM, @ FROG, PATT) is of interest, subject to the convention that both the order of node descriptors (where given) and node descriptors not mentioned in the pattern are to be Lgnored. The occurrence of keyword FORM = < form class file name) indicates that the contents of the stated form class file are to form an additional dimension to the final tabulations. Thus the pattern

AUX --+ [?I where OKFILE contains pair of patterns. AUX -P [?I AUX+ [?I The symbol ? indicates additional dimension to the The output may also be used to construct a classification file of the informants, to be used in further analyses. CONCLUSION

In respect of performance of groups with different socio-linguistic descriptions, for purposes of this study, it is assumed that the frequency of occurrence of particular basic parse trees is a meaningful indicator of differences in speech patterns. A major difficulty is that no two trees in the study are identical but at the same time if we strip too much information FORM = OKFILE the keywords OK and NOK and is an abbreviation Sor the OK NOK that the items found there are also to add an

tabulations. of each tabulation from each node there are too few trees to make an analysis worthwhile, and in part, the study aims at determining the degree to which strippilrp of information at interior nodes is necessary if the Gomputer is to be a qseful aid. REFERENCES Bickerton, D. 1973 "The Nature of a Creole Continuumn Language 49 (3) p.640-669. Carrington L. and "An Investigation into English Language Learning Borely, C. 1969 and Teaching Problems in Trinidad and Tobago Progress Reporttf. U. W. I. Institute of Education, St. Augustine (mimeo) . Carrington L., 1972 Away Robin Run: A Critical description of the Borely, C. and Knight H. Carrington L., 1974

Borely, C. and

Knight H. De Camp, Dm 1971 Le Page, R.B. 1957 Knig,ht, Haw 19 74

Carrington L.

and Borely, C. Teaching of Language Ar ts in the Primary Schools of Trinidad and Tobago. U.W.I. Institute of Education, St. Augustine. (mimeo) . 'ILinguistic Exposure of Trinidadian Children'' Caribbean Journal of Education No. 1, p. 12-22. "The study of pidgin and creole languagest' in Hymes Pidginization of Creolization of Languages CUP. p.13-39. "General outlines of creole English dialects in the British Caribbean1'. Orbes 6, p. 373-391. 'tPreliminary Comments on Language Arts Textbooks in use in the primary schools of Trinidad and Tobago". Caribbean Journal of Education No. 2 American Journal of Computational Linguistics Microfiche 36 : 52 DAVID BRILL AND BEATRICE T. OSHIKA Speech Communications Research Laboratory, Inc. 800A Mirarnonte Drive Santa Barbara, California 93109 ABSTRACT A set of SAIL programs has been implemented for analyzing large bodies of natural language data in which associations exist between strings and sets of strings. These programs include facilities for compiling information such as frequency of occurrence of strings (e.g. word frequencies) or substrings (e.g. consonant cluster frequencies), and describing relationships among strings (e.g. various phonological realizations af a word). Also, an associative data base may be interactively accessed on the basis of keys corresponding to different types of data elements, and a pattern matcher allows retrieval of incompletely specified elements. Applications Of this natural language processing package include analysis of phonological variation for specifying and testing phonological rules, and comparison across languages for historical reconstruction. f, NATURAL LANGUAGE PROCESSING PACKAGE A. General characteristics The natural language processing package implemented at the Speech Communications Research Laboratoqy (SCIU;) is currently wed in the analysis of associated lists @f string data such as discourse transcriptions or pronouncing dictionaries. The package consists of a) a set of "batchw programs which provide frequency and context information on the lexical and phonological forms appearing in the input; and

b) a system for interactively accessing the data dn the basis of orthographic and phonological patterns.

All of the programs in this package are written in SAIL, an ALGOL-based language offering extended string and set manipulation operations and an associative data base. The programs run on a DEC PDP-10 at Carnegie-Mellon University via the Advanced Research Projects Agency (ARPA) computer network (ARPANET). The ARPANET is accessed by the ELI? operating system developed by SCRL, which runs on a local PDP-11 [I].

While the processing package is applicable to various types of natural language data, it has been used most extensively at SCRL in the analysis of discourse transcriptions. The discourses consist of conversational speech gathered in interviews with adult speakers of various dialects of American English. More than twenty-five discourses, transcribed orthographically and phonologically, have been prmssed, yielding detailed information on over 28,000 utterances representing about 3,500 distinct lexical- items. All examples in this section are taken from a typical discburse. B. "Batchw Facility Discourse processing usually begins with the generation of a transcription reference file in which orthographic and phonological representations are listed in discourse order, as illustrated in Figure 1. WELL LET ' S TRY CLASSIFYING THEM ACCORDING= TO THE EXCUSES

In this example, the phonological realization of TRY is /tray/ (coded TRAY). The phonological code shown is a basic ARPA phonemic alphabet augmented by special symbols indicatim some phonetic detail, such as vowel height. The realization of THE, for example, is coded DH$I, indicating that the vowel fell between /i/ and /I / .

Reference number's assigned to each utterance serve as an index to the discourse context in which utterances occur, and are used to interpret the output of other programs in the package. Separate reference number sequences are provided for Figure 1 TRAY KLAES$CFAYIHNS DHAXM //AXK$ORDIHN/TUW// the orthographic and phonological forms in the reference files, since there may not be a one-to-one correspondence between these forms, as in the case of phonological merging whl-eh obscures word boundaries, In Figure 1, for example, the two orthographic items WELL and LET'S are realized as a single phonological item /wl E ts/ (coded WELEHTS) The core of the "batch! processing facility is a set of three programs: PROCON, ENVIRN and CLUSTR. PROCON provides frequency and context information on the lexical level, while the other two provide similaf information on the phonological level,

PROCON output contains an alphabetically sorted list of the utterance types occurring in the input discourse transcription file as illustrated in Figure 2. Frequency of occurrence of each type is given, along with the various phonological realizations. For each phonological realization, frequency count and reference numbers

added to a set (which k i~a l ly cantains the dons) in certain ways pvided other formulas are dLready in the set. In general, this sort of infexence is quite open-ended in that one can keep applying the rules of LnXerence and ccrme up with mre and ao~e famulad dl of which represent qmtenrs, where the issue of terndnaticm is very real. The usual solution has been to invoke the inferencer under the very strict control of a supervising pgran w*&- has its own gmls progmnmd in See also (C,75), (W,75). a + b) . Part of the intention of the project is to describe in terms Structural type is reserved for special idioms. is reserved for tags. 8.1 SIMP; 9.2 DEC; 8.3 SW; 9.4 MC + TAG; 9.5 NA; 8.6 NA; 9.7 AFM ACTV MC + SUM + PRED + DOBJ 1.1 SUBJ-PRMD + HDW PRMD--+ [MY] PADJ, ST, SG, OK HDW-N. ASOC, ANIM, NOK: IAE: NEQV 1.1.2.1 N ASOC -NCO + ASOC 1.1.2.1.1 NCO + [SISTER] N SG, ANIM, OK 1.1.2.1.2 ASOC+ [AND THEM] NOK; IAE: NEQV; VIDE 8.9 1.2 PRED--3AUX + VT; GR @ CTN, @ PROG, PATT, NEUTTM 1.2.1 AUX+[DOES] NOK; IAE: ZERO 1.2.2 VT +[BREAK] OK TRAN DOBJ *PRMD + HDW 1.3.1 PRMD- IND DET + N + PREP IND DET-f[A] 1.3.1.2 N---)[SET] NCO, SG; LEX: NOK; IAE:[LOT] 1.3.1.3 PREP-t[OF] OK 1.3.2 HDW [PLATE] N PL, INAN, NOK; IAE : [PLATES] NPL* NCQ - PLZR; NOK; IAE : NCO + PLZR +[PLATE] @ BCL, OK P.3.2.1.2 PLZRrZERO NOK; 3 IAE: PLZR = , +S, CLF 8.8 [MY SISTER AND THEM] [MY SISTERS]* [MY SISTER AND HER FRIENDS] TAG [YES] if the informant is aged 5 with a Hindi linguistic background and 2 888 88s HAVE 3 1

Figure 2

In Figure 2, for example, HAVE occurred eight times, and was pronounced (/av/) three times and HHAEV v three times. Using the reference numbers associated with these pronunciations, it is possible to establish the discourse context. AXV HHAEI HHAXh HH$GV are provided. One would find that the tbree AXV pronunciations (i.e. utterances 11, 337 and 703) all involved the auxiliary construction in ",.,may have felt,,,seemed to have been which have since been. .. II ENVIRN tallies occurrences of phonological segments and environments in the discourse transcriptions. The output of this program lists frequencies of all phonemes appearing in &he input file, as illustrated in Figure 3.

Figure 3

Glottal stop, coded Q, occurred a total of thirty times in the discourse, The immediate environments of Q are listed alphabetically by left context, with word boundaries indicated by slash /, and a frequency count and reference numbers are given for each environment. For example, Q appeared eight times in the context EH--EN ( E - n , and a check of the reference list shows that all these occurrences were in the word sentence (s) . ENVIRN output also provides a frequency ordered l iSt of phonemes, with frequency totals brokerr down according to occurrence in word initial, medial and final position. CLUSTR, the third of the "batch" programs,is used in the analysis of phoneme cluster distribution in the discourse data. A l l clusters are indexed by each of their component phonemes, so that the cluster NDZ (fndz)') which is listed under D in Figure 4 also appears under N ad 2 in the full output. DENTS DQENTS D V D Z

354,828,1397 710 1067

1 5 N D Z 1 Figure Separate output may be generated for clusters occurring within woxds or across word boundaries- Currently, consonant and vawel clusters are tallied, but the program can be easily modified to handle sequences of phonemes belonging to arbitrary user-defined classes (e.g. voiced sounds,, nasals, unvoiced stops, etc. ) . For each phoneme belonging to a selected class, CLUSTR provides a count of the number of times that the phoneme appears in clusters, an alphabetically sorted list of those clusters, and a frequency count and reference numbers for each cluster. Figure 4, a sample of CLUSTR output for within-word consonant clusters, shows that D appeared in clusters a total of 70 times, with 32 of these being ND clusters. Reference numbers may be used to establish the discourse context of any cluster. For example, the cluster D Q EN T S (/di?nts/) appears in utterance 486 which is the word students. Like ENVIRN, CLUSTR provides a frequency ordered list of cluster types in addition ts the alphabetic list. C- Interactive Retrieval Facility

The set of "batch" programs is complemented by a language data retrieval system which allows the user to interactively retrieve data items conforming to various orthographic, phonological and syntactic patterns.

Linguistic data is inte~nally stored in the system as a network of associations between items of various types. These

associations are implemented in SAIL as LEAP triples [2J and

the element types entering into these associations vary according to the - particular application. For example, in analysis of the discourse data described above, triples contain orthographic, phonological and syntactic elements. For study of phonetic-to-phonemic mapping, triples might be orthographic, phonemic and phonetic elements. In comparative linguistic research, triples might consist of an orthographic element and two phonological elements corresponding to two languages or dialects Data can be accessed on the basis of patterns directed to any one (or any combination) of these elements. For example, if the data base contains associations between orthographic, phonological and syntactic elements, then the query

P/ 0: THE retrieves the phonological items associated with the .spelling THE, and might return DHAX(/Ba/) and DHIY i ) . The query O/ P: TUW would retprn the orthographic items pronounced Tm (/tu/), e.g --- two, too, t o r

Patterns such as THE and TUW completely specify the element to which they are directed, but various special forms allow partial. specifications to be expressed also. The symbol $ matches any single segment (in a phonological pattern) or character (in an orthographic pattern), and the symbol = matches any number, llncluding zero, of contiguous segments

(or characters). Thus, if N is the syntactic code for Nounr the query

O/ P: $$, S: N, 0: D= searches for all two-phoneme nouns - which begin with the letter D, and might return dye, day, doe, dough. Each phonological element is defined in terms of a set of features such as UV (unvoiced) and ST (stop), and these features may be used to specify segments in phonological pawerns. To search for phonological realizations containing /i/ between unvoiced stops, one could use the query P/ P: =(UV + ST)IY(UV + ST)= to find /kip/ (keep) , /pik~ ?/ (peeking) , and /r pit d/ (repeated)

Boolean operators are also available for specifying pattern segments. For example, the query

0 6: (C OR K)=, P: (NOT K)= returns arthographic ikems which begin with C or K and are not pronounced with initial k , e.g. cite, change, know.

Several capabilities lacking in the current interactive system will be available in the near future. The user will be able to (1) specify optional segments and sequences of segments in phonological patterns; (2) create and name sets containing items of interest, e.g. monosyllabic function words, and use set operations such as union and intersection; (3) interactively modify feature definitions of phonological symbols: (4) retrieve several elements, e.g. orthographic and phonological

forms, simultaneously; (5) display the discourse context of any

given item, and (6) write retrieval queries and responses to a

file for subsequent analysis. The processing package can be used in the analysis of various kihds of natural language data, as illustrated in the following examples. A. Phonological variation The programs can be used to efficiently index and sort natural language data so that systematic phonological variation can be easily examined. For example, inspection of a PROCON output for a ten minute interview consisting of over

-- final /t/ alternates with final glottal stop /?/ under certain conditions;

-- alveolar flapping occurs under several stress conditions whidh appear to be related to noun affixes.

These preliminary observations can be systematically investigated using the interactive query system.

The data base can be queried for all phonological realizations ending in T (/t/) or Q (/?/), graphic entries, using the queries

P/ P: =(T OR Q) and The resulting list art but can' t fished might include /a*/ /bat/ /k=nt/ /k en?/ /f 1Jt/ and the corresponding orthoO/ P: =(T OR 9) limit raft that want That is, final /t/ appears to vary with final /?/ following vowels arid following nasals, but not elgewhere. This hypothesis, represented as a context-sensitive phonological rule, could then be tested against additional data using any of several computer rule testers [3-51. Forthcoming modifications will allow queries with set operations, such that the intersection of orthographic entrieshaving final /t/ alternating with /?/ can be requested directly by the query

01 P: =T n P: =Q . That is, only entrieq with /t/ and /?/ alternation would be retrieved, and the entties art, fished and raft would not be returned,

In order to determine the conditions under which alveolar flapping occurs, the queries O/ P: =DX= can be used to retrieve ( ) and correspondihg include ability city facility letter petty responsibility writing

and P/ P: =DX= phonological items which contain DX orthographic items. Such a list might lab il xfi/ /s ifid/ /f as il IF^/ /l ;fa/ /p ;pi/ /r~spans~bil~~i/ /rd~f I g / Flapping occprs in a descending stress pattern, e.g. city letter, petty, wrdting in which a stressed vowel precedes the flap and an unstressed vowel follows. In addition, trhe flap appears to occur between unstressed vowels when the sequence rppresents the noun asfix -ity, as in ability. To check this, the query P/ 0: =ITY, S: N could be used to retrieve a81 nouns ending in -ity, and the subset involving affixed forms (i.e. excluding city, pity) could be examined for occurrences of flapping. B. Word Error Recognition testing

The interactive facility can be used to examine the kinds of word recognition errors which might occur in a speech understanding system due to indeterminacies in segment labelling. If a string is completely specified as /likrg/(coded LIYKIHNX), then it matches a single word, leaking. However, if labelling is less precise, then alternative (and incorrect)word matches might occur. Using the inte~ctive retrieval system, alternative labels and resulting word matches can be examined for any given lexicon.

In the example above, the labelled string might be

L (VOC HIGH ANT) K IH NX with the stressed vowel represented as a set of features: vocalic, high, anterior. Resulting word matches might include leaking and licking.

If the initial consonant is also specified as a set of features (consonant, sonorant, continuant), as in the string* (CON SON CONT) {VOC HIGH ANT) K IH NX then the resulting word matches might be leaking, lickinq, reeking. If the K is specified less precisely as $ voiceless stop, word matches might include leakinq, licking, reeking, leaping, rippinq. The interactive facility allows the system designer to easily determine the nature of possible incorrect matches due to phonological indeterminacy, especially as the size of the lexison increases. C. Comparative Linguistic Relationships

If the data base is represented as an orthographic list with two associated phonological lists representing two languages or dialects, the interactive system can be used to discover systematic sound correspondences, and to aid in the study of dialect relationships and historical reconstruction.

A sample'data base might be:

Gloss Lanquaqe A

a fish plaa

to have mi i

no, not plaaw

brother phii

bamboo phaay The query would retrieve those items in language B items in language A with initial /pl-/ clusters, e.g. and paw, indicating that consonant cluster simplification may have occurred in language B. The query Lanquage B Pa mia Paw fia fay which correspond to B/ A: =IYIY would retrieve those items in language B which correspond to items in language A with final /-ii/, e.g. the drphthongized - mia and - f ia. A large data base could be accessed in this way to discover systematic correspondences between languages A and B, such as the correspondences /pl-/:/p-/, m : m , /ph-/:/f/, i i : i a a a : a , etc. The flexibility of the interactive system, combined with the linguistic intuition of the user, can be used.to specify and retrieve any set of correspondences, without the need to format the data according to initial consonants or clusters, vowel nuclei, finals, etc. Information such as tonal cnntours and stress can also be represented and accessed. [I] [2] [3] [4] [5] REFERENCES Retz, D. La, 3. R. Miller, J. L. McClurg, B. W. Schafer, Elf Kernel Programmer's Guide, Speech Comunications Research Laboratory, Santa Barbara, California. April, 1975. Feldman, J. A. and P. Rovner , "An ALGOL-based Associative Language," Comm. ACM, Volume 12, August, 1969, 439-449. Barnett, J. A,, A Phonological Rules System, TM-5478/000/00, System beveiopment Corporation, Santa Monica, California, 1975. Bobrow, D. G. and J. B. Fraser, "A Phonological Rule Te~ter,~ Comm. ACM, Volume 11, November, 1968, 766-772. Friedman, J. and Y. C. Morin, Phonoloqical Grammar Tester: Description, Natural Language Studies No. 9, Phonetics Laboratory, The University of Michigan, 1971. ACKNOWLEDGEMENT This research was supported in part by the Advanced Research Projects Agency of the Department df Defense through Contract N00014-73-C-0221 administered by the Office of Naval Research Information Systems Proqram. ON THE ROLE OF WORDS AND PHRASES IN AUTOMATIC TEXT ANALYSIS Automatic indexing nom~ally consists in assigning to documents either single terms, or more specific entities such as phrases, or more general entities such as term classes. Discrimination value analysis assigns an appropriate role in the indexing operation to the single terms, term phrases, and thesaurus categories. To enhance precision it is useful to form phrases from high-frequency single term components. To improve recall, low-frequency terms should be grouped into affinity classes, assigned as content identifiers instead of the single terms.

Collections in different subj ect areas are used in experiments to characterize the type of phrase an8 word class most effective for content representation.

The following typical conclusions can be reached:

a) the addition of phrases improves performance considerably;

b) use of phrases is better with corresponding deletion of single terms in practically all cases;

c) the use of both high-frequency and medium-frequency phrases is generally more effective than the use of either phrasetype alone;

d) the most effective thesaurus categories are those which include a large number of low-frequency terms;

e) the least effective classes either consist of only one or two terms, or else they include terms wi~h unequal frequency characteristics permitting the high-frequency terms to overcome the others.

The discrimination value theagr is developed and appropriate experimental output is supplied. American Journal of Computational Linguistics Microfiche 36 : 68 Departqent of ~nthropology University of New Brunswick Roswell Park Memorial Xnsti tutg Buffalo, New York Linguistic String Project New York University ABSTRACT

~inguistic mechanisms of compression are used when making notes within a context where the objects and meanings are known. Mechanisms of compressidn in medical records for a collaborative study of breast cancer are described. The syntactic devices were mainly deletion of words having a special status in the grammar of the whole language and deletion in particular positions of word+ having a special sta&us in the sublanguage. The deIeted forms are described and sublanguage Qord classes defined. A subcorpus of the medical records was parsed by an existing computer parsing system; a component covering the deletion-forms was added to the granunar. Modifications to t,he computer grammar are discussed and the parsing results are summarized. Introduction All 1anguages"have mechanisms of compression. Sentences may be embedded within other sentenaes by means of nominalization and complementation. Various grammatical transformations involve deletion of certain parts of the sentence. In medical records, we find entries such as no evidence of metastases, which may be said to be derived Trea something like There is no evidence of metastases. Such incomplete sentences are not common in the spoken language of the medical records (i.e. dictated reports). However when physiciakrs themselves are requirbd to write- material for records, compression mechanisms are qmmonly use&. Although this paper will deal with a mific corpus, similar devices would I often be used for compression in other s-ations where there is pressure to write as little as possible, Legal, educational, and scientific recordg where informal notes are kept woum be other examples of this class of sitqations.

The original motivation for this study was to develop effective methods for storing &e information in a medical record and to be able to retrieve this information for purposes of research, medical care, or administration. Fsoan previous research, the feasibility of verbatim input of dictated narrative has been established, Computerized extraction of the information has been shown to be feaeible i~ a test system ACORN (Automated Coding of Report ~arrative). his system has been described in detail in a series of previous papers. 1dt3

1 I .Dm J. Bross et al. "Information in Natural Languages : A New Approach1'. Journal of the American Medical ~ssociakkon, Vol. 207, No. 11, 1969, pp. 20802084.

2 I.D.3, Bross et al. "Feasibility ~f ~utoxnated Information Systems in the Usarts NatvlLal Language". American Scientist, Vol. 57, No. 2, 1969, pp. 193-205.

3p.k Shapim and D.F. Stemole. "ACORN (Automated Coding ot Report Narrative.): An Automated Ratrral-language Question-Answering syst& for surgical Reportsw.

Conputers and Butomation, Vol. 20, No. 7, 1971. For a highly structured medical record where the entries are single words or very restricted sentences, the feasibility'of computer-assisted editing and coding has also been established. A procedure for typing in the entdes verbatim in a medical record,called 'TICPIS' (Type-In Coding and Editing System) ha8 been reported e1sew)rere. 4 However, the thitd, intermediate class of material cannot be handled by ACORN or by TICES. Therefore, a linguistic analysis of this type of material has been undertaken with the ultimate objective of setting up a comprehensive eomputer system that can handle almost everything in the medical records.

In the earlier effoxts to develop natural language technology, the work was facilitated by the fact that the documents involved were strictly for the transmission of factual information. 5 Such documents are regarded as important both by the persons who are filling them out and by the persons who read them. In this no-nonsense situation where the record may be critically reviewed by the peers of the person who is reporting the information, unambiguous and informative transmission of information is a critical need. Some of the simplicities in the present analysis may be~eculiar to ws type of situatfon.

The existence of a subculture with shared training, objectives, and experience may facilitate the note-taking process in somewhat the same way that a person taking notes for himself can somehow be more concise without ambiguity. .. Howeveb, r many other note-taking situations would involve subculture, though not necessarily a medical one, and the findings here might be expected to have sdne general applicability.

4 1.D.J. Bross et al. "Unobtrusive Biomedical Data-Input Systems". - BioMedical Computing, No. 4, 1973, pp. 219-228.

J E I.D.J. Bross, P.A. Shapiro and B.B. Anderson. "How Information Is Carried in Scientific Sublangukges". Science, Vol. 176, No. 4041, 1972, pp. 1303-1307. Source of Material The medical&es discussed here are ffom tjhe records of the Surgical Adjutrant Breast Project, a nationwide collaborative study involving 36 medical institutions. The records were filled out by medical and paramedical personnel at the participating institutions and cehtralized at Ro$glell Park Nemri&l Institute in a Statistical unit under thq direction\of-Dr. Nelson S-lack. A sample of approximately 50 was taken from the 2734 cage histories of patients in the program and is being used in the lbguistic analysis. Each case history ordinarily consiats df 3-6 pages of detailed information on the patient's initial status, treatment, pathology report, nledicai problems, and subsequent fate. When the structured information in the record was excluded, each case history had between 6 and 26 notated items, each item consisting of 1 t6 5 partial-sentences. While this material is speckalized to me purposes of the collaborative study, this type of information iq fairly typical of what is found in the usual hospital record.

The notes were typed vexbath using An IBM Mag Card Communicator so as to obtain simultaneously a typed paper document and a record in computer-usable form. This device is used in the data-input sgstem of T~CES; an existing system for handling completely structured records. It would presumably be usea in any extension of TICES which would handle medical nates. In eis'analysis the computer was used to reorganize the material in a fbrmmore convenient for manual analysis by the linguist.

Anderson analyzed the linguistic structure of the entries in a sample of the medical records involving radiation findings, A discussion of this analysis w i l l take up the next part of the paper. Sager and associates used some of the findings from this study to develop methods for processing these same medical records by computef, adapting% program and grananar which had been developed fok parsing science articles. This project w i l l be discussed in the final part of the paper. Linguistic Characteristics of Medical Notes Many of the entries on the medical records are in the form of notes which are neither complete sentences nor single word entries, but linguistic strings of an intermediate type, which we w i l l hereafter call fragments, Fragments are a compressed typ of linguistic material resulting from various transformations which have the effect of making linguistic strings shorter by reducing or deleting materihl. The writer of these stretches of material must make his entries brief, in order to save time and effort, but also make them informative and unambiguous. For this reason the deleted material has to be easily recoverable, or in other words it must not contain much information. An analysis of the fragments shows that deletion is maiinly of a small class of sentence parts: (1) tense and the verb - be (t - be); (2) subject, tense and the verb - be; (3) the subject; and (4) subject, tense, and verb (V) other than - be.

A second characteristic of fragments which makes deleted material recoverable is that both the meted material and the remainders consist of words in easily defined subclasses, based on both distributional and semantic criteria. These subclasses are easily defined because of the nature of the sublanguage; in general the vocabulary is limited and each word has a limited semantic range. The question on a form khich is being answered can also be used as a basis for retoring deleted material.

One of the most commonly deleted items in the medical records is t - be (1 and 2). Tense is perhaps the most important information - be gives. The deletion of tense in the medical records causes no ambiguity because usually the physician describes the situation at the time of filling out the report, Otherwise he

gives the time in a time phrase: x-rays on November 2. Fragment Types In Table 1 we list the fragment types, giving an example of each, but not with all occurring word subcl&!3ses. The types w i l l Sirst be given according to what material is deleted and then w i l l be futther subclassed according to the highest nodes of the tree structure of the remainder. The material in brackets is the word subclasses which are assumed .So have been deldted. TABU3 1. FRAGMENT TYPES Material Deleted 1. t - be by S trudurs of Fragment N Ven

Example no metastatic lesions [were] detectea [by N-physician (rare 1 Subject t - be Subject N Adj N P N to 1 N V N Ving Ven Ads to be Ven Ving t V Object Object Object Object

physician] chest films [were] nonnal patient [was] without cough this form [is] to be used . . . wound [is] healing well [N-disease was] aspirated once [N-Patient is] dead [N-patient is] to be seen by gynecologist [N-patient is] doing well [I] found osteochondritis in,

rib (5th right) [N-patient] had period one week ago [N-disease] invades skin [N-disease] seems minor [I V-discovered] no bony metastases [N-ghysician did] excision of (r )

5th costal cartilage [N-patient has] no bone pain Word Subclasses The word subclassbs should have three characteristics: (1) they should enable deleted material to be recovered, (2) they should make it possible to extract and store informational units such as those in ACORN 6 and (3) they should be defined so that a linguistically unsophisticated person can easily put words into their subclasses. The word subclasses ate based on both semantic and distributional criteria. To a large extent nouns can conveniently be subclassed on a semantic basis and verbs can be subclassed on a distributional basis, according ta the subclasses of nouns which they take as subject and object. Due to the nature of the sublanguage there is relatively little overlap (e.g., a given verb is likely to take only one noun subclass as shject) compared to what we would find in the language as a whole.

Two impoftant subclasses of h-n nouns used in the medical records are N-physician and N-patieht. Each has only a few members, but is important because many verbs chqacteristically take it as subject or object, and also because both, but particularly N-physician, are usually deleted. It is on the basis of the verbs which characteristically take them as subject or object that they can usually be recovered without ambiguity.

Other noun subclasses concern more directly the subject matter of the reports, the concrete objects with which the physician is dealing. Unlike Nphysician and N-patient, these classes usually have many mmbers and they are seldom deleted. As w i th N-physician and N-patient, certain uerb subclasses char~cteristically take them as subject or object.

Table 2 gives some of the word subclasses with examples of each. TABLE 2. SOME WORD SUBCLASSES N-bwa$t abdomen, axilla, bone, Br-t, cervix, pelvis N-change I change, elevation, enlargement, gab, increase R-dimas ion pressure,' rate, rhythm, size, weight N-disease carcinoma, cough, disease, edema, fibxosis N-exam 6iopsy, exam, film, qamogram; scan, x-xay N-locatibrl N-patient N- physician area, field, floor, lobe, neck, part, regionr she, her, patient, lady, woman doctor, he, him, his, I,*M.D., radiologist N-therapy drug, insulin, medication, medicine, radiation N-time V-be-equivalent V-change V-discer V-patient-object V-patient-subject V-physician-subject V-show Ad j -bodypart Adj-changed Adj-degree Ad j -discover Adj-disease quality Computer Parsing of Medical Records To test the linguistic analysis, a subset of the manually analyzed corpus of medical records parsed by computer, using the Linguistic String Parser. 8 was NYU date, month, the, visit, winter, year appear, feel, indicate, remain, represent, seem alter, clear, change, enlarge, heal, progress detect, find, identify, ncyte, observe, see ah=, give, leave, place, readmit, see, transfer, trqat complain, come, moperate, enter, feel, gain, go, have,

refuse, show, suf f ;r , take feel, have, place, tel.1, t'kansfer, treat, See show, demonstrate, indicate, reveal, suggest axillary, bony, clavicular, lumbar, pelvic elevated, enlarged, healed, stable, unchanged. considerable, extensive, intermittent, little absent, evident, Fnown, possible, present active, bad, benign, degenerative, firm, hard,

malignant, metastatpc, nodular adjoining, distal, dorsal, frontal, left clear, free, healthy, negative, normal ' The LSP grqmmar of English is based on the same linguistic principles as the ACORN grammar. Hence it could also serve to test the feasibility of adding a note-handling capability to the ACORN-TICES system. The LSY sylr which was designed for text-processinQ, was adapted to the parsing of medical records by deleting portions of the grammar which are not required for this type of material and adding a section covering sentence fragments. These change$ are described below, followed by the parsing resultb. The corpus which was parsed consisted of 12 sections of the Radiation Findings extracted in their order of appearance from the medical records. These sections contained 245 sentences or sentence fragments (word sequences ending in a period). Of these, 37 were complete English sentences and 205 were fragments; 3 were combinations of both types. 21 entries were identical to others in the corpus, accounting in all for 139 of the sentences ox sentence fragments. Of the complete sentences, same were quite long, e.g., Reexamination shows some scarring and thickening over the right apex which is perhaps slightly more evident than it was before, but nothing is seen that is typical of tumor involvement. Typical sxorter sentences are Chest films on 10-25-68 and 12-14-68 do not show any essential changes since- last reports, Liver scan 1-29-69 was normal. Fragments were, as predicted, of the types listed in Table 1, above, though not all tyMs were represent-ed in the parsed corpus.

Table 3 shows the new definitions or redefinitioqd which were added to the LSP grammar to cover fragments. These definitions are written in ~ahs-Naur Form

(BNF), as ilze all the ca. 180 definitions which comprise the context-free-part of the LSP English grammar. The BNF definitions are used by the parser to construct a tree representing the structure of the input sentence.

In addition to BNF definitions, the grammar contains restrictions, which

test the sentence trees for grammatical and selectional well-formedness. The

9 For more explanation of the LSP system and grammar, see N. Sager and TABLE 3. DEFINITIONS ADDED TO THE LSP GRAMMAR TO COVER SENTENm FRAGMENTS (SENTENCE) ::= <TEXTLET>. <TExTILET> ::I <OLD-SEmNCE><MORESENT>. COLD-SENTENCE? <MORESENT> (INTRODUCER) ::= <INTRODUCEI~<CENTER)<ENDMARK>. : : = NULL/<TEXTLET>. c : = NULL. <CENTER? <FRAGMENT> ::= <ASSERTION>/:FRAGMENT>/<IMPERATIVE>. : := <SA> C<SOBJBESHOW/<ASTG)(SA>/<NSTG><SA~/ <VENPASS~/<NSTG) ~~ASSERTION~/~SOBJBESHOW~ 1 . ::= <SUBJECT><B$-OR-SHOW><OBJBE><SA>. : :=- +--+/NULL. ::= +.+/+,+/+;+/+-4. starting, or root, definition of the gramnqr is SENTENCE, so this is tha first definition seen in Table 3. In the case of medical records, the unit may be. longer than one sentence, but we have retained the root-word SENTENCE and defined SENTENCE in this case to be a TEXTLET (definition 2), ,which.consists of a sentence (called OLD-SENTENCE, definition 3) optionally followed by more sdntences (MORESENT, definition 4). The definition of OLD-SENTENCE has the same three elements (INTRODUCER, CENTER, ENDMARK) that the definition of SENTENCE does in the LSP grammar; however, in this case, tho INTRODUCER (definiqion 5) is NULL; the CENTER (definition 6) contains an option FRAGMENT in addition to the options ASSERTION and IMPERATIVE defined in the English grammar; (other~options of CENTER, e.g. QUESTION, have been deleted); and the ENDMARK (definition 10) contains unconventional punctuation, such as dashes and cornma, in addition to the period and semicolon. Since our main interest here is in FRAGMENT (definition 7), we will elaborate onlhis definition. R. Grishman, "The Restrictton Language for Computer Grammars of TFatural Language' Commun. of the ACM, 18, 390-400, 1975, and the references cited there. In defining FRAGMENT, we have used parts of the grammar which were defined independently of the fragment problem. That this is possible is in itself a partial verificatian of the conclusion from manual analysis that only limited, grammatically specifiable, deletion-forms occur in the fragments seen in notes and records. For example, the dropping of the verb (type 1 of Table 1) can occur in normal English - when a sentence containing the verb - be occurs as the object of a verb like find, e.g. W e found the chest clear to pekcussion and auscultation. itn the UP grammar there is an object string defined for such occurrences; it is calleg SOBJBE (Subject - + Object of - be) . This same string can then be made an option of CENTER to analyze fragments having the same £om e,g. Cheet clear to.percussion and auscultation.

In detail, the definition of FRAGMENT begins with the element SA (Sentence - Adjunct). The definition' of SA (not shown here) contaihs 16 options covering all types of sentence modifiers* In this material the most frequent SA is a the expzession, usually a date [called PDATE, for optional Preposition + date) or this examination, this visit. Following SA in the definition FRAGMENT are the options proper, naming definitions already - in the LSP grammar. The first option SOBJBESHOW (Subject - + Obj ect of - be or show), corresponds to the second and third structures of type 1 and also occurrences like Chest f i lm no change, which is an expansion of SOBJBE, discussed above. This common in this material, - be and - option covers deletions of show. The plade of - be or - the two most verbs show (definition 8) in a fragment is either empty or is filled by a dash.

The mkond and fourth options, ASTG and VENPASS, in FRAGMENT correspond to structures of type 2 in Table 1 (e.g., Negative_, felt to be a benign lesion), where the subject, tense and verb - be have been dropped. In the LSP grammar, ASTG

(Adjective - - strind is an option of OBJBE, and VEWASS (V-en passive string) is

also permitted after -C be and in other places. The thitd option, NSTG Noun - strind, is an object of -' show e.g., Mild degenerative chanqes (£$om, X-rays show mild*degenerative changes). It ale0 covers occurrences of the first structure of type 1 (e.g. No X-rays taken) where for regularity with more complete entries the passive Verb (taken) is seen as a right adjunct of the noun. The last option, consisting of NSTG followed by either ASSERTION or SOBJBESHOW, covers such occurrences as PA and lateral chest 1l,-5-71 reexamination shows some scarring and thickening over the right apex. where a noun phrase (PA and'lateral chest 11-5- 71) precedes an assertion about that ngun phrase.

Space permits only a few remarks about these definitions. It was helpful to order the options so that the longer options precede the shorter ones, since some of the shorter options (e.g., NSTG) can have the &form as the first element of the longer ones. This is not required in parsing texts, since in full senthces there is usually no other way of fitting in the remainder of the sentence. Also, in text sentences, many nouns require a preceding determiner, so that compound nouns are not split into separate noun phrases. In this material, determiners are rarely emplbyed, so this constraint cannot be applied. Th i~ , combined wim verb deletions and the use of commas both in the text and as sentence sepamtors, makes fof a great deal of syntactic ambiguity. However, as the next section shows, it was possible to obtain the intended parse as the firs* output in most cases. This was accomplished without using the subclasses special to the medical material, which are used in a subsequent stage of processing preparatory to information retrieval. Parsing Results

Parsing output is in the form of a tree, illustrated for a typicdl fragment in Fig. 1. (Only the nodes mentioned above are shown, plus LN/RN = - left/ - right modifiers of - Noun,) The full power of the parser is better illustrated by the along full sentences; but space does not permit presenting them here. Fig. 1 Parse tree for FRAGMENT = 5-2-67 chest--no chanqe sjnce 2-7-67 FRAGMENT. A summary of the parsing results is given in Table 4. Of the total 245 sentences, a correct firat parse was obtained for 171 or 69.8%, and a first parse adequate for further processing to obtain an "information format1' in 213 cases, or 86.9%. The latter statement brings us to the important question of how these parses are to be used. TABU3 4. PARSING RESULTS Number of Sentences Full sentence Fragment Full S + Fragment TOTAL TOTAL Average time for 1st par$e Percentage

15.1

83.7

1.2

100.0 analysis more refined,) In each specialized technical area, more specific structure is p~s~ibie, making use of the restricted word usage characteristic of the disqourse in given subject K) the area. A second stage of processing of this type is now being applied to the parsed corpus of medical records and will be reported in a subsequent paper. A convenkent test of the adeqyacy of the parsing outputs is therefore whether they can serve as input to this second stage of processing (called forhatting). It can be seen in Table 4 that a number of "wrong" parses were still adequate as input to the formatting; the segmentation of the sentence into parts was correct even if the parts were assigned an incorrect syntactic status, e.g., object instead of adjunct. Only when the first parse was not adequate for formatting was the sentence rerun to obtain alternative analyses.

The parsing times are a rough indication of the efficiency of the parsing but two points should be kept in mind. (1) The present LSP system is not a production model, but a research tool, with all that implies. (2) A bignificant fraction of the input sentences were "no data" types, e,g., None this visit. These word sequences were so limited linguistically that a literal formula could serve to reaogniae them. The experimental use of such a formula cut down parsing times on the no-data entries from about 1.817 to0.030. However, this formula was not used in the parsing summarized in Table 4, - -

This investigation was supported by Public Health service Research Grant number CA-11531 from the National Cancer Institute.

losee Ref. 5 and A. Sager, Syntactic Formatting of Scientific Information, Proc. FJCC, AFIPS Press, Montvale, N. J., 1972,

486 1445 284, 837, 1341, 1350 APPLICATIONS utterance tokeno yields general observations such as ACL Meeting Washington Sqyare Villager 28 New York, New Yofk 10012 3a. N-physician Subject 3b. N-patient Subject 3c. N-disease Subject Subject t v 4a. N-physician t V-diecover 4b. N-physician t V-do 4c. N-patient t have et al. "Information in Natural Languages: A New Approach," 1969. I d grateful to Cynthia Insolio and Lynette Hirschman for their help in processing these data. w. S) 8 R. Grishman, N. Sager, C. Raze, and B. Bookchin, "The ~inguistic String Parsern'. Proceedings of the NCC, AFIPS Press, Montvale; N. 3. , 1973. parse correct 1st parse OK for format 2nd or 3rd parse OK for format No parse ox parse9 i-3 not OK for format seconds The aim in processing natural language notes and records is to arrive at forms for the data which are suitable for computerized information retrieval. The data structures must not change the meaning. This is why syntactic methods are knpo%ant. Parsing with an English grammar provides the gross structure of input sentences. (The use of English transformations makes the grammatical