A Self-Learning Universal Concept Spotter

Tomek Strzalkowski and Jin Wang

GE Corporate Research and Development P.O. Box 8 Schenectady, NY 12301 USA

{strzalkowski,wangj}@crd.ge.com

We describe; the Universal Spotter, a system for identifying in-text references to entities of an arbitrary, nscr-speeified type, such as people;, organizations, equipment, products, materials, etc. Starting with some initial seed examples, and a training text œrpus, the system generates rules that will find furthe;r concepts of the; same type. The initial se;eel information is provideel by the; user in the form of a typie;al lexical context in which the entities to be spotted occ:ur, e.g., "the name enels with Co.", or "to the right of produced or made", anel se> forth, or by simply supplying examples of the concept itself, e.g., Ford Taurus, gas turbine, Dig Mac. In aeklition, negative examples e:an be supplied, if kneiwn. Given a sufficiently large training corpus, an unsupervised learning process is ini-tiateel in which the system will: (1) find instances of the sought-after concept using the see;d-e:onte;xt informatie>n while; maximizing recall and precision; (2) find additional contexts in whie;h these entities occur; and (3) expand the initial seed-context with selecteel new contexts to find even more entities. Preliminary results e;f creating spotters for organizations anel proelucts are; discussed.

lelentifying coneiepts in natural language; text is an important information extraction task. Depending upon the current information necels one may be interested in finding all references to people, locatiems, dates, e)rganizations, companies, proelucts, equipment, and so on. These; concepts, along with their classification, can be useel to in-elex any given text ibr search or categorization purposes, to generate summaries, or to populate database records. However, automating the process of concept identification in unformatted text has not been an easy task. Various single;-purpose; spotters have; been eleveloped for spécifie: types of e:oncepts, including people names, company names, location names, elates, etc. but these we:re usually e:ither hand crafteel for partie:ular applicatiems e;r elomains, e>r were heavily relying on apriori lexical clues, such as keywords (e.g., 'Ce;.'), e:ase: (e.g., 'John K. Big'), preelie:atablc few-mat (e.g., 12,3 Maple Street), or a combination e>f thereof. This makes e:reation and extension e)f such spotters an arduous manual job. Other, less salie;nt entities, such as proelue:ts, equipment, foodstuff, or generic references of any kind (e.g., 'a Japanese automaker') could oidy be ielenti-fie;el if a sufficiently detaileel ek)main moelel was available. Domain-model driven extraction was used in ARPA-sponsorcd Message Understanding Conferences (MUC); a eletaileel overview of émirent research can be found in the proceedings of MUC-5 (muc5, 1993) anel the recently conchuie:d MUC-6, as well as Tipster Project meetings, or ARPA's Human Language; Technology wewksheips (tipsterl, 199.3), (hltw, 1994).

We: take: a somewhat eliffe:rent approach to ielem-tify various types of text entities, both generic and spe;e:ific, withemt a eletaileel unele:rstaneling oi the: text domain, and relying instead on a combination oi shallow linguistic processing (to identify candi-elate: lexical entities), statistical knowledge: acquisition, unsupervised learning techniques, and pexs-sibly broael (universal but e)ften shallow) knowledge; sources, such as em-line elictionaries (e.g., WorelNet, Comlox, OALÜ, etc.). Our method moves beyemd the traditional name spotters anel towards a universal spotter where the reeruire-ments em what te) spot can be specified as input parameters, and a specific-purpose spotter e:oukl be generated automatically. In this paper, we describe a method of creating spotters for entities of a specified e:ategory given only initial seed examples, anel using an unsupervised learning process to discover rules for finding more instances of the concept. At this time: we; plae;e ne; limit on what kind of things one: may want to builel a spe)tter fe>r, although our experiments thus far concentrated on entities customarily roferred to with noun phrases, e.g., equipment (e.g., "gas turbine assembly"), tools (e.g., "adjustable wrench"), products (e.g., "canned soup", "Arm & Hammer baking soda"), organizations (e.g., American Medical Association), locations (e.g., Albany County Airport), people (e.g., Bill Clinton), and so on. We view the semantic categorization problem as a case of disambiguation, where for each lexical entity considered (words, phrases, N-grams), a binary decision has to be made whether or not it is an instance of the semantic type we are interested in. The problem of semantic tagging is thus reduced to the problem of partitioning the space of lexical entities into those that are used in the desired sense, and those that are not. We should note here that it is acceptable for homonym entities to have different classification depending upon the context in which they are used. Just as the word "bank" can be assigned different senses in different contexts, so can "Boeing 777 jet" be once a product, and another time an equipment and not a product, depending upon the context. Other entities may be less context dependent (e.g., company names) if their definitions are based on internal context (e.g., "ends with Co.") as opposed to external context (e.g., "followed by manufactures"), or if they lack negative contexts.

The user provides the initial information (seed) about what kind of things he wishes to identify in text. This information should be in a form of a typical lexical context in which the entities to be spotted occur, e.g., "the name ends with Co.", or "to the right of produced or made''', or "to the right of maker of, and so forth, or simply by listing or highlighting a number of examples in text. In addition, negative examples can be given, if known, to eliminate certain 'obvious' exceptions, e.g., "not to the right of made for", "not toothbrushes". Given a sufficiently large training corpus, an unsupervised learning process is initiated in which the system will: (1) generate initial context rules from the seed examples; (2) find further instances of the sought-after concept using the initial context while maximizing recall and precision; (3) find additional contexts in which these entities occur; and (4) expand the current context rules based on selected new contexts to find even more entities.

In the rest of the paper we discuss the specifics of our system. We present and evaluate preliminary results of creating spotters for organizations and products.

If we want to identify some things in a stream of text, we first need to learn how to distinguish them from other items. For example, company names are usually capitalized and often end with 'Co.', 'Corp.', 'Inc.' and so forth. Place names, such as cities, are normally capitalized, sometimes are followed by a state abbreviation (as in Albany, NY), and may be preceded by locative prepositions (e.g., in, at, from, to). Products may have no distinctive lexical appearance;, but they tend to be associated with verbs such as 'produce', 'manufacture', 'make', 'sell', etc., which in turn may involve a company name. Other concepts, such as equipment or materials, have few if any obvious associations with the surrounding text, and one may prefer just to point them out directly to the learning program. There arc texts, e.g., technical manuals, where such specialized entities occur more often than elsewhere, and it may be advan-tagous to use these texts to derive spotters.

The seed can be obtained either by hand tagging some text or using a naive spotter that has high precision but presumably low recall. A naive spotter may contain simple contextual rules such as those mentioned above, e.g., for organizations: a noun phrases ending with "Co." or "Inc."; for products: a noun phrase following "manufacturer of", "producer of", or "retailer of". When such naive spotter is difficult to come by, one may resort to hand tagging.

The seed should identify the sought-after entities with a high precision (though not necessarily 100%), however its recall is assumed to be low, or else we would already have a good spotter. Our task is now to increase the recall while maintaining (or even increase if possible) the precision.

We proceed by examining the lexical context in which the seed entities occur. In the simplest instance of this process we consider a context to consist of N words to the left of the seed and N words to the right of the seed, as well as the words in the seed itself. Each piece of significant contextual evidence is then weighted against its distribution in the balance of the training corpus. This in turn leads to selection of some contexts to serve as indicators of relevant entities, in other words, they become the initial rules of the emerging spotter.

As an example, let's consider building a spotter for company names, starting with seeds as illustrated in the following fragments (with seed contexts highlighted):

... HENRY KAUFMAN is president of Henry Kaufman & Co., a ... Gabelli, chairman of Gabelli Funds Inc.; Claude N. Rosenberg ... is named president of Skandinaviska Enskilda Banken ... become vice chairman of the state-owned electronics giant Thomson S.A. ... banking group, said the formal merger of Skanska Banken into ... water maker Source Perrier S.A., according to French stock ...

93 2

... HENRY KAUFMAN is president of Henry Kaufman & Co., a ... Gabelli, chairman of Gabelli Funds Inc.; Claude N. Rosenberg ... is named president of Skandinaviska Enskilda Banken ... become vice chairman of the state-owned electronics giant Thomson S.A. ... banking group, said the formal merger of Skanska Hanken into ... water maker Source Perrier S.A., according to French stock ...

Having "Co." "Inc." to pick out "Henry Kaufman & Co." and "Gabelli Funds Inc." as seeds, we proceed to find new evidence in the; training corpus, using an unsupervised learning process, and discover that "chairman of" and "president of" are very likely to precede company names. We expand our initial set of rules, which allows us to spot more companies:

This evidence discovery can be repeated in a bootstrapping process by replacing the initial set of seeds with the new set of entities obtained from the last iteration. In the above example, we now have "Skandinaviska Enskilda Banken" and "the state-owned electronics giant Thomson S.A." in addition to the initial two names. A further iteration may add "S.A." and "Banken" to the set of contextual rules, and so forth. In general, entities can be both added and deleted from the evolving set of examples, depending on how exactly the evidence is weighted and combined. The details are explained in the following sections.

In most cases the text needs to be preprocessed to isolate basic, lexical tokens (words, abbreviations, symbols, annotations, etc), and structural units (sections, paragraphs, sentences) whenever applicable. In addition, part-of-spoech fagging is usually desirable, in which case the tagger may need to be re-trained on a text sample to optimize its performance (Brill, 1993), (Metcer, Schwartz & Weischedel, 1991). Finally, a limited amount of lexical normalization, or stemming, may be performed.

The entities we are looking for may be expressed by certain types of phrases. For example, people names are usually sequences of proper nouns, while equipment names are contained within noun phrases, e.g., 'forward looking infrared radar'. We use part of speech information to delineate those sequences of lexical tokens that are likely to contain 'our' entities. From then on we restrict any further processing on these sequences, and their contexts.

These preparatory steps are desirable since they reduce the amount of noise through which the learning process needs to plow, but they are not, strictly speaking, necessary. Further experiments are required to determine the level of preprocessing required to optimize the performance of the Universal Spotter.

The semantic categorization problem described here displays some parallels to the word sense disambiguation problem where homonym words need to be assigned to one of several possible senses, (Yarowsky, 1995), (Gale, Church & Yarowsky, 1992), (Brown, Pietra, Pietra & Mercer, 1991). There are two important diffbrenc.es, however. First, in the semantic categorization problem, there is at least one open-ended category serving as a grab bag for all things non-relevant. This category may be hard, if not impossible, to describe by any finite set of rules. Second, unlike the word sense disambiguation where the items to be classified are known apriori, we attempt to accomplish two things at the same time:

2. actually decide if an item belongs to a given category, or falls outside of it.

The categorization of a lexical token as belonging to a given semantic class is based upon the information provided by the words occurring in the token itself, as well as the words that precede and follow it in text. In addition, positional relationships among these words may be of importance. To capture; this inforniatfon, we define; the notion of an evidence set for a lexical unit, W\W2---Wn, (a phrase, e;r an N-grani) as follows. Let ...If ,1b .,U';,...U . be a string e>f subsequent tokens (e.g., words) in text, such that W-\W-i...Wm is a unit of interest (e.g., a noun phrase) and n is the: maximum size: of the e:ontext window on either siele of the: unit. The: actual winefow size: may be: limited by bounelaries of structural units such as sentences or paragraphs. Few e:ae;h unit W-\ W'j...W,n, a set of evidence items is ce)llee:te;el as a se:t union e)f the: following foui-sets:

1. Pairs of (word,position), where position c: {p,s, f} indicates whether word is founel in the e:ontext prece:ding (p) the: c:e:ntral unit, following (f) it, or whether it e:e)me:s from the: central unit itself (s). E, =

(Wn,p)......(W.-,,p)(W-.,,p) Ï(Wl,s), (W2,s)......(W,n,s) \(W-l-l , f),(W+i,f)......(W,n,ï)J

2. Pairs of (hi-gram, position) to capture; word sc:e]ue:ne:e information. K% —

f((W-.n,w.-(„-.i)),P)... ((w.-ï.w.-o.p)I((Wi,W2))B) ... ((WT„..j,Wra),s){ ((W+hW+3),f)... ((W+(ll_I)lW+„),f)

3. 3-tuples (word, position, distance), where distance indicates how far word is located relative to Wi or Wm. Ea =

(W-n,p,n) (Wi,s, m)(W+i,f,l)(Wm,S,1)(W+n,f,n)4. 3-tuples(bi-gram, position, distance).E4 —f((W-n,W-ln-l)),p,n-l)...(W-2,W-l),p,l) ){ ((Wl,W2),B,m-l)......((Wm~l,Wm),S,l){ ((W+uW+2),f,l)...((W+in-lhW+n),f,n-l)J

For example, in the fragment below, the central phrase the door has the context window of size 2:

... boys kicked the door with rage ...

The set of evidence items generated for this fragment, i.e., i51UE2U.E3U.E4, contains the following elements:

(boys,p),(kicked, p), (the,s),(door,s),(with,f),(rage,f ),((boys, kicked),p),((the, door)), s),((with, rage),{), (boys,p,2),(kicked,p,1),(the, s, 2), (door, s, 1), (ivitti,f, 1),(rage,f, 2),((boys, kicked),p,1),((the, door)),s, J )((with, rage),f,1)

Items in evidence sets are assigned significance weights (SW) to indicate how strongly they point towards or against the hyphothesis that the central unit belongs to the semantic category of interest to the spotter. The significance weights are acquired through corpus-based training.

Evidence items for all candidate phrases in the training corpus, for those selected by the initial used-supplied seed, as well as for those added by a training iteration, arc divided into two groups. Group A items are collected from the candidate phrases that are accepted by the spotter; group R items come from the candidate phrases that are rejected. Note that A and R may contain repeated elements.

For each evidence item t, its significance weight is computed as:

SW(t)f(t,A)-f(t,H)f(t,A) + J(l,R)0f(t,A) + f(t,R)> sotherwise(1)

where f(t, X) is the frequency of t in group X, and s is a constant used to filter the noise of very low frequency items.

As defined SW(t) takes values from -1 to 1 interval. SW(t) close to 1.0 means that t appears nearly exclusively with the candidates that have been accepted by the spotter, and thus provides the strongest positive evidence. Conversely, SW(t) close to -1.0 means that t is a strong negative indicator since it occurs nearly always with the rejected candidates. SW(t) close to 0 indicates neutral evidence, which is of little or no consequence to the spotter. In general, we take SW(t) > e > 0 as a piece of positive evidence, and SW(t) < — e as a piece of negative evidence, as provided by item t. Weights of evidence items within an evidence set are then combined to arrive at the compound context weight which is used to accept or reject candidate phrase.

At this time, we make no claim as to whether (1) is an optimal formula for calculating evidence weights. An alternative method we considered was to estimate certain conditional probabilities, similarly to the formula used in (Yarowsky, 1995):

SW(t)= logP(peA/t) P(i>eR/t)logf(t,A)f(A)f(t,R)f(R.)(2)

Here f(A) is (an estimate of) the probability that any given candidate phrase will be accepted by the spotter, and f(R,) is the probability that this phrase is rejected, i.e., f(R) = l — f(A). Thus far our experiments show that (1) produces better results than (2). We continue investigating other weighting schemes as well.

In order to classify a candidate phrase, all evidence items need to be collected from its context and their SW weights are combined. When the combined weight exceeds a threshold value, the candidate is accepted and the phrase becomes available for tagging by the spotter. Otherwise, the candidate is rejected, although it may ' be reevaluated in a future iteration.

There are many ways to combine evidence weights. In our experiments we tried the following two options:

fx + y — xyifx> 0 andy> 0x4y4xyifx <0 andy <0 (3)x4yotherwiseandx(Dy

x if abs(x) > abs(y) y otherwise

(4)

In (3), x (B y is greater than either x or y when both x and y are positive, and it is less than both x and y for negative x and y. In all cases, x ffi y remains within [—1,4-1] interval.

In (4) only the dominating evidence is considered. This formula is more noise resistant than (3), but produces generally less recall.

The evidence training and candidate! selection cycle forms a bootstrapping process, as follows:

Procedure Bootstrapping Collect seeds loop

Training phase Tagging phase until Satisfied.

The bootstrapping process allows for collecting more and new contextual evidence and increase recall of the spotter. This is possible thanks to overall redundancy and repetitiveness of information, particularly local context information, in large bodies of text. For example, in our three-sectional context representation (preceding, self, following), if one section contains strong evidence that the candidate phrase is selectable, evidence found in other sections will be considered in the next training cycle, in order to select, additional candidates.

An important, consideration here is to maintain an overall precision level throughout the entire process. Although, it may be possible to recover from some misclassification errors (e.g., (Yarowsky, 1995)), care should be taken when adjusting the process parameters so that precision does not deteriorate too rapidly. For instance, acceptance thresholds of evidence weights, initially set high, can be gradually decreased to allow more recall while keeping precision at a reasonable level.

In addition, (Yarowsky, 1995), (Gale, Church & Yarowsky, 1992) point out, that there is a strong tendency for words to occur in one sense within any given discourse ("one sense per discourse"). The same seems to apply to concept selection, that is, multiple occurrences of a candidate phrase within a discourse should all be either accepted or rejected by the spotter. This in turn allows for bootstrapping process to gather more contextual evidence more quickly, and thus to converge faster producing better r esults.

We used the Universal Spotter to find organizations and products in a 7 MBytes corpus consisting of articles from the Wall Street Journal. First, we pre-processed the text with a part-of-speech tagger and identified all simple noun groups to be used as candidate phrases. 10 articles were set aside and hand tagged as key for evaluation. Subsequently, seeds were constructed manually in form of contextual rules. For organizations, these initial rules had a 98% precision and 49% recall; for products, the corresponding numbers were 97% and 42%. (4) is used to combine evidences. No lexicon verification (see later) has been used in order to show more clearly the behavior the learning method itself ( the performance can

iiunmiimima)..

80 60 40 20"Seeds ■ 1st loop •4th loop204060 80 100-1—►■recall

be enhanced by lexicon verification). Also note that, the quality of the seeds affects the performance of the final spotter since they define what type of concept the system is supposed to look for. The seeds that we used in our experiments are quite simple, perhaps too simple. Better seeds may bo needed (possibly developed through an interaction with the user) to obtain strong results for some categories of concepts.

For organization tagging, the recall and precision results obtained after the first and the fourth bootstrapping cycle are given in Figure 1.

The point; with the maximum procision*rocall in the fourth run is 95% precision and 90% recall. Examples of extracted organizations include: "the State Statistical Institute Istai", "Werthcim Schroder & Co", "Skandinaviska Enskilda Banken", "Statistics Canada".

The results for products tagging are given in Figure 2 on the next, page. Examples of extracted products include: "the Mercury Grand Marquis and Ford Crown Victoria cars", "Chevrolet Prizm", "Pump shoe", "AS/400".

The effect of bootstrapping is clearly visible in both charts: it improves the recall while maintaining or even improving the precision. We may also notice that some misclassifications due to an imperfect seed (e.g., see the first dip in precision on the products chart) can in fact be corrected in further bootstrapping loops. The generally lower performance levels for the product spotter is probably due to the fact that the concept of product is harder to circumscribe.

The items identified in the second step can be further validated for their broad semantic classification using on-line lexical databases such as Comprecision

93 5100

precision 100

recall

20 40 60 80 100 Figure 2: Product spotter results.

lex or Longman Dictionary, or Princeton's Word-Net; (Miller, 1990) For example, "gas turbine" is an acceptable equipment/machinery name since 'turbine' is listed as "machine" or "device" in WordNet hierarchy. More complex validation may involve other words in the phrase (e.g., "circuit breaker") or words in the immediate context.

The current program cannot deal with conjunction. The difficulty with conjunction is not with classification of the conjoined noun phrases (it is easier, as a matter of fact, because they carry more evidences) but with identification of the phrase itself because of the structural ambiguities it typically involves that cannot be dealt with easily on lexical or even syntactic level.

In this paper we presented the Universal Spotter, a system that learns to spot in-text references to instances of a given semantic class: people, organizations, products, equipment, tools, to name just a few. A specific class spotter is created through an unsupervised learning process on a text corpus given only an initial user-supplied seed: either a number of examples of the concept, or a typical context in which they can be found. The experiment shows that this method indeed can produce useful spotters based on easy-to-construct seeds. The results shown here are promising, can be further improved by using lexicon verification. Different methods of computing SWs, combining SWs, and parameter adjustmenting for the bootstrapping process need to be explored as we believe there is still room for improvement. The method is being continuously refined as we gain more feedback from empirical tests across several different applications.

We believe that the Universal Spotter can replace much of the need to create hand-crafted concept spotters commonly used in text extraction operations. In can also be applied to building other than the most common spotters such as those for people names, place names, or company names. In fact, is can be used to create more-or-less on-demand spotters, depending upon the applications and its subject domain. In particular, we believe such spotters will be required to gain further advance in intelligent text indexing and retrieval applications, text summarization, and database applications, e.g., (Harman, 1995), (Strzalkowski, 1995).

hltw. 1994. Proceedings of the Human Language Technology Workshop, Princeton. San Francisco, CA:Morgan Kaufman Publishers.

rnucS. 1993. Proceedings of 5th Message Understanding Conference, Baltimore. San Francisco, CA:Morgan Kaufman Publishers.

tipsterl. 1993. Tipster Text Phase !•: 24 month Conference, Fredericksburg, Virginia.

Brill, E. 1992. A Simple Rule-based Part of Speech Tagger. Proceedings of 3rd Applied Natural Language Processing , San Francisco, CA:Morgan Kaufman Publishers.

Brown,P., S. Pietra, V. Pietra and R. Mercer. 1991. Word Sense Disambiguation Using Statistical Methods. Proceedings of the 29h Annual Meeting of the Association for Computational Linguistics, pp. 264-270.

Gale, W., K. Church and D. Yarowsky. 1992. A Method for Disambiguating Word Senses in a Large Corpus. Computers and the Humanities, 26, pp. 415-439.

Harman, D. 1995. Overview of the Third Text REtrieval Conference. Overview of the Third Text REtrieval Conference (TREC-3), pp. 1-20.

Meteer, M., R. Schwartz, and R. Weischedel. 1991. Studies in Part of Speech Labeling. Proceedings of the 4th DARPA Speech and Natural Language Workshop, Morgan-Kaufman, San Mateo, CA. pp. 331-336.

Miller, G. 1990. WordNet: An On-line Lexical Database. International Journal of Lexicography, 3, 4.

Strzalkowski, T. 1995. Natural Language Information Retrieval. Information Processing and Management, vol. 31, no. 3, pp. 397-417.

Yarowsky, D. 1995. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189-196.