<?xml version="1.0"?><!DOCTYPE article SYSTEM "/project/take/software/searchbench_offline_processing/paperxml_generator/aclextractor/src/python/../resource/dtd/paperxml.dtd"><article><header><firstpageheader><page local="1" global="931"/><title>A Self-Learning Universal Concept Spotter</title><author surname="Strzalkowski" givenname="Tomek"><org  name="GE Corporate Research and Development P" country="USA"/></author><author surname="Wang" givenname="Jin"><org  name="GE Corporate Research and Development P" country="USA"/></author></firstpageheader><frontmatter><p>A Self-Learning Universal Concept Spotter</p><p><b>Tomek Strzalkowski and Jin Wang</b></p><p>GE Corporate Research and Development P.O. Box 8 Schenectady, NY 12301 USA</p><p><b>{strzalkowski,wangj}@crd.ge.com</b></p></frontmatter><abstract>We describe; the Universal Spotter, a system for identifying in-text references to entities of an arbitrary, nscr-speeified type, such as people;, organizations, equipment, products, materials, etc. Starting with some initial seed examples, and a training text œrpus, the system generates rules that will find furthe;r con­cepts of the; same type. The initial se;eel information is provideel by the; user in the form of a typie;al lexical context in which the entities to be spotted occ:ur, e.g., "the name enels with <i>Co.", </i>or "to the right of <i>produced </i>or <i>made", </i>anel se&gt; forth, <i>or </i>by simply supplying examples of the concept itself, e.g., <i>Ford Taurus, gas turbine, Dig Mac. </i>In aeklition, nega­tive examples e:an be supplied, if kneiwn. Given a sufficiently large training corpus, an unsupervised learning process is ini-tiateel in which the system will: (1) find instances of the sought-after concept us­ing the see;d-e:onte;xt informatie&gt;n while; maximizing recall and precision; (2) find additional contexts in whie;h these en­tities occur; and (3) expand the initial seed-context with selecteel new contexts to find even more entities. Preliminary results e;f creating spotters for organiza­tions anel proelucts are; discussed. </abstract></header><body><section number="1" title="Introduction"><p>lelentifying coneiepts in natural language; text is an important information extraction task. De­pending upon the current information necels one may be interested in finding all references to peo­ple, locatiems, dates, e)rganizations, companies, proelucts, equipment, and so on. These; concepts, along with their classification, can be useel to in-elex any given text ibr search or categorization purposes, to generate summaries, or to popu­late database records. However, automating the process of concept identification in unformatted text has not been an easy task. Various single;-purpose; spotters have; been eleveloped for spécifie: types of e:oncepts, including people names, com­pany names, location names, elates, etc. but these we:re usually e:ither hand crafteel for partie:ular applicatiems e;r elomains, e&gt;r were heavily relying on apriori lexical clues, such as keywords (e.g., 'Ce;.'), e:ase: (e.g., 'John K. Big'), preelie:atablc few-mat (e.g., 12,3 Maple Street), or a combination e&gt;f thereof. This makes e:reation and extension e)f such spotters an arduous manual job. Other, less salie;nt entities, such as proelue:ts, equipment, foodstuff, or generic references of any kind (e.g., 'a Japanese automaker') could oidy be ielenti-fie;el if a sufficiently detaileel ek)main moelel was available. Domain-model driven extraction was used in ARPA-sponsorcd Message Understanding Conferences (MUC); a eletaileel overview of émi­rent research can be found in the proceedings of MUC-5 (muc5, 1993) anel the recently conchuie:d MUC-6, as well as Tipster Project meetings, <i>or </i>ARPA's Human Language; Technology wewksheips (tipsterl, 199.3), (hltw, 1994).</p><p>We: take: a somewhat eliffe:rent approach to ielem-tify various types of text entities, both generic and spe;e:ific, withemt a eletaileel unele:rstaneling <i>oi </i>the: text domain, and relying instead on a combination <i>oi </i>shallow linguistic processing (to identify candi-elate: lexical entities), statistical knowledge: acqui­sition, unsupervised learning techniques, and pexs-sibly broael (universal but e)ften shallow) knowl­edge; sources, such as em-line elictionaries (e.g., WorelNet, Comlox, OALÜ, etc.). Our method moves beyemd the traditional name spotters anel towards a universal spotter where the reeruire-ments em what te) spot can be specified as in­put parameters, and a specific-purpose spotter e:oukl be generated automatically. In this pa­per, we describe a method of creating spotters for entities of a specified e:ategory given only initial seed examples, anel using an unsupervised learn­ing process to discover rules for finding more in­stances of the concept. At this time: we; plae;e ne; limit on what kind of things one: may want to builel a spe)tter fe&gt;r, although our experiments thus far concentrated on entities customarily roferred to with noun phrases, e.g., equipment (e.g., "gas turbine assembly"), tools (e.g., "adjustable wrench"), products (e.g., "canned soup", "Arm &amp; Hammer baking soda"), organizations (e.g., American Medical Association), locations (e.g., Albany County Airport), people (e.g., Bill Clin­ton), and so on.<page local="2" global="932"/> We view the semantic cate­gorization problem as a case of disambiguation, where for each lexical entity considered (words, phrases, N-grams), a binary decision has to be made whether or not it is an instance of the se­mantic type we are interested in. The problem of semantic tagging is thus reduced to the problem of partitioning the space of lexical entities into those that are used in the desired sense, and those that are not. We should note here that it is acceptable for homonym entities to have different classifica­tion depending upon the context in which they are used. Just as the word "bank" can be assigned dif­ferent senses in different contexts, so can "Boeing 777 jet" be once a product, and another time an equipment and not a product, depending upon the context. Other entities may be less context depen­dent (e.g., company names) if their definitions are based on internal context (e.g., "ends with Co.") as opposed to external context (e.g., "followed by <i>manufactures")</i>, or if they lack negative contexts.</p><p>The user provides the initial information (seed) about what kind of things he wishes to identify in text. This information should be in a form of a typical lexical context in which the entities to be spotted occur, e.g., "the name ends with Co.", or "to the right of <i>produced </i>or <i>made'''</i>, or "to the right of <i>maker of</i>, and so forth, or simply by list­ing or highlighting a number of examples in text. In addition, negative examples can be given, if known, to eliminate certain 'obvious' exceptions, e.g., "not to the right of <i>made for", </i>"not <i>tooth­brushes". </i>Given a sufficiently large training cor­pus, an unsupervised learning process is initiated in which the system will: (1) generate initial con­text rules from the seed examples; (2) find further instances of the sought-after concept using the ini­tial context while maximizing recall and precision; (3) find additional contexts in which these entities occur; and (4) expand the current context rules based on selected new contexts to find even more entities.</p><p>In the rest of the paper we discuss the specifics of our system. We present and evaluate prelimi­nary results of creating spotters for organizations and products.</p></section><section number="2" title="What do you want to find: seed selection"><p>If we want to identify some things in a stream of text, we first need to learn how to distinguish them from other items. For example, company names are usually capitalized and often end with 'Co.', 'Corp.', 'Inc.' and so forth. Place names, such as cities, are normally capitalized, sometimes are followed by a state abbreviation (as in <i>Albany, NY), </i>and may be preceded by locative preposi­tions (e.g., <i>in, at, from, to).</i><i> </i>Products may have no distinctive lexical appearance;, but they tend to be associated with verbs such as 'produce', 'man­ufacture', 'make', 'sell', etc., which in turn may involve a company name. Other concepts, such as equipment or materials, have few if any obvious associations with the surrounding text, and one may prefer just to point them out directly to the learning program. There arc texts, e.g., techni­cal manuals, where such specialized entities occur more often than elsewhere, and it may be advan-tagous to use these texts to derive spotters.</p><p>The seed can be obtained either by hand tag­ging some text or using a naive spotter that has high precision but presumably low recall. A naive spotter may contain simple contextual rules such as those mentioned above, e.g., for organizations: a noun phrases ending with "Co." or "Inc."; for products: a noun phrase following "manufacturer of", "producer of", or "retailer of". When such naive spotter is difficult to come by, one may re­sort to hand tagging.</p></section><section number="3" title="From seeds to spotters"><p>The seed should identify the sought-after enti­ties with a high precision (though not necessarily 100%), however its recall is assumed to be low, or else we would already have a good spotter. Our task is now to increase the recall while maintain­ing (or even increase if possible) the precision.</p><p>We proceed by examining the lexical context in which the seed entities occur. In the simplest in­stance of this process we consider a context to con­sist of N words to the left of the seed and N words to the right of the seed, as well as the words in the seed itself. Each piece of significant contextual ev­idence is then weighted against its distribution in the balance of the training corpus. This in turn leads to selection of some contexts to serve as in­dicators of relevant entities, in other words, they become the initial rules of the emerging spotter.</p><p>As an example, let's consider building a spotter for company names, starting with seeds as illus­trated in the following fragments (with seed con­texts highlighted):</p><p>... HENRY KAUFMAN is president of <i>Henry Kaufman &amp; </i>Co., a ... Gabelli, chairman of <i>Gabelli Funds </i>Inc.; Claude N. Rosenberg ... is named president of Skandinaviska Enskilda Banken ... be­come vice chairman of the state-owned electronics giant Thomson S.A. ... bank­ing group, said the formal merger of Skanska Banken into ... water maker Source Perrier S.A., according to French stock ...</p><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">93 2</doubt><page local="3" global="933"/><p>... HENRY KAUFMAN is pres­ident of <i>Henry Kaufman &amp; </i>Co., a ... Gabelli, chairman of <i>Gabelli Funds </i>Inc.; Claude N. Rosenberg ... is named president of <i>Skandinaviska Enskilda Banken ... </i>become vice chairman of <i>the state-owned electronics giant Thom­son S.A. ... </i>banking group, said the for­mal merger of Skanska Hanken into ... water maker Source Perrier S.A., accord­ing to French stock ...</p><p>Having "Co." "Inc." to pick out "Henry Kauf­man &amp; Co." and "Gabelli Funds Inc." as seeds, we proceed to find new evidence in the; training corpus, using an unsupervised learning process, and discover that "chairman of" and "president of" are very likely to precede company names. We expand our initial set of rules, which allows us to spot more companies:</p><p>This evidence discovery can be repeated in a bootstrapping process by replacing the initial set of seeds with the new set of entities obtained from the last iteration. In the above example, we now have "Skandinaviska Enskilda Banken" and "the state-owned electronics giant Thomson S.A." in addition to the initial two names. A further iter­ation may add "S.A." and "Banken" to the set of contextual rules, and so forth. In general, entities can be both added and deleted from the evolving set of examples, depending on how exactly the ev­idence is weighted and combined. The details are explained in the following sections.</p></section><section number="4" title="Text preparation"><p>In most cases the text needs to be preprocessed to isolate basic, lexical tokens (words, abbreviations, symbols, annotations, etc), and structural units (sections, paragraphs, sentences) whenever appli­cable. In addition, part-of-spoech fagging is usu­ally desirable, in which case the tagger may need to be re-trained on a text sample to optimize its performance (Brill, 1993), (Metcer, Schwartz &amp; Weischedel, 1991). Finally, a limited amount of lexical normalization, or stemming, may be per­formed.</p><p>The entities we are looking for may be expressed by certain types of phrases. For example, peo­ple names are usually sequences of proper nouns, while equipment names are contained within noun phrases, e.g., 'forward looking infrared radar'. We use part of speech information to delineate those sequences of lexical tokens that are likely to con­tain 'our' entities. From then on we restrict any further processing on these sequences, and their contexts.</p><p>These preparatory steps are desirable since they reduce the amount of noise through which the learning process needs to plow, but they are not, strictly speaking, necessary. Further experiments are required to determine the level of preprocess­ing required to optimize the performance of the Universal Spotter.</p></section><section number="5" title="Evidence items"><p>The semantic categorization problem described here displays some parallels to the word sense dis­ambiguation problem where homonym words need to be assigned to one of several possible senses, (Yarowsky, 1995), (Gale, Church &amp; Yarowsky, 1992), (Brown, Pietra, Pietra &amp; Mercer, 1991). There are two important diffbrenc.es, however. First, in the semantic categorization problem, there is at least one open-ended category serving as a grab bag for all things non-relevant. This cat­egory may be hard, if not impossible, to describe by any finite set of rules. Second, unlike the word sense disambiguation where the items to be classi­fied are known apriori, we attempt to accomplish two things at the same time:</p></section><section number="1." title="discover the items to be considered for cate­gorization;"><p><i>2. </i>actually decide if an item belongs to a given category, or falls outside of it.</p><p>The categorization of a lexical token as belong­ing to a given semantic class is based upon the information provided by the words occurring in the token itself, as well as the words that pre­cede and follow it in text. In addition, positional relationships among these words may be of im­portance. To capture; this inforniatfon, we define; the notion of an <i>evidence set </i>for a lexical unit, <i>W</i><b><i>\W2</i></b><i>---Wn, </i>(a phrase, e;r an N-grani) as follows. Let ...If ,1b <b>.,U';,...U</b><b> </b>.      be a string e&gt;f subsequent tokens (e.g., words) in text, such that <i>W-\W-i.</i><i>..Wm </i>is a unit of interest (e.g., a noun phrase) and n is the: maximum size: of the e:ontext window on either siele of the: unit. The: ac­tual winefow size: may be: limited by bounelaries of structural units such as sentences or paragraphs. Few e:ae;h unit <i>W-\ W'j...W,n, </i>a set of <i>evidence items </i>is ce)llee:te;el as a se:t union e)f the: following foui-sets:</p><p>1. Pairs of <i>(word,position), </i>where <i>position </i>c: {p,s, f} indicates whether <i>word </i>is founel in the e:ontext prece:ding (p) the: c:e:ntral unit, following (f) it, or whether it e:e)me:s from the: central unit itself (s). <i>E, =</i></p><doubt alpha="26.7" length="30" tooSmall="False" monospace="0.0">(Wn,p)......(W.-,,p)(W-.,,p) Ï</doubt><doubt alpha="22.2" length="36" tooSmall="False" monospace="0.0">(Wl,s),        (W2,s)......(W,n,s) \</doubt><doubt alpha="33.3" length="33" tooSmall="False" monospace="0.0">(W-l-l , f),(W+i,f)......(W,n,ï)J</doubt><p>2. Pairs of <i>(hi-gram, position) </i>to capture; word sc:e]ue:ne:e information. <i>K% —</i><page local="4" global="934"/></p><doubt alpha="28.2" length="39" tooSmall="False" monospace="0.0">f((W-.n,w.-(„-.i)),P)... ((w.-ï.w.-o.p)</doubt><doubt alpha="28.7" length="80" tooSmall="False" monospace="0.0">I((Wi,W2))B)        ... ((WT„..j,Wra),s){     ((W+hW+3),f)... ((W+(ll_I)lW+„),f)</doubt><p>3. 3-tuples <i>(word, position, distance), </i>where <i>distance </i>indicates how far <i>word </i>is located rela­tive to <i>Wi </i>or <i>Wm. Ea =</i></p><doubt alpha="42.9" length="28" tooSmall="False" monospace="0.0">(W-n,p,n) (Wi,s, m)(W+i,f,l)</doubt><doubt alpha="37.5" length="8" tooSmall="False" monospace="0.0">(Wm,S,1)</doubt><doubt alpha="44.4" length="9" tooSmall="False" monospace="0.0">(W+n,f,n)</doubt><doubt alpha="64.4" length="45" tooSmall="False" monospace="0.0">4. 3-tuples(bi-gram, position, distance).E4 —</doubt><doubt alpha="32.6" length="86" tooSmall="False" monospace="0.0">f((W-n,W-ln-l)),p,n-l)...(W-2,W-l),p,l) ){        ((Wl,W2),B,m-l)......((Wm~l,Wm),S,l)</doubt><doubt alpha="40.0" length="40" tooSmall="False" monospace="0.0">{ ((W+uW+2),f,l)...((W+in-lhW+n),f,n-l)J</doubt><p>For example, in the fragment below, the central phrase <i>the door </i>has the context window of size 2:</p><p>... boys kicked <i>the door </i>with rage ...</p><p>The set of evidence items generated for this frag­ment, i.e., i51UE2U.E3U.E4, contains the following elements:</p><doubt alpha="52.6" length="95" tooSmall="False" monospace="0.0">(boys,p),(kicked, p), (the,s),(door,s),(with,f),(rage,f ),((boys, kicked),p),((the, door)), s),</doubt><doubt alpha="45.5" length="154" tooSmall="False" monospace="0.0">((with, rage),{), (boys,p,2),(kicked,p,1),(the, s, 2), (door, s, 1), (ivitti,f, 1),(rage,f, 2),((boys, kicked),p,1),((the, door)),s, J )((with, rage),f,1)</doubt><p>Items in evidence sets are assigned <i>significance weights </i>(SW) to indicate how strongly they point towards or against the hyphothesis that the cen­tral unit belongs to the semantic category of in­terest to the spotter. The significance weights are acquired through corpus-based training.</p></section><section number="6" title="Training"><p>Evidence items for all candidate phrases in the training corpus, for those selected by the initial used-supplied seed, as well as for those added by a training iteration, arc divided into two groups. Group A items are collected from the candidate phrases that are accepted by the spotter; group R items come from the candidate phrases that are rejected. Note that A and R may contain repeated elements.</p><p>For each evidence item <i>t, </i>its significance weight is computed as:</p><doubt alpha="60.0" length="5" tooSmall="False" monospace="0.0">SW(t)</doubt><doubt alpha="42.9" length="28" tooSmall="False" monospace="0.0">f(t,A)-f(t,H)f(t,A) + J(l,R)</doubt><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">0</doubt><doubt alpha="59.3" length="27" tooSmall="False" monospace="0.0">f(t,A) + f(t,R)&gt; sotherwise</doubt><doubt alpha="0.0" length="3" tooSmall="False" monospace="0.0">(1)</doubt><p>where <i>f(t, </i><i>X)</i><i> </i>is the frequency of <i>t</i><i> </i>in group <i>X,</i><i> </i>and <i>s </i>is a constant used to filter the noise of very low frequency items.</p><p>As defined <i>SW(t) </i>takes values from -1 to 1 interval. <i>SW(t) </i>close to 1.0 means that <i>t </i>ap­pears nearly exclusively with the candidates that have been accepted by the spotter, and thus pro­vides the strongest positive evidence. Conversely, <i>SW(t) </i>close to -1.0 means that <i>t </i>is a strong neg­ative indicator since it occurs nearly always with the rejected candidates. <i>SW(t) </i>close to 0 indi­cates neutral evidence, which is of little or no consequence to the spotter. In general, we take <i>SW(t) </i>&gt; e &gt; 0 as a piece of positive evidence, and <i>SW(t) &lt; — e </i>as a piece of negative evidence, as provided by item <i>t. </i>Weights of evidence items within an evidence set are then combined to arrive at the compound context weight which is used to accept or reject candidate phrase.</p><p>At this time, we make no claim as to whether (1) is an optimal formula for calculating evidence weights. An alternative method we considered was to estimate certain conditional probabilities, similarly to the formula used in (Yarowsky, 1995):</p><doubt alpha="60.0" length="10" tooSmall="False" monospace="0.0">SW(t)= log</doubt><doubt alpha="55.6" length="18" tooSmall="False" monospace="0.0">P(peA/t) P(i&gt;eR/t)</doubt><doubt alpha="100.0" length="3" tooSmall="False" monospace="0.0">log</doubt><doubt alpha="47.6" length="21" tooSmall="False" monospace="0.0">f(t,A)f(A)f(t,R)f(R.)</doubt><doubt alpha="0.0" length="3" tooSmall="False" monospace="0.0">(2)</doubt><p>Here <i>f(A) </i>is (an estimate of) the probability that any given candidate phrase will be accepted by the spotter, and f(R,) is the probability that this phrase is rejected, i.e., <i>f(R) </i>= <i>l </i><i>— </i><i>f(A). </i>Thus far our experiments show that (1) produces better results than (2). We continue investigating other weighting schemes as well.</p></section><section number="7" title="Combining evidence weights to classify phrases"><p>In order to classify a candidate phrase, all ev­idence items need to be collected from its con­text and their <i>SW </i>weights are combined. When the combined weight exceeds a threshold value, the candidate is accepted and the phrase becomes available for tagging by the spotter. Otherwise, the candidate is rejected, although it may ' be reevaluated in a future iteration.</p><p>There are many ways to combine evidence weights. In our experiments we tried the following two options:</p><doubt alpha="55.7" length="61" tooSmall="False" monospace="0.0">fx + y — xyifx&gt; 0 andy&gt; 0x4y4xyifx &lt;0 andy &lt;0 (3)x4yotherwise</doubt><doubt alpha="100.0" length="3" tooSmall="False" monospace="0.0">and</doubt><doubt alpha="75.0" length="4" tooSmall="False" monospace="0.0">x(Dy</doubt><p><i>x </i>if <i>abs(x) &gt; abs(y) y </i>otherwise</p><doubt alpha="0.0" length="3" tooSmall="False" monospace="0.0">(4)</doubt><p>In (3), <i>x </i><i>(B </i><i>y </i>is greater than either x or y when both x and y are positive, and it is less than both x and y for negative x and y. In all cases, <i>x </i>ffi <i>y </i>remains within [—1,4-1] interval.</p><p>In (4) only the dominating evidence is consid­ered. This formula is more noise resistant than (3), but produces generally less recall.</p><page local="5" global="935"/></section><section number="8" title="Bootstrapping"><p>The evidence training and candidate! selection cy­cle forms a bootstrapping process, as follows:</p><p><b>Procedure </b>Bootstrapping Collect seeds <b>loop</b></p><p>Training phase Tagging phase <b>until </b>Satisfied.</p><p>The bootstrapping process allows for collect­ing more and new contextual evidence and in­crease recall of the spotter. This is possible thanks to overall redundancy and repetitiveness of infor­mation, particularly local context information, in large bodies of text. For example, in our three-sectional context representation (preceding, self, following), if one section contains strong evidence that the candidate phrase is selectable, evidence found in other sections will be considered in the next training cycle, in order to select, additional candidates.</p><p>An important, consideration here is to main­tain an overall precision level throughout the en­tire process. Although, it may be possible to recover from some misclassification errors (e.g., (Yarowsky, 1995)), care should be taken when ad­justing the process parameters so that precision does not deteriorate too rapidly. For instance, ac­ceptance thresholds of evidence weights, initially set high, can be gradually decreased to allow more recall while keeping precision at a reasonable level.</p><p>In addition, (Yarowsky, 1995), (Gale, Church &amp; Yarowsky, 1992) point out, that there is a strong tendency for words to occur in one sense within any given discourse ("one sense per discourse"). The same seems to apply to concept selection, that is, multiple occurrences of a candidate phrase within a discourse should all be either accepted or rejected by the spotter. This in turn allows for bootstrapping process to gather more contextual evidence more quickly, and thus to converge faster producing better r esults.</p></section><section number="9" title="Experiments and Results"><p>We used the Universal Spotter to find <i>organiza­tions </i>and <i>products </i>in a 7 MBytes corpus consist­ing of articles from the Wall Street Journal. First, we pre-processed the text with a part-of-speech tagger and identified all simple noun groups to be used as candidate phrases. 10 articles were set aside and hand tagged as key for evaluation. Subsequently, seeds were constructed manually in form of contextual rules. For organizations, these initial rules had a 98% precision and 49% recall; for products, the corresponding numbers were 97% and 42%. (4) is used to combine evi­dences. No lexicon verification (see later) has been used in order to show more clearly the behavior the learning method itself ( the performance can</p><p><b>iiunmiimima)..</b></p><doubt alpha="0.0" length="11" tooSmall="False" monospace="0.0">80 60 40 20</doubt><doubt alpha="63.0" length="27" tooSmall="False" monospace="0.0">"Seeds ■ 1st loop •4th loop</doubt><doubt alpha="0.0" length="2" tooSmall="False" monospace="0.0">20</doubt><doubt alpha="0.0" length="2" tooSmall="False" monospace="0.0">40</doubt><doubt alpha="0.0" length="14" tooSmall="False" monospace="0.0">60      80 100</doubt><doubt alpha="54.5" length="11" tooSmall="False" monospace="0.0">-1—►■recall</doubt><figure caption="Figure 1: Organization spotter results."></figure><p>be enhanced by lexicon verification). Also note that, the quality of the seeds affects the perfor­mance of the final spotter since they define what type of concept the system is supposed to look for. The seeds that we used in our experiments are quite simple, perhaps too simple. Better seeds may bo needed (possibly developed through an in­teraction with the user) to obtain strong results for some categories of concepts.</p><p>For organization tagging, the recall and preci­sion results obtained after the first and the fourth bootstrapping cycle are given in Figure 1.</p><p>The point; with the maximum procision*rocall in the fourth run is 95% precision and 90% re­call. Examples of extracted organizations in­clude: <i>"the State Statistical Institute Istai", "Werthcim Schroder &amp; Co", "Skandinaviska En­skilda Banken", "Statistics Canada".</i></p><p>The results for products tagging are given in Figure 2 on the next, page. Examples of ex­tracted products include: <i>"the Mercury Grand Marquis and Ford Crown Victoria cars", "Chevro­let Prizm", "Pump shoe", "AS</i><i>/400".</i></p><p>The effect of bootstrapping is clearly visible in both charts: it improves the recall while main­taining or even improving the precision. We may also notice that some misclassifications due to an imperfect seed (e.g., see the first dip in precision on the products chart) can in fact be corrected in further bootstrapping loops. The generally lower performance levels for the product spotter is prob­ably due to the fact that the concept of product is harder to circumscribe.</p></section><section number="10" title="Further options"><subsection number="10.1" title="Lexicon verification"><p>The items identified in the second step can be fur­ther validated for their broad semantic classifica­tion using on-line lexical databases such as Comprecision<page local="6" global="936"/></p><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">93 5</doubt><doubt alpha="0.0" length="3" tooSmall="False" monospace="0.0">100</doubt><p>precision 100</p><doubt alpha="100.0" length="6" tooSmall="False" monospace="0.0">recall</doubt><p>20 40 60 80 100 Figure 2: Product spotter results.</p><p>lex or Longman Dictionary, or Princeton's Word-Net; (Miller, 1990) For example, "gas turbine" is an acceptable equipment/machinery name since 'turbine' is listed as "machine" or "device" in WordNet hierarchy. More complex validation may involve other words in the phrase (e.g., "circuit breaker") or words in the immediate context.</p></subsection><subsection number="10.2" title="Conjunctions"><p>The current program cannot deal with conjunc­tion. The difficulty with conjunction is not with classification of the conjoined noun phrases (it is easier, as a matter of fact, because they carry more evidences) but with identification of the phrase it­self because of the structural ambiguities it typi­cally involves that cannot be dealt with easily on lexical or even syntactic level.</p></subsection></section><section number="11" title="Conclusions"><p>In this paper we presented the Universal Spotter, a system that learns to spot in-text references to instances of a given semantic class: people, organi­zations, products, equipment, tools, to name just a few. A specific class spotter is created through an unsupervised learning process on a text corpus given only an initial user-supplied seed: either a number of examples of the concept, or a typical context in which they can be found. The exper­iment shows that this method indeed can pro­duce useful spotters based on easy-to-construct seeds. The results shown here are promising, can be further improved by using lexicon verification. Different methods of computing SWs, combining SWs, and parameter adjustmenting for the boot­strapping process need to be explored as we be­lieve there is still room for improvement. The method is being continuously refined as we gain more feedback from empirical tests across several different applications.</p><p>We believe that the Universal Spotter can re­place much of the need to create hand-crafted concept spotters commonly used in text extrac­tion operations. In can also be applied to build­ing other than the most common spotters such as those for people names, place names, or com­pany names. In fact, is can be used to create more-or-less on-demand spotters, depending upon the applications and its subject domain. In par­ticular, we believe such spotters will be required to gain further advance in intelligent text index­ing and retrieval applications, text summariza­tion, and database applications, e.g., (Harman, 1995), (Strzalkowski, 1995).</p></section><references><p>hltw. 1994. <i>Proceedings of the Human Lan­guage Technology Workshop, </i>Princeton. San Francisco, CA:Morgan Kaufman Publishers.</p><p>rnucS. 1993. <i>Proceedings of 5th Message Under­standing Conference, </i>Baltimore. San Francisco, CA:Morgan Kaufman Publishers.</p><p>tipsterl. 1993. <i>Tipster Text Phase !•: </i><i>24 </i><i>month Conference, </i>Fredericksburg, Virginia.</p><p>Brill, E. 1992. A Simple Rule-based Part of Speech Tagger. <i>Proceedings of 3rd Applied Natural Language Processing </i>, San Francisco, CA:Morgan Kaufman Publishers.</p><p>Brown,P., S. Pietra, V. Pietra and R. Mercer. 1991. Word Sense Disambiguation Using Statis­tical Methods. <i>Proceedings of the 29h Annual Meeting of the Association for Computational Linguistics, </i>pp. 264-270.</p><p>Gale, W., K. Church and D. Yarowsky. 1992. A Method for Disambiguating Word Senses in a Large Corpus. <i>Computers and the Humanities, </i>26, pp. 415-439.</p><p>Harman, D. 1995. Overview of the Third Text REtrieval Conference. <i>Overview of the Third Text REtrieval Conference (TREC-3), </i>pp. 1-20.</p><p>Meteer, M., R. Schwartz, and R. Weischedel. 1991. Studies in Part of Speech Labeling. <i>Pro­ceedings of the </i><i>4th </i><i>DARPA Speech and Natu­ral Language Workshop, </i>Morgan-Kaufman, San Mateo, CA. pp. 331-336.</p><p>Miller, G. 1990. WordNet: An On-line Lexical Database. <i>International Journal of Lexicogra­phy, </i>3, 4.</p><p>Strzalkowski, T. 1995. Natural Language Infor­mation Retrieval. <i>Information Processing and Management, vol. 31, no. 3, </i>pp. 397-417.</p><p>Yarowsky, D. 1995. Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. <i>Proceedings of the </i><i>33rd</i><i> Annual Meeting of the Association for Computational Linguistics, </i>pp. 189-196.</p></references></body></article>