Squibs and Discussions Unsupervised Named Entity Recognition Using Syntactic and Semantic Contextual Evidence

Alessandro Cucchiarelli* Universita di Ancona Paola Velardi t Universit~i di Roma 'La Sapienza' Proper nouns form an open class, making the incompleteness ofmanually or automatically learned classification rules an obvious problem. The purpose of this paper is twofold:first, to suggest he use of a complementary "backup" method to increase the robustness of any hand-crafted or machinelearning-based NE tagger; and second, to explore the effectiveness of using more fine-grained evidence--namely, syntactic and semantic ontextual knowledge--in classifying NEs. 1. Proper Noun Classification In this paper we present a corpus-driven statistical technique that uses a learning corpus to acquire contextual classification cues, and then uses the results of this phase to classify unrecognized proper nouns (PN) in an unlabeled corpus. Training examples of proper nouns are obtained using any available named entity (NE) recognizer (in our experiments we used a rule-based recognizer and a machine-learningbased recognizer). The contextual model of PN categories is learned without supervision.

The approach described in this paper is complementary to current methods for NE recognition: our objective is to improve, without additional manual effort, the robustness of any available NE system through the use of more "fine-grained" contextual knowledge, best exploited at a relatively late stage of analysis. The method is particularly useful when an available NE system must be rapidly adapted to another language or to another domain, provided the shift is not dramatic.

Furthermore, our study provides experimental evidence relating to two issues still under debate: i) the effectiveness, in practical NLP applications, of using syntactic relations (most systems use plain collocations and morphological features), and ii) context expansion based on thesauri. While we do not provide a definitive argument in favor of syntactic contexts and semantic expansion for word sense disambiguation tasks in general, we do show that they can be successfully used for unknown proper noun classification. Proper nouns have particular characteristics, uch as low or zero ambiguity, which makes it easier to characterize their contexts. 2. Description of the U_PN Classification Method In this section we briefly summarize the corpus-based tagging technique for the classification of unknown proper nouns (for more details, see Cucchiarelli, Luzi, and Velardi \[1998\]). * Istituto di Informatica, Via Brecce Bianche 1-60131 Ancona, Italy. E-mail: alex@inform.unian.it t Dipartimento diScienze dell'Informazione, Via Salaria 113, 1-00198 Roma, Italy. E-mail: velardi@

dsi.uniromal.it Computational Linguistics Volume 27, Number 1 2.1 Learning Contextual Sense Indicators Our method proceeds as follows: first, by means of any available NE recognition technique (which we will call an early NE classifier), at least some examples of PNs in each category are detected. Second, through an unsupervised corpus-based technique, typical PN syntactic and semantic ontexts are learned. Syntactic and semantic ues can then be used to extend the coverage of the early NE classifier, increasing its robustness to the limitations of the gazetteers (PN dictionaries) and domain shifts.

In phase one, a learning corpus in the application domain is morphologically processed. The gazetteer lookup and the early NE classifier are then used to detect PNs. At the end of this phase, "some" PNs are recognized and classified, depending upon the size of the gazetteer and the actual performance (in the domain) of the NE classifier.

In phase two, the objective is to learn a contextual model of each PN category, augmented with syntactic and semantic features. Since the algorithm is unsupervised, statistical techniques are applied to smooth the weight of acquired examples as a function of semantic and syntactic ambiguity. 1

Syntactic processing is applied over the corpus. A shallow parser (see details in Basili, Pazienza, and Velardi \[1994\]) extracts from the learning corpus elementary syntactic relations such as Subject-Object, Noun-Preposition-Noun, etc. 2 An elementary syntactic link (esl) is represented as: esl(wi, mod( typei, Wk ) ) where wj is the headword, Wk is the modifier, and type i is the type of syntactic relation (e.g. Prepositional Phrase, Subject-Verb, Verb-Direct-Object, etc.). For example, esl(close mod(G_N_V_Act Xerox)) reads: Xerox is the modifier of the head close in a Subject-Verb (G_N_V_Act) syntactic relation.

In our study, the context of a word w in a sentence S is represented by the esls including w as one of its arguments (wj or Wk). The esls that include semantically classified PNs as one of their arguments are grouped in a database, called PN_esl. This database provides contextual evidence for assigning a category to unknown PNs. 2.2 Tagging Unknown PNs A corpus-driven algorithm is used to classify unknown proper nouns recognized as such, but not semantically classified by the early NE recognizer. 3 • Let U_PN be an unknown proper noun, i.e., a single word or a complex nominal. Let Cpn = (Cp~l, Cpn2 ..... CpnN) be the set of semantic ategories for proper nouns (e.g. Person, Organization, Product, etc.). Finally, let ESL be the set of esls (often more than one in a text) that include U_PN as one of their arguments. • For each esli in ESL let: esli( wj, mod( typei, Wk ) ) = esli( x, U_PN) Cucchiarelli and Velardi Unsupervised Named Entity Recognition where x = w\] or x = Wk and U-PN=wk or wj (the unknown PN can be either the head or the modifier), type i is the syntactic type of esl (e.g. N-of-N, NAN, V-for-N, etc.), and furthermore let: pl(esli(x, U_PN) ) be the plausibility of a detected esl. Plausibility is a measure of the statistical evidence of a detected syntactic relation (Basili, Marziali, and Pazienza 1994; Grishman and Sterling 1994) that depends upon local (i.e., sentence-level) syntactic ambiguity and global corpus evidence. The plausibility accounts for the uncertainty arising from syntactic ambiguity. ,. • Finally, let: - - ESLA be a set of esls in PN_esl (the previously learned

contextual model) defined as follows: for each esli(x, Uff)N) in

ESL, put in ESLA the set of eslj(x, PNj) with typej = type i, x in

the same position as esli, and PNj a known proper noun, in

the same position as U_PN in esli.

ESLB be a set of esls in PN_esl defined as follows: for each

esli(x, U_PN) in ESL put in ESLB the set of eslj(w, PNj) with

type\] -- type i, w in the same position as x in esli, Sim(w,x) > 6,

and PNj a known proper noun, in the same position as U_PN

in esli. Sim(w, x) is a similarity measure between x and w. In

our experiments, Sim(w,x) > ~ iff w and x have a common

hyperonym H in WordNet. The generality of H (i.e., the

number of levels from x to H) is made parametric, to analyze

the effect of generalization. For each semantic category Cp,j compute evidence(Cp,j) as: E esliC ESLA,C( PNj)=Cpn j weightq (x)D(x, C(PNj)) evidence(Cp~j) = a

E esliEESLA

E fl esli E ESLB,C( PNj) =Cpn j + weight~j (x)D(x, C(PNj)) weightq (x)D(x, C(PNj)) E weightiy(x)D(x'C(PNJ )) esli6 ESLB where:

weightq(x) = weight q ( esli(x, PNj) ) = pl( esli(x, PNj) ) • (1 - ~(~)-1~_1 ,

weightij(w ) = weightij(esli(w, PNj) ) = pl(esli(w, PNj)). (1 - amb(w)-l~k_\] -2 u pl(esli(x, PNj)) is the plausibility and arab(x) is the ambiguity of x in esli k is a constant factor used to incrementally reduce the influence of ambiguous words. The smoothing is tuned to be higher in ESLB a and fl are parametric, and can be used to study the evidence provided by ESLA and ESLB Computational Linguistics Volume 27, Number 1 D(x, C(PNj)) is a discrimination factor used to determine the saliency (Yarowsky 1992) of a context esli(x, _) for a category C(PNj), i.e., how good a context is at discriminating between C(PNj)and the other categories. 4 The selected category for U~N is C = argmax(evidence(Cp~k))

When grouping all the evidence of a U_PN in a text, the underlying hypothesis is that, in a given linguistic domain (finance, medicine, etc.), a PN has a unique sense. This is a reasonable restriction for Proper Nouns, supported by empirical evidence, though we would be more skeptical about the applicability of the one-sense-per-discourse paradigm (Gale, Church, and Yarowsky 1992) to generic words. We believe that it is precisely this restriction that makes the use of syntactic and semantic ontexts effective for PNs.

Notice that the formula of the evidence has several smoothing factors that work together to reduce the influence of unreliable or uninformative contexts. The formula also has parameters (k, ~, fl), estimated by running systematic experiments. Standard statistical techniques have been used to balance experimental conditions and the sources of variance.