<?xml version="1.0"?><!DOCTYPE article SYSTEM "/project/take/software/searchbench_offline_processing/paperxml_generator/aclextractor/src/python/../resource/dtd/paperxml.dtd"><article><header><firstpageheader><page local="1"/><title>A Method of Measuring Term Representativeness - Baseline Method Using Co-occurrence Distribution</title><author surname="Hisamitsu" givenname="Tom"><org  name="and  x"/></author><author surname="Niwa" givenname="Yoshiki"><org  name="and  x"/></author><author surname="Tsujii" givenname="Jun-ichi"><org  name="and  x"/></author></firstpageheader><frontmatter><p>A Method of Measuring Term Representativeness - Baseline Method Using Co-occurrence Distribution -</p><p>Tora Hisamitsu,T Yoshiki Niwa,T and Jun-ichi Tsujiix</p></frontmatter><abstract>tsujii@is.s.u-tokyo.ac.jp <i>Keizai Shimbun </i>(a financial newspaper) 1996 as the corpus. Frequently appearing words are displayed in the upper part ofthe window, and words are selected by a <i>tf-idf-\ike </i>measure (Niwa et al. 1997). Typical non-representative words are filtered out by using a stop-word list. rEE3-=dï*L money ' " electronic .—^    s ^ -^year one month read cipher </abstract></header><body><section number="1" title="Introduction"><p>Measuring the representativeness (i.e., the informativeness or domain specificity) of a term<footnote anchor="1"/> is essential to various tasks in natural language processing (NLP) and information retrieval (IR). It is particularly crucial when applied to an IR interface to help a user find informative terms. For instance, when the number of retrieved documents is intractably large, an overview of representative words in the documents is needed to understand the contents. To enable this, an IR system, called <i>DualNAVIy </i>that has two navigation windows where one displays a graph of representative words in the retrieved documents, was developed (Nishioka et al. 1997). This window helps users grasp the contents of retrieved documents, but it also exposes problems concerning existing representativeness measures.</p><figure caption="Figure 1 shows an example of a graph for the query It^p-^-^^- (electronic money), withNihon"></figure><footnote label="1">A term is a word or a word sequence.</footnote><p><b>Figure 1</b></p><p>A topic word graph when the query is ST^^—(electronic money).</p><p>One problem is the difficulty of suppressing uninformative words such as ^ (year), — (one), and M (month) because classical measures, such as <i>tf-idf, </i>are too sensitive to word frequency and no established method to automatically construct a stop-word list has been developed.</p><p>Another problem is that the difference in the representativeness of words is not sufficiently indicated. In the example above, highlighting (cipher) over less representative words such as <i>h%</i><i> </i>(read) would be useful. Most classical measures based on only term frequency and document frequency cannot overcome this problem.</p><p>To define a more elaborate measure, attempts to incorporate more precise co-occurrence information have been made. Caraballo et al. (1999) tried to define a measure for "specificity" of a noun by using co-occurrence information of a noun, but it was not very successful in the sense that the measure did not particularly outperformed the term frequency.</p><p>Hisamitsu et al. (1999) developed a measure of the representativeness of a term by using co-occurrence  information  and  a normalization {hisamitu, yniwa}@harl.hitachi.co.jp</p><doubt alpha="64.0" length="203" tooSmall="False" monospace="0.0">tCentral Research Laboratory, Hitachi, Ltd.         Î Graduate School of Science, the University ofTokyo Akanuma 2520, Hatoyama, Saitama 350-0395, Japan      7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8654, Japan</doubt><p>This paper introduces a scheme, which we call the baseline method, to define a measure of term representativeness and measures defined by using the scheme. The representativeness of a term is measured by a normalized characteristic value defined for a set of all documents that contain the term. Normalization is done by comparing the original characteristic value with the characteristic value defined for a randomly chosen document set of the same size. The latter  value is estimated by a baseline function obtained by random sampling and logarithmic linear approximation. We found that the distance between the word distribution in a document set and the word distribution in a whole corpus is an effective characteristic value to use for the baseline method. Measures defined by the baseline method have several advantages including that they can be used to compare the representativeness of two terms with very different frequencies, and that they have well-defined threshold values of being representative. In addition, the baseline function for a corpus is robust against differences in corpora; that is, it can be used for normalization in a different corpus that has a different size or is in a different domain.</p><page local="2"/><p>technique. The measure is based on the distance between the word distribution in the documents containing a term and the word distribution in the whole corpus. Their measure overcomes previously mentioned problems and preliminary experiments showed that this measure worked better than existing measures in picking out representative/non-representative terms. Since the normalization technique plays a crucial part of constructing the measure, issues related to the normalization need more study.</p><p>In this paper we review Hisamitsu's measure and introduce a generic scheme -- which we call the baseline method for convenience -- that can be used to define various measures including the above. A characteristic value of all documents containing a term <i>T </i>is normalized by using a baseline function that estimates the characteristic value of a randomly chosen document set of the same size. The normalized value is then used to measure the representativeness of the term T. A measure defined by the baseline-method has several advantages compared to classical measures.</p><p>We compare four measures (two classical ones and two newly defined ones) from various viewpoints, and show the superiority of the measure based on the normalized distance between two word distributions. Another important finding is that the baseline function is substantially portable, that is, one defined for a corpus can be used for a different corpus even if the two corpora have considerably different sizes or are in different domains.</p></section><section number="2." title="Existing measures of representativeness 2.1 Overview"><p>Various methods for measuring the informativeness or domain specificity of a word have been proposed in the domains of IR and term extraction in NLP (see the survey paper by Kageura 1996). In characterizing a term, Kageura introduced the concepts of "unithood" and "termhood": unithood is "the degree of strength or stability of syntagmatic combinations or collocations," and termhood is "the degree to which a linguistic unit is related to (or more straightforwardly, represents) domain-specific concepts." Kageura's termhood is therefore what we call representativeness here.</p><p>Representativeness measures were first introduced in an IR domain for determining indexing words. The simplest measure is calculated from only word frequency within a document, For example, the weight <i>I</i><i>y </i>of word <i>wL </i>in document <i>d</i><i>j </i>is defined by where/; is the frequency of word <i>wL </i>in document <i>d</i><i>j </i>(Sparck-Jones 1973, Noreault et al. 1977). More elaborate measures for termhood combine word frequency within a document and word occurrence over a whole corpus. For instance, <i>tf-idf, </i>the most commonly used measure, was originally defined as where 1 and <i>N</i><i>totai </i>are, respectively, the number of documents containing word wL and the total number of documents (Salton et al. 1973). There are a variety of definitions of <i>tf-idf, </i>but its basic feature is that a word appearing more frequently in fewer documents is assigned a higher value. If documents are categorized beforehand, we can use a more sophisticated measure based on the y<footnote anchor="2"/> test of the hypothesis that an occurrence of the target word is independent of categories (Nagao et al. 1976).</p><doubt alpha="26.7" length="15" tooSmall="False" monospace="0.0">,9=f9* log( ^),</doubt><p>Research on automatic term extraction in NLP domains has led to several measures for weighting terms mainly by considering the unithood of a word sequence. For instance, mutual information (Church et al. 1990) and the log-likelihood (Dunning 1993) methods for extracting word bigrams have been widely used. Other measures for calculating the unithood of n-grams have also been proposed (Frantzi et al. 1996, Nakagawa et al. 1998, Kita et al. 1994).</p><subsection number="2.2" title="Problems"><p>Existing measures suffer from at least one of the following problems:</p><p>(1) Classical measures such as <i>tf-idf </i>are so sensitive to term frequencies that they fail to avoid very frequent non-informative words.</p><p>(2) Methods using cross-category word distributions (such as the y<footnote anchor="2"/> method) can be applied only if documents in a corpus are categorized.</p><p>(3) Most measures in NLP domains cannot treat single word terms because they use the unithood strength of multiple words.</p><p>(4) The threshold value for being representative is defined in an <i>ad hoc </i>manner.</p><p>The scheme that we describe here constructs measures that are free of these problems.</p></subsection></section><section number="3." title="Baseline method for defining representativeness measures"><subsection number="3.1" title="Basic idea"><p>This subsection describes the method we developed for defining a measure of term representativeness. Our basic idea is summarized by the famous quote (Firth 1957) :</p><p><i>"You shall know a word by the company it keeps."</i></p><p>We interpreted this as the following working hypothesis:</p><page local="3"/><p>For any term <i>T,</i><i> </i>if the term is representative, <i>D(T),</i><i> </i>the set of all documents containing <i>T,</i><i> </i>should have some characteristic property compared to the "average".</p><p>To apply this hypothesis, we need to specify a measure to obtain some "property" of a document set and the concept of "average". Thus, we converted this hypothesis into the following procedure:</p><p>Choose a measure <i>M </i>characterizing a document set. For term <i>T, </i>calculate <i>M(D(T)), </i>the value of the measure for D(T). Then compare M(D(T)) with <i>BM </i>(#D(T)), where #D(T) is the number of words contained in #D(T), and <i>BM</i>estimates the value of <i>M(</i><i> </i><i>D)</i><i> </i>when <i>D</i><i> </i>is a randomly chosen document set of size <i>#D(</i><i> </i><i>T).</i></p><p>Here, <i>M </i>measures the property and <i>BM </i>estimates the average. The size of a document set is defined as the number of words it contains.</p><p>We tried two measures as M One was the number of different words (referred to here as <i>DIFFNUM </i>appearing in a document set. Teramoto conducted an experiment with a small corpus and reported that <i>DIFFNUM </i>was useful for {picking out important words (Teramoto et al. 1999) under the hypothesis that the number of different words co-occurring with a topical (representative) word is smaller than that with a generic word. The other measure was the distance between the word distribution in D(2) and the word distribution in the whole corpus D0. The distance between the two distributions can be measured in various ways, and we used the log-likelihood ratio as in Hisamitsu et al. 1999, and denote this measure as LLR. Figure 2 plots (#D, M(D))s when M is <i>DIFFNUM </i>or <i>LLR, </i>where <i>D </i>varies over sets of randomly selected documents of various sizes from the articles in <i>Nikkei-Shinbun </i>1996.</p><p>For measure M, we define <i>Rep(T, </i>M), the representativeness of T by normalizing <i>M(D(Tj) </i>by BM(#D(7)). The next subsection describes the construction of <i>BM </i>and the normalization.</p></subsection><subsection number="3.2" title="Baseline function and normalization"><p>Using the case of <i>LLR </i>as an example, this subsection explains why normalization is necessary and describes the construction of a baseline function.</p><figure caption="Figure 3 superimposes coordinates {(#D(T), LLR(D(T))}s onto the graph ofLLRwhereTvaries"></figure><doubt alpha="0.0" length="8" tooSmall="False" monospace="0.0">^ 100000</doubt><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">1000</doubt><doubt alpha="0.0" length="2" tooSmall="False" monospace="0.0">10</doubt><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">1</doubt><doubt alpha="100.0" length="6" tooSmall="False" monospace="0.0">MFFNUM</doubt><doubt alpha="0.0" length="20" tooSmall="False" monospace="0.0">100 100000 100000000</doubt><p>#D: Size of randomly chosen documents</p><p><b>Figure 2</b></p><p>Values of <i>DIFFNUM </i>and <i>LLR </i>for randomly chosen document set.</p><p>over ßt-^(cipher), ¥(year), M (month), (read), — (one), -f~5 (do), and (economy). Figure 3 shows that, for example, LLR(D(^5)) is smaller than <i>LLR(D( </i>)), which reflects our linguistic intuition that words co-occurring with "economy" are more biased than those with "do". However, LLR(D(^^)) is smaller than <i>LLR(D(f%^ </i>and smaller even than LLR(D(-f5&gt;)). This contradicts our linguistic intuition, and is why values of <i>LLR </i>are not directly used to compare the representativeness of terms. This phenomenon arises because <i>LLR(D(T)) </i>generally increases as #D(T) increases. We therefore need to use some form of normalization to offset this underlying tendency.</p><p>We used a baseline function to normalize the values. In this case, BLLR(») was designed so that it approximates the curve in Fig. 3. From the definition of the distance, it is obvious that BLLR(0) = <i>BLLR(#D0) </i>= 0. At the limit when #D0 o &lt;x&gt;, BLLR(«) becomes a monotonously increasing function.</p><p>The curve could be approximated precisely through logarithmic linear approximation near (0, 0). To make an approximation, up to 300 documents are randomly sampled at a time. (Let each randomly chosen document set be denoted by D. The number of sampled documents are increased from one to 300, repeating each number up to five times.) Each (#D, <i>LLR(D)) </i>is converted to (log(#D,), log(LLR(D))). The curve formulated by the (log(#D), log(LLR(D))) values, which is very close to a straight line, is further divided into multiple parts and is part-wise approximated by a linear function. For instance, in the interval <i>I </i>= {x | 10000 <i>&lt;x </i>&lt; 15,000}, log(LLR(D)) could be approximated by 1.103 + 1.023 x log(#D) withR<footnote anchor="2"/> = 0.996.</p><p>For LLR, we define <i>Rep(T, LLR), </i>the representativeness of <i>T </i>by normalizing <i>LLR(D(T)) by </i><i>Bllr</i><i>(#</i><i>D</i><i>(</i><i>T</i><i>)) </i>as follows:</p><footnote label="2">With Teramoto's method, eight parameters must be tuned to normalize DIFFNUM(D(T)), but the details of how this was done were not disclosed.</footnote><doubt alpha="44.4" length="18" tooSmall="False" monospace="0.0">Rep(T, LLR) =100x(</doubt><doubt alpha="48.4" length="31" tooSmall="False" monospace="0.0">\og(LLR(D(T )))\og(BLLR(#D(T)))</doubt><p>■1).</p><doubt alpha="0.0" length="6" tooSmall="False" monospace="0.0">000000</doubt><doubt alpha="0.0" length="5" tooSmall="False" monospace="0.0">10000</doubt><doubt alpha="0.0" length="2" tooSmall="False" monospace="0.0">00</doubt><page local="4"/><p>For instance, when we used <i>Nihon Keizai Shimbun </i>1996, The average of 100x(log(LLR(D)) /log(%iiÄ (#£&gt;)) - 1), Avr, was -0.00423 and the standard deviation, <b>v</b>, was about 0.465 when <i>D </i>varies over randomly selected document sets. Every observed value fell within <i>Avr±4</i><b><i>v </i></b>and 99% of observed values fell within Avr±3v. This happened in all corpora (7 orpora) we tested. Therefore, we can define the threshold of being representative as, say, <i>Avr </i>+ 4<b>v</b>.</p><doubt alpha="1.7" length="60" tooSmall="False" monospace="0.0">100 1000       10000      100000    1000000   10000000 1e+08</doubt><doubt alpha="50.0" length="12" tooSmall="False" monospace="0.0">#D and#D (T)</doubt><p><b>Figure 3</b></p><p>Baseline and sample word distribution</p></subsection><subsection number="3.3" title="Treatment of very frequent terms"><p>So far we have been unable to treat extremely frequent terms, such as "t"&lt;5 (do). We therefore used random sampling to calculate the <i>Rep(T, LLR) </i>of a very frequent term T. If the number of documents in <i>DCT) </i>is larger than a threshold value <i>N, </i>which was calculated from the average number of words contained in a document, <i>N </i>documents are randomly chosen fromD(T) (we usedN= 150). This subset is denoted D(T) and <i>Rep(T, LLR) </i>is defined by 100 x (log(LLR(D(T))) /log(5its (#D(T))) - 1). This is effective because we can use a well-approximated part of the baseline curve; it also reduces the amount of calculation required.</p><p>By using <i>Rep(T, LLR) </i>defined above, we obtained <i>Rep(tZ, LLR) </i>= -0.573, <i>Rep(B&amp;ÏÏL&amp;, LLR) </i>= 4.08, and <i>Rep(</i>Rf-i§-, LLR) = 6.80, which reflect our linguistic intuition.</p></subsection><subsection number="3.4" title="Features of Rep(T, M)"><p><i>Rep(T, </i><i>M)</i><i> </i>has the following advantages by virtue of its definition:</p><p>(1) Its definition is mathematically clear.</p><p>(2) It can compare high-frequency terms with low-frequency terms.</p><p>(3) The threshold value ofbeing representative can be defined systematically.</p><p>(4) It can be applied to n-gram terms for any n.</p></subsection></section><section number="4." title="Experiments"><subsection number="4.1" title="Evaluation of monograms"><p>Taking topic-word selection for a navigation window for IR (see Fig. 1) into account, we examined the relation between the value of <i>Rep(T, MM) </i>and a manual classification of words (monograms) extracted from 158,000 articles (excluding special-styled non-sentential articles such as company-personnel-affair articles) in the 1996 issues of the <i>Nikkei Shinbun.</i></p><subsubsection number="4.1.1" title="Preparation"><p>We randomly chose 20,000 words from 86,000 words having document frequencies larger than 2, then randomly chose 2,000 of them and classified these into three groups: <i>class a </i>(acceptable) words useful for the navigation window, <i>class d </i>(delete) words not useful for the navigation window, and <i>class u </i>(uncertain) words whose usefulness in the navigation window was either neutral or difficult to judge. In the classification process, ajudge used the <i>DualNAVI </i>system and examined the informativeness of each word as guidance. Classification into class <i>d </i>words was done conservatively because the consequences of removing informative words from the window are more serious than those of allowing useless words to appear.</p><p>Table 1 shows part of the classification of the 2,000 words. Words marked "p" are proper nouns. The difference between proper nouns in <i>class a </i>and proper nouns in other classes is that the former are wellknown. Most words classified as "d" are very common verbs (such as "f"&lt;5 (do) and JtfO(have)), adverbs, demonstrative pronouns, conjunctions, and numbers. It is therefore impossible to define a stop-word list by only using parts-of-speech because almost all parts-of-speech appear in <i>class d </i>words.</p></subsubsection><subsubsection number="4.1.2" title="Measures used in the experiments"><p>To evaluate the effectiveness of several measures, we compared the ability of each measure to gather (avoid) representative (non-representative) terms. We randomly sorted the 20,000 words and then compared the results with the results of sorting by other criteria: <i>Rep(; </i>LLR), <i>Rep(; DIFFNUM), tf </i>(term frequency), and <i>tf-idf. </i>The comparison was done by using the accumulated number of words marked by a specified class that appeared in the first <i>N(1&lt;</i><i> </i><i>N&lt;</i><i> </i>2,000) words. The definition we used for <i>tf-idf </i>was <i>tf - idf</i>x log where <i>T </i>is a term, TF(T) is the term frequency of T, <i>N</i><i>totai </i>is the number of total documents, and <i>N(T) </i>is the number of documents that contain T.</p></subsubsection><subsubsection number="4.1.3" title="Results"><figure caption="Figure 4 compares, for all the sorting criteria, the"></figure><page local="5"/><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">1000</doubt><p>accumulated number of words marked "a". The total number of <i>class a </i>words was 911. <i>Rep(', </i>LLR) clearly outperformed the other measures. Although <i>Rep(; DIFFNUM) </i>outperformed <i>tf </i>and <i>tf-idf </i>up to about the first 9,000 monograms, it otherwise under-performed them. If we use the threshold value of <i>Rep(*, LLR), </i>from the first word to the 1,511th word is considered representative. In this case, the recall and precision of the 1,511 words against all <i>class a </i>words were 85% and 50%, respectively. When using <i>tf-idf, </i>the recall and precision of the first 1,511 words against all <i>class a </i>words were 79% and 47%, respectively (note that <i>tf-idf </i>does not have a clear threshold value, though).</p><p>Although the degree of out-performance by Rep(*, LLR) is not seemingly large, this is a promising result because it has been pointed out that, in the related domains of term extraction, existing measures hardly outperform even the use of frequency (for example, Daille et al. 1994, Caraballo et al. 1999) when we use this type of comparison based on the accumulated numbers.</p><p>Figure 5 compares, for all the sorting criteria, the accumulated number of words marked by <i>d </i>(454 in total). In this case, fewer the number of words is better. The difference is far clearer in this case: Rep(*, LLR) obviously outperformed the other measures. In contrast, <i>tf-idf </i>and frequency barely outperformed random sorting. Rep(*, <i>DIFFNUM) </i>outperformed <i>tf </i>and <i>tf-idf </i>until about the first 3,000 monograms, but under-performed otherwise.</p><p>Figure 6 compares, for all the sorting criteria, the accumulated number of words marked <i>ap </i>(acceptable proper nouns, 216 in total ). Comparing this figure with Fig. 4, we see that the out-performance of Rep(*, LLR) is more pronounced. Also, <i>Rep(; DIFFNUM </i>globally outperformed <i>tf </i>and <i>tf-idf, </i>while the performance of <i>tf </i>and <i>tf-idf </i>were nearly the same or even worse than with random sorting.</p><doubt alpha="0.0" length="24" tooSmall="False" monospace="0.0">0 5000 10000 15000 20000</doubt><doubt alpha="100.0" length="5" tooSmall="False" monospace="0.0">Order</doubt><doubt alpha="58.8" length="51" tooSmall="False" monospace="0.0">• random ■ Rep(-, LLR)aRep(-, DIFFNUM)    tf-idf tf</doubt><p><b>Figure 4</b></p><p>Sorting results on <i>class a </i>words <b>Figure 5</b></p><doubt alpha="0.0" length="23" tooSmall="False" monospace="0.0">0 5000 10000 15000 2001</doubt><doubt alpha="57.7" length="52" tooSmall="False" monospace="0.0">• random ■ Rep(-, LLR)aRep(-, DIFFNUM) * tf-idf « tf</doubt><p>Sorting results on <i>class d </i>words <b>Figure 6</b></p><doubt alpha="4.2" length="24" tooSmall="False" monospace="0.0">0 5000 10000 15000 2000c</doubt><doubt alpha="58.8" length="51" tooSmall="False" monospace="0.0">• random ■ Rep(-, LLR)aRep(% DIFFNUM) » tf-idf * tf</doubt><p>Sorting results on <i>class ap </i>words <u>Examples of the classified words</u></p><doubt alpha="71.4" length="7" tooSmall="False" monospace="0.0">Table 1</doubt><doubt alpha="66.7" length="9" tooSmall="False" monospace="0.0">_class a_</doubt><doubt alpha="66.7" length="90" tooSmall="False" monospace="0.0">(amusement park)#iâ#v(thretening letter)77^7 5*-*(firewall)H'Ïlpp(antique)717^^(Atlanta) p</doubt><p><i>_<u>class x</u></i> t&gt;A<b>y</b>4"9 (chilly) ïti^t (depressed)</p><p>(Ishigami) p (Shigeyuki) p <b>$^3sV </b>^'(misdirected) <b><i><u>W</u></i></b><i><u>c</u></i><b><i><u>ii </u></i></b><u>(agility)</u>_</p><doubt alpha="52.9" length="87" tooSmall="False" monospace="0.0">classGAfHSS (83,000,000)$r-Xt£(greatly)fEHfÄ(1,146)f^X(all) {J&gt;L% (not... in the least)</doubt><p>In the experiments, proper nouns generally have a high Repvalue, and some have particularly high scores. Proper nouns having particularly high scores are, for instance, the names of <i>sumo </i>wrestlers or horses. This is because they appear in articles with special formats such as sports reports.</p><p>We attribute the difference of the performance between <i>Rep(; LLR) </i>and Re^(«, <i>DIFFNUM) </i>to the quantity of information used. Obviously information on the distribution of words in a document is more comprehensive than that on the number of different words. This encourages us to try other measures of document properties that incorporate even more precise information.</p><doubt alpha="0.0" length="3" tooSmall="False" monospace="0.0">250</doubt><doubt alpha="0.0" length="3" tooSmall="False" monospace="0.0">200</doubt><doubt alpha="0.0" length="3" tooSmall="False" monospace="0.0">150</doubt><doubt alpha="0.0" length="3" tooSmall="False" monospace="0.0">100</doubt><doubt alpha="0.0" length="2" tooSmall="False" monospace="0.0">50</doubt><doubt alpha="0.0" length="4" tooSmall="False" monospace="0.0">1 §&lt;</doubt><page local="6"/><p>Sorting results on <i>class a </i>words</p></subsubsection></subsection><subsection number="4.2" title="Picking out frequent non-representative monograms"><p>When we concentrate on the most frequent terms, <i>Rep(; DIFFNUM) </i>outperformed <i>Rep(; LLR) </i>in the following sense. We marked "clearly non-representative terms" in the 2,000 most frequent monograms, then counted the number of marked terms that were assigned Rep-values smaller than the threshold value of a specified representativeness measure.</p><p>The total number of checked terms was 563, and 409 of them are identified as non-representative by Rep(*, LLR). On the other hand, Rep(*, <i>DIFFNUM) </i>identified 453 terms as non-representative.</p></subsection><subsection number="4.3" title="Rank correlation between measures"><p>We investigated the rank-correlation of the sorting results for the 20,000 terms used in the experiments described in subsection 4.1. Rank correlation was measured by Spearman's method and Kendall's method (see Appendix) using 2,000 terms randomly selected from the 20,000 terms. Table 2 shows the correlation between Rep(*, LLR) and other measures. It is interesting that the ranking by Rep(*, LLR) and that by <i>Rep(', DIFFNUM) </i>had a very low correlation, even lower than with <i>tf </i>or <i>tf-idf. </i>This indicates that a combination of Rep(*, LLR) and Rep(*, <i>DIFFNUM) </i>should provide a strong discriminative ability in term classification; this possibility deserves further investigation.</p><doubt alpha="71.4" length="7" tooSmall="False" monospace="0.0">Table 2</doubt><p>Two types of Rank correlation between term-rankings by <i>Rep(', LLR) </i>and other measures.</p></subsection><subsection number="4.4" title="Portability of baseline functions"><p>We examined the robustness of the baseline functions; that is, whether a baseline function defined from a corpus can be used for normalization in a different corpus. This was investigated by using Rep(*, LLR) with seven different corpora. Seven baseline functions were defined from seven corpora, then were used for normalization for defining Rep(*, LLR) in the corpus used in the experiments described in subesction 4.1. The performance of the Rep(*, <i>LLR)s </i>defined using the different baseline functions was compared in the same way as in the subsection 4.1. The seven corpora used to construct baseline functions were as follows:</p><doubt alpha="62.4" length="117" tooSmall="False" monospace="0.0">NK96-ORG: 15,8000 articles used in the experiments in4.1 NK96-50000: 50,000 randomly selected articles from the whole</doubt><doubt alpha="59.2" length="250" tooSmall="False" monospace="0.0">corpus NK96 (206,803 articles ofNikkei-shinbun1996) NK96-100000: 100,000 randomly selected articles from NK96 NK96-200000: 200,000 randomly selected articles from NK96 NK98-158000: 158,000 randomly selected articles from articles inNikkei-shinbun1998</doubt><doubt alpha="66.2" length="65" tooSmall="False" monospace="0.0">NC-158000: 158,000 randomly selected abstracts of academic papers</doubt><p>from NACSIS corpus (Kando et al. 1999) NC-ALL: all abstracts (333,003 abstracts) in the NACSIS corpus.</p><p>Statistics on their content words are shown in Table 3. <b>Table 3</b></p><p>Corpora and statistics on their content words</p><doubt alpha="36.5" length="52" tooSmall="False" monospace="0.0">----INK96-ORGINK96-50000 | NK96-iooooo | NK96-200000</doubt><doubt alpha="19.7" length="61" tooSmall="False" monospace="0.0">#of total words42,555,095   13,498,244'26,934,068" 53,816,407</doubt><doubt alpha="31.4" length="51" tooSmall="False" monospace="0.0">#of different words|210,572|127,852|172,914|233,668</doubt><doubt alpha="18.0" length="50" tooSmall="False" monospace="0.0">---^^|      NK98-158000      |   NC-158000| NC-ALL</doubt><doubt alpha="20.3" length="59" tooSmall="False" monospace="0.0">#of total words39,762,127        ^30,770,682   " 64,806,627</doubt><doubt alpha="39.5" length="43" tooSmall="False" monospace="0.0">#of different wordsl196,261|231,769|350,991</doubt><p>Figure 7 compares, for all the baseline functions, the accumulated number of words marked "a" (see subsection 4.1). The performance decreased only slightly when the baseline defined from NC-ALL was used. In other cases, the differences was so small that they were almost invisible in Fig. 7. The same results were obtained when using <i>class d </i>words and <i>class ap </i>words.</p><doubt alpha="20.0" length="5" tooSmall="False" monospace="0.0">1000r</doubt><doubt alpha="5.7" length="87" tooSmall="False" monospace="0.0">0      2000    4000    6000    8000    10000   12000   14000   16000   18000 20000Order</doubt><doubt alpha="38.1" length="42" tooSmall="False" monospace="0.0">• random ■ NK96-ORGaNK96-50000 NK96-100000</doubt><doubt alpha="23.5" length="51" tooSmall="False" monospace="0.0">NK96-200000 • NK98-158000    NC-158000     x NC-ALL</doubt><p><b>Figure 7</b></p><p>We also examined the rank correlations between the ranking that resulted from each representativeness measure in the same way as described in subsection 4.2 (see Table 4). They were close to 100% except when combining the Kendall's method and NACSIS corpus baselines.</p><doubt alpha="71.4" length="7" tooSmall="False" monospace="0.0">Table 4</doubt><p>Rank correlation between the measure defined by an NK96-ORG baseline and ones defined by other baselines</p><doubt alpha="0.0" length="5" tooSmall="False" monospace="0.0">_(%)_</doubt><p>These results suggest that a baseline function constructed from a corpus can be used to rank terms in considerably different corpora. This is particularly useful when we are dealing with a corpus similar to a known corpus but do not know the precise word distributions in the corpus. The same kind of robustness was observed when we used Rep(*, <i>DIFFNUM).</i><page local="7"/><i> </i>This baseline function robustness is an important feature of measures defined using the baseline based.</p><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><i>Rep(; DIFFNUM)</i></p></td><td class="cell"><p><i>tf-idf</i></p></td><td class="cell"><p><i>tf</i></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Spearman</p></td><td class="cell"><p>-0.00792</p></td><td class="cell"><p>0.202</p></td><td class="cell"><p>0.198</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Kendall</p></td><td class="cell"><p>-0.0646</p></td><td class="cell"><p>0.161</p></td><td class="cell"><p>0.153</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>NK96-50000</p></td><td class="cell"><p>NK96-</p><p>100000</p></td><td class="cell"><p>NK96-200000</p></td><td class="cell"><p>NK98-158000</p></td><td class="cell"><p>NC-158000</p></td><td class="cell"><p>NC-ALL</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Spearmann</p></td><td class="cell"><p>0.997</p></td><td class="cell"><p>0.997</p></td><td class="cell"><p>0.996</p></td><td class="cell"><p>0.999</p></td><td class="cell"><p>0.912</p></td><td class="cell"><p>0.900</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Kendall</p></td><td class="cell"><p>0.970</p></td><td class="cell"><p>0.956</p></td><td class="cell"><p>0.951</p></td><td class="cell"><p>0.979</p></td><td class="cell"><p>0.789</p></td><td class="cell"><p>0.780</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table></subsection></section><section number="5." title="Conclusion and future works"><p>We have developed a better method — the baseline method — for defining the representativeness of a term. A characteristic value of all documents containing a term <i>T, D</i>(<i>T), </i>is normalized by using a baseline function that estimates the characteristic value of a randomly chosen document set of the same size as D(T). The normalized value is used to measure the representativeness of the term <i>T, </i>and a measure defined by the baseline method offers several advantages compared to classical measures:</p><p>(1) its definition is mathematically simple and clear, (2) it can compare high-frequency terms with low-frequency terms, (3) the threshold value for being representative can be defined systematically, and (4) it can be applied to n-gram terms for any n.</p><p>We developed two measures: one based on the normalized distance between two word distributions <i>(Rep(*, LLR)) </i>and another based on the number of different words in a document set (Rep(*, <i>DIFFNUM)). </i>We compared these measures with two classical measures from various viewpoints, and confirmed that Rep(*, <i>LLR) </i>was superior. Experiments showed that the newly developed measures were particularly effective for discarding frequent but uninformative terms. We can expect that these measures can be used for automated construction of a stop-word list and improvement of similarity calculation of documents.</p><p>An important finding was that the baseline function is portable; that is, one defined on a corpus can be used for normalization in a different corpus even if the two corpora have considerably different sizes or are in different domains. We can therefore apply the measures in a practical application when dealing with multiple similar corpora whose word distribution information is not fully known but we have the information on one particular corpus.</p><p>We plan to apply Rep(*, LLR) and <i>Rep(', DIFFNUM) </i>to several tasks in IR domain, such as the construction of a stop-word list for indexing and term weighting in document-similarity calculation.</p><p>It will also be interesting to theoretically estimate the baseline functions by using fundamental parameters such as the total number of words in a corpus or the total different number in the corpus. The natures of the baseline functions deserve further study.</p></section><section title="Acknowledgements"><p>This project is supported in part by the Advanced Software Technology Project under the auspices of Information-technology Promotion Agency, Japan (IPA).</p></section><references><p>Caraballo, S. A. and Charniak, E. (1999). Determining the specificity of nouns from text. <i>Proc. ofEMNLP'99, </i>pp. 63-70.</p><doubt alpha="61.5" length="52" tooSmall="False" monospace="0.0">Church, K. W. and Hanks, P. (1990). Word Association</doubt><p>Norms, Mutual Information, and Lexicography, automatic extraction of monolingual and bilingual</p><doubt alpha="61.6" length="99" tooSmall="False" monospace="0.0">Computational Linguistics6(1), pp.22-29. Daille, B. and Gaussier, E., and Lange, J. (1994). Towards</doubt><doubt alpha="65.7" length="99" tooSmall="False" monospace="0.0">terminology.Proc. ofCOLING'94,pp.515-521. Dunning, T. (1993). Accurate Method for the Statistics of</doubt><p>Surprise and Coincidence, <i>Computational Linguistics</i></p><p>19(1), pp.61-74. Firth, J. A synopsis of linguistic theory 1930-1955. (1957).</p><p><i>Studies in Linguistic Analysis, </i>Philological Society, Oxford. Frantzi, K. T., Ananiadou, S., and Tsujii, J. (1996).</p><p>Extracting Terminological Expressions, <i>IPS- Technical</i></p><doubt alpha="56.0" length="91" tooSmall="False" monospace="0.0">Report ofSIGNL,NL112-12, pp.83-88. Hisamitsu, T., Niwa, Y, and Tsujii, J. (1999). Measuring</doubt><p>Representativeness of Terms, <i>Proc. ofIRAL'99, </i>pp.83-90. Kageura, K. andUmino, B. (1996). Methods of automatic term collection workshop (NTCIR-1), <i>Proc.</i><i> of the 22ndAnnual</i> <i>International ACM SIGIR Conf.</i><i> on Research and</i></p><doubt alpha="60.2" length="108" tooSmall="False" monospace="0.0">recognition: A review.Terminology3(2), pp.259-289. Kando, N., Kuriyama, K., andNozue, T. (1999). NACSIS test</doubt><p><i>Development in IR, </i>pp.299-300. Kita, Y., Kato, Y., Otomo, T., and Yano, Y (1994).</p><p>Comparative Study of Automatic Extraction of Collocations from Corpora: Mutual Information vs. Cost Criteria, <i>Journal</i></p><doubt alpha="60.2" length="103" tooSmall="False" monospace="0.0">of Natural Language Processing,1(1), 21-33. Nagao, M., Mizutani, M., and Ikeda, H. (1976). An Automated</doubt><p>Method of the Extraction of Important Words from Japanese</p><doubt alpha="61.3" length="111" tooSmall="False" monospace="0.0">Scientific Documents,Trans. of IPS-,17(2), pp.110-117. Nakagawa, H. and Mori, T. (1998). Nested Collocation and</doubt><p>Compound Noun For Term Extraction, <i>Proc. of</i></p><p><i>Computerm'98, </i>pp.64-70 Nishioka, S., Niwa, Y., Iwayama, M., and Takano, A. (1997).</p><p><i>DualNAVI: </i>An information retrieval interface. <i>Proc. of</i></p><p><i>WISS'97, </i>pp.43-48. (in Japanese) Niwa, Y., Nishioka, S., Iwayama, M., and Takano, A. (1997).</p><p>Topic graph generation for query navigation: Use of frequency classes for topic extraction. <i>Proc. ofNLPRS'97,</i> pp.95-100.</p><doubt alpha="51.0" length="51" tooSmall="False" monospace="0.0">Noreault, T., McGill, M., and Koll, M. B. (1977). A</doubt><p>Performance Evaluation of Similarity Measure, Document Term Weighting Schemes and Representation in a Boolean Environment. In Oddey, R. N. (ed.), <i>Information Retrieval Research. </i>London: Butterworths, pp.57-76.</p><p>Salton, G. andYang, C. S. (1973). Onthe Specification ofTerm Values in Automatic Indexing. <i>Journal of Documentation </i>29(4), pp.351-372.</p><p>Sparck-Jones, K. (1973). Index Term Weighting. <i>Information Storage andRetrieval </i>9(11), pp.616-633.</p><p>Teramoto, Y., Miyahara, Y., and Matsumoto, S. (1999).</p><p>Word weight calculation for document retrieval by analyzing the distribution of co-occurrence words, <i>Proc. of the 59th Annual Meeting of IPS-, </i>IP-06. (in Japanese)</p></references><appendix title="Appendix"><p>Asusume that items are ranked by measures <i>A </i>and B, and that the rank of item <i>I</i><i>j </i>assigned by <i>A </i>(B) is <i>R$(j) </i>(R%(/)), where R$(i) <i>zR</i><i>J</i><i>j)</i><i> </i>(R%(i) <i>zRB(j)) </i>if <i>i z j.</i><i> </i>Then, Spearman's rank correlation between the two rankings is given as and Kendall's rank correlation between the two rankings is given as</p><doubt alpha="30.0" length="30" tooSmall="False" monospace="0.0">1_6uZj (ra(J)_5%(j))2N(N2-1) '</doubt><doubt alpha="17.4" length="46" tooSmall="False" monospace="0.0">x ({#{(*,j)\v(5$(i) - 5$(j))=v(5%(i)-5%(j))} -</doubt><p><i>j) </i>| v(5$(i) -5$<i>(j)) = - v(5%(i) </i><i>-5%</i><i>(j))}), </i>where <i>v </i>([)=1 if x &gt; 0, else if x &lt; 0, <i>v </i>(x)= -1.</p></appendix></body></article>