<?xml version="1.0"?><!DOCTYPE article SYSTEM "/project/take/software/searchbench_offline_processing/paperxml_generator/aclextractor/src/python/../resource/dtd/paperxml.dtd"><article><header><firstpageheader><page local="1" global="170"/><title>Bipolar Person Name Identification of Topic Documents Using Principal Component Analysis</title><pubinfo>Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010),pages 170-178, Beijing, August 2010</pubinfo><author surname="Chen" givenname="Chien Chin"><org  name="National Taiwan University" country="Taiwan" city="Taipei"/></author><author surname="Wu" givenname="Chen-Yuan"><org  name="National Taiwan University" country="Taiwan" city="Taipei"/></author></firstpageheader><frontmatter><p><b>Bipolar Person Name Identification of Topic Documents Using Principal Component Analysis</b></p><p><b>Chein Chin Chen</b></p><p>Department of Information</p><p>Management National Taiwan University paton@im.ntu.edu.tw</p><p><b>Chen-Yuan Wu</b></p><p>Management National Taiwan University</p><p>r97 72 5035@ntu.edu.tw</p></frontmatter><abstract>In this paper, we propose an unsuper­vised approach for identifying bipolar person names in a set of topic documents. We employ principal component analysis (PCA) to discover bipolar word usage patterns of person names in the docu­ments and show that the signs of the en­tries in the principal eigenvector of PCA partition the person names into bipolar groups spontaneously. Empirical evalua­tions demonstrate the efficacy of the proposed approach in identifying bipolar person names of topics. </abstract></header><body><section number="1" title="Introduction"><p>With the advent of Web2.0, many online colla­borative tools, e.g., weblogs and discussion fo­rums are being developed to allow Internet users to express their perspectives on a wide variety of topics via Web documents. One benefit is that the Web has become an invaluable knowledge base for Internet users to learn about a topic comprehensively. Since the essence of Web2.0 is knowledge sharing, collaborative tools are generally designed with few constraints so that users will be motivated to contribute their know­ledge. As a result, the number of topic docu­ments on the Internet is growing exponentially. Research subjects, such as topic threading and timeline mining (Nallapati et al., 2004; Feng and Allan, 2007; Chen and Chen, 2008), are thus being studied to help Internet users comprehend numerous topic documents efficiently.</p><p>A topic consists of a sequence of related events associated with a specific time, place, and person(s) (Nallapati et al., 2004). Topics that involve bipolar (or competitive) viewpoints are often attention-getting and attract a large number of topic documents. For such topics, identifying the polarity of the named entities, especially per­son names, in the topic documents would help readers learn the topic efficiently. For instance, for the 2008 American presidential election, In­ternet users can find numerous Web documents about the Democrat and Republican parties. Identifying important people in the competing parties would help readers form a balanced view of the campaign.</p><p>Existing works on topic content mining focus on extracting important themes in topics. In this paper, we propose an unsupervised approach that identifies bipolar person names in a set of topic documents automatically. We employ principal component analysis (PCA) (Smith, 2002) to dis­cover bipolar word usage patterns of important person names in a set of topic documents, and show that the signs of the entries in the principal eigenvector of PCA partition the person names in bipolar groups spontaneously. In addition, we present two techniques, called off-topic block elimination and weighted correlation coefficient, to reduce the effect of data sparseness on person name bipolarization. The results of experiments based on two topic document sets written in English and Chinese respectively demonstrate that the proposed PCA-based approach is effec­tive in identifying bipolar person names. Fur­thermore, the approach is language independent.</p><page local="2" global="171"/></section><section number="2" title="Related Work"><p>Our research is closely related to opinion mining, which involves identifying the polarity (or sen­timent) of a word in order to extract positive or negative sentences from review documents (Ga-napathibhotla and Liu, 2008). Hatzivassiloglou and McKeown (1997) validated that language conjunctions, such as <i>and, or, </i>and <i>but, </i>are effec­tive indicators for judging the polarity of con­joined adjectives. The authors observed that most conjoined adjectives (77.84%) have the same orientation, while conjunctions that use <i>but </i>generally connect adjectives of different orienta­tions. They proposed a log-linear regression model that learns the distributions of conjunction indicators from a training corpus to predict the polarity of conjoined adjectives. Turney and Littman (2003) manually selected seven positive and seven negative words as a polarity lexicon and proposed using pointwise mutual informa­tion (PMI) to calculate the polarity of a word. A word has a positive orientation if it tends to co-occur with positive words; otherwise, it has a negative orientation. More recently, Esuli and Sebastiani (2006) developed a lexical resource, called SentiWordNet, which calculates the de­grees of objective, positive, and negative senti­ments of a synset in WordNet. The authors em­ployed a bootstrap strategy to collect training datasets for the sentiments and trained eight sen­timent classifiers to assign sentiment scores to a synset. Kanayama and Nasukawa (2006) posited that polar clauses with the same polarity tend to appear successively in contexts. The authors de­rived the coherent precision and coherent density of a word in a training corpus to predict the word's polarity. Ganapathibhotla and Liu (2008) investigated comparative sentences in product reviews. To identify the polarity of a compara­tive word (e.g., longer) with a product feature (e.g., battery life), the authors collected phrases that describe the Pros and Cons of products from Epinions.com and proposed one-side association (OSA), which is a variant of PMI. OSA assigns a positive (negative) orientation to the compara­tive-feature combination if the synonyms of the comparative word and feature tend to co-occur in the Pros (resp. Cons) phrases.</p><p>Our research differs from existing approaches in three respects. First, most works identify the polarity of adjectives and adverbs because the syntactic constructs generally express sentimen­tal semantics. In contrast, our method identifies the polarity of person names. Second, to the best of our knowledge, all existing polarity identifica­tion methods require external information sources (e.g., WordNet, manually selected polar­ity words, or training corpora). However, our method identifies bipolar person names by simp­ly analyzing person name usage patterns in topic documents without using external information. Finally, our method does not require any lan­guage constructs, such as conjunctions; hence, it can be applied to different languages.</p></section><section number="3" title="Method"><subsection number="3.1" title="Data Preprocessing"><p>Given a set of topic documents, we first decompose the documents into a set of non-overlapping <i>blocks B </i>= <i>{b1, </i>b2, bn}. A block can be a paragraph or a document, depending on the granularity of PCA sampling. Let <i>U </i>= {ui, u2, ..., <i>um} </i>be a set of <i>textual units </i>in B. In this study, a unit refers to a person name. Then, the document set can be represented as an mxn <i>unit-block association matrix A. </i>A column in A, denoted as <i>b</i><i>j, </i>represents a decomposed block i. It is an m-dimensional vector whose <i>j'th </i>entry, denoted as by, is the frequency of <i>u </i>in <i>bi. </i>In addition, a row in A, denoted as <i>ui, </i>represents a textual unit <i>i; </i>and it is an <i>n</i>-dimensional vector whose <i>j</i>'th entry, denoted as <i>u</i><i>i,j, </i>is the frequency of <i>ui </i>in bj.</p></subsection><subsection number="3.2" title="PCA-based Person Name Bipolarization"><p>Principal component analysis is a well-known statistical method that is used primarily to identi­fy the most important feature pattern in a high-dimensional dataset (Smith, 2002). In our re­search, it identifies the most important unit pat­tern in the topic blocks by first constructing an <i>m</i>x<i>m unit relation matrix R, </i>in which the <i>(i,j)-</i>entry (denoted as <i>r</i><i>i,j) </i>denotes the correlation coefficient of <i>u</i><i>i </i>and <i>u</i><i>j. </i>The correlation is com­puted as follows:</p><doubt alpha="100.0" length="1" tooSmall="True" monospace="0.0">n</doubt><doubt alpha="45.8" length="24" tooSmall="False" monospace="0.0">S(ui,k "ui:)*(uj,k "uj~)</doubt><doubt alpha="63.6" length="22" tooSmall="False" monospace="0.0">ri,j =corr(ui,uj)=InL,</doubt><doubt alpha="44.4" length="27" tooSmall="False" monospace="0.0">JË(ui,k -u~)2*JË(uj,k -u~)2</doubt><doubt alpha="50.0" length="8" tooSmall="False" monospace="0.0">Vk=1Vk=1</doubt><p>where <i>u1~=1/nj]nk=1ui:k </i>and <i>uj~=1/nj]nk=1u</i><i>j:k </i>are the average frequencies of units <i>i </i>and <i>j </i>respectively.</p><page local="3" global="172"/><p>The range of ry- is within [-1,1] and the value represents the degree of correlation between <i>ui</i>and <i>u </i>under the decomposed blocks. If <i>ri:J </i>= 0, we say that <i>ui </i>and <i>U</i><i>j </i>are uncorrelated; that is, occurrences of unit <i>ui </i>and unit <i>U</i><i>j </i>in the blocks are independent of each other. If <i>r</i><i>ij </i>&gt; 0, we say that units <i>ui </i>and <i>U</i><i>j </i>are positively correlated. That is, <i>U</i><i>i </i>and <i>U</i><i>j </i>tend to co-occur in the blocks; oth­erwise, both tend to be jointly-absent. If <i>r</i><i>ij </i>&lt; 0, we say that <i>ui </i>and <i>U</i><i>j </i>are negatively correlated; that is, if one unit appears, the other tends not to appear in the same block simultaneously. Note that if <i>ri:J- </i>4 0, <i>\riJ\ </i>scales the strength of a positive or negative correlation. Moreover, since the cor­relation coefficient is commutative, <i>r</i><i>i j </i>will be identical to <i>r]]i </i>such that matrix <i>R </i>will be symme­tric.</p><p>A unit pattern is represented as a vector <i>v </i>of dimension <i>m </i>in which the i'th entry <i>vi </i>indicates the weight of i'th unit in the pattern. Since ma­trix <i>R </i>depicts the correlation of the units in the topic blocks, given a constituent of v, <i>vTRv </i>com­putes the variance of the pattern to characterize the decomposed blocks. A pattern is important if it characterizes the variance of the blocks specif­ically. PCA can then identify the most important unit pattern by using the following object func­tion:</p><p>max <i>vTRv, </i>s.t. <i>vTV </i>= 1.</p><p>Without specifying any constraint on v, the objective function becomes arbitrarily large with large entry values of v. Constraint <i>vTv </i>= 1 limits the search space within the set of length-normalized vectors. Chen and Chen (2008) show that the desired <i>v </i>for the above constrained op­timization problem is the eigenvector of <i>R </i>with the largest eigenvalue. Furthermore, as <i>R </i>is a symmetric matrix, such an eigenvector always exists (Spence et al., 2000) and the optimization problem is solvable.</p><p>PCA is not the only method that identifies im­portant textual patterns in terms of eigenvectors. For instance, Gong and Liu (2001), Chen and Chen (2008) utilize the eigenvectors of symme­tric matrices to extract salient concepts and sa­lient themes from documents respectively<footnote anchor="1"/>. The difference between PCA and other eigenvector-based approaches lies in the way the unit relation matrix is constructed. PCA calculates <i>r</i><i>i j </i>by us­ing the correlation coefficient, whereas the other approaches employ the inner product or cosine formula<footnote anchor="2"/> (Manning et al., 2008) to derive the relationship between textual units. Specifically, the correlation coefficient is identical to the co­sine formula if we normalize each unit with its</p><doubt alpha="80.0" length="5" tooSmall="False" monospace="0.0">mean:</doubt><p><i>corr</i>(ui, <i>U</i><i>j</i>)</p><doubt alpha="55.0" length="20" tooSmall="False" monospace="0.0">Z(ua"UH2* JZ(Uj"Uj)2</doubt><doubt alpha="50.0" length="6" tooSmall="False" monospace="0.0">Ui,k *</doubt><doubt alpha="50.0" length="18" tooSmall="False" monospace="0.0">= cosine (ui,U *),</doubt><p>where <i>u* </i>= <i>U - </i>uf[1,1,...,1]T; <i>u</i><i>j*</i><i> </i>= <i>U</i><i>j</i><i> </i>" U;~[1, 1,^,1]T; and are the mean-normalized vectors of <i>U</i><i>i </i>and Uj, respectively. Conceptually, the mean normalization process is the only difference be­tween PCA and other eigenvector-based ap­proaches.</p><p>Since the eigenvectors of a symmetric matrix form an orthonormal basis of <i>Rm, </i>they may con­tain negative entries (Spence et al., 2000). Even though Kleinberg (1999) and Chen and Chen (2008) have shown experimentally that negative entries in an eigenvector are as important as pos­itive entries for describing a certain unit pattern, the meaning of negative entries in their ap­proaches is unexplainable. This is because tex­tual units (e.g., terms, sentences, and documents) in information retrieval are usually characterized by frequency-based metrics, e.g., term frequency, document frequency, or TFIDF (Manning et al., 2008), which can never be negative. In PCA, however, the mean normalization process of the correlation coefficient gives bipolar meaning to positive and negative entries and that helps us partition textual units into bipolar groups in ac­cordance with their signs in v.</p><doubt alpha="33.3" length="3" tooSmall="False" monospace="0.0">k=1</doubt><doubt alpha="0.0" length="1" tooSmall="True" monospace="0.0">2</doubt><p>1 The right singular vectors of a matrix <i>A </i>used by Gong and Liu (2001) are equivalent to the eigenvectors of a symme­tric matrix <i>ATA </i>whose entries are the inner products of the corresponding columns of <i>A.</i></p><p>2 The inner product is equivalent to the cosine formula when the calculated vectors are length normalized (Man­ning et al., 2008).</p><page local="4" global="173"/><doubt alpha="50.0" length="2" tooSmall="False" monospace="0.0">u2</doubt><doubt alpha="5.3" length="19" tooSmall="False" monospace="0.0">v = &lt;-0.707, 0.707&gt;</doubt><p><i>normalization</i> <i>below the average</i> <i>above the</i></p><doubt alpha="50.0" length="2" tooSmall="False" monospace="0.0">u1</doubt><doubt alpha="100.0" length="7" tooSmall="False" monospace="0.0">average</doubt><figure caption="Figure 1. The effect of the mean normalization process."></figure><p>The synthesized example in Figure 1 illu­strates the effect of the normalization process. In this example, we are only interested in textual units u1 and u2; the corpus consists of ten blocks. Graphically, each block can be represented as a point in a 2-dimensional vector space. The mean normalization process moves the origin of the 2-dimensional vector space to the centroid of the blocks that makes negative unit values explaina­ble. A negative unit of a block in this normalized vector space indicates that the number of occur­rences of the unit in the block is less than the unit's average; by contrast, a positive unit means that the number of occurrences of the unit in a block is above the average. In the figure, the most important unit pattern <i>v </i>&lt;-0.707, 0.707&gt; calculated by PCA is represented by the dashed line. The signs of v's entries indicate that the occurrence of <i>u</i>1 will be lower than the average if <i>u</i>2 occurs frequently in a block. In addition, as the signs of entries in an eigenvector are inverti-ble (Spence et al., 2000), the constituent of <i>v </i>also claims that if u1 occurs frequently in a block, then the probability that we will observe <i>u</i>2 in the same block will be lower than expected. The instances of bipolar word usage behavior pre­sented in <i>v </i>are consistent with the distribution of the ten blocks. As mentioned in Section 2, Ka-nayama and Nasukawa (2006) validated that po­lar text units with the same polarity tend to ap­pear together to make contexts coherent. Conse­quently, we believe that the signs in PCA's prin­cipal eigenvector are effective in partitioning textual units into bipolar groups.</p></subsection><subsection number="3.3" title="Sparseness of Textual Units"><p>A major problem with employing PCA to process textual data is the sparseness of textual units. To illustrate this problem, we collected 411 news documents about the 2009 NBA Finals from Google News and counted the frequency that each person name occurred in the docu­ments. We also evaluate the documents in the experiment section to determine if the proposed approach is capable of bipolarizing the person names into the teams that played in the finals correctly. We rank the units according to their frequencies and list the frequencies in descend­ing order in Figure 2. The figure shows that the frequency distribution follows Zipf s law (Man­ning et al., 2008); and for most units, the distri­bution in a block will be very sparse.</p><figure caption="Figure 2. The rank-frequency distribution of per­son names on logarithmic scales (base 10)."></figure><p>We observe that a unit will not to occur in a block in the following three scenarios. 1) The polarity of the block is the opposite of the polari­ty of the unit. For instance, if the unit represents a player in one team and the block narrates in­formation about the other team, the block's au­thor would not mention the unit in the block to ensure that the block's content is coherent. 2) Even if the polarity of a block is identical to that of the unit; the length of the block may not be sufficient to contain the unit. 3) The block is off-topic so the unit will not appear in the block. In the last two scenarios, the absence of units will impact the estimation of the correlation coeffi­cient. To alleviate the problem, we propose two techniques, the weighted correlation coefficient and off-block elimination, which we describe in the following sub-sections.</p><p><b>Weighted Correlation Coefficient</b></p><p>The so-called data sparseness problem in scena­rio 2 affects many statistical information retriev­al and language models (Manning et al., 2008). For units with the same polarity, data sparseness could lead to underestimation of their correla­tions because the probability that the units will occur together is reduced. Conversely, for uncor-related units or units with opposite polarities, data sparseness may lead to overestimation of their correlations because they are frequently jointly-absent in the decomposed blocks.<page local="5" global="174"/> While smoothing approaches, such as Laplace's law (also known as adding-one smoothing), have been developed to alleviate data sparseness in language models (Manning et al., 2008), they are not appropriate for PCA. This is because the cor­relation coefficient of PCA measures the diver­gence between units from their means, so adding one to each block unit will not change the diver­gence. To summarize, data sparseness could in­fluence the correlation coefficient when units do not co-occur. Thus, for two units <i>ui </i>and <i>U</i><i>j, </i>we separate <i>B </i>into co-occurring and non-co-occurring parts and apply the following weighted correlation coefficient:</p><p><i>corrw(ui</i>, <i>U</i><i>j</i>) =</p><doubt alpha="36.7" length="30" tooSmall="False" monospace="0.0">'(1-a)Z(ui,b "u~)*(uj,b "u~)+N</doubt><doubt alpha="66.7" length="9" tooSmall="False" monospace="0.0">beco(ij)/</doubt><doubt alpha="55.0" length="20" tooSmall="False" monospace="0.0">aZ(ui,b-u~)*(uit-uj)</doubt><doubt alpha="53.3" length="15" tooSmall="True" monospace="0.0">^ beB-co(i, j)j</doubt><doubt alpha="41.7" length="12" tooSmall="False" monospace="0.0">Z(ui,b-u~)2*</doubt><doubt alpha="66.7" length="21" tooSmall="False" monospace="0.0">Ybeco(i,j)beB-co(i,j)</doubt><doubt alpha="42.4" length="33" tooSmall="False" monospace="0.0">/(!-a)Z(uj,b -uj)2+aZ(uj,b -uj)2,</doubt><doubt alpha="66.7" length="18" tooSmall="False" monospace="0.0">Ybeco(i,j) beB-co)</doubt><p>where <i>corrw(uhu</i><i>j) </i>represents the weighted corre­lation coefficient between units <i>i</i><i> </i>and j; and <i>co(ij)</i><i> </i>denotes the set of blocks in which units and <i>j </i>co-occur. The range of parameter <i>a </i>is within [0,1]. It weights the influence of non-co-occurring blocks when calculating the correla­tion coefficient. When <i>a </i>= 0.5, the equation is equivalent to the standard correlation coefficient; and when <i>a </i>= 0, the equation only considers the blocks in which units <i>i </i>and <i>j </i>co-occur. Converse­ly, when <i>a </i>= 1, only non-co-occurring blocks are employed to calculate the units' correlation. In the experiment section, we will examine the ef­fect of <i>a </i>on bipolar person name identification.</p><p><b>Off-topic Block Elimination</b></p><p>Including off-topic blocks in PCA will lead to overestimation of the correlation between units. This is because units are usually jointly-absent from off-topic blocks that make uncorrelated or even negatively correlated units positively corre­lated. To eliminate the effect of off-topic blocks on unit bipolarization, we construct a centroid of all the decomposed blocks by averaging bi s. Then, blocks whose cosine similarity to the cen-troid is lower than a predefined threshold <i>ß </i>are excluded from calculation of the correlation coefficient.</p></subsection></section><section number="4" title="Performance Evaluations"><p>In this section, we evaluate two topics with bipo­lar (or competitive) viewpoints to demonstrate the efficacy of the proposed approach.</p><doubt alpha="46.2" length="26" tooSmall="False" monospace="0.0">4.1    The 2009 NBA Finals</doubt><p>For this experiment, we collected 411 news doc­uments about the 2009 NBA Finals from Google News during the period of the finals (from 2009/06/04 to 2009/06/16). The matchup of the finals was Lakers versus Orlando Magic. In this experiment, a block is a topic document, as pa­ragraph tags are not provided in the evaluated documents. First, we parsed the blocks by using Stanford Named Entity Recognizer<footnote anchor="3"/> to extract all possible named entities. We observed that the parser sometimes extracted false entities (such as Lakers Kobe) because the words in the headlines were capitalized and that confused the parser. To reduce the effect of false extraction by the parser, we examined the extracted named entities ma­nually. After eliminating false entities, the data­set comprised 546 unique named entities; 538 were person names and others represented or­ganizations, such as basketball teams and bas­ketball courts. To examine the effect of the weighted correlation coefficient, parameter <i>a </i>is set between 0 and 1, and increased in increments of 0.1; and the threshold <i>ß </i>used by off-topic block elimination is set at 0.3. The frequency distribution of the person names, shown in Fig­ure 2, indicates that many of the person names rarely appeared in the examined blocks, so their distribution was too sparse for PCA. Hence, in the following subsections, we sum the frequen­cies of the 538 person names in the examined blocks. We select the first <i>k </i>frequent person names, whose accumulated term frequencies reach 60% of the total frequencies, for evalua­tion. In other words, the evaluated person names account for 60% of the person name occurrences in the examined blocks.</p><p>For each parameter setting, we perform prin­cipal component analysis on the examined blocks and the selected entities, and partition the entities into two bipolar groups according to</p></section><section number="3" title="http://nlp.stanford.edu/software/CRF-NER.shtml"><page local="6" global="175"/><p>their signs in the principal eigenvector. To eva­luate the accuracy rate of bipolarization, we need to label the team of each bipolar group. Then, the accuracy rate is the proportion of the entities in the groups that actually belong to the labeled teams. Team labeling is performed by examining the person names in the larger bipolarization group. If the majority of the entities in the group belong to the Lakers (Magic), we label the group as Lakers (Magic) and the other group as Magic (Lakers). If the two bipolar groups are the same size, the group that contains the most Lakers (Magic) entities is labeled as Lakers (Magic), and the other group is labeled as Magic (Lakers). If both groups contain the same number of Lake­rs (Magic) entities, we randomly assign team labels because all assignments produce the same accuracy score. To the best of our knowledge, there is no similar work on person name bipola-rization; therefore, for comparison, we use a baseline method that assigns the same polarity to all the person names.</p><p>_<u>Magic</u>__<u>Lakers</u>_</p><doubt alpha="51.1" length="45" tooSmall="False" monospace="0.0">Dwight Howard10.0884     Derek Fisher1-0.0105</doubt><doubt alpha="50.0" length="44" tooSmall="False" monospace="0.0">Hedo Turkoglu_0.1827     Kobe Bryant -0.2033</doubt><doubt alpha="48.8" length="43" tooSmall="False" monospace="0.0">Jameer Nelson_0.3317     Lamar Odom_-0.1372</doubt><doubt alpha="48.9" length="94" tooSmall="False" monospace="0.0">Jeff Van Gundy*+ 0.3749 LeBron James" -0.0373 Magic Johnson*0.3815      Mark Jackson*" -0.2336</doubt><doubt alpha="45.2" length="42" tooSmall="False" monospace="0.0">Rafer Alston_0.3496      Pau Gasol_-0.1858</doubt><doubt alpha="46.7" length="45" tooSmall="False" monospace="0.0">Rashard Lewis_0.1861     Paul Gasol*+ -0.1645</doubt><doubt alpha="55.8" length="43" tooSmall="False" monospace="0.0">Stan Van Gundy|q.4035|phil Jackson |-0.2553</doubt><table caption="Table 1. The bipolarization results for NBA per­son names. (a = 0.8 andß= 0.3)"></table><p>Table 1 shows the bipolarization results for frequent person names in the dataset. The para­meter <i>a </i>is set at 0.8 because of its superior per­formance. The left-hand column of the table lists the person names labeled as Magic and their en­try values in the principal eigenvector; and the right-hand column lists the person names labeled as Lakers. It is interesting to note that the eva­luated entities contain person names irrelevant to the players in the NBA finals. For instance, the frequency of Magic Johnson, an ex-Lakers play­er, is high because he constantly spoke in sup­port of the Lakers during the finals. In addition, many documents misspell Pau Gasol as Paul Ga-sol. Even though the names refer to the same player, the named entity recognizer parses them as distinct entities. We propose two evaluation strategies, called <i>strict evaluation </i>and <i>non-strict evaluation. </i>The strict evaluation strategy treats the person names that do not refer to the players, coaches in the finals as false positives. Under the non-strict strategy, the person names that are closely related to Lakers or Magic players, such as a player's relatives or misspellings, are deemed true positives if they are bipolarized into the correct teams. In Table 1, a person name an­notated with the symbol * indicates that the enti­ty is bipolarized incorrectly. For instance, Magic Johnson is not a member of Magic. The symbol " indicates that the person name is neutral (or irrelevant) to the teams in the finals. In addition, the symbol + indicates that the person name represents a relative of a member of the team he/she is bipolarized to; or the name is a miss­pelling, but it refers to a member of the bipola-rized team. This kind of bipolarization is correct under the non-strict evaluation strategy. As shown in Table 1, the proposed method bipola-rizes the important persons in the finals correctly without using any external information source. The accuracy rates of strict and non-strict evalu­ation are 68.8% and 81.3% respectively. The rates are far better than those of the baseline me­thod, which are 37.5% and 43.8% respectively. If we ignore the neutral entities, which are al­ways wrong no matter what bipolarization ap­proach is employed, the strict and non-strict ac­curacies are 78.6% and 92.9% respectively. In the non-strict evaluation, we only mis-bipolarized Magic Johnson as Magic. The mis­take also reflects a problem with person name resolution when the person names that appear in a document are ambiguous. In our dataset, the word 'Magic' sometimes refers to Magic John­son and sometimes to Orlando Magic. Here, we do not consider a sophisticated person name res­olution scheme; instead, we simply assign the frequency of a person name to all its specific entities (e.g., Magic to Magic Johnson, and Kobe to Kobe Bryant) so that specific person names are frequent enough for PCA. As a result, Magic Johnson tends to co-occur with the members of Magic and is incorrectly bipolarized to the Mag­ic team. Another interesting phenomenon is that LeBron James (a player with Cavaliers) is incor­rectly bipolarized to Lakers. This is because Kobe Bryant (a player with Lakers) and LeBron James were rivals for the most valuable player documents that mentioned Kobe Bryant during the finals often compared him with LeBron<page local="7" global="176"/></p><doubt alpha="64.1" length="39" tooSmall="False" monospace="0.0">(MVP) award in the 2009 NBA season. The</doubt><p>James to attract the attention of readers. As the names often co-occur in the documents, LeBron James was wrongly classified as a member of Lakers.</p><p>Figures 3 and 4 illustrate the effects of the weighted correlation coefficient and off-topic block elimination on NBA person name bipola-rization. As shown in the figures, eliminating off-topic blocks generally improves the system performance. It is noteworthy that, when off-topic blocks are eliminated, large <i>a </i>values pro­duce good bipolarization performances. As men­tioned in Section 3.3, a large <i>a </i>implies that non-co-occurring blocks are important for calculating the correlation between a pair of person names. When off-topic blocks are eliminated, the set of non-co-occurring blocks specifically reveals op­posing or jointly-absent relationships between entities. Therefore, the bipolarization perfor­mance improves as <i>a </i>increases. Conversely, when off-topic blocks are not eliminated, the set of non-co-occurring blocks will contain off-topic blocks. As both entities in a pair tend to be ab­sent in off-topic blocks, a large <i>a </i>value will lead to overestimation of the correlation between bi­polar entities. Consequently, the bipolarization accuracy decreases as <i>a </i>increases. It is also in­teresting to note that the bipolarization perfor­mance decreases as <i>a </i>decreases. We observed that some of the topic documents are recaps of the finals, which tend to mention Magic and Lakers players together. As a small <i>a </i>value makes co-occurrence blocks important, recap-style documents will overestimate the correlation between bipolar entities. Consequently, the bipo-larization performance is inferior when <i>a </i>is small.</p><doubt alpha="88.4" length="69" tooSmall="True" monospace="0.0">iBefore off-topic block eliminationIAfter off-topic block elimination</doubt><doubt alpha="81.0" length="21" tooSmall="True" monospace="0.0">■ The baseline method</doubt><doubt alpha="83.3" length="36" tooSmall="True" monospace="0.0">■ Before off-topic block élimination</doubt><doubt alpha="80.7" length="57" tooSmall="True" monospace="0.0">■ After off-topic block élimination - The baseline method</doubt><doubt alpha="14.3" length="49" tooSmall="True" monospace="0.0">0.2    0.3    0.4    0.5    0.6    0.7 0.Sa value</doubt><p>Figure 3. The effects of the weighted correlation coefficient and off-topic block elimination on NBA person name bipolarization. (Strict)</p><doubt alpha="11.1" length="63" tooSmall="True" monospace="0.0">0.1     0.2    0.3    0.4    0.5    0.6    0.7    0.S 0.9avalue</doubt><p>Figure 4. The effects of the weighted correlation coefficient and off-topic block elimination on NBA person name bipolarization. (Non-strict)</p><subsection number="4.2" title="Taiwan's 2009 Legislative By-Elections"><p>For this experiment, we evaluated Chinese news documents about Taiwan's 2009 legislative by-elections, in which two major parties, the Demo­cratic Progressive Party (DPP) and the KouMin-Tang (KMT), campaigned for three legislative positions. Since the by-elections were regional, not many news documents were published dur­ing the campaign. In total, we collected 89 news documents that were published in The Liberty Times <footnote anchor="4"/> during the election period (from 2009/12/27 to 2010/01/11). Then, we used a Chinese word processing system, called Chinese Knowledge and Information Processing (CKIP)<footnote anchor="5"/>, to extract possible Chinese person names in the documents. Once again, the names were ex­amined manually to remove false extractions. The dataset comprised 175 unique person names. As many of the names only appeared once, we selected the first <i>k </i>frequent person names whose accumulated frequency was at least 60% of the total term frequency count of the person names for evaluation. We calculated the accuracy of person name bipolarization by the same method as the NBA experiment in order to assess how well the bipolarized groups represented the KMT and the DPP. As none of the selected names were misspelled, we do not show the non-strict accuracy of bipolarization. The threshold <i>ß </i>is set at 0.3, and each block is a topic document.</p><p>Table 2 shows the bipolarization results for the frequent person names of the candidates of the respective parties, the party chair persons, and important party staff members. The accuracy rates of the bipolarization and the baseline methods are 70% and 50%, respectively.<page local="8" global="177"/> It is note­worthy that the chairs of the DPP and the KMT, who are Ing-wen Tsai and Ying-jeou Ma respec­tively, are correctly bipolarized. We observed that, during the campaign, the chairs repeatedly helped their respective party's candidates gain support from the public. As the names of the chairs and the candidates often co-occur in the documents, they can be bipolarized accurately. We also found that our approach bipolarized two candidates incorrectly if the competition be­tween them was fierce. For instance, Kun-cheng Lai and Li-chen Kuang campaigned intensively for a single legislative position. As they often commented on each other during the campaign, they tend to co-occur in the topic documents. PCA therefore misclassifies them as positively correlated and incorrectly groups Kun-cheng Lai with the KMT party.</p><footnote>http://www.libertytimes.com.tw/index.htm</footnote><footnote>http://ckipsvr.iis.sinica.edu.tw/</footnote><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">4</doubt><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">5</doubt><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">5</doubt><p><u>KMT </u><u>(BRf </u><u>)</u></p><doubt alpha="66.7" length="27" tooSmall="False" monospace="0.0">Kun-cheng Lai(jjJtjfo)*0.39</doubt><doubt alpha="65.4" length="26" tooSmall="False" monospace="0.0">Li-chen Kuang Qjpjjâ) 0.40</doubt><doubt alpha="60.9" length="23" tooSmall="False" monospace="0.0">Li-ling Chen(PJUI^)0.01</doubt><doubt alpha="56.5" length="23" tooSmall="False" monospace="0.0">Ying-jeou Ma(.i^AQ10.05</doubt><table caption="Table 2. The bipolarization results for the elec­tion dataset. (a = 0.7)"></table><doubt alpha="87.6" length="89" tooSmall="True" monospace="0.0">IBefore off-topic block eliminationIAfter off-topic block elimination The baseline method</doubt><doubt alpha="0.0" length="62" tooSmall="True" monospace="0.0">0     0.1    0.2    0.3    0.4    0.5    0.6    0.7    0.8 0.9</doubt><figure caption="Figure 5. The effects of the weighted correlation coefficient and off-topic block elimination."></figure><p>Figure 5 shows that off-topic block elimina­tion is effective in person name bipolarization. However, the weighted correlation coefficient only improves the bipolarization performance slightly. We have investigated this problem and believe that the evaluated person names in the documents are frequent enough to prevent the data sparseness problem. While the weighted correlation coefficient does not improve the bi-polarization performance significantly, the pro­posed PCA-based approach can still identify the bipolar parties of important persons accurately.</p><p>Unlike the results in the last section, the accura­cy rate in this experiment does not decrease as <i>a </i>decreases. This is because the topic documents generally report news about a single party. As the documents rarely recap the activities of par­ties, the co-occurrence blocks accurately reflect the bipolar relationship between the persons. Hence, a small <i>a </i>value can identify bipolar per­son names effectively.</p><p>The evaluations of the NBA and the election datasets demonstrate that the proposed PCA-based approach identifies bipolar person names in topic documents effectively. As the writing styles of topic documents in different domains vary, the weighted correlation coefficient may not always improve bipolarization performance. However, because we eliminate off-topic blocks, a large <i>a </i>value always produces superior bipola-rization performances.</p><p><b>Conclusion</b></p><p>In this paper, we have proposed an unsupervised approach for identifying bipolar person names in topic documents. We show that the signs of the entries in the principal eigenvector of PCA can partition person names into bipolar groups spon­taneously. In addition, we introduce two tech­niques, namely the weighted correlation coeffi­cient and off-topic block elimination, to address the data sparseness problem. The experiment results demonstrate that the proposed approach identifies bipolar person names of topics suc­cessfully without using any external knowledge; moreover, it is language independent. The re­sults also show that off-topic block elimination along with a large <i>a </i>value for the weighted cor­relation coefficient generally produce accurate person name bipolarization. In the future, we will integrate text summarization techniques with the proposed bipolarization method to pro­vide users with polarity-based topic summaries. We believe that summarizing important informa­tion about different polarities would help users gain a comprehensive knowledge of a topic.</p><p><b>Acknowledge</b></p><p>The authors would like to thank the anonymous re­viewers for their valuable comments and suggestions. This work was supported in part by NSC 97-2221-E-002-225-MY2.</p><table caption="Table 2. The bipolarization results for the election dataset. (a = 0.7)" class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>1            DPP (Ritf) 1</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Wen-chin Yu (^3L#0*</p></td><td class="cell"><p>-0.56</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Den-yih Wu (^-ftA)*</p></td><td class="cell"><p>-0.03</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Chao-tung Chien (ffilf^)</p></td><td class="cell"><p>-0.56</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Ing-wen Tsai (^^^)</p></td><td class="cell"><p>-0.17</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Tseng-chang Su (Jailli )</p></td><td class="cell"><p>-0.01</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Jung-chung Kuo (fßü^)</p></td><td class="cell"><p>-0.01</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><page local="9" global="178"/></subsection></section><references><p>Chen, Chien Chin and Meng Chang Chen. 2008. TSCAN: a novel method for topic summarization and content anatomy. In <i>Proceedings of the 31st annual international ACM SIGIR Conference on Research and Development in Information Re­trieval, </i>pages 579-586.</p><p>Esuli, Andrea and Fabrizio Sebastiani. 2006. SEN-TIWORDNET: A Publicly Available Lexical Re­source for Opinion Mining. In <i>Proceedings of the 5th Conference on Language Resources and Eval­uation.</i></p><p>Feng, Ao and James Allan. 2007. Finding and Link­ing Incidents in News. In <i>Proceedings of the six­teenth ACM Conference on information and know­ledge management, </i>pages 821-830.</p><p>Ganapathibhotla, Murthy and Bing Liu. 2008. Mining Opinions in Comparative Sentences. In <i>Proceed­ings of the 22nd International Conference on Computational Linguistics, </i>pages 241-248.</p><p>Gong, Yihong and Xin Liu. 2001. Generic text sum­marization using relevance measure and latent se­mantic analysis. In <i>Proceedings of the 24th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, </i>pages 19-25.</p><p>Hatzivassiloglou, Vasileios and Kathleen R. McKeown. 1997. Predicting the Semantic Orienta­tion of Adjectives. In <i>Proceedings of the eighth conference on European chapter of the Associa­tion for Computational Linguistics, </i>pages 174-181.</p><p>Kanayama, Hiroshi and Tetsuya Nasukawa. 2006. Fully automatic lexicon expansion for domain-oriented sentiment analysis. In <i>Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, </i>pages 355-363.</p><p>Kleinberg, Jon M.. 1999. Authoritative sources in a hyperlinked environment. <i>Journal of the ACM 46,</i> <i>5, </i>pages 604-632.</p><p>Manning, Christopher D., Prabhakar Raghavan and Hinrich Schutze. 2008. <i>Introduction to Information Retrieval. </i>Cambridge University Press.</p><p>Nallapati, Ramesh, Ao Feng, Fuchun Peng and James Allan. 2004. Event Threading within News Topics. In <i>Proceedings of the thirteenth ACM internation­al conference on Information and knowledge man­agement, </i>pages 446-453.</p><p>Smith, Lindsay I.. 2002. <i>A Tutorial on Principal Components Analysis. </i>Cornell University.</p><p>Spence, Lawrence E., Arnold J. Insel and Stephen H. Friedberg. 2000. <i>Elementary Linear Algebra, A Matrix Approach. </i>Prentice Hall.</p><p>Turney, Peter D., and Michael L. Littman. 2003. Measuring Praise and Criticism: Inference of Se­mantic Orientation from Association. <i>ACM Trans­actions on Information Systems (TOIS), </i>pages 315­346.</p></references></body></article>