<?xml version="1.0"?><!DOCTYPE article SYSTEM "/project/take/software/searchbench_offline_processing/paperxml_generator/aclextractor/src/python/../resource/dtd/paperxml.dtd"><article><header><firstpageheader><page local="1"/><title>Determining Recurrent Sound Correspondences by Inducing Translation Models</title><author surname="Kondrak" givenname="Grzegorz"><org  name="University of Toronto" country="Canada" city="Toronto"/></author></firstpageheader><frontmatter><p><b>Determining Recurrent Sound Correspondences by Inducing Translation Models</b></p><p><b>Grzegorz Kondrak</b></p><p>Department of Computer Science</p><p>University of Toronto Toronto, Ontario, Canada M5S 3G4</p></frontmatter><abstract>I present a novel approach to the determination of recurrent sound correspondences in bilingual wordlists. The idea is to relate correspondences be­tween sounds in wordlists to translational equiva­lences between words in bitexts (bilingual corpora). My method induces models of sound correspon­dence that are similar to models developed for sta­tistical machine translation. The experiments show that the method is able to determine recurrent sound correspondences in bilingual wordlists in which less than 30% of the pairs are cognates. By employ­ing the discovered correspondences, the method can identify cognates with higher accuracy than the pre­viously reported algorithms. </abstract></header><body><section number="1" title="Introduction"><p>Genetically related languages often exhibit recur­rent sound correspondences (henceforth referred to simply as correspondences) in words with similar meaning. For example, <i>t:d, 6:t, n:n, </i>and other known correspondences between English and Latin are demonstrated by the word pairs in Table 1. Word pairs that contain such correspondences are called <i>cognates, </i>because they originate from the same protoform in the ancestor language. Correspon­dences in cognates are preserved over time thanks to the regularity of sound changes, which normally ap­ply to sounds in a given phonological context across all words in the language.</p><p>The determination ofcorrespondences is the prin­cipal step of the comparative method of language reconstruction. Not only does it provide evidence for the relatedness of languages, but it also makes it possible to distinguish cognates from loan words and chance resemblances. However, because man­ual determination of correspondences is an ex­tremely time-consuming process, it has yet to be ac­complished for many proposed language families. A system able to perform this task automatically</p><p>Table 1: Examples of English-Latin cognates exhibiting correspondences. The corresponding phonemes shown in boldface originate from a sin­gle proto-phoneme.</p><p>from unprocessed bilingual wordlists could be of great assistance to historical linguists. The <i>Recon­struction Engine </i>(Lowe and Mazaudon, 1994), a set of programs designed to be an aid in language re­construction, requires a set of correspondences to be provided beforehand.</p><p>The determination of correspondences is closely related to another task that has been much stud­ied in computational linguistics, the identification of cognates. Cognates have been employed for sentence and word alignment in bitexts (Simard et al., 1992; Melamed, 1999), improving statisti­cal machine translation models (Al-Onaizan et al., 1999), and inducing translation lexicons (Koehn and Knight, 2001). Some of the proposed cognate identification algorithms implicitly determine and employ correspondences (Tiedemann, 1999; Mann and Yarowsky, 2001).</p><p>Although it may not be immediately apparent, there is a strong similarity between the task of matching phonetic segments in a pair of cognate words, and the task of matching words in two sen­tences that are mutual translations (Figure 1). The consistency with which a word in one language is translated into a word in another language is mir­rored by the consistency of sound correspondences.<page local="2"/> The former is due to the semantic relation of syn­onymy, while the latter follows from the principle of the regularity of sound change. Thus, as already asserted by Guy (1994), it should be possible to use similar techniques for both tasks.</p><table class="main" frame="box" rules="all" border="0" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>English</p></td><td class="cell"><p>Latin</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>t</b>n</p></td><td class="cell"><p><b>d</b>eke</p></td><td class="cell"><p>'ten'</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>t </b>u</p></td><td class="cell"><p><b>d</b>uo</p></td><td class="cell"><p>'two'</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>1<b>1</b></p></td><td class="cell"><p>e<b>d</b></p></td><td class="cell"><p>'eat'</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>t </b>u<b>~e</b></p></td><td class="cell"><p><b>d</b>en<b>t</b></p></td><td class="cell"><p>'tooth'</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>n</b>e s <b>t</b></p></td><td class="cell"><p><b>n</b>i<b>d</b></p></td><td class="cell"><p>'nest'</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>n </b>i</p></td><td class="cell"><p>ge<b>n</b></p></td><td class="cell"><p>'knee'</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>n</b>e<b>f </b>j u</p></td><td class="cell"><p><b>n</b>e<b>p</b>ot</p></td><td class="cell"><p>'nephew'</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>f</b>u<b>t</b></p></td><td class="cell"><p><b>p</b>e<b>d</b></p></td><td class="cell"><p>'foot'</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>f </b>o <b>m</b></p></td><td class="cell"><p>s<b>p</b>u<b>m</b></p></td><td class="cell"><p>'foam'</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>w u <b>1 </b><b>f</b></p></td><td class="cell"><p><b>1 </b>u <b>p</b></p></td><td class="cell"><p>'wolf</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><doubt alpha="51.4" length="37" tooSmall="False" monospace="0.0">Snow    lies       on      the ground</doubt><doubt alpha="53.6" length="28" tooSmall="False" monospace="0.0">Nix     iacet       in terra</doubt><doubt alpha="100.0" length="4" tooSmall="False" monospace="0.0">wulf</doubt><doubt alpha="100.0" length="3" tooSmall="False" monospace="0.0">lup</doubt><figure caption="Figure 1: The similarity of word alignment in bi-texts and phoneme alignment between cognates."></figure><p>The primary objective of the method proposed in this paper is the automatic determination of corre­spondences in bilingual wordlists, such as the one in Table 1. The method exploits the idea of relat­ing correspondences in bilingual wordlists to trans-lational equivalence associations in bitexts through the employment of models developed in the con­text of statistical machine translation, The second task addressed in this paper is the identification of cognates on the basis of the discovered correspon­dences. The experiments to be described in Sec­tion 6 show that the method is capable of determin­ing correspondences in bilingual wordlists in which less than 30% of the pairs are cognates, and out­performs comparable algorithms on cognate identi­fication. Although the experiments focus on bilin­gual wordlists, the approach presented in this paper could potentially be applied to other bitext-related tasks.</p></section><section number="2" title="Related work"><p>In a schematic description of the comparative method, the two steps that precede the determi­nation of correspondences are the identification of cognate pairs (Kondrak, 2001), and their phonetic alignment (Kondrak, 2000). Indeed, if a compre­hensive set of correctly aligned cognate pairs is available, the correspondences could be extracted by simply following the alignment links. Unfortu­nately, in order to make reliable judgments of cog­nation, it is necessary to know in advance what the correspondences are. Historical linguists solve this apparent circularity by guessing a small number of likely cognates and refining the set of correspon­dences and cognates in an iterative fashion.</p><p>Guy (1994) outlines an algorithm for identifying cognates in bilingual wordlists which is based on correspondences. The algorithm estimates the prob­ability of phoneme correspondences by employing a variant of the %<footnote anchor="2"/> statistic on a contingency ta­ble, which indicates how often two phonemes co-occur in words of the same meaning. The prob­abilities are then converted into the estimates of cognation by means of some experimentation-based heuristics. The paper does not contain any eval­uation on authentic language data, but Guy's pro­gram COGNATE, which implements the algorithm, is publicly available. An experimental evaluation of COGNATE is described in Section 6.</p><p>Oakes (2000) describes a set of programs that together perform several steps of the comparative method, from the determination of correspondences in wordlists to the actual reconstruction ofthe proto-forms. Word pairs are considered cognate if their edit distance is below a certain threshold. The edit operations cover a number of sound-change cate­gories. Sound correspondences are deemed to be regular if they are found to occur more than once in the data. The paper describes experimental results of running the programs on a set of wordlists rep­resenting four Indonesian languages, and compares those to the reconstructions found in the linguistic literature. Section 6 contains an evaluation of one of the programs in the set, JAKARTA, on the cog­nate identification task.</p></section><section number="3" title="Models of translational equivalence"><p>In statistical machine translation, a translation model approximates the probability that two sen­tences are mutual translations by computing the product of the probabilities that each word in the target sentence is a translation of some source lan­guage word. A model oftranslation equivalence that determines the word translation probabilities can be <i>induced </i>from bitexts. The difficulty lies in the fact that the mapping, or alignment, of words between two parts of a bitext is not known in advance.</p><p>Algorithms for word alignment in bitexts aim at discovering word pairs that are mutual translations. A straightforward approach is to estimate the likeli­hood that words are mutual translations by comput­ing a similarity function based on a co-occurrence statistic, such as mutual information, Dice coeffi­cient, or the %<footnote anchor="2"/> test.<page local="3"/> The underlying assumption is that the association scores for different word pairs are independent of each other.</p><p>Melamed (2000) shows that the assumption ofin-dependence leads to invalid word associations, and proposes an algorithm for inducing models oftrans-lational equivalence that outperform the models that are based solely on co-occurrence counts. His mod­els employ the <i>one-to-one </i>assumption, which for­malizes the observation that most words in bitexts are translated to a single word in the correspond­ing sentence. The algorithm, which is related to the expectation-maximization (EM) algorithm, iter-atively re-estimates the <i>likelihood scores </i>which rep­resent the probability that two word types are mu­tual translations. In the first step, the scores are initialized according to the <i>G<footnote anchor="2"/> </i>statistic (Dunning, 1993). Next, the likelihood scores are used to in­duce a set of one-to-one <i>links </i>between word tokens in the bitext. The links are determined by a greedy <i>competitive linking </i>algorithm, which proceeds to link pairs that have the highest likelihood scores. After the linking is completed, the link counts are used to re-estimate the likelihood scores, which in turn are applied to find a new set of links. The process is repeated until the translation model con­verges to the desired degree.</p><p>Melamed presents three translation-model esti­mation methods. Method A re-estimates the like­lihood scores as the logarithm of the probability of jointly generating the pair of words <i>u </i>and <i>v:</i></p><doubt alpha="53.8" length="13" tooSmall="False" monospace="0.0">. links (u.v)</doubt><doubt alpha="61.1" length="18" tooSmall="False" monospace="0.0">scoreA{u,v) =log--</doubt><doubt alpha="60.0" length="15" tooSmall="False" monospace="0.0">2uylinks (u',v)</doubt><p>where <i>links u v </i>denotes the number of links in­duced between <i>u</i><i> </i>and <i>v.</i><i> </i>Note that the co-occurrence counts of <i>u </i>and <i>v</i>are not used for the re-estimation, In Method B, an explicit noise model with auxil­iary parameters A+ and <i>X~ </i>is constructed in order to improve the estimation of likelihood scores. A+ is a probability that a link is induced between two co-occurring words that are mutual translations, while At is a probability that a link is induced between two co-occurring words that are not mutual trans­lations. Ideally, A+ should be close to one and At should be close to zero. The actual values of the two parameters are calculated by the maximum likeli­hood estimation. Let <i>cooc u v </i>be the number of co-occurrences of <i>u </i>and <i>v</i>.The<i>score </i>function is defined as:</p><doubt alpha="57.7" length="26" tooSmall="False" monospace="0.0">B(links(u,v)\cooc(u,v),A+)</doubt><doubt alpha="53.6" length="28" tooSmall="False" monospace="0.0">scoreB{u,v) =log -r,-?-r-r3r</doubt><doubt alpha="66.7" length="24" tooSmall="False" monospace="0.0">B^linkskuv|cooc(u,v),A )</doubt><p>where <i>B k n p </i>denotes the probability of <i>k </i>being generated from a binomial distribution with param­eters <i>n </i>and <i>p.</i></p><p>In Method C, bitext tokens are divided into classes, such as content words, function words, punctuation, etc., with the aim of producing more accurate translation models. The auxiliary parame­ters are estimated separately for each class.</p><doubt alpha="66.7" length="24" tooSmall="False" monospace="0.0">scoreC u vZ    class u v</doubt><doubt alpha="48.4" length="62" tooSmall="False" monospace="0.0">j_S(/z'«fa(M,v) |cooc(m,v),Aj)^ B [links [u,y) \ cooc(u,v),X^)</doubt></section><section number="4" title="Models of sound correspondence"><p>Thanks to its generality and symmetry, Melamed's parameter estimation process can be adapted to the problem ofdetermining correspondences. The main idea is to induce a model of sound correspondence in a bilingual wordlist, in the same way as one in­duces a model of translational equivalence among words in a parallel corpus. After the model has con­verged, phoneme pairs with the highest likelihood scores represent the most likely correspondences.</p><p>While there are strong similarities between the task of estimating translational equivalence of words and the task of determining recurrent corre­spondences of sounds, a number of important modi­fications to Melamed's original algorithm are neces­sary in order to make it applicable to the latter task. The modifications include the method of finding a good alignment, the handling of null links, and the method of computing the alignment score.</p><p>For the task at hand, I employ a different method of aligning the segments in two corresponding se­quences. In sentence translation, the alignment links frequently cross and it is not unusual for two words in different parts of sentences to correspond. In contrast, the processes that lead to link inter­section in diachronic phonology, such as <i>metathe­sis, </i>are quite sporadic. The introduction of the <i>no-crossing-links constraint </i>on alignments not only leads to a dramatic reduction of the search space, but also makes it possible to replace the approximate competitive-linking algorithm of Melamed with a variant of the well-known dynamic programming algorithm  (Wagner and Fischer, 1974; Kondrak, 2000), which computes the <i>optimal </i>alignment be­tween two strings in polynomial time.<page local="4"/></p><p><i>Null links </i>in statistical machine translation are induced for words on one side of the bitext that have no clear counterparts on the other side of the bitext. Melamed's algorithm explicitly calculates the likelihood scores of null links for every word type occurring in a bitext. In diachronic phonol­ogy, phonological processes that lead to insertion or deletion of segments usually operate on individ­ual words rather than on particular sounds across the language. Therefore, I model insertion and deletion by employing a constant <i>indel </i>penalty for unlinked segments.</p><p>The alignment score between two words is com­puted by summing the number of induced links, and applying an indel penalty for each unlinked seg­ment, with the exception of the segments beyond the rightmost link. The exception reflects the relative instability of word endings in the course of linguis­tic evolution. In order to avoid inducing links that are unlikely to represent recurrent sound correspon­dences, only pairs whose likelihood scores exceed a set threshold are linked. All correspondences above the threshold are considered to be equally valid. In the cases where more than one best alignment is found, each link is assigned a weight that is its av­erage over the entire set of best alignments (for ex­ample, a link present in only one of two competing alignments receives the weight of 0.5).</p></section><section number="5" title="Implementation"><p>The method described above has been implemented as a C++ program, named CORDI, which will soon be made publicly available. The program takes as input a bilingual wordlist and produces an ordered list of correspondences. A model for a 200-pair list usually converges after 3-5 iterations, which takes only a few seconds on a Sparc workstation. The user can choose between methods A, B, and C, de­scribed in Section 3, and an additional Method D. In Method C, phonemes are divided into two classes: non-syllabic (consonants and glides), and syllabic (vowels); links between phonemes belonging to dif­ferent classes are not induced. Method D differs from Method C in that the syllabic phonemes do not participate in any links.</p><p>Adjustable parameters include the indel penalty ratio <i>d </i>and the minimum-strength correspondence threshold <i>t</i>. The parameter <i>d </i>fixes the ratio be­tween the negative indel weight and the positive weight assigned to every induced link. (A lower ratio causes the program to be more adventurous in positing sparse links.) The parameter <i>t </i>controls the tradeoff between reliability and the number of links. In Method A, the value of <i>t </i>is the minimum number of phoneme links that have to be induced for the correspondence to be valid. In methods B, C, and D, the value of <i>t </i>implies a likelihood score threshold of <i>t ■ </i>log which is a score achieved by a pair of phonemes that have <i>t </i>links out of <i>t </i>co­occurrences. In the experiments reported in Sec­tion 6, <i>d </i>was set to 0 15, and <i>t </i>was set to 1 (suf­ficient to reject all non-recurring correspondences). In Method D, where the lack of vowel links causes the linking constraints to be weaker, a higher value of <i>t </i>3 was used. These parameter values were op­timized on the development set described below.</p></section><section number="6" title="Evaluation"><subsection number="6.1" title="The data for experiments"><p>The experiments in this section were performed us­ing a well-known list of 200 basic meanings that are considered universal and relatively resistant to lex­ical replacement (Swadesh, 1952). The Swadesh 200-word lists are widely used in linguistics and have been compiled for a large number of lan­guages.</p><p>The development set consisted ofthree 200-word list pairs adapted from the Comparative Indoeuro-pean Data Corpus (Dyen et al., 1992). The cor­pus contains the 200-word lists for a number of Indoeuropean languages together with cognation judgments made by a renowned historical linguist Isidore Dyen. Unfortunately, the words are rep­resented in the Roman alphabet without any dia­critical marks, which makes them unsuitable for automatic phonetic analysis. The Polish-Russian, Spanish-Romanian, and Italian-Serbocroatian were selected because they represent three different levels ofrelatedness (73.5%, 58.5%, and 25.3% of cognate pairs, respectively), and also because they have rel­atively transparent grapheme-to-phoneme conver­sion rules. They were transcribed into a phonetic notation by means of <i>Perl </i>scripts and then stemmed and corrected manually.</p><p>The test set consisted of five 200-word lists repre­senting English, German, French, Latin, and Alba­nian, compiled by Kessler (2001) As the lists con­tain rich phonetic and morphological information, the stemmed forms were automatically converted from the XML format with virtually no extra processing.<page local="5"/> The word pairs classified by Kessler as doubtful cognates were assumed to be unrelated.</p></subsection><subsection number="6.2" title="Determination of correspondences in word pairs"><p>Experiments show that CORDI has little difficulty in determining correspondences given a set of cog­nate pairs (Kondrak, 2002) However, the assump­tion that a set of identified cognates is already avail­able as the input for the program is not very plausi­ble. The very existence of a reliable set of cognate pairs implies that the languages in question have al­ready been thoroughly analyzed and that the sound correspondences are known. A more realistic in­put requirement is a list of word pairs from two languages such that the corresponding words have the same, well-deined meaning. Determining cor­respondences in a list of synonyms is clearly a more challenging task than extracting them from a list of reliable cognates because the non-cognate pairs in­troduce noise into the data. Note that Melamed's original algorithm is designed to operate on aligned sentences that are guaranteed to be mutual transla­tions.</p><p>In order to test CORDI's ability to determine cor­respondences in noisy data, Method D was applied to the 200-word lists for English and Latin. Only 29% of word pairs are actually cognate; the remain­ing 71% of the pairs are unrelated lexemes. The top ten correspondences discovered by the program are shown in Table 2. Remarkably, all but one are valid. In contrast, only four of the top ten phoneme matchings picked up by the %<footnote anchor="2"/> statistic are valid cor­respondences (the validity judgements are my own).</p></subsection><subsection number="6.3" title="Identification of cognates in word pairs"><p>The quality of correspondences produced by CORDI is dificult to validate, quantify, and com­pare with the results of alternative approaches. However, it is possible to evaluate the correspon­dences indirectly by using them to identify cog­nates. The likelihood of cognation of a pair of words increases with the number of correspondences that they contain. Since CORDI explicitly posits corre­spondence links between words, the likelihood of cognation can be estimated by simply dividing the number of induced links by the length of the words that are being compared. A minimum-length pa­rameter can be set in order to avoid computing cog­nation estimates for very short words, which tend to be unreliable.</p><p>The evaluation method for cognate identiication algorithms adopted in this section is to apply them to a bilingual wordlist and order the pairs accord­ing to their scores (refer to Table 3). The ranking is then evaluated against a gold standard by com­puting the n-point average precision, a generaliza­tion of the 11-point average precision, where <i>n </i>is the total number of cognate pairs in the list. The <i>n</i>-point average precision is obtained by taking the average of <i>n </i>precision values that are calculated for each point in the list where we ind a cognate pair: <i>Pi </i>= ^,/=l,... <b>,72, </b>where / is the number of the cognate pair counting from the top of the list pro­duced by the algorithm, and <i>rt </i>is the rank of this cognate pair among all word pairs. The <i>n</i>-point pre­cision of the ranking in Table 3 is (1.0 + 0.66)/2 = 0.83. The expected n-point precision of a program that randomly orders word pairs is close to the pro­portion of cognate pairs in the list.</p><table caption="Table 3: An example ranking of cognate pairs." class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><i>r,</i></p></td><td class="cell"><p>word pair</p></td><td class="cell"><p>cognate?</p></td><td class="cell"><p><i>i</i></p></td><td class="cell"><p><i>Pi</i></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>1</p></td><td class="cell"><p>/hart/:/kord/</p></td><td class="cell"><p>yes</p></td><td class="cell"><p>1</p></td><td class="cell"><p>1.00</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>2</p></td><td class="cell"><p>/hat/:/kalid/</p></td><td class="cell"><p>no</p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>3</p></td><td class="cell"><p>/sno/:/niw/</p></td><td class="cell"><p>yes</p></td><td class="cell"><p>2</p></td><td class="cell"><p>0.66</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table caption="Table 2: English-Latin correspondences discovered by CORDI in noisy synonym data." class="main" frame="box" rules="all" border="0" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><i>cooc</i></p></td><td class="cell"><p><i>links</i></p></td><td class="cell"><p><i>score</i></p></td><td class="cell"><p>valid</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>r:r</p></td><td class="cell"><p>26</p></td><td class="cell"><p>24</p></td><td class="cell"><p>158.7</p></td><td class="cell"><p>yes</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>n:n</p></td><td class="cell"><p>24</p></td><td class="cell"><p>23</p></td><td class="cell"><p>154.2</p></td><td class="cell"><p>yes</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>t:d</p></td><td class="cell"><p>18</p></td><td class="cell"><p>18</p></td><td class="cell"><p>122.4</p></td><td class="cell"><p>yes</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>k:k</p></td><td class="cell"><p>12</p></td><td class="cell"><p>11</p></td><td class="cell"><p>72.5</p></td><td class="cell"><p>yes</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>s:s</p></td><td class="cell"><p>11</p></td><td class="cell"><p>10</p></td><td class="cell"><p>65.7</p></td><td class="cell"><p>yes</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>f:p</p></td><td class="cell"><p>9</p></td><td class="cell"><p>9</p></td><td class="cell"><p>61.2</p></td><td class="cell"><p>yes</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>m:m</p></td><td class="cell"><p>10</p></td><td class="cell"><p>9</p></td><td class="cell"><p>58.9</p></td><td class="cell"><p>yes</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>d:t</p></td><td class="cell"><p>10</p></td><td class="cell"><p>8</p></td><td class="cell"><p>49.8</p></td><td class="cell"><p>no</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>l:l</p></td><td class="cell"><p>14</p></td><td class="cell"><p>9</p></td><td class="cell"><p>49.7</p></td><td class="cell"><p>yes</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>h:k</p></td><td class="cell"><p>7</p></td><td class="cell"><p>7</p></td><td class="cell"><p>47.6</p></td><td class="cell"><p>yes</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table caption="Table 4: Average cognate identification precision on the development set for various methods." class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Lang</p></td><td class="cell"><p>uages</p></td><td class="cell"><p></p></td><td class="cell"><p>Method</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>A</p></td><td class="cell"><p>B</p></td><td class="cell"><p>C</p></td><td class="cell"><p>D</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Polish</p></td><td class="cell"><p>Russian</p></td><td class="cell"><p>.989</p></td><td class="cell"><p>.994</p></td><td class="cell"><p>.994</p></td><td class="cell"><p>.986</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Romanian</p></td><td class="cell"><p>Spanish</p></td><td class="cell"><p>.898</p></td><td class="cell"><p>.948</p></td><td class="cell"><p>.948</p></td><td class="cell"><p>.875</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Italian</p></td><td class="cell"><p>Serbocr.</p></td><td class="cell"><p>.499</p></td><td class="cell"><p>.455</p></td><td class="cell"><p>.527</p></td><td class="cell"><p>.615</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><page local="6"/><p>Table 4 compares the average precision achieved by methods A, B, C, and D on the development set. The cognation judgments from the Comparative In-doeuropean Data Corpus served as the gold stan­dard.</p><p>All four methods proposed in this paper as well as other cognate identiication programs were uni­formly applied to the test set representing ive In-doeuropean languages. Apart from the English-German and the French-Latin pairs, all remaining language pairs are quite challenging for a cognate identiication program. In many cases, the gold­standard cognate judgments distill the indings of decades of linguistic research. In fact, for some of those pairs, Kessler inds it dificult to show by sta­tistical techniques that the surface regularities are unlikely to be due to chance. Nevertheless, in or­der to avoid making subjective choices, CORDI was evaluated on all possible language pairs in Kessler's set.</p><p>Two programs mentioned in Section 2, COG­NATE and JAKARTA, were also applied to the test set. The source code of JAKARTA was obtained di­rectly from the author and slightly modiied accord­ing to his instructions in order to make it recognize additional phonemes. Word pairs were ordered ac­cording to the conidence scores in the case ofCOG-NATE, and according to the edit distances in the case of JAKARTA. Since the other two programs do not impose any length constraints on words, the minimum-length parameter was not used in the ex­periments described here.</p><p>The results on the test set are shown in Table 5. The best result for each language pair is underlined.</p><p>The performance of COGNATE and JAKARTA is quite similar, even though they represent two rad­ically different approaches to cognate identiica-tion. On average, methods B, C, and D outper­form both comparison programs. On closely re­lated languages, Method B, with its relatively un­constrained linking, achieves the highest precision. Method D, which considers only consonants, is the best on fairly remote languages, where vowel correspondences tend to be weak. The only ex­ception is the extremely dificult Albanian-English pair, where the relative ordering of methods seems to be accidental. As expected, Method A is out­performed by methods that employ an explicit noise model. However, in spite of its extra complexity, Method C is not consistently better than Method B, perhaps because of its inability to detect important vowel-consonant correspondences, such as the ones between French nasal vowels and Latin /n/.</p></subsection></section><section number="7" title="Conclusions and future work"><p>I have presented a novel approach to the determi­nation of correspondences in bilingual wordlists. The results of experiments indicate that the ap­proach is robust enough to handle a substantial amount of noise that is introduced by unrelated word pairs. CORDI does well even when the number of non-cognate pairs is more than double the number of cognate pairs. When tested on the cognate-identiication task, CORDI achieves sub­stantially higher precision than comparable pro­grams. The correspondences are explicitly posited, which means that, unlike in some statistical ap­proaches, they can be veriied by examining indi­vidual cognate pairs. In contrast with approaches that assume a rigid alignment based on the syllabic structure, the models presented here can link phonemes in any word position.<page local="7"/></p><table caption="Table 5: Average cognate identification precision on the test set for various methods." class="main" frame="box" rules="all" border="0" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Languages</p></td><td class="cell"><p>Proportion</p></td><td class="cell"><p>COGNATE</p></td><td class="cell"><p>JAKARTA</p></td><td class="cell"><p></p></td><td class="cell"><p>Method</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>of cognates</p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>A</p></td><td class="cell"><p>B</p></td><td class="cell"><p>C</p></td><td class="cell"><p>D</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>English</p></td><td class="cell"><p>German</p></td><td class="cell"><p>.590</p></td><td class="cell"><p>.878</p></td><td class="cell"><p>.888</p></td><td class="cell"><p>.936</p></td><td class="cell"><p>.957</p></td><td class="cell"><p>.952</p></td><td class="cell"><p>.950</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>French</p></td><td class="cell"><p>Latin</p></td><td class="cell"><p>.560</p></td><td class="cell"><p>.867</p></td><td class="cell"><p>.787</p></td><td class="cell"><p>.843</p></td><td class="cell"><p>.914</p></td><td class="cell"><p>.838</p></td><td class="cell"><p>.866</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>English</p></td><td class="cell"><p>Latin</p></td><td class="cell"><p>.290</p></td><td class="cell"><p>.590</p></td><td class="cell"><p>.447</p></td><td class="cell"><p>.584</p></td><td class="cell"><p>.641</p></td><td class="cell"><p>.749</p></td><td class="cell"><p>.853</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>German</p></td><td class="cell"><p>Latin</p></td><td class="cell"><p>.290</p></td><td class="cell"><p>.532</p></td><td class="cell"><p>.518</p></td><td class="cell"><p>.617</p></td><td class="cell"><p>.723</p></td><td class="cell"><p>.736</p></td><td class="cell"><p>.857</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>English</p></td><td class="cell"><p>French</p></td><td class="cell"><p>.275</p></td><td class="cell"><p>.324</p></td><td class="cell"><p>.411</p></td><td class="cell"><p>.482</p></td><td class="cell"><p>.528</p></td><td class="cell"><p>.545</p></td><td class="cell"><p>.559</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>French</p></td><td class="cell"><p>German</p></td><td class="cell"><p>.245</p></td><td class="cell"><p>.390</p></td><td class="cell"><p>.406</p></td><td class="cell"><p>.347</p></td><td class="cell"><p>.502</p></td><td class="cell"><p>.487</p></td><td class="cell"><p>.528</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Albanian</p></td><td class="cell"><p>Latin</p></td><td class="cell"><p>.195</p></td><td class="cell"><p>.449</p></td><td class="cell"><p>.455</p></td><td class="cell"><p>.403</p></td><td class="cell"><p>.432</p></td><td class="cell"><p>.568</p></td><td class="cell"><p>.606</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Albanian</p></td><td class="cell"><p>French</p></td><td class="cell"><p>.165</p></td><td class="cell"><p>.306</p></td><td class="cell"><p>.432</p></td><td class="cell"><p>.249</p></td><td class="cell"><p>.292</p></td><td class="cell"><p>.319</p></td><td class="cell"><p>.437</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Albanian</p></td><td class="cell"><p>German</p></td><td class="cell"><p>.125</p></td><td class="cell"><p>.277</p></td><td class="cell"><p>.248</p></td><td class="cell"><p>.156</p></td><td class="cell"><p>.177</p></td><td class="cell"><p>.154</p></td><td class="cell"><p>.312</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Albanian</p></td><td class="cell"><p>English</p></td><td class="cell"><p>.100</p></td><td class="cell"><p>.225</p></td><td class="cell"><p>.227</p></td><td class="cell"><p>.302</p></td><td class="cell"><p>.373</p></td><td class="cell"><p>.319</p></td><td class="cell"><p>.196</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>Average</b></p></td><td class="cell"><p><b>.283</b></p></td><td class="cell"><p><b>.484</b></p></td><td class="cell"><p><b>.482</b></p></td><td class="cell"><p><b>.492</b></p></td><td class="cell"><p><b>.554</b></p></td><td class="cell"><p><b>.567</b></p></td><td class="cell"><p><b>.616</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><p>Currently, I am working on the incorporation of complex correspondences into the cognate identifi­cation algorithm by employing Melamed's (1997) algorithm for discovering non-compositional com­pounds in parallel data. Such an extension would overcome the limitation of the one-to-one model, in which links are induced only between individual phonemes. Other possible extensions include taking into account the phonological context of correspon­dences, combining the correspondence-based ap­proach with phonetic-based approaches, and iden­tifying correspondences and cognates directly in dictionary-type data.</p><p>The results presented here prove that the tech­niques developed in the context of statistical ma­chine translation can be successfully applied to a problem in diachronic phonology. The transfer of methods and insights should also be possible in the other direction.</p></section><section title="Acknowledgments"><p>Thanks to Graeme Hirst, Radford Neal, and Suzanne Stevenson for helpful comments, to Michael Oakes for assistance with JAKARTA, and to Gemma Enriquez for helping with the experimen­tal evaluation ofCOGNATE. This research was sup­ported by the Natural Sciences and Engineering Re­search Council ofCanada.</p></section><references><p>Y. Al-Onaizan, J. Curin, M. Jahr, K. Knight, J. Laf-ferty, D. Melamed, F. Och, D. Purdy, N. Smith, and D. Yarowsky. 1999. Statistical machine translation. Technical report, Johns Hopkins University.</p><p>Ted Dunning. 1993. Accurate methods for the statistics of surprise and coincidence. <i>Computational Linguis­tics, </i>19(1):61-74.</p><p>Isidore Dyen, Joseph B. Kruskal, and Paul Black. 1992. An Indoeuropean classification: A lexicosta-tistical experiment. <i>Transactions of the American Philosophical Society, </i>82(5). Word lists available at http://www.ldc.upenn.edu/ldc/service/comp-ie.</p><p>Jacques B. M. Guy. 1994. An algorithm for identify­ing cognates in bilingual wordlists and its applicability to machine translation. <i>Journal ofQuantitative Lin­guistics, </i>1(1):35-42. MS-DOS executable available at http://garbo.uwasa.i.</p><p>Brett Kessler. 2001. <i>The Significance of Word Lists. </i>Stanford: CSLI Publications. Word lists available at http://spell.psychology.wayne.edu/ bkessler.</p><p>Philipp Koehn and Kevin Knight. 2001. Knowledge sources for word-level translation models. In <i>Pro-</i> <i>ceedings of the 2001 Conference on Empirical Meth­ods in Natural Language Processing, </i>pages 27-35.</p><p>Grzegorz Kondrak. 2000. A new algorithm for the alignment of phonetic sequences. In <i>Proceedings of NAACL 2000: 1st Meeting of the North American Chapter ofthe Association for Computational Lin­guistics, </i>pages 288-295.</p><p>Grzegorz Kondrak. 2001. Identifying cognates by pho­netic and semantic similarity. In <i>Proceedings of NAACL 2001: 2nd Meeting ofthe North American Chapter ofthe Association for Computational Lin­guistics, </i>pages 103-110.</p><p>Grzegorz Kondrak. 2002. <i>Algorithms for Language Re­construction. </i>Ph.D. thesis, University of Toronto. Available at http://www.cs.toronto.edu/ kondrak.</p><p>John B. Lowe and Martine Mazaudon. 1994. The re­construction engine: a computer implementation of the comparative method. <i>Computational Linguistics, </i>20:381-417.</p><p>Gideon S. Mann and David Yarowsky. 2001. Multipath translation lexicon induction via bridge languages. In <i>Proceedings ofNAACL 2001:</i><i> 2nd Meeting ofthe North American Chapter ofthe Association for Com­putational Linguistics, </i>pages 151-158. I. Dan Melamed. 1997. Automatic discovery of non-compositional compounds in parallel data. In <i>Pro­ceedings of the Second Conference on Empirical Methods in Natural Language Processing, </i>pages 97­108.</p><p>I. Dan Melamed. 1999. Bitext maps and alignment via pattern recognition. <i>Computational Linguistics, </i>25(1):107-130.</p><p>I. Dan Melamed. 2000. Models of translational equiv­alence among words.   <i>Computational Linguistics,</i> 26(2):221-249.</p><p>Michael P. Oakes. 2000. Computer estimation ofvocab-ulary in protolanguage from word lists in four daugh­ter languages. <i>Journal ofQuantitative Linguistics,</i> 7(3):233-243.</p><p>Michel Simard, George F. Foster, and Pierre Isabelle. 1992. Using cognates to align sentences in bilingual corpora. In <i>Proceedings ofthe Fourth International Conference on Theoretical and Methodological Is­sues in Machine Translation, </i>pages 67-81, Montreal, Canada.</p><p>Morris Swadesh. 1952. Lexico-statistical dating of pre­historic ethnic contacts. <i>Proceedings ofthe American Philosophical Society, </i>96:452-463.</p><p>Jorg Tiedemann. 1999. Automatic construction of weighted string similarity measures. In <i>Proceedings ofthe Joint SIGDAT Conference on Empirical Meth­ods in Natural Language Processing and Very Large Corpora, </i>College Park, Maryland.</p><p>Robert A. Wagner and Michael J. Fischer. 1974. The string-to-string correction problem. <i>Journal of the As­sociation for Computing Machinery, </i>21(1):168-173.</p></references></body></article>