<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" xmlns:ns2="http://www.tei-c.org/ns/Examples">
    <teiHeader>
        <fileDesc>
            <titleStmt>
                <title>Squibs and Discussions Decoding Complexity in Word-Replacement Translation Models</title>
            </titleStmt>
        </fileDesc>
    </teiHeader>
    <text>
        <front/>
        <body>
            <div>
                <p>Kevin Knight* University of Southern California Statistical machine translation is a relatively new approach to the long-standing problem of translating human languages by computer. Current statistical techniques uncover translation rules from bilingual training texts and use those rules to translate new texts. The general architecture is the source-channel model: an English string is statistically generated (source), then statistically transformed into French (channel). In order to translate (or &quot;decode&quot;) a French string, we look for the most likely English source. We show that for the simplest form of statistical models, this problem is NP-complete, i.e., probably exponential in the length of the observed sentence. We trace this complexity to factors not present in other decoding problems. Statistical models are widely used in attacking natural language problems. The sourcechannel framework is especially popular, finding applications in part-of-speech tagging, accent restoration, transliteration, speech recognition, and many other areas. In this framework, we build an underspecified model of how certain structures (such as strings) are generated and transformed. We then instantiate the model through training on a database of sample structures and transformations.</p>
                <p>Recently, Brown et al. (1993) built a source-channel model of translation between English and French. They assumed that English strings are produced according to some stochastic process (source model) and transformed stochastically into French strings (channel model). To translate French to English, it is necessary to find an English source string that is likely according to the models. With a nod to its cryptographic antecedents, this kind of translation is called decoding. This paper looks at decoding complexity. The prototype source-channel application in natural language is part-of-speech tagging (Church 1988). We review it here for purposes of comparison with machine translation.</p>
                <p>Source strings comprise sequences of part-of-speech tags like noun, verb, etc. A simple source model assigns a probability to a tag sequence tl .. •tm based on the probabilities of the tag pairs inside it. Target strings are English sentences, e.g., wl ... win. The channel model assumes each tag is probabilistically replaced by a word (e.g., noun by dog) without considering context. More concretely, we have: • v total tags • A bigram source model with v 2 parameters of the form b(t\]t), where P(tl... tin) &quot;&quot; b(tllboundary) • b(t2\]tl) ..... b(tn\]tm-1) &quot; b(boundary\]tm) • Information Sciences Institute, Marina del Rey, CA 90292 @ 1999 Association for Computational Linguistics Computational Linguistics Volume 25, Number 4 • A substitution channel model with parameters of the form s(w\]t), where P(wl ... Wmlh... tm) ~ S(Wllh)&quot; S(W21t2)&quot; ...&quot; S(Wraltm) • an m-word text annotated with correct tags • an m-word unannotated text</p>
                <p>We can assign parts-of-speech to a previously unseen word sequence wl... Wm by finding the sequence tl... tm that maximizes P(h... tmlWl... Wm). By Bayes' rule, we can equivalently maximize P(h ... tm)'P(wl.., wmlh.., tin), which we can calculate directly from the b and s tables above.</p>
                <p>Three interesting complexity problems in the source-channel framework are: • Can parameter values be induced from annotated text efficiently? • Can optimal decodings be produced efficiently? • Can parameter values be induced from unannotated text efficiently?</p>
                <p>The first problem is solved in O(m) time for part-of-speech tagging--we simply count tag pairs and word/tag pairs, then normalize. The second problem seems to require enumerating all O(v m) potential source sequences to find the best, but can actually be solved in O(mv 2) time with dynamic programming. We turn to the third problem in the context of another application: cryptanalysis. In a substitution cipher, a plaintext message like HELLO WORLD is transformed into a ciphertext message like EOPPX YXAPF via a fixed letter-substitution table. As with tagging, we can assume an alphabet of v source tokens, a bigram source model, a substitution channel model, and an m-token coded text.</p>
                <p>If the coded text is annotated with corresponding English, then building source and channel models is trivially O(m). Comparing the situation to part-of-speech tagging: • (Bad news.) Cryptanalysts rarely get such coded/decoded text pairs and</p>
                <p>must employ &quot;ciphertext-only&quot; attacks using unannotated training data. • (Good news.) It is easy to train a source model separately, on raw</p>
                <p>unannotated English text that is unconnected to the ciphertext. Then the problem becomes one of acquiring a channel model, i.e., a table s(fle )with an entry for each code-letter/plaintext-letter pai . Starting with an initially uniform table, we can use the estimation-maximization (EM) algorithm to iteratively revise s(fle )so as to increase the probability of the observed corpus P(f). Figure 1 shows a naive EM implementation that runs in O(mv m) time. There is an efficient O(mv 2) EM implementation based on dynamic programming that accomplishes the same thing. Once the s(fle )table has been learned, there is a similar O(mv 2) algorithm for optimal decoding. Such methods can break English letter-substitution ciphers of moderate size. Knight Decoding Complexity Given coded text f of length m, a plaintext vocabulary of v tokens, and a source model b: 1. set the s0Cle) table initially to be uniform 2. for several iterations do: a, b. C. d. set up a count table c0CI e) with zero entries P(f) = 0 for all possible source texts el... em (el drawn from vocabulary)</p>
                <p>compute P(e) = b(ell boundary), b(boundary lem). \[Ii~=2 b(eilei_l)</p>
                <p>compute P(fle) = I~j=l m s(fjleJ)</p>
                <p>P(f) += P(e). P(fle) for all source texts e of length m</p>
                <p>compute P(elf ) = P(e)'P(fle) P(f)</p>
                <p>for j = 1 to m</p>
                <p>c0~lej) += P(e~) normalize c0Ci e) table to create a revised s0CI e) Figure 1 A naive application of the EM algorithm to break a substitution cipher. It runs in O(mv m) time.</p>
            </div>
            <note n="1." place="below">Introduction</note>
            <note n="2." place="below">Part-of-Speech Tagging</note>
            <note n="3." place="below">Substitution Ciphers</note>
            <note n="608" place="below"></note>
            <div1>
                <head xml:id="sec4.">Machine Translation</head>
                <p>In our discussion of substitution ciphers, we were on relatively sure ground the channel model we assumed in decoding is actually the same one used by the cipher writer for encoding. That is, we know that plaintext is converted to ciphertext, letter by letter, according to some table. We have no such clear conception about how English gets converted to French, although many theories exist. Brown et al. (1993) recently cast some simple theories into a source-channel framework, using the bilingual Canadian parliament proceedings as training data. We may assume: • • • • • v total English words. A bigram source model with V 2 parameters. Various substitution/permutation channel models. A collection of bilingual sentence pairs (sentence lengths &lt; m). A collection of monolingual French sentences (sentence lengths &lt; m).</p>
                <p>Bilingual texts seem to exhibit English words getting substituted with French ones, though not one-for-one and not without changing their order. These are important departures from the two applications discussed earlier.</p>
                <p>In the main channel model of Brown et al. (1993), each English word token ei in a source sentence is assigned a &quot;fertility&quot; @, which dictates how many French words it will produce. These assignments are made stochastically according to a table n(~le ). Then actual French words are produced according to s(fie )and permuted into new positions according to a distortion table d(jli, m, 1). Here, j and i are absolute target/source word positions within a sentence, and m and I are target/source sentence lengths.</p>
                <p>Inducing n, s, and d parameter estimates is easy if we are given annotations in the form of word alignments. An alignment is a set of connections between English and French words in a sentence pair. In Brown et al. (1993), aligrtrnents are asymmetric-each French word is connected to exactly one English word. Computational Linguistics Volume 25, Number 4 Given a collection of sentence pairs: 1. 2. 3. collect estimates for the ~(m\]l) table directly from the data set the s0e\]e) table initially to be uniform for several iterations do: a. set up a count table c(f\]e) with zero entries b. for each given sentence pair e, f with respective lengths I, m:</p>
                <p>foral=ltol</p>
                <p>for a2 = 1 to 1 /* select connections for a word alignment */ for am = 1 to l 1-Ij=, m s~l%) compute P(al ...... \]e, f) - p(f' al ...... \]e)</p>
                <p>P(f\]e) G'o; =, &quot; • ~,,-=, ' I-\[j=, m s~le,;)</p>
                <p>for j = 1 to m</p>
                <p>c0~l%) += P(al... amle, f) C. normalize c0~\]ei ) table to create new s(fi\]ei) Figure 2 Naive EM training for the Model 1 channel model.</p>
                <p>Word-aligned data is usually not available, but large sets of unaligned bilingual sentence pairs do sometimes exist. A single sentence pair will have \[m possible alignments--for each French word position 1... m, there is a choice of I English positions to connect to. A naive EM implementation will collect n, s, and d counts by considering each alignment, but this is expensive. (By contrast, part-of-speech tagging involves a single alignment, leading to O(m) training). Lacking a polynomial reformulation, Brown et al. (1993) decided to collect counts only over a subset of likely alignments. To bootstrap, they required some initial idea of what alignments are reasonable, so they began with several iterations of a simpler channel model (called Model 1) that has nicer computational properties.</p>
                <p>In the following description of Model 1, we represent an aligmnent formally as a vector al ..... am, with values aj ranging over English word positions 1... I. Model 1 Channel Parameters: c(mll ) and s(f\[e). Given a source sentence of length I:</p>
                <p>1. choose a target sentence length m according to ¢(mll )</p>
                <p>2. for j = 1 to m, choose an English word position aj according to the</p>
                <p>uniform distribution over 1... l</p>
                <p>3. for j = 1 to m, choose a French word j~ according to s~\]%)</p>
                <p>4. read off fl ...fro as the target sentence</p>
                <p>Because the same e may produce the same f by means of many different alignments, we must sum over all of them to obtain P(fle): P(fl e) = c(mll) T~ 1 Y~al=l l ~a2=l 1 &quot;&quot;&quot; Y~am=l l I\]j=l m s(fjleai) Figure 2 illustrates naive EM training for Model 1. If we compute P(fle) once per iteration, outside the &quot;for a&quot; loops, then the complexity is O(ml m) per sentence pair, per iteration. Knight Decoding Complexity</p>
                <p>More efficient O(lm) training was devised by Brown et al. (1993). Instead of processing each alignment separately, they modified the algorithm in Figure 2 as follows: b. for each given sentence pair e, f of respective lengths l, m:</p>
                <p>for j = 1 to m sum = 0 for i = 1 to I</p>
                <p>sum += s(fjlei) for i = 1 to I</p>
                <p>c(fjlei ) += s(fjlei ) / sum This works because of the algebraic trick that the portion of P(fle) we originally wrote as ~al=,&quot; 1 &quot;&quot; Y~am=l 1 1-Ij=l m S(J~\[ e aj) can be rewritten as YIj=I m Y~I=I s(fjlei)&quot;</p>
                <p>We next consider decoding. We seek a string e that maximizes P(elf), or equivalently maximizes P(e) • P(fle). A naive algorithm would evaluate all possible source strings, whose lengths are potentially unbounded. If we limit our search to strings at most twice the length m of our observed French, then we have a naive O(m2v 2m) method: Given a string f of length m</p>
                <p>1. for all source strings e of length I _ 2m:</p>
                <p>a. compute P(e) = b(el I boundary) -b(boundary Iet) &quot; I-lli=2 b(eilei-1)</p>
                <p>b. compute P(fle) = c(mll ) ~ l-\[j=1 m ~1i=1 s(fjlei)</p>
                <p>c. compute P(elf) ,-~ P(e) • P(fle)</p>
                <p>d. if P(elf ) is the best so far, remember it 2. print best e We may now hope to find a way of reorganizing this computation, using tricks like the ones above. Unfortunately, we are unlikely to succeed, as we now show. For proof purposes, we define our optimization problem with an associated yes-no decision problem: Definition: M1-OPTIMIZE Given a string f of length m and a set of parameter tables (b, e, s), return a string e of length I &lt; 2m that maximizes P(elf), or equivalently maximizes P(e) - P(fle) = b(el I boundary) -b(boundary Iel ) 1 • 1-\[i=2 b(eilei-1) • c(mll ) ± l m llj=l vi m x-,! /'i=1 s(fjlei) Definition: M1-DECIDE Given a string f of length m, a set of parameter tables (b, e, s), and a real number k, does there exist a string e of length l &lt; 2m such that P(e) • P(fle) &gt; k? We will leave the relationship between these two problems somewhat open and intuitive, noting only that M1-DECIDE's intractability does not bode well for M1OPTIMIZE. Computational Linguistics Volume 25, Number 4 Theorem M1-DECIDE is NP-complete.</p>
                <p>To show inclusion in NP, we need only nondeterministically choose e for any problem instance and verify that it has the requisite P(e) • P(fle) in O(m 2) time. Next we give separate polynomial-time reductions from two NP-complete problems. Each reduction highlights a different source of complexity. 4.1 Reduction 1 (from Hamilton Circuit Problem) The Hamilton Circuit Problem asks: given a directed graph G with vertices labeled 0,...,n, does G have a path that visits each vertex exactly once and returns to its starting point? We transform any Hamilton Circuit instance into an M1-DECIDE instance as follows. First, we create a French vocabulary fl ..... fn, associating word fi with vertex i in the graph. We create a slightly larger English vocabulary e0 ..... en, with e0 serving as the &quot;boundary&quot; word for source model scoring. Ultimately, we will ask M1-DECIDE to decode the string fl...fn.</p>
                <p>We create channel model tables as follows: s~.lei) = {10 ifi=j</p>
                <p>otherwise ¢(mll) = {10 ifl=m</p>
                <p>otherwise</p>
                <p>These tables ensure that any decoding e off1 ...fn will contain the n words el .... , en (in some order). We now create a source model. For every pair (i,j) such that 0 G i,j G n:</p>
                <p>= ~l/n if graph G contains an edge from vertex i to vertex j b(ej\[ei) to otherwise</p>
                <p>Finally, we set k to zero. To solve a Hamilton Circuit Problem, we transform it as above (in quadratic time), then invoke M1-DECIDE with inputs b, c, s, k, and fl...fm.</p>
                <p>If M1-DECIDE returns yes, then there must be some string e with both P(e) and P(fle) nonzero. The channel model lets us conclude that if P(f\[e) is nonzero, then e contains the n words el,..., en in some order. If P(e) is nonzero, then every bigram in e (including the two boundary bigrams involving e0) has nonzero probability. Because each English word in e corresponds to a unique vertex, we can use the order of words in e to produce an ordering of vertices in G. We append vertex 0 to the beginning and end of this list to produce a Hamilton Circuit. The source model construction guarantees an edge between each vertex and the next.</p>
                <p>If M1-DECIDE returns no, then we know that every string e includes at least one zero value in the computation of either P(e) or P(fle). From any proposed Hamilton Circuit--i.e., some ordering of vertices in G--we can construct a string e using the same ordering. This e will have P(f\]e) = 1 according to the channel model. Therefore, P(e) = 0. By the source model, this can only happen if the proposed &quot;circuit&quot; is actually broken somewhere. So no Hamilton Circuit exists.</p>
                <p>Figure 3 illustrates the intuitive correspondence between selecting a good word order and finding a Hamilton Circuit. We note that Brew (1992) discusses the NPcompleteness of a related problem, that of finding some permutation of a string that is acceptable to a given context-free grammar. Both of these results deal with decision problems. Returning to optimization, we recall another circuit task called the Traveling Knight Decoding Complexity my b°uid~N ~ r ~'/~~ falls Thursday Figure 3 Selecting a good source word order is like solving the Hamilton Circuit Problem. If we assume that the channel model offers deterministic, word-for-word translations, then the bigram source model takes responsibility for ordering them. Some word pairs in the source language may be illegal. In that case, finding a legal word ordering is like finding a complete circuit in a graph. (In the graph shown above, a sample circuit is boundary --, this ---* year ~ comma ~ my --* birthday --~ falls --~ on ---* a --+ Thursday ~ boundary). If word pairs have probabilities attached to them, then word ordering resembles the finding the least-cost circuit, also known as the Traveling Salesman Problem. Salesman Problem. It introduces edge costs dq and seeks a minimum-cost circuit. By viewing edge costs as log probabilities, we can cast the Traveling Salesman Problem as one of optimizing P(e), that is, of finding the best source word order in Model 1 decoding. 4.2 Reduction 2 (from Minimum Set Cover Problem) The Minimum Set Cover Problem asks: given a collection C of subsets of finite set S, and integer n, does C contain a cover for S of size ~ n, i.e., a subcollection whose union is S? We now transform any instance of Minimum Set Cover into an instance of M1-DECIDE, using polynomial time. This time, we assume a rather neutral source model in which all strings of a given length are equally likely, but we construct a more complex channel.</p>
                <p>We first create a source word ei for each subset in C, and let gi be the size of that subset. We create a table b(eilej) with values set uniformly to the reciprocal of the source vocabulary size (i.e., the number of subsets in C).</p>
                <p>Assuming S has m elements, we next create target words fl ..... fm corresponding to each of those elements, and set up channel model tables as follows: if the element in S corresponding to j~ is also in the subset corresponding to ei otherwise ¢(mll) = {10 ifl~n</p>
                <p>otherwise ~(m fl if l&gt;n</p>
                <p>otherwise</p>
                <p>Finally, we set k to zero. This completes the reduction. To solve an instance of Minimum Set Cover in polynomial time, we transform it as above, then call M1DECIDE with inputs b, c, s, k, and the words fl ..... fm in any order. Computational Linguistics Volume 25, Number 4</p>
                <p>obtained however m~ ) ted J ..... ~d left left the meal Figure 4 Selecting a concise set of source words is like solving the Minimum Set Cover Problem. A channel model with overlapping, one-to-many dictionary entries will typically license many decodings. The source model may prefer short decodings over long ones. Searching for a decoding of length _&lt; n is difficult, resembling the problem of covering a finite set with a small collection of subsets. In the example shown above, the smallest acceptable set of source words is {and, cooked, however, left, comma, period}.</p>
                <p>If M1-DECIDE returns yes, then some decoding e with P(e) • P(f\]e) &gt; 0 must exist. We know that e must contain n or fewer words--otherwise P(f\[e) = 0 by the c table. Furthermore, the s table tells us that every word fj is covered by at least one English word in e. Through the one-to-one correspondence between elements of e and C, we produce a set cover of size G n for S.</p>
                <p>Likewise, if M1-DECIDE returns no, then all decodings have P(e) • P(f\[e) = 0. Because there are no zeroes in the source table b, every e has P(f\[e) = 0. Therefore either (1) the length of e exceeds n, or (2) somefj is left tmcovered by the words in e. Because source words cover target words in exactly the same fashion as elements of C cover S, we conclude that there is no set cover of size &lt; n for S. Figure 4 illustrates the intuitive correspondence between source word selection and minimum set covering. 5. Discussion The two proofs point up separate factors in MT decoding complexity. One is wordorder selection. But even if any word order will do, there is still the problem of picking a concise decoding in the face of overlapping bilingual dictionary entries. The former is more closely tied to the source model, and the latter to the channel model, though the complexity arises from the interaction of the two.</p>
                <p>We should note that Model 1 is an intentionally simple translation model, one whose primary purpose in machine translation has been to allow bootstrapping into more complex translation models (e.g., IBM Models 2-5). It is easy to show that the intractability results also apply to stronger &quot;fertility/distortion&quot; models; we assign zero probability to fertilities other than 1, and we set up uniform distortion tables. Simple translation models like Model 1 find more direct use in other applications (e.g., lexicon construction, idiom detection, psychological norms, and cross-language information retrieval), so their computational properties are of wider interest. Knight Decoding Complexity</p>
                <p>The proofs we presented are based on a worst-case analysis. Real s, e, and b tables may have properties that permit faster optimal decoding than the artificial tables constructed above. It is also possible to devise approximation algorithms like those devised for other NP-complete problems. To the extent hat word ordering is like solving the Traveling Salesman Problem, it is encouraging substantial progress continues to be made on Traveling Salesman algorithms. For example, it is often possible to get within two percent of the optimal tour in practice, and some researchers have demonstrated an optimal tour of over 13,000 U.S. cities. (The latter experiment relied on things like distance symmetry and the triangle inequality constraint, however, which do not hold in word ordering.) So far, statistical translation research has either opted for heuristic beam-search algorithms or different channel models. For example, some researchers avoid bag generation by preprocessing bilingual texts to remove word-order differences, while others adopt channels that eliminate syntactically unlikely alignments.</p>
                <p>Finally, expensive decoding also suggests expensive training from unannotated (monolingual) texts, which presents a challenging bottleneck for extending statistical machine translation to language pairs and domains where large bilingual corpora do not exist. References Brew, Chris. 1992. Letting the cat out of the</p>
                <p>bag: Generation for shake-and-bake MT.</p>
                <p>In Proceedings of the 14th International</p>
                <p>Conference on Computational Linguistics</p>
                <p>(COLING), pages 610-616, Nantes, France,</p>
                <p>August. Brown, Peter, Stephen Della-Pietra, Vincent</p>
                <p>Della-Pietra, and Robert Mercer. 1993. The</p>
                <p>mathematics of statistical machine</p>
                <p>translation: Parameter estimation.</p>
                <p>Computational Linguistics, 19(2):263-311. Church, Kenneth. 1988. A stochastic parts</p>
                <p>program and noun phrase parser for</p>
                <p>unrestricted text. In Proceedings of the 2nd</p>
                <p>Conference on Applied Natural Language</p>
                <p>Processing, pages 136-143, Austin, TX,</p>
                <p>June.</p>
            </div1>
            <div1>
                <head xml:id="sec615"></head>
            </div1>
        </body>
        <back/>
    </text>
</TEI>
