Squibs and Discussions Decoding Complexity in Word-Replacement Translation Models

Kevin Knight* University of Southern California Statistical machine translation is a relatively new approach to the long-standing problem of translating human languages by computer. Current statistical techniques uncover translation rules from bilingual training texts and use those rules to translate new texts. The general architecture is the source-channel model: an English string is statistically generated (source), then statistically transformed into French (channel). In order to translate (or "decode") a French string, we look for the most likely English source. We show that for the simplest form of statistical models, this problem is NP-complete, i.e., probably exponential in the length of the observed sentence. We trace this complexity to factors not present in other decoding problems. Statistical models are widely used in attacking natural language problems. The sourcechannel framework is especially popular, finding applications in part-of-speech tagging, accent restoration, transliteration, speech recognition, and many other areas. In this framework, we build an underspecified model of how certain structures (such as strings) are generated and transformed. We then instantiate the model through training on a database of sample structures and transformations.

Recently, Brown et al. (1993) built a source-channel model of translation between English and French. They assumed that English strings are produced according to some stochastic process (source model) and transformed stochastically into French strings (channel model). To translate French to English, it is necessary to find an English source string that is likely according to the models. With a nod to its cryptographic antecedents, this kind of translation is called decoding. This paper looks at decoding complexity. The prototype source-channel application in natural language is part-of-speech tagging (Church 1988). We review it here for purposes of comparison with machine translation.

Source strings comprise sequences of part-of-speech tags like noun, verb, etc. A simple source model assigns a probability to a tag sequence tl .. •tm based on the probabilities of the tag pairs inside it. Target strings are English sentences, e.g., wl ... win. The channel model assumes each tag is probabilistically replaced by a word (e.g., noun by dog) without considering context. More concretely, we have: • v total tags • A bigram source model with v 2 parameters of the form b(t\]t), where P(tl... tin) "" b(tllboundary) • b(t2\]tl) ..... b(tn\]tm-1) " b(boundary\]tm) • Information Sciences Institute, Marina del Rey, CA 90292 @ 1999 Association for Computational Linguistics Computational Linguistics Volume 25, Number 4 • A substitution channel model with parameters of the form s(w\]t), where P(wl ... Wmlh... tm) ~ S(Wllh)" S(W21t2)" ..." S(Wraltm) • an m-word text annotated with correct tags • an m-word unannotated text

We can assign parts-of-speech to a previously unseen word sequence wl... Wm by finding the sequence tl... tm that maximizes P(h... tmlWl... Wm). By Bayes' rule, we can equivalently maximize P(h ... tm)'P(wl.., wmlh.., tin), which we can calculate directly from the b and s tables above.

Three interesting complexity problems in the source-channel framework are: • Can parameter values be induced from annotated text efficiently? • Can optimal decodings be produced efficiently? • Can parameter values be induced from unannotated text efficiently?

The first problem is solved in O(m) time for part-of-speech tagging--we simply count tag pairs and word/tag pairs, then normalize. The second problem seems to require enumerating all O(v m) potential source sequences to find the best, but can actually be solved in O(mv 2) time with dynamic programming. We turn to the third problem in the context of another application: cryptanalysis. In a substitution cipher, a plaintext message like HELLO WORLD is transformed into a ciphertext message like EOPPX YXAPF via a fixed letter-substitution table. As with tagging, we can assume an alphabet of v source tokens, a bigram source model, a substitution channel model, and an m-token coded text.

If the coded text is annotated with corresponding English, then building source and channel models is trivially O(m). Comparing the situation to part-of-speech tagging: • (Bad news.) Cryptanalysts rarely get such coded/decoded text pairs and

must employ "ciphertext-only" attacks using unannotated training data. • (Good news.) It is easy to train a source model separately, on raw

unannotated English text that is unconnected to the ciphertext. Then the problem becomes one of acquiring a channel model, i.e., a table s(fle )with an entry for each code-letter/plaintext-letter pai . Starting with an initially uniform table, we can use the estimation-maximization (EM) algorithm to iteratively revise s(fle )so as to increase the probability of the observed corpus P(f). Figure 1 shows a naive EM implementation that runs in O(mv m) time. There is an efficient O(mv 2) EM implementation based on dynamic programming that accomplishes the same thing. Once the s(fle )table has been learned, there is a similar O(mv 2) algorithm for optimal decoding. Such methods can break English letter-substitution ciphers of moderate size. Knight Decoding Complexity Given coded text f of length m, a plaintext vocabulary of v tokens, and a source model b: 1. set the s0Cle) table initially to be uniform 2. for several iterations do: a, b. C. d. set up a count table c0CI e) with zero entries P(f) = 0 for all possible source texts el... em (el drawn from vocabulary)

compute P(e) = b(ell boundary), b(boundary lem). \[Ii~=2 b(eilei_l)

compute P(fle) = I~j=l m s(fjleJ)

P(f) += P(e). P(fle) for all source texts e of length m

compute P(elf ) = P(e)'P(fle) P(f)

for j = 1 to m

c0~lej) += P(e~) normalize c0Ci e) table to create a revised s0CI e) Figure 1 A naive application of the EM algorithm to break a substitution cipher. It runs in O(mv m) time.