NARA: A Two-way Simultaneous Interl3retation System between Korean and Jal~anese -A methodological study-

\[lee Sung Chung and Tosiyasu L. Kunii

Department of" Information Science Faculty of Science, University of Tokyo 7-3-I Ilongo, Bunkyo-ku Abstr_~ paper presents a new computing model for con-

a two-way simultaneous interpretation system Tokyo, 113 Japan the adjusted grammar with the interpretation mechanism: a representation, an a\]goritb~l, and a complexity metric. We take the following items as the subjects of c;lethodological study. (I) The theory of grammar We require an adjusted grammar to be suitable for description of the two languages as input and output. It is intuitiw}ly clear that the more cc~,municatable the adjusted gralr~imr is, expressed by a powerful forreal system, the more efficiently is the grammar interprcted. We adopt general ized phrase structure grarl,~lar (GPSG) framework\[\]{\]. (2) The notion of direct realization of interpretation Because we need to connect competence and performance as directly as possible, one of the goals of our study is to identify rules of the grammar with the manipulative unit of interpretation in a one-to-one fashion° Thus we carefully distinguish between the grammar and the rules of interpretation. For this, we adopt the following notions as the methodological principles of our system: I) Equivalence of grammar\[5\]~ 2) gralfmlar cover' and grarmlar modif:ieation\[6\]5 3) type transparency\[2\], and 4) an invariant of forma\] languages\[4\]. (3) The notion of complexity measure fhe direct association between unit interpretation time cost and the complexity of a sequential operation during interpretation can be measured. 3. Linguistic Data Structure and Computing Model

in order to investigate the correspondence between the two languages, we partition a grammar into independent components: segmented words, the word order, morphology, syntax, and semantics. The partition of a grammar constitutes an important step of modular decomposition into the interpretation subsystems. 3.1 Interpretation strategy of segmented word component 3.1.1 Data structure

In comparison with other symbol system, every hmlan language has a remarkable characteristics; namely, the structure of segmented words. The utterance as a segmented word conveys a message regarding some matter, and communicates the information concerning the matter. A se\[~nented word is a word or an ordered pair of words. Using some criteria: positional transformation, substitution and insertion, we can specify a segmented word of Korean or Japanese. 3.1.2 Word order in a segmented word of Korean or Japanese

This structing between Korean and Japanese. We also propose several methodological approaches to the construction of a two-way simultaneous interpretation system, and realize the two-way interpreting process as a model unifying both linguistic competence and linguistic performance. The model is verified theoretically and through actual applications. I. Introduction

Our goal is to develop a two-way simultaneous interpretation system between Korean and Japanese. In order to achieve this goal, we have designed a specific computing model, which is a computer program based on the a\]gorithll that formalizes the mechanism of two-way simultaneous interpretation and the correspondence of the two languages. Our computational approach consists of two parts° First, build an explicit computational model, then sh(m the practical applicabiIity and theoretical validity of the model° The most significant advantage of using a formal descrip-. tion to represent our" system is in that the descriptive contents of the representative algoritb~ do not depend upon the conventJ.orml approaches to machine translation° ~.Jc have also implemented a prototypint~ system .\[~AR~.., a two-way simultaneous interpretation system between Korean and Japanese. In this paper, we outline the features of the system without \[4oin\[~ into the details. 2. Methodology

Our approach is intuitively motivated by Cha~mky's hypothesis\[l\]: hc~logeneous communication by the same linguistic performance is possible among those who have the same linguistic competence. We take a performance theory to be the study of real time processing of languages. The performance theory cannot be developed without a c~apetence theory. This hypothesis suggests that a key point of contact between the theory of grab, mr and the interpretation control is the natural link between the theory of knowledge representation and the theory of knowledge proeessing. That is, for two classes of languages to be interpretable by humaln being, there exists an interpreting procedure. Consequently, if we can show that there is an adjusted grab,far for the two languages plus an adequate interpreting procedure to predetermine the mechanism of' our two-way simultaneous interpretation, then we have some support for our methodology. In order to guarantee two-way simultaneous interpretation, there are several subareas to be inquired. The first is the type of representation constructed during the interpretation. The second is the method of utilizing the representation during the interpretation. The third is the measure of c~nputational complexity during the interpretation. These three components of a complete c(~aputational model are necessary for linking

Between Korean and Japanese, some common properties are observed, such as an agglutinative language structure and the identical word order(SOV). In addition, we sight three corresponding word order properties of segmented words between the two languages: For some (kl, k2) e:Sk and (jl, j2) eSj, where Sk and Sj are a set of Korean segmented wordsj a set of Japanese segmented words, respectively, and I a binary relation (interpretation) : \[Property I\] reflexivity (kl,k2) (jl,j2). e.g. 0~ W 4-~I q~l (our nation) \[Property 2\] synlnetry (jl,j2) (k2,kl). e.g.~55 --/~ ~,}.~,,! ~1 (one more time) \[Property 3\] transitivity (jl,no,j2) (kl,k2). e.g. Bako X %t-t'~ ~ (a Japanese) ~mong above properties, Property 3 depends upon Korean pragmatic information. 3.1.3 Computing model

The production form of a se\[{nented word of Korean or Japanese can be described in the rule forms in a regular granmar, and it is right linear. Since a language L generated by some right linear grammar G is regular, there exists a finite automaton which accepts L. If L is a context-free language and s is a substitution map such that for every a e V(a fixed vocabulary), s(a) is a context-free language, then s(L) is a context-free language. A type of substitution map that is of special intcrest is a hc~mmorphiaa. If' L is a regular language and h is a hom~rJorphia~l, then the range of tile inverse homomorphism ff~(L) is also regular language. And, for two given regular grammars G and G', if L(G) : L(G'), there is a sequence equivalence. Two sequences generate the same word order in the increasing length order. 3.2 Interpretation strategy of Norphological component 3.2.1. Data structure

The study of the structure of words occupies an important place within linguistics, sandwiched between phonology and syntax. Horphemes may also be partitioned into lexical and grammatical classes. Lexical morphemes are generally free, while many of the grammatical morphemes are bound. 3.2.2 C~nputing model

In a given Korean-Japanese (or Japanese-Korean) dictionary, let Dk be the set of morphemes of Korean, and Dj be the set of morphemes of Japanese. A mapping I between the sets is defined as follows. I(Dk) = Dj implying that the image of Dk is D j; taking the inverse mapping, I(Dj) : Dk. By generalizing the relation and the mapping between the two sets, we may consider the set of Korean words to be a domain, and the set of Japanese words a range. ~ssuming the same cardinality for both, Dk and Dj may be partitioned as shown below. Here we suppose {I<I, k2,..kn}eDk, {jl, j2,..jm}eDj: (I) one-to-one (ki,ji) e DkxDj. (2) one-to-many (ki. lJn.Ji2,...Ji,(il\]) e DkX2"i (3) many-to-nlany (Ikihki2,-..ki,(i)l,lJil,Ji2.'"Ji,,,(i)}) e 21)kx2 l)j where, A xB is the Cartesian product of the two sets A and B~ and 2 A is the a power set of a set A. Obviously, one-to-one correspondence is isomorphic. Naturally, our attention will be focused on the oneto-many and many-to-many relations. Interpretation of these relations depends on various factors: allomorph, synonym and homonym. Thus, as for the interpretation which is dependent on synonymy or polysemy, we charac-terize the interpretation by specifying the canonical form, or the semantic feature instantiation, respectively. 3.3 Syntax level interpretation strategy

We examine the syntactic structure of the two languages. Frcn~ the correspondence in a segTaented word and word order, it is seen intuitively that they are strongly equivalent. And there is a sufficient linguistic evidence for it based on the study of experimental comparative linguistics\[2\]. ~ phrase structure preserves each lexical semantic feature of a constituent structure, and a parse tree describes the construction of syntactic representation of a sentence. Horeover~ a partial tree in the whole parse tree plays a role of adjusting semantic and syntactic interpretation. Let us compare the examples of two parse tree constructions(Fig I):

VP VP ./ \, \

VP / \ / k V AUX VP P V AUX l I I I I { / NP PP VP

/\ VP I': I I / \

Fig 1: Syntactic trees of "(I) thought (somebody) went (to somewhere) ' It is obvious that parse trees coincide with each other in one-to-one fashion, but syntactic categories do not. This implies that two given languages, Korean and Japanese, do not generate the same set of sentential forms. Furthermore, there is no algorit~n for deciding whether or not two given context-free grammars generate the same sentential forms. This is the reason why we adopt the covering grar~ar technique to parse the source language for interpretation. 3.4 Semantics, pra~aatics and ambiguity

Semantics and pragmatics also play an important role in generating the well-formed target language. In the interpretation between Korean and Japanese, there exist several kinds of inherently ambiguous sentences which are generated only by the ambiguous gralrmars of both languages. (see 5°Fragments of interpretation) 4. K-J Gr~nmar

We design the K-J (or J-K) grammar which elgninates syntactical and semantical ambiguity of' both languages for interpretation° This gra~m~mr corresponds to the ccxnmunicative c~npetence for the interpretation between Korean and Japanese. The K-J (J-K) grammar is motivated by grammar modification and the coverinl\] grammar. ALGORITHM" irregularity categories removal or adjustment and semantic features insertion.

Input: a 5-tuple phrase structure grammar G : (N,Tk,Tj, P,S).

Output: an equivalent 5-tuple phrase structure grammar G' : (N',Tk'\[semj\],Tj',P',S'). Method: entpirical and heuristic method° llere N and I~' are nontermina\].s, Tk, Tj, Tk' ancl T o ' are terminals, sem~ is semantic features, P and P' are production rules, ~nd S and S' are the start symbols° The J-K granmmr is designed analogously. In the framc~vorl,: of the generalized phrase structure grammar, the semantic features are accepted by a special phrase structure rule, that is a linking rule, which causes the relevant information about the phrase to be passed down the tree as a feature on the syntactic nodes. Therefore, interpretation procedure is constructed by a succinct algorithn founded on the K-J(J-K) grammar. 5. Fragments of Irlterpretation

In this section, we exhibit the frap~nents of our intcrpretatJon system: how phrase structure rules and semantic £eatures interact in the interpretation procedure aceordJ.ru;; to the K-.J(J-K) grai~lt/iai". 5.1 \[Iomonymous construction

There are some kinds of construction types provided by syntax relations of each constituent. Among them, modificatiorl is a construction type related to Head and Attributes. Coordination imp\].ies that more than two subconstituents have syntactical coordination re-. lation. Let us consider the following Japanese utteranoes : I) t~2~.I-~,~ \[T\] ~>~<~o (modification

"(Someone) goes to school, and eats bread. 2) ~>99:< \[-c\] g*~-~9<o (coordination

"(Someone) eats \[)read and goes to school.' The two utterances imply the semantic notions of modification and coordination, respectively, but have the same conjunction morpheme \[tel. Semantically, they are represented in Korean by the outcome of interpretation as follows: I) ~d;,~°ll d*l ~'~~z '~#4=*:I. (modification) 2) ,,~.~-,-'\].,L q;,'-ol\] ",\]'-l. (coordination) All such morpheme ambiguities induce not only lexieal semantic ambiguity but sentential ambiguity, in order to interpret such ambiguous utterances, we c~nploy semantic feature specification as the discipline of the semantic conjunction schemata. The foilowin{~ rules account immediately for the sentences in the example. Here we use the GPSG notations: (I) modification schema S-> \[l\[sem~0 , Conj *l \], lt\[\]semc~ 1\] where~Y~\[ (0,1), (O,O) \] (2) eoordinat:i.on schema S -> \[i\[semrm , Conj .,'- \], II\[sem~l \] where~,(:{ (I,0), (1,1) \] 5.2 Missing construction

Korean and Japanese allow one of" the constituents of a sentence not to be explicitly stated when it is understandable fr~ll the context. In the GPSG framework, this kind of difference can be expressed by a FOOT feature S\[,ASH\[3\]. The SLAS\[I feature indicates that somethinl\] is lilJ.ssJ.n 6 in the structure dominated by the category specified, in this subsection, we exhibit a semantically ~m\]biguous utterance across a h(~nonytilous construction and a missing construction° Consider the fol\].owing Korean utterance. This utteranee also has inherent syntactical and semantic ambil~ui ty. 1 ) *l-R-°llq -',4°I 'ffd-<: 11,II°1 -~.:~. This utterance has two distinct syntactic trees: ( 1 ) j i / S/PP \ / \ PP\[de\] '\] ~J ~b lJ, 6 #,/PP VP

" . . . . _<F. _'>~.

S/PP Conj\[ P- \] ~fJ~ ~4~o PP/PP VP / "Frcxn Seoul came a report that there was a fire (in Seoui ) ' (2) SIPP. /\]S- " S/PP / S Con j\[ ~ \] PP/PP VP -11"- i jC"E "(From Seoul) came a report that there was a fire in Seoul ' In the above example, h~nonymous construction does not arise in Japanese, but missing construction remains. We ~nploy a parse tree (2) for semantic adjustment, and fill the gap of local environment with syntactically and semantically agreeable vocabulary; then such utterance of Korean and Japanese is interpretable without ambiguity° Consequently, the utterance of Korean I) is interpreted as follows° \[\[seoul- cj~ \[kazi-ga okitato\] \[se99~\-~a_/~ \[renraku ga kita\]\]\]. 327 / S S Conj \[ ~ \] PP S S "From Seoul came a report that there was a fire in Seoul ' 6. K-J(J-k) system

In order to define a two-way interpretation system more formally, we formulate the internal interfaee(K-J system) for the interpretation. This interface corresponds to the transducer of interpretation. We can define the K-J(J-K) syster,; as a 3-tuple grammar G:(wj,k(or j),wk ), wherewk and w i are Korean words and Japanese words, respectively, and k(j) : Wi-~Wk ( Wk--~Wj ) is a homomorphism. The K-J(J-K) system G defines the following sequence preserving the word order: VP w~-k(wD, w~w~=k(wDk(w~),. ... . It also defines the language L(Gk) = {ki(wi)li>O}. As mentioned above, the K-J(J-K) systel;L constitutes a simple device for interpretation. A language defined by the K-J(J-K) systom corresponds to the target language. Inversely, the mapping j of w~ into w i is such that the inverse homc{i~orphi~i j(wO = {wilk(wi) =w~} , j :k exists. Thus, we define the two-way simultaneous interpretation system ~ by: j(Lk) :k ~(Lk) = {wilk(wj) eLk}. We can define our system ~ using the extended notion; the inverse homo~\]orphism can be replaced by the direct operation of a finite substitution. Consider a gra\[~ar(e.g. Korean) GK" = (Nk, Tk, Pk, Sk) and let j be a finite substitution, defined on the vocabulary (Nk u Tk)*, such that j(w) is a finite(possibly empty) set of word for each word A. We denote

j(Nk) = \[~j, j(Tk) = Tj, PjDj(Pk), Sjnj(Sk). Then, the gray, nat (e. g. Japanese) Gj = (Nj, Tj, Pj, Sj) Gk. If I(Gk), I(Cj) are the

of Gk and G j, respectively, I is an invariant for Gk and NARA

The complexity of the algorithm is usually measured by the growth rate of its time and space requirements, as a function of the size of its input (or the length of input string) to which the algorithm is applied. We adopt a finite state transducer as a computing model which governs the fundamental interpretation control. Since we do not count the time it takes to read the input, finite state languages have zero complexity. If reading the input is counted, then finite languages have time complexity of exactly A (the length of input string). Such languages are interpretable in exactly 328 is an interpretation of sets of all interpretation then I(G#') = I(Gj), and Gj. 7. Complexity of System time it, and then called real-time languages. The interpretation which is accompanied by co-occurrence dependency cannot be done in general without relying on arbitrary look-ahead or rescanning of the output. However, the nature of on line interpretation is unchangeable. Consequently, our system \]~R_& is interpreted in real-time. 8. Concluding Remarks

Our approach for constructing this system has both logical view and experimental view; the former is given by mathematical formalization, the latter by the correspondence of two languages. In the view of computational linguistics, we separated the mechanism of our two-way simultaneous interpretation system into the levels of abstract theory, algorit~ii, and implementation to carve out the results at each level ira more independent f'ashion. In order to do so, we specified four important levels of description; the lowest level is morphology, the second level is se~lented word, the third level is syntax and semantics, and the top level controls the computing model of each level. Hence, we could determine the range of correspondence between internal representations of both grammars, and the basic architecture of the machinery actually instantiates the algorithn. Consequently, our model produces the extra power by the proposed theory with multiple levels of representation and systematic mapping between the corresponding levels of two languages, because interpretation efficiency requires both functional and mathematical discussions. Nevertheless, the complete pragmatic interpretation still remains quite obscure. Finally, we confront the proble~,~ whether it is possible to construct a two-way simuitaneous interpretation system between other two different language systems such as Japanese and English. We presuppose that the key point of problem-solving is in the study of universality and individuality between two given languages. Acknowled ~F~ients We are deeply grateful to Prof. If. YAI~\[ADA for his encouragement. We would like to thank Dr. A. ADACHI and Dr. K. HASHIDA, for many stimulating discussions and for detailed commentsp and to I lr. Y. SHIRAI and Hr. I. FUJISIIIRO for suggestions to if~iprove the paper. References \[ I\] N. CHOHSKY, Aspects of the Theory of Syntax, I I.I.T. Press, Reading, 1963. \[2\] H. S. CI\]UNG, Current Korean: Elementary Sentence Patterns and Structures, Komasholin, Reading, 1982(in Japanese). \[3\] GAZDAR, KLEIN, PULLUM arld SAG, Generalized Phrase Structure Grammar, Blacl~Jell, Reading, 1985. \[4\] H. HORZ, Eine l'leue Invariante f(Jr Kontext-freie Sprachen, Theoretical Computer Science 11, 1980. \[5\] H. R. LE\]~IS, C. H. PAPADIHITRIOU, ELEMEf~TS OF THE THEORY OF COHPUTATION, Prentice-Hall, Inc. Reading, 1981. \[6\] A. NIHOLT, Context-Free Grill,mr: Cover, Normal Forms and Parsing, Springer, Reading, 1980. \[7\] A. SALOMAA, Jewels of Formal Language Theory, Computer Science Press, Reading, 1981.