<?xml version="1.0"?><!DOCTYPE article SYSTEM "/project/take/software/searchbench_offline_processing/paperxml_generator/aclextractor/src/python/../resource/dtd/paperxml.dtd"><article><header><firstpageheader><page local="1" global="537"/><title>Hindi Urdu Machine Transliteration using Finite-State Transducers</title><pubinfo>Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008),pages 537-544 Manchester, August 2008</pubinfo><author surname="Malik" givenname="M. G. Abbas"><org  name="ristian Boitet GTALP" country="France"/></author><author surname="Boitet" givenname="Christian"><org  name="Laboratoire d'Informatique Grenoble Université Joseph Fourier" country="France"/></author><author surname="Bhattacharyya" givenname="Pushpak"><org  name="Laboratoire d'Informatique Grenoble Université Joseph Fourier" country="France"/></author></firstpageheader><frontmatter><p><b>Hindi Urdu Machine Transliteration using Finite-state Transducers</b></p><p><b>M G Abbas Malik Christian Boitet</b></p><p>GTALP, Laboratoire d'Informatique Grenoble Université Joseph Fourier, France</p><p>abbas.malik@imag.fr,</p><p>Christian.Boitet@imag.fr</p></frontmatter><abstract>Finite-state Transducers (FST) can be very efficient to implement inter-dialectal transliteration. We illustrate this on the Hindi and Urdu language pair. FSTs can also be used for translation between sur­face-close languages. We introduce UIT (universal intermediate transcription) for the same pair on the basis of their com­mon phonetic repository in such a way that it can be extended to other languages like Arabic, Chinese, English, French, <i>etc. </i>We describe a transliteration model based on FST and UIT, and evaluate it on Hindi and Urdu corpora. </abstract></header><body><section number="1" title="Introduction"><p>Transliteration is mainly used to transcribe a word written in one language in the writing sys­tem of the other language, thereby keeping an approximate phonetic equivalence. It is useful for MT (to create possible equivalents of unknown words) (Knight and Stall, 1998; Paola and San-jeev, 2003), cross-lingual information retrieval (Pirkola et al, 2003), the development of multi­lingual resources (Y an et al, 2003) and multilin­gual text and speech processing. Inter-dialectal translation without lexical changes is quite useful and sometimes even necessary when the dialects in question use different scripts; it can be achieved by transliteration alone. That is the case of HUMT (Hindi-Urdu Machine Transliteration) where each word has to be transliterated from Hindi to Urdu and <i>vice versa, </i>irrespective of its</p><p>© 2008. Licensed under the <i>Creative Commons Attri­bution-Noncommercial-Share Alike 3.0 Unported </i>license (http://creativecommons.org/licenses/by-nc-<u>sa/3.0/</u>). Some rights reserved.</p><p><b>Pushpak Bhattacharyya</b></p><p>Dept. of Computer Science and Engineering, IIT Bombay, India pb@cse.iitb.ac.in type (noun, verb, <i>etc.</i><i> </i>and not only proper noun or unknown word).</p><p>"One man's Hindi is another man's Urdu" (Rai, 2000). The major difference between Hindi and Urdu is that the former is written in Devana-gari script with a more Sanskritized vocabulary and the latter is written in Urdu script (derivation of Persio-Arabic script) with more vocabulary borrowed from Persian and Arabic. In contrast to the transcriptional difference, Hindi and Urdu share grammar, morphology, a huge vocabulary, history, classical literature, cultural heritage, <i>etc. </i>Hindi is the National language of India with 366 million native speakers. Urdu is the National and one of the state languages of Pakistan and India respectively with 60 million native speakers (Rahman, 2004). Table 1 gives an idea about the <u>size of Hindi and Urdu.</u>__</p><p>Hindi and Urdu, being varieties of the same language, cover a huge proportion of world's population. People from Hindi and Urdu com­munities can understand the verbal expressions of each other but not the written expressions. HUMT is an effort to bridge this scriptural divide between India and Pakistan.</p><p>Hindi and Urdu scripts are briefly introduced in section 2. Universal Intermediate Transcrip­tion (UIT) is described in section 3, and UIT mappings for Hindi and Urdu are given in sec­tion 4. Contextual HUMT rules are presented and discussed in section 5. An HUMT system im­plementation and its evaluation are provided in section 6 and 7. Section 8 is on future work and conclusion.</p><table caption="Table 1: Hindi and Urdu speakers" class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>Native Speakers</b></p></td><td class="cell"><p><b>2nd Language Speakers</b></p></td><td class="cell"><p><b>Total</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Hindi</p></td><td class="cell"><p>366,000,000</p></td><td class="cell"><p>487,000,000</p></td><td class="cell"><p>853,000,000</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Urdu</p></td><td class="cell"><p>60,290,000</p></td><td class="cell"><p>104,000,000</p></td><td class="cell"><p>164,290,000</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Total</p></td><td class="cell"><p>426,290,000</p></td><td class="cell"><p>591,000,000</p></td><td class="cell"><p><b>1,017,000,000</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><page local="2" global="538"/><doubt alpha="66.7" length="6" tooSmall="False" monospace="0.0">2 HUMT</doubt><p>There exist three languages at the border between India and Pakistan: Kashmiri, Punjabi and Sindhi. All of them are mainly written in two scripts, one being a derivation of the Persio-Arabic script and the other being Devanagari script. A person us­ing the Persio-Arabic script cannot understand the Devanagari script and <i>vice versa. </i>The same is true for Hindi and Urdu which are varieties or dialects of the same language, called <i>Hindustani</i> by Platts (1909).</p><p>PMT (Punjabi Machine Transliteration) (Ma­lik, 2006) was a first effort to bridge this scrip­tural divide between the two scripts of Punjabi namely Shahmukhi (a derivation of Perio-Arabic script) and Gurmukhi (a derivation of Landa, Shardha and Takri, old Indian scripts). HUMT is a logical extension of PMT. Our HUMT system is generic and flexible such that it will be extend­able to handle similar cases like Kashmiri, Pun­jabi, Sindhi, <i>etc. </i>HUMT is also a special type of machine transliteration like PMT.</p><p>A brief account of Hindi and Urdu is first giv­en for unacquainted readers.</p><subsection number="2.1" title="Hindi"><p>The Devanagari (literally "godly urban") script, a simplified version of the alphabet used for San­skrit, is a left-to-right script. Each consonant symbol inherits by default the vowel sound [a]. Two or more consonants may be combined to­gether to form a cluster called Conjunct that marks the absence of the inherited vowel [a] be­tween two consonants (Kellogg, 1872; Montaut, 2004). A sentence illustrating Devanagari is giv­en below:</p><p>[hindi hindustan ki qomi zuban hae] (Hindi is the national language of India)</p><doubt alpha="50.0" length="8" tooSmall="False" monospace="0.0">2.2 Urdu</doubt><p>Urdu is written in an alphabet derived from the Persio-Arabic alphabet. It is a right-to-left script and the shape assumed by a character in a word is context-sensitive, <i>i.e. </i>the shape of a character is different depending on whether its position is at the beginning, in the middle or at the end of a word (Zia, 1999). A sentence illustrating Urdu is given below:</p><doubt alpha="43.8" length="16" tooSmall="False" monospace="0.0">yi; t/yS (j&gt;\ »j</doubt><p>[urdu pakistan ki qomi zuban has] (Urdu is the National Language of Pakistan.)</p></subsection></section><section number="3" title="Universal Intermediate Transcription"><p>UIT (Universal Intermediate Transcription) is a scheme to transcribe texts in Hindi, Urdu, Punja­bi, <i>etc. </i>in an unambiguous way encoded in AS­CII range 32 - 126, since a text in this range is portable across computers and operating systems (James 1993; Wells, 1995). SAMPA (Speech Assessment Methods Phonetic Alphabet) is a widely accepted scheme for encoding the IPA (International Phonetic Alphabet) into ASCII. It was first developed for Danish, Dutch, French, German and Italian, and since then it has been extended to many languages like Arabic, Czech, English, Greek, Hebrew, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, <i>etc. </i>We define UIT as a logical extension of</p><p>SAMPA. The UIT encoding for Hindi and Urdu is developed on the basis of rules and principles of SAMPA and X-SAMPA (Wells, 1995), that cover all symbols on the IPA chart. Phonemes are the most appropriate invariants to mediate between the scripts of Hindi, Punjabi, Urdu, <i>etc., </i>so that the encoding choice is logical and suitable.</p></section><section number="4" title="Analysis of Scripts and UIT Mappings"><p>For the analysis and comparison, scripts of Hindi and Urdu are divided into different groups on the basis of character types.</p><subsection number="4.1" title="Consonants"><p>These are grouped into two categories:</p><p><b>Aspirated Consonants: </b>Hindi and Urdu both have 15 aspirated consonants. In Hindi, 11 aspi­rated consonants are represented by separate cha­racters <i>e.g. </i>M [kh], <b>3T </b>[bh], <i>etc. </i>The remaining 4 consonants are represented by combining a sim­ple consonant to be aspirated and the conjunct form of HA ?[h], <i>e.g. &amp; </i>[l] + o + ? [h] = F? [l<footnote anchor="11"/>].</p><p>In Urdu, all aspirated consonants are represented by a combination of a simple conso­nant to be aspirated and Heh Doachashmee (<b>a</b>) [h], <i>e.g. ^ </i>[k] + <b>a </b>[h] = £ [kh], M [b] + <b>a </b>[h] = &amp; [bh], J [l] + <b>a </b>[h] = &amp; [lh], <i>etc.</i></p><p>The UIT mapping for aspirated consonants is <u>given in Table 2..</u>____<page local="3" global="539"/></p><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>Hindi Urdu</b></p></td><td class="cell"><p><b>UIT Hindi</b></p></td><td class="cell"><p><b>Urdu</b></p></td><td class="cell"><p><b>UIT</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>3T       <b>&amp; [b</b>h]</p></td><td class="cell"><p>b_h ?</p></td><td class="cell"><p><b>aj </b><b>[r</b>h]</p></td><td class="cell"><p>r_h</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>^          <b>&amp; [p</b>h]</p></td><td class="cell"><p>p_h ?</p></td><td class="cell"><p><b>aj </b>[rh]</p></td><td class="cell"><p>r'_h</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>t_d_h M</p></td><td class="cell"><p><b>[k</b>h]</p></td><td class="cell"><p>k_h</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>5          <b>&amp; </b>[th]</p></td><td class="cell"><p></p></td><td class="cell"><p><b>^ [g</b>h]</p></td><td class="cell"><p>g_h</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>d_Z_h</p></td><td class="cell"><p><b>&amp; [l</b>h]</p></td><td class="cell"><p>l_h</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><p><b>Non-aspirated Consonants: </b>Hindi has 29 non-aspirated consonant symbols representing 28 consonant sounds as both SHA (5T) and SSA (W) represent the same sound [J]. Similarly Urdu has 35 consonant symbols representing 27 sounds as multiple characters are used to represent the same sound <i>e.g. </i>Heh <b>(c) </b>and Heh-Goal (<b>&lt;&gt;</b>) represent the sound [h] and Theh (<b>^</b>), Seen <b>(0) </b>and Sad <b>(o=) </b>represent the sound [s], etc.</p><p>UIT mapping for non-aspirated consonants is g<u>iven in Table </u><u>3.</u>___</p><p>Hindi has 11 vowels and 10 of them have nasa­lized forms. They are represented by 11 indepen­dent vowel symbols <i>e.g. </i>3TT [a], 3&gt; [u], 3n" [d], <i>etc.</i><i> </i>and 10 dependent vowel symbols <i>e.g. </i>or [a], O [u], O [d], etc. called <i>maatraas. </i>When a vowel comes at the start of a word or a syllable, the independent form is used; otherwise the de­pendent form is used (Kellogg, 1872; Montaut, 2004).</p><p>Urdu contains 10 vowels and 7 of them have nasalized forms (Hussain, 2004; Khan, 1997). Urdu vowels are represented using four long vo­wels (Alef Madda <b>0), </b>Alef <b>(I), </b>Vav <b>(j) </b>and Choti Yeh (<b>l</b>)) and three short vowels (Arabic Fatha -Zabar -, Arabic Damma - Pesh - and Arabic Ka-sra - Zer -). Vowel representation is context-sensitive in Urdu. Vav <b>(j) </b>and Choti Yeh (<b>l</b>) are also used as consonants.</p><p>Hamza (<b>*</b>) is a place holder between two suc­cessive vowel sounds, <i>e.g. </i>in [kgmai] (earning), Hamza (<b>*</b>) separates the two vowel sounds Alef <b>(I) </b>[a] and Choti Yeh (<b>l</b>) [i]. Noonghunna <b>(j) </b>is used as nasalization marker. Anal­ysis and mapping of Hindi Urdu vowels is given in Table 5.</p></subsection><subsection number="4.3" title="Diacritical Marks"><p>Urdu contains 15 diacritical marks. They represent vowel sounds, except Hamza-e-Izafat -and Kasr-e-Izafat - that are used to build com­pound words, <i>e.g. </i><b>o^l^ SJJ! </b>[idarghisains] (In­stitute of Science), <b>gjjti </b>[tarixipedaiJ] (date of birth), <i>etc. </i>Shadda - is used to geminate a consonant <i>e.g.</i><i> </i><b>M</b>j [r<b>a</b>bb] (God), <b>[9</b>tftfha] (good), <i>etc. </i>Jazm - is used to mark the absence of a vowel after the base consonant (Platts, 1909). In Hindi, the conjunct form is used to geminate a consonant. Urdu diacritical marks mapping is <u>given in Table 4.</u>____</p><p>Diacritical marks are present in Urdu but spa­ringly used by people. They are very important for the correct pronunciation and understanding the meanings of a word. For example,</p><doubt alpha="64.3" length="14" tooSmall="False" monospace="0.0">&lt;-SJJ?l-Sjui'S</doubt><p>[je sgpk buhgt rjor.i hœ] (This is a wide road.) <b>CJ" --JJ? --J^<footnote anchor="1"/></b>[meri tjuri surgx hœ] (My bangle is red.) In the first sentence, the word <b>-sjj? </b>is pro­nounced as [rprj] (wide) and in the second, it is pronounced as [tjuri] (bangle).<page local="4" global="540"/> There should be</p><table caption="Table 2: Hindi Urdu aspirated consonants" class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>«5          *=¥ W]</p></td><td class="cell"><p>t_S_h 3-</p></td><td class="cell"><p>?      <b>£</b><b>j</b><b>o </b>[mh] m_h</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>ÎT           <b>aj </b>[dh]</p></td><td class="cell"><p>d_d_h ?</p></td><td class="cell"><p>?       *j [nh] n_h</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>5       <b>aj </b>[dh]</p></td><td class="cell"><p>d'_h</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table caption="Table 3: Hindi Urdu non-aspirated consonants 4.2 Vowels" class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>Hindi</b></p></td><td class="cell"><p><b>Urdu</b></p></td><td class="cell"><p><b>UIT</b></p></td><td class="cell"><p><b>Hindi</b></p></td><td class="cell"><p><b>Urdu</b></p></td><td class="cell"><p><b>UIT</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>m </b><b>[b]</b></p></td><td class="cell"><p>b</p></td><td class="cell"><p></p></td><td class="cell"><p><footnote anchor="00"/> [s]</p></td><td class="cell"><p>s2</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>V <b>[p]</b></p></td><td class="cell"><p>p</p></td><td class="cell"><p></p></td><td class="cell"><p>^ [z]</p></td><td class="cell"><p>z2</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>H</p></td><td class="cell"><p>ù [t]</p></td><td class="cell"><p>t_d</p></td><td class="cell"><p>H</p></td><td class="cell"><p>^ [t]</p></td><td class="cell"><p>t_d1</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>z</p></td><td class="cell"><p>- [t]</p></td><td class="cell"><p>t'</p></td><td class="cell"><p></p></td><td class="cell"><p>i <b>[z]</b></p></td><td class="cell"><p>z3</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>^ <b>[s]</b></p></td><td class="cell"><p>s1</p></td><td class="cell"><p>-</p></td><td class="cell"><p>t [?]</p></td><td class="cell"><p>?</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>E [*]</p></td><td class="cell"><p>d_Z</p></td><td class="cell"><p>&lt;iï</p></td><td class="cell"><p>è [y]</p></td><td class="cell"><p>X</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>€ ffl</p></td><td class="cell"><p>t_S</p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>f</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>?</p></td><td class="cell"><p>C <b>[h]</b></p></td><td class="cell"><p>h1</p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p>q</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>m</p></td><td class="cell"><p>C <b>[x]</b></p></td><td class="cell"><p>x</p></td><td class="cell"><p></p></td><td class="cell"><p>^ <b>[k]</b></p></td><td class="cell"><p>k</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>z</p></td><td class="cell"><p>J [d]</p></td><td class="cell"><p>d_d</p></td><td class="cell"><p></p></td><td class="cell"><p>^ <b>[g]</b></p></td><td class="cell"><p>g</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>J [d]</p></td><td class="cell"><p>d'</p></td><td class="cell"><p></p></td><td class="cell"><p>J <b>[l]</b></p></td><td class="cell"><p>l</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>i <b>[z]</b></p></td><td class="cell"><p>z1</p></td><td class="cell"><p>W</p></td><td class="cell"><p>f <b>[m]</b></p></td><td class="cell"><p>m</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>J <b>[r]</b></p></td><td class="cell"><p>r</p></td><td class="cell"><p></p></td><td class="cell"><p>û <b>[n]</b></p></td><td class="cell"><p>n</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>s</p></td><td class="cell"><p>J [r]</p></td><td class="cell"><p>r'</p></td><td class="cell"><p></p></td><td class="cell"><p>J <b>[v]</b></p></td><td class="cell"><p>v</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>j <b>[z]</b></p></td><td class="cell"><p>z</p></td><td class="cell"><p>?</p></td><td class="cell"><p>» <b>[h]</b></p></td><td class="cell"><p>h</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>J [3]</p></td><td class="cell"><p>Z</p></td><td class="cell"><p></p></td><td class="cell"><p>* L<b>i]</b></p></td><td class="cell"><p>j</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>0* [s]</p></td><td class="cell"><p>s</p></td><td class="cell"><p>H</p></td><td class="cell"><p><footnote anchor="4"/> [t]</p></td><td class="cell"><p>t_d2</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>^ [J]</p></td><td class="cell"><p>S</p></td><td class="cell"><p></p></td><td class="cell"><p>- M</p></td><td class="cell"><p>n'</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>^ [J]</p></td><td class="cell"><p>S1</p></td><td class="cell"><p></p></td><td class="cell"><p>u [q]</p></td><td class="cell"><p>~</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table caption="Table 4: Diacritical Marks of Urdu" class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>Hindi</b></p></td><td class="cell"><p><b>Urdu</b></p></td><td class="cell"><p><b>UIT</b></p></td><td class="cell"><p><b>Hindi</b></p></td><td class="cell"><p><b>Urdu</b></p></td><td class="cell"><p><b>UIT</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>-</p></td><td class="cell"><p>O [8]</p></td><td class="cell"><p>@</p></td><td class="cell"><p>CT</p></td><td class="cell"><p>O [a]</p></td><td class="cell"><p>A</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>O [i]</p></td><td class="cell"><p>I</p></td><td class="cell"><p></p></td><td class="cell"><p>O |o<b>n|</b></p></td><td class="cell"><p>@n</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>O</b></p></td><td class="cell"><p>O [u]</p></td><td class="cell"><p>U</p></td><td class="cell"><p>C*</p></td><td class="cell"><p>O [u<b>n]</b></p></td><td class="cell"><p>Un</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>"&lt;\</p></td><td class="cell"><p>O <b>[u]</b></p></td><td class="cell"><p>u</p></td><td class="cell"><p></p></td><td class="cell"><p><i>O </i><b>[m]</b></p></td><td class="cell"><p>In</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>O <b>[i]</b></p></td><td class="cell"><p>i</p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><p>Zabar ( ) and Pesh (o) after Cheh (g) in above words and correct transcriptions are &lt;_sjjt (wide) essential for removing ambiguities, natural lan­guage processing and speech synthesis.</p></subsection></section><section number="5" title="HUMT Rules"><p>In this section, UIT mappings of Hindi Urdu al­phabets and contextual rules that are necessary for Hindi-Urdu transliteration are discussed.</p><subsection number="5.1" title="UIT Mappings"><p>[J] and have one equivalent symbol in Urdu, <i>i.e.</i></p><table caption="Table 5: Analysis and Mapping of Hindi Urdu Vowels"></table><p>Sheen (&lt;_£). To make distinction between SHA</p><p>(5T) and SSA (W) in UIT, they are mapped on S</p><p>and S1 respectively. Similarly in Urdu, Seh (^), Seen (o) and Sad (o=) represent the sound [s] and have one equivalent symbol in Hindi, <i>i.e. </i>SA </p><p>UIT mappings for Hindi an (^). To make distinction among them in UIT, they are mapped on s1, s and s2 respectively. All similar cases are shown in Table 6.</p><p>their vowels are given in Table 2 - 5. In Hindi, SHA (5T) and SSA (W) both represent the sound<page local="5" global="541"/></p><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>Vowel</b></p></td><td class="cell"><p><b>Urdu</b></p></td><td class="cell"><p><b>Hindi (UIT)</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>3</b></p></td><td class="cell"><p>It is represented by Alef <b>(I) </b>+ Zabar - at the start of a word <i>e.g. </i><b>m' </b><b>[a</b>b] (now) and by Zabar - in the middle of a word respectively <i>e.g. </i><b>mj </b>[r<b>a</b>bb] (God). It never comes at the end of a word.</p></td><td class="cell"><p>33 (@)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>a</p></td><td class="cell"><p>It is represented by Alef Madda <b>(1) </b>at the start of a word <i>e.g. </i>[admi] (man) and by Alef <b>(I) </b>or Alef Madda <b>(1) </b>in the middle of a word <i>e.g. </i><b>I</b><b>jV </b>[cfeana] (go), [bilax<b>a</b>r] (at last). At the end of a word, it is represented by Alef <b>(I). </b>In some Arabic loan words, it is represented by Choti Yeh <b>(s) </b>+ Khari Zabar - at the end of a word <i>e.g. </i><b>'Jc-I [a</b>?la] (Superior) and by Khari Zabar - in the middle of a word <i>e.g. </i>[ilahi] (God).</p></td><td class="cell"><p>3<b>TT </b>or c<b>::T </b>(A)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>e</p></td><td class="cell"><p>It is represented by Alef <b>(I) </b>+ Choti Yeh <b>(s) </b>at the start of a word <i>e.g. </i><b>J</b><b>jj</b><b>! </b>[esar] (sacrifice), [ek] (one), <i>etc. </i>and by Choti Yeh <b>(s) </b>or Baree Yeh (<b>^</b>) in the middle of a word <i>e.g. </i><b>Ij&gt; </b>[mera] (mine), <b>Ij*^! [3</b>ndhera] (darkness), <b>j*S^&gt; </b>[beg<b>hs</b>r] (homeless) <i>etc. </i>At the end of a word, It is represented by Baree Yeh (<b>^</b>) <i>e.g. </i><b>^jl^ </b>[sare] (all).</p></td><td class="cell"><p>IT or O (e)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>It is represented by Alef <b>(I) </b>+ Zabar - + Choti Yeh <b>(s) </b>at the start of a word <i>e.g. </i>[ash] (this) and by Zabar - + Choti Yeh <b>(s) </b>in the middle of a word <i>e.g. </i>[masl] (dirt). At the end of a word, it is represented by Zabar - + Baree Yeh (<b>^</b>) <i>e.g. ^ </i>[has] (is).</p></td><td class="cell"><p>* orO({)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>i</p></td><td class="cell"><p>It is represented by Alef <b>(I) </b>+ Zer - at the start of a word <i>e.g. </i><b>ol </b>[is] (this) and by Zer - in the middle of a word <i>e.g. </i><b>l</b><b>A</b><b>j</b><b>W </b>[barif] (rain). It never comes at the end of a word. At the end of a word, it is used as Kasr-e-Izafat to build compound words.</p></td><td class="cell"><p>? or fO (I)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>i</p></td><td class="cell"><p>It is represented by Alef <b>(I) </b>+ Zer - + Choti Yeh <b>(s) </b>at the start of a word <i>e.g. </i>[iman] (belief) and by Zer - + Choti Yeh <b>(s) </b>in the middle or at the end of a word <i>e.g. </i><b>sj^I </b>[amiri] (richness), [q<b>a</b>rib] (near), <i>etc.</i></p></td><td class="cell"><p>f or O (i)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>u</p></td><td class="cell"><p>It is represented by Alef <b>(I) </b>+ Pesh - at the start of a word <i>e.g. </i>[udd<b>h3</b>r] (there) and by Pesh - in the middle of a word <i>e.g. </i><b>i% </b>[mull] (price). It never comes at the end of a word.</p></td><td class="cell"><p>3 or O (U)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>u</p></td><td class="cell"><p>It is represented by Alef <b>(I) </b>+ Pesh - + Vav <b>(j) </b>at the start of a word <i>e.g. </i><b>lj$Sijf [ü</b>g<b>hq</b>ta] (dozzing) and by</p><p>Pesh - + Vav (<b>j) </b>in the middle or at the end of a word <i>e.g. </i><b>cjjjL, </b>[sur<b>a</b>t] (face), <b>jjIJ </b>[t<b>a</b>razu] (physical bal­ance), <i>etc.</i></p></td><td class="cell"><p>3&gt; or ° (u)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>o</p></td><td class="cell"><p>It is represented by Alef <b>(I) </b>+ Vav <b>(j) </b>at the start of a word <i>e.g. </i><b>W</b><b>j-jI </b>[otfha] (nasty) and by Vav <b>(j) </b>in the middle or at the end of a word <i>e.g.      </i>[holi] (slowly), <i>j£ </i>[k<b>a</b>ho] (say), <i>etc.</i></p></td><td class="cell"><p>3<b>fT </b>or <i>O </i>(o)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p>It is represented by Alef <b>(I) </b>+ Zabar - + Vav (<b>j) </b>at the start of a word <i>e.g. </i><b>ijl </b>[Df] (hindrance) and by Zabar -+ Vav <b>(j) </b>in the middle or at the end of a word <i>e.g. </i><b>^</b>&gt;<b>&gt;"</b>&gt; [mDt] (death).</p></td><td class="cell"><p>3ft or <b>o</b><b>T</b><b> </b>(O)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>r</p></td><td class="cell"><p>It is represented by a consonant symbol Reh <b>(j) </b>[r] as this vowel is only present in Sanskrit loan words. It is almost not used in modern standard Hindi. It is not present in Urdu as it is used only in Sanskrit loan words.</p></td><td class="cell"><p>3t or O (rl)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>Note: In Hindi, Nasalization of a vowel is done by adding Anunasik (O) or Anusavar (O) after the vowel. Anusavar (O) is used when the vowel graph goes over the upper line; otherwise Anunasik (O) is used (Kellogg, 1872; Montaut, 2004). In UIT, ~ is added at end of UIT encoding for nasalization of all above vowels except the last one that do not have a nasalized form.</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>IPA</b></p></td><td class="cell"><p><b>Urdu (UIT)</b></p></td><td class="cell"><p><b>Hindi (UIT)</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>t</p></td><td class="cell"><p><i>Ca </i>(t_d), J= (t_d1), <b><i>i </i></b>(t_d2)</p></td><td class="cell"><p>H (t_d)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>s</p></td><td class="cell"><p>^ (sl), o" (s), (s2)</p></td><td class="cell"><p>* (s)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>H</p></td><td class="cell"><p>C (hl), • (h)</p></td><td class="cell"><p>F (h)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><p>Multi-equivalences are problematic for Hindi-Urdu transliteration.</p><p>UIT is extendable to other languages like Eng­lish, French, Kashmiri, Punjabi, Sindhi, <i>etc. </i>For example, Punjabi has one extra character than</p><p>Urdu <i>i.e. </i>Rnoon [nj <b>(if), </b>it is mapped on 'n ' in</p><p>UIT. Similarly, UIT, a phonetic encoding scheme, can be extended to other languages.</p><p>All these mappings can be implemented by simple finite-state transducers using XEROX's XFST (Beesley and Karttunen, 2003) language. <u>A sample XFST code is given in Figure 1.</u>_</p><doubt alpha="40.0" length="40" tooSmall="False" monospace="0.0">read regex [m-&gt; b,v-&gt; p,E-&gt; [d "_" Z] ];</doubt><doubt alpha="37.8" length="37" tooSmall="False" monospace="0.0">read regex [[ea] -&gt; [d "_" Z "_" h]];</doubt><doubt alpha="40.0" length="35" tooSmall="False" monospace="0.0">read regex[j-&gt; v,is-&gt; j || .#. _ ];</doubt><doubt alpha="45.7" length="35" tooSmall="False" monospace="0.0">read regex[j-&gt; v,is-&gt; j || _[I|I]];</doubt><doubt alpha="60.0" length="35" tooSmall="False" monospace="0.0">read regex[s-&gt; e || CONSONANTS _ ];</doubt><doubt alpha="39.3" length="28" tooSmall="False" monospace="0.0">read regex [s-&gt; i ||[|.#.]];</doubt><doubt alpha="34.0" length="100" tooSmall="False" monospace="0.0">read regex pT-&gt; b,*T-&gt; p,ö-&gt; z,    -&gt; [d "_" Z "_" h]]; read regex[3T-&gt; "@",3JT-&gt; A,f-&gt; i || .#. _ ]</doubt><p><b>Figure 1: Sample XFST code </b>Finite-state transducers are robust and time and space efficient (Mohri, 1997). They are a logical choice for Hindi-Urdu transliteration via UIT as this problem could also be seen as string matching and producing an analysis string as an output like finite-state morphological analysis.</p></subsection><subsection number="5.2" title="Contextual HUMT Rules"><p>UIT mappings need to be accompanied by neces­sary contextual HUMT rules for correct Hindi to Urdu transliteration and <i>vice versa.</i></p><p>For example, Vav <b>(j) </b>and Choti Yeh <b>(s) </b>are used to represent vowels like [o], [d], [i], [e], <i>etc.</i><i> </i>but they are also used as consonants. Vav <b>(j) </b>and Choti Yeh <b>(s) </b>are consonants when they come at the beginning of a word or when they are fol­lowed by Alef mada <b>0) </b>or Alef <b>(I). </b>Also, Choti Yeh <b>(s) </b>represents the vowel [e] when it is pre­ceded by a consonant but when it comes at the end of a word and is preceded by a consonant then it represents the vowel [i]. These rules are shown in red colour in Figure 1.</p><p>Thus HUMT contextual rules are necessary for Hindi-Urdu transliteration and they can also be implemented as finite-state transducer using XFST. All these rules can't be given here due to shortage of space.</p></subsection></section><section number="6" title="HUMT System"><p>The HUMT system exploits the simplicity, ro­bustness, power and time and space efficiency of finite-state transducers. Exactly the same trans­ducer that encodes a Hindi or Urdu text into UIT can be used in the reverse direction to generate Hindi or Urdu text from the UIT encoded text. This two-way power of the finite-state transducer (Mohri, 1997) has significantly reduced the amount of efforts to build the HUMT system. Another very important and powerful strength of finite-state transducers, they can be composed together to build a single transducer that can per­form the same task that could be done with help of two or more transducers when applied sequen­tially (Mohri, 1997), not only allows us to build a direct Hindi &lt;-» Urdu transducer, but also helps to divide difficult and complex problems into sim­ple ones, and has indeed simplified the process of building the HUMT system. A direct Hindi &lt;-» Urdu transducer can be used in applications where UIT encoding is not necessary like Hindi­Urdu MT system.</p><p>The HUMT system can be extended to per­form transliteration between two or more differ­ent scripts used for the same languages like Kashmiri, Kazakh, Malay, Punjabi, Sindhi, <i>etc. </i>or between language pairs like English-Hindi, English-Urdu, English-French, <i>etc. </i>by just in­troducing the respective transducers in the Fi­nite-state Transducer Manager of the HUMT system to build a <i>multilingual ma­chine transliteration system. </i>Hindi or Urdu Unicode text</p><p>Hindi or Urdu Unicode text</p><p><b>Figure 2: HUMT System </b>In the HUMT system, Text Tokenizer takes the input Hindi or Urdu Unicode text, toke-nizes it into Hindi or Urdu words and passes</p><p>Text Tokenizer</p><p>Hindi Urdu Machine Transliteration System</p><p>Hindi or Urdu words <b>_i_</b><page local="6" global="542"/></p><table caption="Table 6: Multiple Characters for one IPA" class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>z      i (z1), J (z), j (Z),     (z2), i (z3)</p></td><td class="cell"><p>ö (z)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>J </b>(S)</p></td><td class="cell"><p>?r (S), W (S1)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>r                       </b>J (r)</p></td><td class="cell"><p>T (r), ^ (r1)</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><p>them to UIT Enconverter. The enconverter enconverts Hindi or Urdu words into UIT words using the appropriate transducer from Finite-state Transducers Manager, <i>e.g. </i>for Hindi words, it uses the Hindi &lt;-» UIT transducer. It passes these UIT encoded words to UIT De­converter , which deconverts them into Hindi or Urdu words using the appropriate transducer from Finite-state Transducers Man­ager in reverse and generates the target Hindi or Urdu text.</p><subsection number="6.1" title="Enconversion of Hindi-Urdu to UIT"><p>Hindi &lt;-» UIT transducer is a composition of the mapping rules transducers and the contextual rules transducers. This is clearly shown in figure <u>3 with a sample XFST code.</u>_</p><p>clear stack set char-encoding UTF-8 define CONSONANTS pr7 j M j J j q j S j 55 j <u>compose net</u>_</p><doubt alpha="45.2" length="31" tooSmall="False" monospace="0.0">read regex[C-&gt; J,C:-&gt; h,C-&gt; 0];</doubt><doubt alpha="35.3" length="51" tooSmall="False" monospace="0.0">read regex     -&gt; k,M-&gt; [k "_" h],JT-&gt; g,q-&gt; [g "_"</doubt><doubt alpha="18.9" length="53" tooSmall="False" monospace="0.0">h],S-&gt; [n "@" g],q-&gt; [t "_" S],55-&gt; [t "_" S "_" h]];</doubt><doubt alpha="43.4" length="53" tooSmall="False" monospace="0.0">read regex [pf7C cf&gt;]-&gt; [k k], pf7C M]-&gt; [k k "_" h],</doubt><doubt alpha="29.0" length="31" tooSmall="False" monospace="0.0">FC-&gt; [g g],C q]-&gt; [g g "_" h]];</doubt><doubt alpha="37.3" length="75" tooSmall="False" monospace="0.0">read regex [pr7fC]-&gt; [k h],[TT]-&gt; [n A],C]-&gt; [j h],FC]-&gt; [v h]jj.#. _ .#.];</doubt><figure caption="Figure 3: Sample code for Hindi &lt;-&gt; UIT TransducerHow the HUMT system works is shown with"></figure><p>the help of an example. Take the Hindi sentence:</p><p>^imdl W?^H 33t 3<b>T</b>WT<b>T </b>cfïï fTftnT<b>T </b>?</p><p>[faxsta muhgbgt or smsn ka niJan hœ]</p><p>(Dove is symbol of love and peace) This sentence is received by the Text To-kenizer and is tokenized into Hindi words, which are enconverted into UIT words using the mapping and the contextual rules of Hindi &lt;-» UIT transducer by the UIT Enconverter. The Hindi Words and the UIT enconversions are given i<u>n Table 7.</u>__</p></subsection><subsection number="6.2" title="Deconversion of UIT to Hindi-Urdu"><p>For the deconversion, Hindi &lt;-» UIT or Urdu &lt;-» UIT transducer is applied in reverse on the UIT enconverted words to generate Hindi or Urdu words. To continue with the example in the pre­vious section, the UIT words are deconverted into the Urdu words by the UIT Deconver­ter using Urdu &lt;-» UIT transducer in reverse. The Urdu words are given in table 8 with the Hindi and the UIT words.</p><p>Finally, the following Urdu sentence is gener­ated from Urdu words.</p><p>Here the word H-&gt;l*sJdi [faxgta] (Dove) is transliterated wrongly into 'tints' because the vowel [a] at the end of some Urdu words (bor­rowed from Persian language) is transcribed with help of Heh-gol [h] ("). This phenomenon is a problem for Hindi to Urdu transliteration but not for Urdu to Hindi transliteration.</p></subsection></section><section number="7" title="Evaluation Experiments and Results"><p>For evaluation purpose, we used a Hindi corpus, containing 374,150 words, and an Urdu corpus with 38,099 words. The Hindi corpus is extracted from the Hindi WordNet<footnote anchor="2"/> developed by the Re­source Center for Indian Language Technology Solutions, CSE Department, Indian Institute of Technology (IIT) Bombay, India and from the project CIFLI (GETALP-LIG<footnote anchor="3"/> , University Jo­seph Fourier), a project for building resources and tools for network-based "linguistic survival" communication between French, English and Indian languages like Hindi, Tamil, etc. The Ur­du corpus was developed manually from a book titled ^&amp;&gt;" [zulmgt kgda]. The Hindi-Urdu corpus contains in total 412,249 words.</p><p>The HUMT system is an initial step to build Urdu resources and add Urdu to the languages of<page local="7" global="543"/></p><footnote label="2"> http://www.cfilt.iitb.ac.in</footnote><footnote label="3"> http://www.liglab.fr</footnote><table caption="Table 8: Hindi, UIT and Urdu Words" class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>Hindi</b></p></td><td class="cell"><p><b>UIT</b></p></td><td class="cell"><p><b>Urdu</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>dirndl [faxsta]</p></td><td class="cell"><p>fAx@t_dA</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>W?^d [muhsbst]</p></td><td class="cell"><p>mUh@b@t_d</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>[or]</p></td><td class="cell"><p>Or</p></td><td class="cell"><p>Jj!</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>JWT [smsn]</p></td><td class="cell"><p>@m@n</p></td><td class="cell"><p>ù^I</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>cfïï [ka]</p></td><td class="cell"><p>kA</p></td><td class="cell"><p>l£</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>f?T?iïTT [niJan]</p></td><td class="cell"><p>nISAn</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>? [hœ]</p></td><td class="cell"><p>H{</p></td><td class="cell"><p></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>Hindi Words</b></p></td><td class="cell"><p><b>UIT</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>^imdl [faxsta]</p></td><td class="cell"><p>fAx@t_dA</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>W?^d [muhsbst]</p></td><td class="cell"><p>mUh@b@t_d</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>[or]</p></td><td class="cell"><p>Or</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>JWT [smsn]</p></td><td class="cell"><p>@m@n</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>cfïï [ka]</p></td><td class="cell"><p>kA</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>f?T?iïTT [niJan]</p></td><td class="cell"><p>nISAn</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p>? [hœ] <b>Table 7: Hindi W&lt;</b></p></td><td class="cell"><p>H{</p><p><b>ords with UIT</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><doubt alpha="100.0" length="5" tooSmall="False" monospace="0.0">Hindi</doubt><doubt alpha="100.0" length="1" tooSmall="False" monospace="0.0">H</doubt><doubt alpha="0.0" length="1" tooSmall="False" monospace="0.0">?</doubt><p>SurviTra-CIFLI (Survival Translation) (Boitet et al, 2007), a multilingual digital phrase-book to help tourists for communication and enquiries like restaurant, hotel reservation, flight enquiry, etc.</p><p>To reduce evaluation and testing efforts, unique words are extracted from the Hindi-Urdu corpus and are transliterated using the HUMT system. These unique words and their translitera­tions are checked for accuracy with the help of dictionaries (Platts, 1911; Feroz).</p><subsection number="7.1" title="Urdu — Hindi Transliteration Results"><p>While transliterating Urdu into Hindi, multiple problems occur like multi-equivalences, no equi­valence, missing diacritical marks in Urdu text.</p><p>For example, Sheen [J] (uU) can be transliterated in Hindi into SHA [J] (5T) or SSA [J] (W) that are present in 7,917 and 6,399 corpus words respec­tively. Sheen [J] (uU) is transliterated into SHA [J] (^) by default. Thus, 6,399 words containing SSA [J] (W) are wrongly transliterated into Hindi using HUMT. Urdu to Hindi multi-equivalences cases a<u>re given in Table 9 with their frequ</u>encies. <b><u>Urdu     Hindi (corpus Frequency) </u></b>uu <b>[J]     ?F </b>(7917), W (6399) <b>j </b>[r]     7 (79,345), ^ (199) <b>Table 9: Urdu — Hindi Multi-equivalences </b>Some Hindi characters do not have equivalent characters in Urdu, <i>e.g.</i><i> </i>NNA [it] (W), retroflexed version of [n], has approximately mapped onto Noon [n] (j). This creates a problem when a word actually containing NNA [r|] (W) is transli­terated from Urdu to Hindi. No-equivalence cas­es are g<u>iven in Table 10.</u>_</p><p><b><u>Urdu     Hindi (corpus Frequency)</u></b></p><doubt alpha="12.5" length="8" tooSmall="False" monospace="0.0">-W(4744)</doubt><doubt alpha="0.0" length="6" tooSmall="False" monospace="0.0">-5(0)_</doubt><doubt alpha="12.5" length="8" tooSmall="False" monospace="0.0">-3T(532)</doubt><p><b>Table 10: Urdu — Hindi No-equivalences </b>Missing diacritical marks is the major problem when transliterating Urdu into Hindi. The impor­tance of diacritical marks has already been ex­plained in section 4.3. This work assumed that all necessary diacritical marks are present in Urdu text because they play a vital role in Urdu to Hindi transliterations. Results of Urdu to Hindi</p></subsection><subsection number="7.2" title="Hindi — Urdu Transliteration Results"><p>Hindi — Urdu transliteration also have multi-equivalences and no-equivalence problems that are given in Table 12.</p><p><b><u>Urdu (corpus Frequency)</u></b></p><doubt alpha="5.0" length="20" tooSmall="False" monospace="0.0">^ (41,751), i (1312)</doubt><doubt alpha="13.8" length="29" tooSmall="False" monospace="0.0">oo (53,289), oo (751), ^ (86)</doubt><table caption="Table 12: Hindi — Urdu Multi &amp; No equivalencesResults of Hindi to Urdu transliteration are"></table><doubt alpha="5.0" length="20" tooSmall="False" monospace="0.0">» (72,850), c (1800)</doubt><doubt alpha="12.5" length="48" tooSmall="False" monospace="0.0">j (2551),o±(1489), i (228), i (215),j(2)t (2857)</doubt><table caption="Table 13: Hindi — Urdu Transliteration Results"></table><p>Interestingly, Hindi to Urdu conversion is 14.47% less accurate on the unique words as compared to its result on the corpus data that is a contrasting fact for the reverse conversion.</p><p>The HUMT system gives 97.12% accuracy for Urdu to Hindi and 97.88% accuracy for Hindi to Urdu. Thus, the HUMT system works with 97.50% accuracy.</p></subsection></section><section number="8" title="Future Implications"><p>Hindi-Urdu transliteration is one of the cases where one language is written in two or more mutually incomprehensible scripts like Kazakh, Kashmiri, Malay, Punjabi, Sindhi, <i>etc. </i>The HUMT system can be enhanced by extending UIT and introducing the respective finite-state transducers. It can similarly be enhanced to transliterate between language pairs, <i>e.g. </i>Eng­lish-Arabic, English-Hindi, English-Urdu, French-Hindi, <i>etc. </i>Thus, it can be enhanced to build a <i>multilingual machine transliteration sys­tem </i>that can be used for <i>cross-scriptural transli­teration </i>and MT.</p><p>We are intended to resolve the problems of multi-equivalences, no-equivalences and the most importantly the restoration of diacritical marks in Urdu text that are observed but left un­attended in the current work. Restoration of dia­critical marks in Urdu, Sindhi, Punjabi, Kashmi­ri, <i>etc. </i>texts is essential for word sense disambig­uation, natural language processing and speech synthesis of the said languages.</p><p>The HUMT system will also provide a basis for the development of <i>Inter-dialectal </i>translation system and MT system for <i>surface-close </i>lan­guages like Indonesian-Malay, Japanese-Korean,<page local="8" global="544"/></p><table class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>Error Words</b></p></td><td class="cell"><p><b>Accuracy</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>Corpus</b></p></td><td class="cell"><p>8,740</p></td><td class="cell"><p>97.88%</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>Unique Words</b></p></td><td class="cell"><p>1400</p></td><td class="cell"><p>83.41%</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><table caption="Table 11: Urdu — Hindi Transliteration Results" class="main" frame="box" rules="all" border="1" regular="False"><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p></p></td><td class="cell"><p><b>Error Words</b></p></td><td class="cell"><p><b>Accuracy</b></p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>Corpus</b></p></td><td class="cell"><p>11,874</p></td><td class="cell"><p>97.12%</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"><p><b>Unique Words</b></p></td><td class="cell"><p>123</p></td><td class="cell"><p>98.54%</p></td><td class="cell"></td></tr><tr class="row"><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td><td class="cell"></td></tr></table><p>Hindi-Marathi, Hindi-Urdu, <i>etc. </i>Translation of the <i>surface-close </i>languages or <i>inter-dialectal </i>translation can be performed by using mainly transliteration and some lexical translations. Thus HUMT will also provide basis for <i>Cross-Scriptural Transliteration, Cross-scriptural In­formation Retrieval, Cross-scriptural Applica­tion Development, inter-dialectal translation </i>and <i>translation of surface-close languages.</i></p></section><section number="9" title="Conclusion"><p>Finite-state transducers are very efficient, robust, and simple to use. Their simplicity and powerful features are exploited in the HUMT model to perform Hindi-Urdu transliteration using UIT that is a generic and flexible encoding scheme to uniquely encode natural languages into ASCII. The HUMT system gives 97.50% accuracy when it is applied on the Hindi-Urdu corpora contain­ing 412,249 words in total. It is an endeavor to bridge the scriptural, ethnical, cultural and geo­graphical division between 1,017 millions people around the globe.</p><p><b>Acknowledgement</b></p><p>This study is partially supported by the project CIFLI funded under ARCUS-INDIA program by <i>Ministry of Foreign Affairs </i>and <i>Rhone-Alpes </i>region.</p></section><references><p>Beesley, Kenneth R. and Karttunen, Lauri. 2003. <i>Fi­nite State Morphology. </i>CSLI Publications, USA.</p><p>Boitet, Christian. Bhattacharayya, Pushpak. Blanc, Etienne. Meena, Sanjay. Boudhh, Sangharsh. Fafiotte, Georges. Falaise, Achille. Vacchani, Vishal. 2007. <i>Building Hindi-French-English-UNL Resources for SurviTra-CIFLI, a linguistic survival system under construction. </i>Proceedings of the Seventh Symposium on NLP, 13 - 15 December, Chonburi, Thailand.</p><p>Feroz ul Din. <b>j^J ^L^Ijjjja </b>Feroz Sons Publishers, Lahore, Pakistan.</p><p>Hussain, Sarmad. 2004. <i>Letter to Sound Rules for Urdu Text to Speech System. </i>Proceedings of Work­shop on Computational Approaches to Arabic Script-based Languages, COLING 2004, Geneva, Switzer­land.</p><p>James, L. Hieronymus. 1993. <i>ASCII Phonetic Symbols for the World's Languages: Worldbet. </i>AT&amp;T Bell</p><p>Laboratories, Murray Hill, NJ 07974, USA.</p><p>Kellogg, Rev. S. H. 1872. <i>A Grammar of Hindi Lan­guage. </i>Delhi, Oriental Book Reprints.</p><p>Khan, Mehboob Alam. 1997. ^ <b>or<footnote anchor="1"/>'^ £ j^J </b>(Sound System in Urdu) National Language Authority, Pakis­tan.</p><p>Knight, K. and Graehl, J. 1998. <i>Machine Translitera­tion. </i>Computational Linguistics, 24(4).</p><p>Knight, K. and Stall, B G. 1998. <i>Translating Names and Technical Terms in Arabic Tex. </i>Proceedings of the COLING/ACL Workshop on Computational Ap­proaches to Semitic Languages.</p><p>Malik, M. G. Abbas. 2006. <i>Punjabi Machine Transli­teration. </i>Proceedings of the 21st International Confe­rence on Computational Linguistics and 44th Annual Meeting of the ACL, July 2006, Sydney.</p><p>Mohri, Mehryar. 1997. <i>Finite-state Transducers in Language and Speech Processing. </i>Computational Linguistics, 23(2).</p><p>Montaut A. 2004. <i>A Linguistic Grammar of Hindi. </i>Studies in Indo-European Linguistics Series, Mün­chen, Lincom Europa.</p><p>Paola, V. and Sanjeev, K. 2003. <i>Transliteration of proper names in cross-language applications. </i>Pro­ceedings of the 26th annual International ACM SIGIR conference on research and development in informa­tion retrieval.</p><p>Pirkola, A. Toivonen, J. Keskustalo, H. Visala, K. and Järvelin, K. 2003. <i>Fuzzy translation of cross-lingual spelling variants. </i>Proceedings of the 26th Annual international ACM SIGIR Conference on Research and Development in informaion Retrieval, Toronto, Canada.</p><p>Platts, John T. 1909. <i>A Grammar of the Hindustani or Urdu Language. </i>Crosby Lockwood and Son, 7 Sta­tioners Hall Court, Ludgate hill, London. E.C.</p><p>Platts, John T. 1911. <i>A Dictionary of Urdu, Classical Hindi and English. </i>Crosby Lockwood and Son, 7 Sta­tioners Hall Court, Ludgate hill, London, E.C.</p><p>Rahman, Tariq. 2004. <i>Language Policy and Localiza­tion in Pakistan: Proposal for a Paradigmatic Shift. </i>Crossing the Digital Divide, SCALLA Conference on Computational Linguistics.</p><p>Rai, Alok. 2000. <i>Hindi Nationalism. </i>Orient Longman Private Limited, New Delhi.</p><p>Wells, J C. 1995. <i>Computer-coding the IPA: A Pro­posed Extension of SAMPA. </i>University College Lon­don. http://www.phon.ucl.ac.uk/home/sampa/ipasam-x.pdf.</p><p>Yan Qu, Gregory Grefenstette, David A. Evans. 2003. <i>Automatic transliteration for Japanese-to-English text retrieval. </i>Proceedings of the 26th annual interntional ACM SIGIR conference on Research and develop­ment in information retrieval.</p><p>Zia, Khaver. 1999a. <i>Standard Code Table for Urdu. </i>Proceedings of 4th Symposium on Multilingual In­formation Processing (MLIT-4), Yangon, Myanmar,</p><p>CICC, Japan.</p></references></body></article>