Hindi Urdu Machine Transliteration using Finite-State Transducers

Hindi Urdu Machine Transliteration using Finite-State TransducersProceedings of the 22nd International Conference on Computational Linguistics (Coling 2008),pages 537-544 Manchester, August 2008

Hindi Urdu Machine Transliteration using Finite-state Transducers

M G Abbas Malik Christian Boitet

GTALP, Laboratoire d'Informatique Grenoble Université Joseph Fourier, France

abbas.malik@imag.fr,

Christian.Boitet@imag.fr

Finite-state Transducers (FST) can be very efficient to implement inter-dialectal transliteration. We illustrate this on the Hindi and Urdu language pair. FSTs can also be used for translation between surface-close languages. We introduce UIT (universal intermediate transcription) for the same pair on the basis of their common phonetic repository in such a way that it can be extended to other languages like Arabic, Chinese, English, French, etc. We describe a transliteration model based on FST and UIT, and evaluate it on Hindi and Urdu corpora.

Transliteration is mainly used to transcribe a word written in one language in the writing system of the other language, thereby keeping an approximate phonetic equivalence. It is useful for MT (to create possible equivalents of unknown words) (Knight and Stall, 1998; Paola and San-jeev, 2003), cross-lingual information retrieval (Pirkola et al, 2003), the development of multilingual resources (Y an et al, 2003) and multilingual text and speech processing. Inter-dialectal translation without lexical changes is quite useful and sometimes even necessary when the dialects in question use different scripts; it can be achieved by transliteration alone. That is the case of HUMT (Hindi-Urdu Machine Transliteration) where each word has to be transliterated from Hindi to Urdu and vice versa, irrespective of its

Pushpak Bhattacharyya

Dept. of Computer Science and Engineering, IIT Bombay, India pb@cse.iitb.ac.in type (noun, verb, etc. and not only proper noun or unknown word).

"One man's Hindi is another man's Urdu" (Rai, 2000). The major difference between Hindi and Urdu is that the former is written in Devana-gari script with a more Sanskritized vocabulary and the latter is written in Urdu script (derivation of Persio-Arabic script) with more vocabulary borrowed from Persian and Arabic. In contrast to the transcriptional difference, Hindi and Urdu share grammar, morphology, a huge vocabulary, history, classical literature, cultural heritage, etc. Hindi is the National language of India with 366 million native speakers. Urdu is the National and one of the state languages of Pakistan and India respectively with 60 million native speakers (Rahman, 2004). Table 1 gives an idea about the size of Hindi and Urdu.__

Hindi and Urdu, being varieties of the same language, cover a huge proportion of world's population. People from Hindi and Urdu communities can understand the verbal expressions of each other but not the written expressions. HUMT is an effort to bridge this scriptural divide between India and Pakistan.

Hindi and Urdu scripts are briefly introduced in section 2. Universal Intermediate Transcription (UIT) is described in section 3, and UIT mappings for Hindi and Urdu are given in section 4. Contextual HUMT rules are presented and discussed in section 5. An HUMT system implementation and its evaluation are provided in section 6 and 7. Section 8 is on future work and conclusion.


	Native Speakers	2nd Language Speakers	Total
Hindi	366,000,000	487,000,000	853,000,000
Urdu	60,290,000	104,000,000	164,290,000
Total	426,290,000	591,000,000	1,017,000,000

2 HUMT

There exist three languages at the border between India and Pakistan: Kashmiri, Punjabi and Sindhi. All of them are mainly written in two scripts, one being a derivation of the Persio-Arabic script and the other being Devanagari script. A person using the Persio-Arabic script cannot understand the Devanagari script and vice versa. The same is true for Hindi and Urdu which are varieties or dialects of the same language, called Hindustani by Platts (1909).

PMT (Punjabi Machine Transliteration) (Malik, 2006) was a first effort to bridge this scriptural divide between the two scripts of Punjabi namely Shahmukhi (a derivation of Perio-Arabic script) and Gurmukhi (a derivation of Landa, Shardha and Takri, old Indian scripts). HUMT is a logical extension of PMT. Our HUMT system is generic and flexible such that it will be extendable to handle similar cases like Kashmiri, Punjabi, Sindhi, etc. HUMT is also a special type of machine transliteration like PMT.

A brief account of Hindi and Urdu is first given for unacquainted readers.

The Devanagari (literally "godly urban") script, a simplified version of the alphabet used for Sanskrit, is a left-to-right script. Each consonant symbol inherits by default the vowel sound [a]. Two or more consonants may be combined together to form a cluster called Conjunct that marks the absence of the inherited vowel [a] between two consonants (Kellogg, 1872; Montaut, 2004). A sentence illustrating Devanagari is given below:

[hindi hindustan ki qomi zuban hae] (Hindi is the national language of India)

2.2 Urdu

Urdu is written in an alphabet derived from the Persio-Arabic alphabet. It is a right-to-left script and the shape assumed by a character in a word is context-sensitive, i.e. the shape of a character is different depending on whether its position is at the beginning, in the middle or at the end of a word (Zia, 1999). A sentence illustrating Urdu is given below:

yi; t/yS (j>\ »j

[urdu pakistan ki qomi zuban has] (Urdu is the National Language of Pakistan.)

UIT (Universal Intermediate Transcription) is a scheme to transcribe texts in Hindi, Urdu, Punjabi, etc. in an unambiguous way encoded in ASCII range 32 - 126, since a text in this range is portable across computers and operating systems (James 1993; Wells, 1995). SAMPA (Speech Assessment Methods Phonetic Alphabet) is a widely accepted scheme for encoding the IPA (International Phonetic Alphabet) into ASCII. It was first developed for Danish, Dutch, French, German and Italian, and since then it has been extended to many languages like Arabic, Czech, English, Greek, Hebrew, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, etc. We define UIT as a logical extension of

SAMPA. The UIT encoding for Hindi and Urdu is developed on the basis of rules and principles of SAMPA and X-SAMPA (Wells, 1995), that cover all symbols on the IPA chart. Phonemes are the most appropriate invariants to mediate between the scripts of Hindi, Punjabi, Urdu, etc., so that the encoding choice is logical and suitable.

For the analysis and comparison, scripts of Hindi and Urdu are divided into different groups on the basis of character types.

These are grouped into two categories:

Aspirated Consonants: Hindi and Urdu both have 15 aspirated consonants. In Hindi, 11 aspirated consonants are represented by separate characters e.g. M [kh], 3T [bh], etc. The remaining 4 consonants are represented by combining a simple consonant to be aspirated and the conjunct form of HA ?[h], e.g. & [l] + o + ? [h] = F? [l].

In Urdu, all aspirated consonants are represented by a combination of a simple consonant to be aspirated and Heh Doachashmee (a) [h], e.g. ^ [k] + a [h] = £ [kh], M [b] + a [h] = & [bh], J [l] + a [h] = & [lh], etc.

The UIT mapping for aspirated consonants is given in Table 2..____


Hindi Urdu	UIT Hindi	Urdu	UIT
3T & [bh]	b_h ?	aj [rh]	r_h
^ & [ph]	p_h ?	aj [rh]	r'_h
	t_d_h M	[kh]	k_h
5 & [th]		^ [gh]	g_h
	d_Z_h	& [lh]	l_h

Non-aspirated Consonants: Hindi has 29 non-aspirated consonant symbols representing 28 consonant sounds as both SHA (5T) and SSA (W) represent the same sound [J]. Similarly Urdu has 35 consonant symbols representing 27 sounds as multiple characters are used to represent the same sound e.g. Heh (c) and Heh-Goal (<>) represent the sound [h] and Theh (^), Seen (0) and Sad (o=) represent the sound [s], etc.

UIT mapping for non-aspirated consonants is given in Table 3.___

Hindi has 11 vowels and 10 of them have nasalized forms. They are represented by 11 independent vowel symbols e.g. 3TT [a], 3> [u], 3n" [d], etc. and 10 dependent vowel symbols e.g. or [a], O [u], O [d], etc. called maatraas. When a vowel comes at the start of a word or a syllable, the independent form is used; otherwise the dependent form is used (Kellogg, 1872; Montaut, 2004).

Urdu contains 10 vowels and 7 of them have nasalized forms (Hussain, 2004; Khan, 1997). Urdu vowels are represented using four long vowels (Alef Madda 0), Alef (I), Vav (j) and Choti Yeh (l)) and three short vowels (Arabic Fatha -Zabar -, Arabic Damma - Pesh - and Arabic Ka-sra - Zer -). Vowel representation is context-sensitive in Urdu. Vav (j) and Choti Yeh (l) are also used as consonants.

Hamza (*) is a place holder between two successive vowel sounds, e.g. in [kgmai] (earning), Hamza (*) separates the two vowel sounds Alef (I) [a] and Choti Yeh (l) [i]. Noonghunna (j) is used as nasalization marker. Analysis and mapping of Hindi Urdu vowels is given in Table 5.

Urdu contains 15 diacritical marks. They represent vowel sounds, except Hamza-e-Izafat -and Kasr-e-Izafat - that are used to build compound words, e.g. o^l^ SJJ! [idarghisains] (Institute of Science), gjjti [tarixipedaiJ] (date of birth), etc. Shadda - is used to geminate a consonant e.g. Mj [rabb] (God), [9tftfha] (good), etc. Jazm - is used to mark the absence of a vowel after the base consonant (Platts, 1909). In Hindi, the conjunct form is used to geminate a consonant. Urdu diacritical marks mapping is given in Table 4.____

Diacritical marks are present in Urdu but sparingly used by people. They are very important for the correct pronunciation and understanding the meanings of a word. For example,

<-SJJ?l-Sjui'S

[je sgpk buhgt rjor.i hœ] (This is a wide road.) CJ" --JJ? --J^[meri tjuri surgx hœ] (My bangle is red.) In the first sentence, the word -sjj? is pronounced as [rprj] (wide) and in the second, it is pronounced as [tjuri] (bangle). There should be


«5 *=¥ W]	t_S_h 3-	? £jo [mh] m_h
ÎT aj [dh]	d_d_h ?	? *j [nh] n_h
5 aj [dh]	d'_h


Hindi	Urdu	UIT	Hindi	Urdu	UIT
	m [b]	b		[s]	s2
	V [p]	p		^ [z]	z2
H	ù [t]	t_d	H	^ [t]	t_d1
z	- [t]	t'		i [z]	z3
	^ [s]	s1	-	t [?]	?
	E [*]	d_Z	<iï	è [y]	X
	€ ffl	t_S			f
?	C [h]	h1			q
m	C [x]	x		^ [k]	k
z	J [d]	d_d		^ [g]	g
	J [d]	d'		J [l]	l
	i [z]	z1	W	f [m]	m
	J [r]	r		û [n]	n
s	J [r]	r'		J [v]	v
	j [z]	z	?	» [h]	h
	J [3]	Z		* Li]	j
	0* [s]	s	H	[t]	t_d2
	^ [J]	S		- M	n'
	^ [J]	S1		u [q]	~


Hindi	Urdu	UIT	Hindi	Urdu	UIT
-	O [8]	@	CT	O [a]	A
	O [i]	I		O \|on\|	@n
O	O [u]	U	C*	O [un]	Un
"<\	O [u]	u		O [m]	In
	O [i]	i

Zabar ( ) and Pesh (o) after Cheh (g) in above words and correct transcriptions are <_sjjt (wide) essential for removing ambiguities, natural language processing and speech synthesis.

In this section, UIT mappings of Hindi Urdu alphabets and contextual rules that are necessary for Hindi-Urdu transliteration are discussed.

[J] and have one equivalent symbol in Urdu, i.e.

Sheen (<_£). To make distinction between SHA

(5T) and SSA (W) in UIT, they are mapped on S

and S1 respectively. Similarly in Urdu, Seh (^), Seen (o) and Sad (o=) represent the sound [s] and have one equivalent symbol in Hindi, i.e. SA

UIT mappings for Hindi an (^). To make distinction among them in UIT, they are mapped on s1, s and s2 respectively. All similar cases are shown in Table 6.

their vowels are given in Table 2 - 5. In Hindi, SHA (5T) and SSA (W) both represent the sound


Vowel	Urdu	Hindi (UIT)
3	It is represented by Alef (I) + Zabar - at the start of a word e.g. m' [ab] (now) and by Zabar - in the middle of a word respectively e.g. mj [rabb] (God). It never comes at the end of a word.	33 (@)
a	It is represented by Alef Madda (1) at the start of a word e.g. [admi] (man) and by Alef (I) or Alef Madda (1) in the middle of a word e.g. IjV [cfeana] (go), [bilaxar] (at last). At the end of a word, it is represented by Alef (I). In some Arabic loan words, it is represented by Choti Yeh (s) + Khari Zabar - at the end of a word e.g. 'Jc-I [a?la] (Superior) and by Khari Zabar - in the middle of a word e.g. [ilahi] (God).	3TT or c::T (A)
e	It is represented by Alef (I) + Choti Yeh (s) at the start of a word e.g. Jjj! [esar] (sacrifice), [ek] (one), etc. and by Choti Yeh (s) or Baree Yeh (^) in the middle of a word e.g. Ij> [mera] (mine), *Ij^! [3ndhera] (darkness), jS^>* [beghsr] (homeless) etc. At the end of a word, It is represented by Baree Yeh (^) e.g. ^jl^ [sare] (all).	IT or O (e)
	It is represented by Alef (I) + Zabar - + Choti Yeh (s) at the start of a word e.g. [ash] (this) and by Zabar - + Choti Yeh (s) in the middle of a word e.g. [masl] (dirt). At the end of a word, it is represented by Zabar - + Baree Yeh (^) e.g. ^ [has] (is).	* orO({)
i	It is represented by Alef (I) + Zer - at the start of a word e.g. ol [is] (this) and by Zer - in the middle of a word e.g. lAjW [barif] (rain). It never comes at the end of a word. At the end of a word, it is used as Kasr-e-Izafat to build compound words.	? or fO (I)
i	It is represented by Alef (I) + Zer - + Choti Yeh (s) at the start of a word e.g. [iman] (belief) and by Zer - + Choti Yeh (s) in the middle or at the end of a word e.g. sj^I [amiri] (richness), [qarib] (near), etc.	f or O (i)
u	It is represented by Alef (I) + Pesh - at the start of a word e.g. [uddh3r] (there) and by Pesh - in the middle of a word e.g. i% [mull] (price). It never comes at the end of a word.	3 or O (U)
u	It is represented by Alef (I) + Pesh - + Vav (j) at the start of a word e.g. lj$Sijf [üghqta] (dozzing) and by Pesh - + Vav (j) in the middle or at the end of a word e.g. cjjjL, [surat] (face), jjIJ [tarazu] (physical balance), etc.	3> or ° (u)
o	It is represented by Alef (I) + Vav (j) at the start of a word e.g. Wj-jI [otfha] (nasty) and by Vav (j) in the middle or at the end of a word e.g. [holi] (slowly), j£ [kaho] (say), etc.	3fT or O (o)
	It is represented by Alef (I) + Zabar - + Vav (j) at the start of a word e.g. ijl [Df] (hindrance) and by Zabar -+ Vav (j) in the middle or at the end of a word e.g. ^>>"> [mDt] (death).	3ft or oT (O)
r	It is represented by a consonant symbol Reh (j) [r] as this vowel is only present in Sanskrit loan words. It is almost not used in modern standard Hindi. It is not present in Urdu as it is used only in Sanskrit loan words.	3t or O (rl)
Note: In Hindi, Nasalization of a vowel is done by adding Anunasik (O) or Anusavar (O) after the vowel. Anusavar (O) is used when the vowel graph goes over the upper line; otherwise Anunasik (O) is used (Kellogg, 1872; Montaut, 2004). In UIT, ~ is added at end of UIT encoding for nasalization of all above vowels except the last one that do not have a nasalized form.


IPA	Urdu (UIT)	Hindi (UIT)
t	Ca (t_d), J= (t_d1), i (t_d2)	H (t_d)
s	^ (sl), o" (s), (s2)	* (s)
H	C (hl), • (h)	F (h)

Multi-equivalences are problematic for Hindi-Urdu transliteration.

UIT is extendable to other languages like English, French, Kashmiri, Punjabi, Sindhi, etc. For example, Punjabi has one extra character than

Urdu i.e. Rnoon [nj (if), it is mapped on 'n ' in

UIT. Similarly, UIT, a phonetic encoding scheme, can be extended to other languages.

All these mappings can be implemented by simple finite-state transducers using XEROX's XFST (Beesley and Karttunen, 2003) language. A sample XFST code is given in Figure 1._

read regex [m-> b,v-> p,E-> [d "_" Z] ];read regex [[ea] -> [d "_" Z "_" h]];read regex[j-> v,is-> j || .#. _ ];read regex[j-> v,is-> j || _[I|I]];read regex[s-> e || CONSONANTS _ ];read regex [s-> i ||[|.#.]];read regex pT-> b,*T-> p,ö-> z, -> [d "_" Z "_" h]]; read regex[3T-> "@",3JT-> A,f-> i || .#. _ ]

Figure 1: Sample XFST code Finite-state transducers are robust and time and space efficient (Mohri, 1997). They are a logical choice for Hindi-Urdu transliteration via UIT as this problem could also be seen as string matching and producing an analysis string as an output like finite-state morphological analysis.

UIT mappings need to be accompanied by necessary contextual HUMT rules for correct Hindi to Urdu transliteration and vice versa.

For example, Vav (j) and Choti Yeh (s) are used to represent vowels like [o], [d], [i], [e], etc. but they are also used as consonants. Vav (j) and Choti Yeh (s) are consonants when they come at the beginning of a word or when they are followed by Alef mada 0) or Alef (I). Also, Choti Yeh (s) represents the vowel [e] when it is preceded by a consonant but when it comes at the end of a word and is preceded by a consonant then it represents the vowel [i]. These rules are shown in red colour in Figure 1.

Thus HUMT contextual rules are necessary for Hindi-Urdu transliteration and they can also be implemented as finite-state transducer using XFST. All these rules can't be given here due to shortage of space.

The HUMT system exploits the simplicity, robustness, power and time and space efficiency of finite-state transducers. Exactly the same transducer that encodes a Hindi or Urdu text into UIT can be used in the reverse direction to generate Hindi or Urdu text from the UIT encoded text. This two-way power of the finite-state transducer (Mohri, 1997) has significantly reduced the amount of efforts to build the HUMT system. Another very important and powerful strength of finite-state transducers, they can be composed together to build a single transducer that can perform the same task that could be done with help of two or more transducers when applied sequentially (Mohri, 1997), not only allows us to build a direct Hindi <-» Urdu transducer, but also helps to divide difficult and complex problems into simple ones, and has indeed simplified the process of building the HUMT system. A direct Hindi <-» Urdu transducer can be used in applications where UIT encoding is not necessary like HindiUrdu MT system.

The HUMT system can be extended to perform transliteration between two or more different scripts used for the same languages like Kashmiri, Kazakh, Malay, Punjabi, Sindhi, etc. or between language pairs like English-Hindi, English-Urdu, English-French, etc. by just introducing the respective transducers in the Finite-state Transducer Manager of the HUMT system to build a multilingual machine transliteration system. Hindi or Urdu Unicode text

Hindi or Urdu Unicode text

Figure 2: HUMT System In the HUMT system, Text Tokenizer takes the input Hindi or Urdu Unicode text, toke-nizes it into Hindi or Urdu words and passes

Text Tokenizer

Hindi Urdu Machine Transliteration System

Hindi or Urdu words _i_


	z i (z1), J (z), j (Z), (z2), i (z3)	ö (z)
	J (S)	?r (S), W (S1)
	r J (r)	T (r), ^ (r1)

them to UIT Enconverter. The enconverter enconverts Hindi or Urdu words into UIT words using the appropriate transducer from Finite-state Transducers Manager, e.g. for Hindi words, it uses the Hindi <-» UIT transducer. It passes these UIT encoded words to UIT Deconverter , which deconverts them into Hindi or Urdu words using the appropriate transducer from Finite-state Transducers Manager in reverse and generates the target Hindi or Urdu text.

Hindi <-» UIT transducer is a composition of the mapping rules transducers and the contextual rules transducers. This is clearly shown in figure 3 with a sample XFST code._

clear stack set char-encoding UTF-8 define CONSONANTS pr7 j M j J j q j S j 55 j compose net_

read regex[C-> J,C:-> h,C-> 0];read regex -> k,M-> [k "_" h],JT-> g,q-> [g "_"h],S-> [n "@" g],q-> [t "_" S],55-> [t "_" S "_" h]];read regex [pf7C cf>]-> [k k], pf7C M]-> [k k "_" h],FC-> [g g],C q]-> [g g "_" h]];read regex [pr7fC]-> [k h],[TT]-> [n A],C]-> [j h],FC]-> [v h]jj.#. _ .#.];

the help of an example. Take the Hindi sentence:

^imdl W?^H 33t 3TWTT cfïï fTftnTT ?

[faxsta muhgbgt or smsn ka niJan hœ]

(Dove is symbol of love and peace) This sentence is received by the Text To-kenizer and is tokenized into Hindi words, which are enconverted into UIT words using the mapping and the contextual rules of Hindi <-» UIT transducer by the UIT Enconverter. The Hindi Words and the UIT enconversions are given in Table 7.__

For the deconversion, Hindi <-» UIT or Urdu <-» UIT transducer is applied in reverse on the UIT enconverted words to generate Hindi or Urdu words. To continue with the example in the previous section, the UIT words are deconverted into the Urdu words by the UIT Deconverter using Urdu <-» UIT transducer in reverse. The Urdu words are given in table 8 with the Hindi and the UIT words.

Finally, the following Urdu sentence is generated from Urdu words.

Here the word H->l*sJdi [faxgta] (Dove) is transliterated wrongly into 'tints' because the vowel [a] at the end of some Urdu words (borrowed from Persian language) is transcribed with help of Heh-gol [h] ("). This phenomenon is a problem for Hindi to Urdu transliteration but not for Urdu to Hindi transliteration.

For evaluation purpose, we used a Hindi corpus, containing 374,150 words, and an Urdu corpus with 38,099 words. The Hindi corpus is extracted from the Hindi WordNet developed by the Resource Center for Indian Language Technology Solutions, CSE Department, Indian Institute of Technology (IIT) Bombay, India and from the project CIFLI (GETALP-LIG , University Joseph Fourier), a project for building resources and tools for network-based "linguistic survival" communication between French, English and Indian languages like Hindi, Tamil, etc. The Urdu corpus was developed manually from a book titled ^&>" [zulmgt kgda]. The Hindi-Urdu corpus contains in total 412,249 words.

The HUMT system is an initial step to build Urdu resources and add Urdu to the languages of

http://www.cfilt.iitb.ac.in http://www.liglab.fr


Hindi	UIT	Urdu
dirndl [faxsta]	fAx@t_dA
W?^d [muhsbst]	mUh@b@t_d
[or]	Or	Jj!
JWT [smsn]	@m@n	ù^I
cfïï [ka]	kA	l£
f?T?iïTT [niJan]	nISAn
? [hœ]	H{


	Hindi Words	UIT
	^imdl [faxsta]	fAx@t_dA
	W?^d [muhsbst]	mUh@b@t_d
	[or]	Or
	JWT [smsn]	@m@n
	cfïï [ka]	kA
	f?T?iïTT [niJan]	nISAn
	? [hœ] Table 7: Hindi W<	H{ ords with UIT

HindiH?

SurviTra-CIFLI (Survival Translation) (Boitet et al, 2007), a multilingual digital phrase-book to help tourists for communication and enquiries like restaurant, hotel reservation, flight enquiry, etc.

To reduce evaluation and testing efforts, unique words are extracted from the Hindi-Urdu corpus and are transliterated using the HUMT system. These unique words and their transliterations are checked for accuracy with the help of dictionaries (Platts, 1911; Feroz).

While transliterating Urdu into Hindi, multiple problems occur like multi-equivalences, no equivalence, missing diacritical marks in Urdu text.

For example, Sheen [J] (uU) can be transliterated in Hindi into SHA [J] (5T) or SSA [J] (W) that are present in 7,917 and 6,399 corpus words respectively. Sheen [J] (uU) is transliterated into SHA [J] (^) by default. Thus, 6,399 words containing SSA [J] (W) are wrongly transliterated into Hindi using HUMT. Urdu to Hindi multi-equivalences cases are given in Table 9 with their frequencies. Urdu Hindi (corpus Frequency) uu [J] ?F (7917), W (6399) j [r] 7 (79,345), ^ (199) Table 9: Urdu — Hindi Multi-equivalences Some Hindi characters do not have equivalent characters in Urdu, e.g. NNA [it] (W), retroflexed version of [n], has approximately mapped onto Noon [n] (j). This creates a problem when a word actually containing NNA [r|] (W) is transliterated from Urdu to Hindi. No-equivalence cases are given in Table 10._

Urdu Hindi (corpus Frequency)

-W(4744)-5(0)_-3T(532)

Table 10: Urdu — Hindi No-equivalences Missing diacritical marks is the major problem when transliterating Urdu into Hindi. The importance of diacritical marks has already been explained in section 4.3. This work assumed that all necessary diacritical marks are present in Urdu text because they play a vital role in Urdu to Hindi transliterations. Results of Urdu to Hindi

Hindi — Urdu transliteration also have multi-equivalences and no-equivalence problems that are given in Table 12.

Urdu (corpus Frequency)

^ (41,751), i (1312)oo (53,289), oo (751), ^ (86)

» (72,850), c (1800)j (2551),o±(1489), i (228), i (215),j(2)t (2857)

Interestingly, Hindi to Urdu conversion is 14.47% less accurate on the unique words as compared to its result on the corpus data that is a contrasting fact for the reverse conversion.

The HUMT system gives 97.12% accuracy for Urdu to Hindi and 97.88% accuracy for Hindi to Urdu. Thus, the HUMT system works with 97.50% accuracy.

Hindi-Urdu transliteration is one of the cases where one language is written in two or more mutually incomprehensible scripts like Kazakh, Kashmiri, Malay, Punjabi, Sindhi, etc. The HUMT system can be enhanced by extending UIT and introducing the respective finite-state transducers. It can similarly be enhanced to transliterate between language pairs, e.g. English-Arabic, English-Hindi, English-Urdu, French-Hindi, etc. Thus, it can be enhanced to build a multilingual machine transliteration system that can be used for cross-scriptural transliteration and MT.

We are intended to resolve the problems of multi-equivalences, no-equivalences and the most importantly the restoration of diacritical marks in Urdu text that are observed but left unattended in the current work. Restoration of diacritical marks in Urdu, Sindhi, Punjabi, Kashmiri, etc. texts is essential for word sense disambiguation, natural language processing and speech synthesis of the said languages.

The HUMT system will also provide a basis for the development of Inter-dialectal translation system and MT system for surface-close languages like Indonesian-Malay, Japanese-Korean,


	Error Words	Accuracy
Corpus	8,740	97.88%
Unique Words	1400	83.41%


	Error Words	Accuracy
Corpus	11,874	97.12%
Unique Words	123	98.54%

Hindi-Marathi, Hindi-Urdu, etc. Translation of the surface-close languages or inter-dialectal translation can be performed by using mainly transliteration and some lexical translations. Thus HUMT will also provide basis for Cross-Scriptural Transliteration, Cross-scriptural Information Retrieval, Cross-scriptural Application Development, inter-dialectal translation and translation of surface-close languages.

Finite-state transducers are very efficient, robust, and simple to use. Their simplicity and powerful features are exploited in the HUMT model to perform Hindi-Urdu transliteration using UIT that is a generic and flexible encoding scheme to uniquely encode natural languages into ASCII. The HUMT system gives 97.50% accuracy when it is applied on the Hindi-Urdu corpora containing 412,249 words in total. It is an endeavor to bridge the scriptural, ethnical, cultural and geographical division between 1,017 millions people around the globe.

Acknowledgement

This study is partially supported by the project CIFLI funded under ARCUS-INDIA program by Ministry of Foreign Affairs and Rhone-Alpes region.

Beesley, Kenneth R. and Karttunen, Lauri. 2003. Finite State Morphology. CSLI Publications, USA.

Boitet, Christian. Bhattacharayya, Pushpak. Blanc, Etienne. Meena, Sanjay. Boudhh, Sangharsh. Fafiotte, Georges. Falaise, Achille. Vacchani, Vishal. 2007. Building Hindi-French-English-UNL Resources for SurviTra-CIFLI, a linguistic survival system under construction. Proceedings of the Seventh Symposium on NLP, 13 - 15 December, Chonburi, Thailand.

Feroz ul Din. j^J ^L^Ijjjja Feroz Sons Publishers, Lahore, Pakistan.

Hussain, Sarmad. 2004. Letter to Sound Rules for Urdu Text to Speech System. Proceedings of Workshop on Computational Approaches to Arabic Script-based Languages, COLING 2004, Geneva, Switzerland.

James, L. Hieronymus. 1993. ASCII Phonetic Symbols for the World's Languages: Worldbet. AT&T Bell

Laboratories, Murray Hill, NJ 07974, USA.

Kellogg, Rev. S. H. 1872. A Grammar of Hindi Language. Delhi, Oriental Book Reprints.

Khan, Mehboob Alam. 1997. ^ or'^ £ j^J (Sound System in Urdu) National Language Authority, Pakistan.

Knight, K. and Graehl, J. 1998. Machine Transliteration. Computational Linguistics, 24(4).

Knight, K. and Stall, B G. 1998. Translating Names and Technical Terms in Arabic Tex. Proceedings of the COLING/ACL Workshop on Computational Approaches to Semitic Languages.

Malik, M. G. Abbas. 2006. Punjabi Machine Transliteration. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, July 2006, Sydney.

Mohri, Mehryar. 1997. Finite-state Transducers in Language and Speech Processing. Computational Linguistics, 23(2).

Montaut A. 2004. A Linguistic Grammar of Hindi. Studies in Indo-European Linguistics Series, München, Lincom Europa.

Paola, V. and Sanjeev, K. 2003. Transliteration of proper names in cross-language applications. Proceedings of the 26th annual International ACM SIGIR conference on research and development in information retrieval.

Pirkola, A. Toivonen, J. Keskustalo, H. Visala, K. and Järvelin, K. 2003. Fuzzy translation of cross-lingual spelling variants. Proceedings of the 26th Annual international ACM SIGIR Conference on Research and Development in informaion Retrieval, Toronto, Canada.

Platts, John T. 1909. A Grammar of the Hindustani or Urdu Language. Crosby Lockwood and Son, 7 Stationers Hall Court, Ludgate hill, London. E.C.

Platts, John T. 1911. A Dictionary of Urdu, Classical Hindi and English. Crosby Lockwood and Son, 7 Stationers Hall Court, Ludgate hill, London, E.C.

Rahman, Tariq. 2004. Language Policy and Localization in Pakistan: Proposal for a Paradigmatic Shift. Crossing the Digital Divide, SCALLA Conference on Computational Linguistics.

Rai, Alok. 2000. Hindi Nationalism. Orient Longman Private Limited, New Delhi.

Wells, J C. 1995. Computer-coding the IPA: A Proposed Extension of SAMPA. University College London. http://www.phon.ucl.ac.uk/home/sampa/ipasam-x.pdf.

Yan Qu, Gregory Grefenstette, David A. Evans. 2003. Automatic transliteration for Japanese-to-English text retrieval. Proceedings of the 26th annual interntional ACM SIGIR conference on Research and development in information retrieval.

Zia, Khaver. 1999a. Standard Code Table for Urdu. Proceedings of 4th Symposium on Multilingual Information Processing (MLIT-4), Yangon, Myanmar,

CICC, Japan.