Machine Transliteration in Java

MACHINE TRANSLITERATION IN JAVA
by Richard Calvi

Abstract:

This project is a Java application that takes text input in non-Roman alphabets and produces Romanized output. The design process involved searching for tables, researching the reading rules and special cases for the various scripts, and then finding ways to process the text efficiently in terms of memory and processing time. Transliteration techniques ranged from simple remapping routines and context-based searches to more involved analysis in the case of Thai. Languages supported include Greek, Russian, Japanese, Korean, Thai, Hindi, Gujarati and several Chinese dialects. In one case, a font-specific table was made, but generally, tables were saved in UTF8 encoding, and conformed to Unicode character mappings.

Javadoc Documentation is available here.

Screenshots are now available!

Comments? Questions? Please use this feedback form.

CONTENTS:
What is Transliteration?
Why Make a Transliterator?
Categories of Languages
Principles of Transliteration
Research Phase
Testing Process
Evaluation of Results
     Russian
     Greek
     Chinese
     Korean
          Hanja
          Hangeul
     Japanese
          Kana
          Kanji
     Thai
     Hindi (Devanagari)
     Gujarati
     Cherokee
     Yi
Languagues that Could Not Be Transliterated Well
Text Sources
Table Sources

What is Transliteration?

Transliteration is the phonetic representation of one language using the alphabet of another language. For example, “Vladimir Putin” is the English transliteration of the Russian name “Владимир Путин.” When you transliterate into the English alphabet, the process is called Romanization.

Although my program currently does only Romanization, it could easily transliterate into other alphabets, if given the proper data tables.

Why Make a Transliterator?

This project was originally begun as an attempt to look for borrowed English words in foreign texts. While it is easy to spot English words in languages that use a Latin-based alphabet, it is harder in other languages, such as Russian, Thai, or Chinese, unless you are familiar with their corresponding writing systems. Although it may not be very hard to learn the Russian alphabet, other languages can be a bit more of a challenge. Thai has more letters, and many more reading rules involved, and written Chinese can take years to master. Furthermore, even if all writing systems were as easy to learn as Russian is, there are simply too many alphabets in existence for any one person to master all of them. If one were to automate the process with a computer, it would theoretically be possible for anyone with access to a transliterator to “read” documents in foreign writing systems without much hesitation. If the transliterated output were clear and readable, English words would be somewhat easy to spot, and readers could have a vague idea of what any given foreign text “sounded” like.

Transliterators can be especially useful to English speakers who speak a foreign language fluently, but are not literate in it. Students beginning a foreign language course might also use it when trying to read foreign texts. With a phonetic guide, people can see how words in languages of a common family are related, even if the languages use completely different scripts. Plus, fans of foreign music could enjoy being able to pronounce the words in their songs without learning a foreign script.

Categories of Languages:

One underlying goal of this project was universality. The transliterator should be able to support as many languages as possible, using as few specialized Java classes as possible. To do this, one must make transliterators that are suitable enough to handle most reading rules of a given language while being general so that most transliterator classes can support more than one language. This way, the transliterator could be expanded to support many more languages by providing more data tables, and not more Java code. To keep things general, I split up the character sets that I set out to transliterate into different basic categories loosely based on their characteristics. In some cases, most or all of the “general characteristics” might not have applied to a language, but I was able to format a transliteration data table that would suit the language despite this.

Simple Replacement
- General characteristics:
- Character sets in this category: Unicode punctuation character sets
- Other character sets that might fall into this category: any other systems in which any given character is always transliterated in the same exact way
Simple Replacement with Capitalization (an extended class of the previous category)
- General characteristics:
- Languages currently inthis category: Greek, Russian
- Other languages that might fall into this category: Other languages that use the Cyrillic alphabet (Ukranian, Uzbek, Mongolian, etc...)
Simple Replacement with Spaces (another extended class of the first category)
- General characteristics:
- Languages currently in this category: Cherokee, Yi
- Other languages that might fall into this category: Other languages that use a syllabary (a huge table of characters entailing every possible syllable in the language)
Thai-like
- General characteristics:¹²
  - Vowel “structures” that surround a consonant determine a vowel sound.
    (For example, if k were a consonant in Thai, “kา” would be “kaa”, “โk” would be “ko”, and “แk็ว” would be “kaeo”.)
  - Consonants can have completely different pronunciations, depending on whether they are initial consonants or final consonants. Only some consonants can be final consonants.
  - Certain pairs of consonants can combine to make consonant clusters.
  - If there is no vowel indicated for a syllable, there is assumed an “inherent” vowel.
  - Some characters are always transliterated in exactly the same way (such as numerals)
- Languages currently in this category: Hindi, Thai, Gujarati, Arabic, Hebrew
- Other languages that might fall into this category: Indian languages with alphabets similar to Hindi (Sanskrit, Bengali, Punjabi, etc...) and Southeast Asian languages with alphabets similar to Thai (Burmese, Khmer, Lao, etc...)
Chinese/Japanese/Korean (CJK) Ideographs
- General characteristics:⁴
  - Any one character can have multiple possible readings.
  - Each character has a tone number assigned to it. Tones indicate how one's voice should rise or fall when pronouncing the syllable.
- Languages currently in this category: Korean Hanja and several Chinese dialects
- Other languages that might fall into this category: Any other Chinese dialects for which tables exist, Vietnamese Chữ Nôm, and Japanese kanji (to an extent)
Korean
- General characteristics:
  - Each Korean “character” in Unicode consists of a consonant, a vowel, and from zero to two more consonants.⁶
  - The pronunciation of a consonant can change depending on whether it is an initial consonant or an end consonant.⁶
  - The pronunciation of a consonant can change depending on the previous end consonant, the next initial consonant, and the current vowel.⁹
- Languages currently in this category: Korean Hangeul
- Other languages that might fall into this category: None
Japanese Kana
- General characteristics:¹⁴
  - Mostly a phonetic syllabary, with characteristics very suitable to Simple Replacement
  - A っ or a ッ before a consonant causes that consonant to be doubled in the transliterated output.
  - A ー causes the previous vowel to be lengthened.
- Character sets currently in this category: Hiragana, Katakana and Half-width Katakana
- Other languages that might fall into this category: None
Context Based
- General characteristics: Very much like CJK Ideographs, except that the problem of multiple readings per character is resolved by paying attention to the context of the character, and using a table with an extensive list of possible contexts of the characters
- Languages currently in this category: Japanese
- Other languages that might fall into this category: Chinese, Vietnamese Chữ Nôm and Korean Hanja

Principles of Transliteration:

These are some of the rules that I set for myself to make certain that the final product would be useful to others:

Use unusual đįăĉŕĩŧĭċǻŀ marks sparingly (occasional apostrophes, grave accents and diereses are OK)
If no standard exists, mix transliterations from existing systems, with emphasis on accuracy and readability.
Make it so that someone like me, who has virtually no knowledge of these languages, can understand vaguely how to pronounce the final output.

Research Phase:

Since I had no prior knowledge of the languages being transliterated, I needed to research the reading rules for them. (See the Sources section at the end). In most cases, it was simply a matter of downloading documentation, usually in Adobe Portable Document Format, listing which characters corresponded to which consonants and vowels, and how different combinations of them yielded different pronunciations. This information was intended for human transliterators, so I needed to read through it and represent it in a data table such that a transliteration program could use it.

However, in languages such as Japanese and Chinese, there are thousands of characters, many of which have multiple readings, depending on their context. For these, I used free tables, also available online (See the Tables section at the end). Many of these tables were intended to be tables for an Input Method Editor (IME), a program that allows users with an English keyboard to type English letters and convert it to a foreign script -- essentially the opposite of the transliterator. For this reason, I had to reorganize or resort much of the data in some of these tables. Additionally, tables used by the transliterator all must have a simple header, so at the very minimum, all the tables I borrowed were modified by having a header added.

Testing Process:

Since I cannot speak any of the languages that I transliterated, I cannot guarantee the accuracy of my transliterations. To tell if they were somewhat accurate, I did the following:

Researched rules for transliteration, and special cases that applied to them
Checked texts on the web that were transliterated by a human, and compared their transliterations to mine
Consulted native speakers whenever possible
Listened to songs in the foreign languages and compared how they sounded to what my transliterated output of the lyrics was.
(Although Roman letters may not seem to fit the sounds in some languages, this method was very effective in identifying glaring errors.)

Evaluation of Results (broken down by language):

Russian
- Pro:
  - Fairly straightforward; letters are transliterated in the order they are written, and it seemed that one character always corresponded to the same character or sequence of characters in English
- Con:
  - Certain consonants can be one word. For example “v” can be one word. It sounded to me like “voh”, but I have seen professional transliterations transcribe it as “v”¹³, so my transliterator does the same.
Greek
- Pro:
  - Fairly straightforward, like Russian
- Cons:
  - Two “i” sounds: one that is like a short “i” and one that is like a long “i”.² I could not find a simple way of making this clear without slightly garbling the output.
  - Certain pairs of consonants are used for foreign sounds in borrowed words. For example, the consonant pair “mp” is sometimes used to represent “b”.⁵ Thus, the Greek word for “basketball” is “mpasket”.¹⁵ Unfortunately, I could not find an efficient way to make the transliterator handle cases like this. Simply replacing every instance of “mp” with “b” would make words that were supposed to have an “mp” sound be spelled incorrectly.
Chinese (Simplified Mandarin, Traditional Mandarin, Cantonese, Teochew, Hakka, and Taiwanese)
- Pros:
  - Output was fairly readable in most Traditional Chinese texts for most dialects, whenever the transliterator was told not to consider tones in output.
- Cons:
  - In simplified Chinese and Taiwanese, many characters had several alternate readings, making it difficult to tell how something was pronounced.
  - If tones are enabled, there are too many readings per character for long text inputs to be readable. A context based table and/or a frequency chart might fix this, but I could not find free one.
Korean (Hanja)
- Pros:
  - Output was very readable; there was only one reading per character
- Cons:
  - The table might have been incomplete
  - The transliterator considered each character in isolation; pronunciations might shift a bit depending on surrounding consonants.
Korean (Hangeul)
- Pros:
  - It accounts for several special cases in which the pronunciations of certain consonants change.
  - It transliterates “시, 샤, etc.” as “shi, sha, etc.” instead of “si, sya, etc.” to be more consistent with pronunciation, even though this diverges a bit from the new standard transliteration. (This might be a con, if you are trying to stick to the standard.)
  - Hyphens are inserted in certain cases where there is ambiguity. (For instance, “한글” is transliterated as “han-geul” so that it is not misread as “hang eul”)
- Cons:
  - It does not account for all possible special cases that occur (although it is easy to tweak the code if there are many more significant special cases to add).
  - In certain cases, there are two consonants in the bottom row of one Hangeul unit; the consonant on the bottom right is pronounced if the top consonant on the next unit is silent.⁶ In some special cases, this top consonant is not silent, so my transliterator makes it gets pronounced and the bottom-right one gets ignored. In reality, there are rules to follow to decide which consonant gets pronounced⁶, but that would complicate things too much for my purposes.
Japanese (Kana)
- Pro:
  - It accounts for changes due to lengthened vowels, and doubled consonants.
- Con:
  - Certain kana can have multiple pronunciations:
    - “は” is “ha” unless it is used as a grammatical particle, in which case it is “wa”.
    - “へ” is “he” unless it is a grammatical particle, in which case it is “e”.
    - “を” is normally “o”, but in songs, it can be pronounced “wo”.
Japanese (Kanji)
(When this transliterator is enabled, it takes over for the kana transliterator, since the readings for kanji sometimes depend on the surrounding kana)
- Pro:
  - Apostrophes or spaces are inserted where necessary to distinguish between “n” as a nasal sound and “n” as an initial consonant
- Cons:
  - See cons for kana
  - Unlike characters in Chinese, Japanese can have many (in some cases as many as 10) completely different alternate readings for one character, depending on the context of the character. Using a huge context-based table, the transliterator was able to narrow down the possibilities in many cases, but in others, there were still several readings, making the output hard to read. It seems a more sophisticated table would be need, complete with a frequency table suggesting when certain readings would be most appropriate.
  - The frequency table takes a few seconds to load, and can eat up ~20MB of RAM.
Thai
- Pros:
  - The program considers pronunciation shifts depending on whether consonants are initial or final.
  - Vowel clusters are identified; this is critical to deciding what vowel sound goes with a syllable.
- Cons:
  - The transliterator was not designed to identify tones for each syllable. All tone indicating characters get ignored.
  - Due to my lack of knowledge of the Thai alphabet, I am unable to predict whether a consonant in the middle of a sentence is an initial consonant or an end consonant; this leads to occasional erroneous transliterations.
  - There are normally no spaces between Thai words; spaces come only between sentences.¹¹ Without a dictionary table, the transliterator is unable to tell where one word ends and another begins. This makes transliterated output hard to read.
  - When no vowel is specified to go in between an initial consonant and an end consonant, a vowel, either “a” or “o” is assumed by the reader.¹² I, however, have no way of telling whether it should be an “a” or an “o”, so the transliterator always makes it an “o”.
  - Certain special Thai characters, such as nikkhahit ( ํ), which indicates an “n” or “ng”, and maiyamok (ๆ), which indicates the repetition of one or more symbols¹², are not implemented.
Hindi (Devanagari)
- Pro:
  - Its written system seems similar to that of Thai⁸, but a little simpler, making it conform to the Thai class well.
- Cons:
  - Half consonants are not implemented.
  - Some reading rules in the Thai class, such as one which removes the “h” consonant in special cases, might interfere with the Hindi transliteration process.
Gujarati (for AkrutiGopi)
- Pro:
  - Its written system seems similar to that of Thai⁷, but a little simpler, making it conform to the Thai class well.
- Cons:
  - See cons for Hindi above
  - This one is font specific, so it does not conform to Unicode; however, it is much easier to find Gujarati documents in AkrutiGopi)
  - Either the table I generated is flawed, or I am misunderstanding the reading rules for Gujarati; many transliterated Strings come up containing the substring “null”, indicating that something went wrong when reading the original Gujarati text.
Cherokee and Yi
- Pro:
  - Characters are always mapped to Roman characters in the same exact way, making transliteration trivial.
- Con:
  - As far as I can tell, there are no sites (excluding Unicode test pages) written in these languages

Languages That Could Not Be Transliterated Well:

Arabic and Hebrew

These two languages have vowels, but seldom use them. The reader must be able to recognize the written words, going on consonants alone. As a result, transliterated output was not very readable. For example, when transliterating Microsoft's Egyptian homepage, the String “m'ykrwswft” showed up in the output; this was most likely an Arabic rendering of the company's name, but with words like this throughout a text, the transliterator cannot serve its purpose well. Perhaps a Romanized Arabic dictionary with words in the original script and the transliterated form would be necessary to process such texts. For words that are not in the dictionary, maybe a frequency table could be used to figure out the most probable vowel sound.

Text Sources:

Unfortunately, many of these links point to places on Søren Binks' "Transliteration of Non-Roman Scripts" site, which no longer exists. Here is an archived copy of his homepage, just to give you an idea of what his site once had.

“Arabic: Transliteration, rev. 2, 2001-09-01” http://homepage.mac.com/sirbinks/Arabic.r2.pdf
Binks, Søren. “Greek” http://homepage.mac.com/sirbinks/Greek.pdf
“Cherokee: Transliteration, rev. 2, 2001-03-15” http://homepage.mac.com/sirbinks/Cherokee.r2.pdf
“CJKV Dictionaries and Printed Resources” http://www.sungwh.freeserve.co.uk/flux/resources.htm
“Classics Log 9805c - Message Number 131” http://omega.cohums.ohio-state.edu:8080/hyper-lists/classics-l/listserve_archives/log98/9805c/9805c.131.html
Eisenberg, David J. “Introduction to Korean” http://langintro.com/kintro/
“Gujarati: Transliteration, rev. 3, 2001-03-10” http://homepage.mac.com/sirbinks/Gujarati.r3.pdf
“Hindi: Transliteration, rev. 4, 2001-03-10” http://homepage.mac.com/sirbinks/Hindi.r4.pdf
“New System of Romanization<br> for the Korean Language” http://www.homestaykorea.com/2002_01/intro/romanization.htm
“Russian: Transliteration, rev. 1, 2001-02-25” http://homepage.mac.com/sirbinks/Russian.r1.pdf
“Thai Language - Learn to Read Thai” http://www.learningthai.com/thai_alphabet.html
“Thai: Transliteration, rev. 2, 2001-03-10” http://homepage.mac.com/sirbinks/Thai.r2.pdf
“"Vsya Rossiya" v Sovete Federacii. 1996-...” http://www.cityline.ru:8084/politika/fs/sf2fvr.html
http://www.prettyodango.net/senshi/info/faq/japanese/
http://www.smartphrase.com/Greek/gr_going_out_voc.shtml

Table Sources:

Cantonese (in Traditional Chinese) http://www.hku.hk/linguist/lshk/Jyutping/download/phrase.box
Hakka (in Traditional Chinese) http://www.asiawind.com/hakka/hagfa.box
Japanese (Kanji) http://cvs.namazu.org/*checkout*/migemo/migemo-dict?rev=1.5&hideattic=0
Korean (Hanja) http://dbsun3.kaist.ac.kr/~kyuho/info/hantable/hantable.txt
Mandarin (in Simplified Chinese) http://www.dbis.ns.ca/~stirling/pyorder.html
Mandarin (in Traditional Chinese) http://www.ibiblio.org/ccic/ftp-pub/software/data/py2big.tab
Taiwanese (in Traditional Chinese) http://daiwanway.dynip.com/tw/twug

My Homepage

This site