MACHINE TRANSLITERATION IN JAVA
by Richard Calvi

Abstract:

This project is a Java application that takes text input in non-Roman alphabets and produces Romanized output. The design process involved searching for tables, researching the reading rules and special cases for the various scripts, and then finding ways to process the text efficiently in terms of memory and processing time. Transliteration techniques ranged from simple remapping routines and context-based searches to more involved analysis in the case of Thai. Languages supported include Greek, Russian, Japanese, Korean, Thai, Hindi, Gujarati and several Chinese dialects. In one case, a font-specific table was made, but generally, tables were saved in UTF8 encoding, and conformed to Unicode character mappings.

Javadoc Documentation is available here.

Screenshots are now available!

Comments? Questions? Please use this feedback form.

CONTENTS:
What is Transliteration?
Why Make a Transliterator?
Categories of Languages
Principles of Transliteration
Research Phase
Testing Process
Evaluation of Results
     Russian
     Greek
     Chinese
     Korean
          Hanja
          Hangeul
     Japanese
          Kana
          Kanji
     Thai
     Hindi (Devanagari)
     Gujarati
     Cherokee
     Yi
Languagues that Could Not Be Transliterated Well
Text Sources
Table Sources

What is Transliteration?

Transliteration is the phonetic representation of one language using the alphabet of another language. For example, “Vladimir Putin” is the English transliteration of the Russian name “Владимир Путин.” When you transliterate into the English alphabet, the process is called Romanization.

Although my program currently does only Romanization, it could easily transliterate into other alphabets, if given the proper data tables.

Back to Top

Why Make a Transliterator?

This project was originally begun as an attempt to look for borrowed English words in foreign texts. While it is easy to spot English words in languages that use a Latin-based alphabet, it is harder in other languages, such as Russian, Thai, or Chinese, unless you are familiar with their corresponding writing systems. Although it may not be very hard to learn the Russian alphabet, other languages can be a bit more of a challenge. Thai has more letters, and many more reading rules involved, and written Chinese can take years to master. Furthermore, even if all writing systems were as easy to learn as Russian is, there are simply too many alphabets in existence for any one person to master all of them. If one were to automate the process with a computer, it would theoretically be possible for anyone with access to a transliterator to “read” documents in foreign writing systems without much hesitation. If the transliterated output were clear and readable, English words would be somewhat easy to spot, and readers could have a vague idea of what any given foreign text “sounded” like.

Transliterators can be especially useful to English speakers who speak a foreign language fluently, but are not literate in it. Students beginning a foreign language course might also use it when trying to read foreign texts. With a phonetic guide, people can see how words in languages of a common family are related, even if the languages use completely different scripts. Plus, fans of foreign music could enjoy being able to pronounce the words in their songs without learning a foreign script.

Back to Top

Categories of Languages:

One underlying goal of this project was universality. The transliterator should be able to support as many languages as possible, using as few specialized Java classes as possible. To do this, one must make transliterators that are suitable enough to handle most reading rules of a given language while being general so that most transliterator classes can support more than one language. This way, the transliterator could be expanded to support many more languages by providing more data tables, and not more Java code. To keep things general, I split up the character sets that I set out to transliterate into different basic categories loosely based on their characteristics. In some cases, most or all of the “general characteristics” might not have applied to a language, but I was able to format a transliteration data table that would suit the language despite this.

  1. Simple Replacement

  2. Simple Replacement with Capitalization (an extended class of the previous category)

  3. Simple Replacement with Spaces (another extended class of the first category)

  4. Thai-like

  5. Chinese/Japanese/Korean (CJK) Ideographs

  6. Korean

  7. Japanese Kana

  8. Context Based

Back to Top

Principles of Transliteration:

These are some of the rules that I set for myself to make certain that the final product would be useful to others:

Back to Top

Research Phase:

Since I had no prior knowledge of the languages being transliterated, I needed to research the reading rules for them. (See the Sources section at the end). In most cases, it was simply a matter of downloading documentation, usually in Adobe Portable Document Format, listing which characters corresponded to which consonants and vowels, and how different combinations of them yielded different pronunciations. This information was intended for human transliterators, so I needed to read through it and represent it in a data table such that a transliteration program could use it.

However, in languages such as Japanese and Chinese, there are thousands of characters, many of which have multiple readings, depending on their context. For these, I used free tables, also available online (See the Tables section at the end). Many of these tables were intended to be tables for an Input Method Editor (IME), a program that allows users with an English keyboard to type English letters and convert it to a foreign script -- essentially the opposite of the transliterator. For this reason, I had to reorganize or resort much of the data in some of these tables. Additionally, tables used by the transliterator all must have a simple header, so at the very minimum, all the tables I borrowed were modified by having a header added.

Back to Top

Testing Process:

Since I cannot speak any of the languages that I transliterated, I cannot guarantee the accuracy of my transliterations. To tell if they were somewhat accurate, I did the following:

Back to Top

Evaluation of Results (broken down by language):

  1. Russian
  2. Greek
  3. Chinese (Simplified Mandarin, Traditional Mandarin, Cantonese, Teochew, Hakka, and Taiwanese)
  4. Korean (Hanja)
  5. Korean (Hangeul)
  6. Japanese (Kana)
  7. Japanese (Kanji)
    (When this transliterator is enabled, it takes over for the kana transliterator, since the readings for kanji sometimes depend on the surrounding kana)
  8. Thai
  9. Hindi (Devanagari)
  10. Gujarati (for AkrutiGopi)

  11. Cherokee and Yi

Back to Top

Languages That Could Not Be Transliterated Well:

Arabic and Hebrew

These two languages have vowels, but seldom use them. The reader must be able to recognize the written words, going on consonants alone. As a result, transliterated output was not very readable. For example, when transliterating Microsoft's Egyptian homepage, the String “m'ykrwswft” showed up in the output; this was most likely an Arabic rendering of the company's name, but with words like this throughout a text, the transliterator cannot serve its purpose well. Perhaps a Romanized Arabic dictionary with words in the original script and the transliterated form would be necessary to process such texts. For words that are not in the dictionary, maybe a frequency table could be used to figure out the most probable vowel sound.

Back to Top

Text Sources:

Unfortunately, many of these links point to places on Søren Binks' "Transliteration of Non-Roman Scripts" site, which no longer exists. Here is an archived copy of his homepage, just to give you an idea of what his site once had.

  1. “Arabic: Transliteration, rev. 2, 2001-09-01” http://homepage.mac.com/sirbinks/Arabic.r2.pdf
  2. Binks, Søren. “Greek” http://homepage.mac.com/sirbinks/Greek.pdf
  3. “Cherokee: Transliteration, rev. 2, 2001-03-15” http://homepage.mac.com/sirbinks/Cherokee.r2.pdf
  4. “CJKV Dictionaries and Printed Resources” http://www.sungwh.freeserve.co.uk/flux/resources.htm
  5. “Classics Log 9805c - Message Number 131”  http://omega.cohums.ohio-state.edu:8080/hyper-lists/classics-l/listserve_archives/log98/9805c/9805c.131.html
  6. Eisenberg, David J. “Introduction to Korean” http://langintro.com/kintro/
  7. “Gujarati: Transliteration, rev. 3, 2001-03-10” http://homepage.mac.com/sirbinks/Gujarati.r3.pdf
  8. “Hindi: Transliteration, rev. 4, 2001-03-10” http://homepage.mac.com/sirbinks/Hindi.r4.pdf
  9. “New System of Romanization<br> for the Korean Language” http://www.homestaykorea.com/2002_01/intro/romanization.htm
  10. “Russian: Transliteration, rev. 1, 2001-02-25” http://homepage.mac.com/sirbinks/Russian.r1.pdf
  11. “Thai Language - Learn to Read Thai” http://www.learningthai.com/thai_alphabet.html
  12. “Thai: Transliteration, rev. 2, 2001-03-10” http://homepage.mac.com/sirbinks/Thai.r2.pdf
  13. “"Vsya Rossiya" v Sovete Federacii. 1996-...” http://www.cityline.ru:8084/politika/fs/sf2fvr.html
  14. http://www.prettyodango.net/senshi/info/faq/japanese/
  15. http://www.smartphrase.com/Greek/gr_going_out_voc.shtml

Back to Top

Table Sources:

Back to Top

My Homepage

This site