MACHINE TRANSLITERATION IN JAVA
by Richard Calvi
Abstract:
This project is a Java application that takes text input in non-Roman
alphabets and produces Romanized output. The design process involved
searching for tables, researching the reading rules and special cases
for the various scripts, and then finding ways to process the text
efficiently in terms of memory and processing time. Transliteration
techniques ranged from simple remapping routines and context-based
searches to more involved analysis in the case of Thai. Languages
supported include Greek, Russian, Japanese, Korean, Thai, Hindi,
Gujarati and several Chinese dialects. In one case, a font-specific
table was made, but generally, tables were saved in UTF8 encoding, and
conformed to Unicode character mappings.
Javadoc Documentation is available here.
Screenshots are now available!
Comments? Questions? Please use this
feedback form.
CONTENTS:
What is Transliteration?
Why Make a Transliterator?
Categories of Languages
Principles of Transliteration
Research Phase
Testing Process
Evaluation of Results
Russian
Greek
Chinese
Korean
Hanja
Hangeul
Japanese
Kana
Kanji
Thai
Hindi (Devanagari)
Gujarati
Cherokee
Yi
Languagues that Could Not Be Transliterated Well
Text Sources
Table Sources
What is Transliteration?
Transliteration is the phonetic representation of one language using
the alphabet of another language. For example, “Vladimir
Putin” is the English transliteration of the Russian name
“Владимир
Путин.” When you
transliterate into the English alphabet, the process is called
Romanization.
Although my program currently does only Romanization, it could easily
transliterate into other alphabets, if given the proper data tables.
Back to Top
Why Make a Transliterator?
This project was originally begun as an attempt to look for borrowed
English words in foreign texts. While it is easy to spot English
words in languages that use a Latin-based alphabet, it is harder in
other languages, such as Russian, Thai, or Chinese, unless you are
familiar with their corresponding writing systems. Although it may
not be very hard to learn the Russian alphabet, other languages can be
a bit more of a challenge. Thai has more letters, and many more
reading rules involved, and written Chinese can take years to master.
Furthermore, even if all writing systems were as easy to learn as
Russian is, there are simply too many alphabets in existence for any
one person to master all of them. If one were to automate the process
with a computer, it would theoretically be possible for anyone with
access to a transliterator to “read” documents
in foreign writing systems without much hesitation. If the
transliterated output were clear and readable, English words would be
somewhat easy to spot, and readers could have a vague idea of what any
given foreign text “sounded” like.
Transliterators can be especially useful to English speakers who speak
a foreign language fluently, but are not literate in it. Students
beginning a foreign language course might also use it when trying to
read foreign texts. With a phonetic guide, people can see how words
in languages of a common family are related, even if the languages use
completely different scripts. Plus, fans of foreign music could enjoy
being able to pronounce the words in their songs without learning a
foreign script.
Back to Top
Categories of Languages:
One underlying goal of this project was universality. The
transliterator should be able to support as many languages as
possible, using as few specialized Java classes as possible. To do
this, one must make transliterators that are suitable enough to handle
most reading rules of a given language while being general so that
most transliterator classes can support more than one language. This
way, the transliterator could be expanded to support many more
languages by providing more data tables, and not more Java code. To
keep things general, I split up the character sets that I set out to
transliterate into different basic categories loosely based on their
characteristics. In some cases, most or all of the
“general characteristics” might not have applied
to a language, but I was able to format a transliteration data table
that would suit the language despite this.
- Simple Replacement
Characters are
simply remapped, from one character set to another, based entirely on
the value (ASCII or Unicode) assigned to the initial
character
- Character
sets in this category: Unicode punctuation character sets
- Other character sets that might fall into this category:
any other systems in which any given character is always
transliterated in the same exact way
- Simple Replacement with Capitalization (an extended class
of the previous category)
Same as the previous, except that the transliterator must pay
attention to capitalization. For instance, the Russian character
“ч” is transliterated as
“ch”, but when it is capitalized, it may come
out as either “Ch” or “CH”
depending on whether the word it appears in is in title case, or all
caps.
- Languages currently inthis category: Greek, Russian
- Other languages that might fall into this category: Other
languages that use the Cyrillic alphabet (Ukranian, Uzbek, Mongolian,
etc...)
- Simple Replacement with Spaces (another extended class of
the first category)
Same as simple replacement, but with spaces between the
characters to improve readability.
- Languages currently in this category: Cherokee, Yi
- Other languages that might fall into this category: Other
languages that use a syllabary (a huge table of characters entailing
every possible syllable in the language)
- Thai-like
- General characteristics:12
- Vowel “structures” that surround a consonant
determine a vowel sound.
(For example, if k were a consonant in
Thai, “kา” would be
“kaa”, “โk”
would be “ko”, and
“แk็ว” would be
“kaeo”.)
- Consonants can have completely different pronunciations, depending
on whether they are initial consonants or final consonants. Only some
consonants can be final consonants.
- Certain pairs of consonants can combine to make consonant
clusters.
- If there is no vowel indicated for a syllable, there is assumed an
“inherent” vowel.
- Some characters are always transliterated in exactly the same way
(such as numerals)
- Languages currently in this category: Hindi, Thai,
Gujarati, Arabic, Hebrew
- Other languages that might fall into this category: Indian
languages with alphabets similar to Hindi (Sanskrit, Bengali, Punjabi,
etc...) and Southeast Asian languages with alphabets similar to Thai
(Burmese, Khmer, Lao, etc...)
- Chinese/Japanese/Korean (CJK) Ideographs
- General characteristics:4
- Any one character can have multiple possible readings.
- Each character has a tone number assigned to it. Tones indicate
how one's voice should rise or fall when pronouncing the
syllable.
- Languages currently in this category: Korean Hanja and
several Chinese dialects
- Other languages that might fall into this category: Any
other Chinese dialects for which tables exist, Vietnamese
Chữ Nôm, and Japanese kanji (to an extent)
- Korean
- General characteristics:
- Each Korean “character” in Unicode consists
of a consonant, a vowel, and from zero to two more consonants.6
- The pronunciation of a consonant can change depending on whether
it is an initial consonant or an end consonant. 6
- The pronunciation of a consonant can change depending on the
previous end consonant, the next initial consonant, and the current
vowel. 9
- Languages currently in this category: Korean Hangeul
- Other languages that might fall into this category:
None
- Japanese Kana
- General characteristics:14
- Mostly a phonetic syllabary, with characteristics very suitable to
Simple Replacement
- A っ or a ッ before a consonant causes that
consonant to be doubled in the transliterated output.
- A ー causes the previous vowel to be lengthened.
- Character sets currently in this category: Hiragana,
Katakana and Half-width Katakana
- Other languages that might fall into this category:
None
- Context Based
- General characteristics: Very much like CJK Ideographs,
except that the problem of multiple readings per character is resolved
by paying attention to the context of the character, and using a table
with an extensive list of possible contexts of the characters
- Languages currently in this category: Japanese
- Other languages that might fall into this category:
Chinese, Vietnamese Chữ Nôm and Korean Hanja
Back to Top
Principles of Transliteration:
These are some of the rules that I set for myself to make certain that
the final product would be useful to others:
- Use unusual
đįăĉŕĩŧĭċǻŀ
marks sparingly (occasional apostrophes, grave accents and diereses
are OK)
- If no standard exists, mix transliterations from existing systems,
with emphasis on accuracy and readability.
- Make it so that someone like me, who has virtually no knowledge of
these languages, can understand vaguely how to pronounce the final
output.
Back to Top
Research Phase:
Since I had no prior knowledge of the languages being transliterated,
I needed to research the reading rules for them. (See the Sources section at the end). In most cases,
it was simply a matter of downloading documentation, usually in Adobe
Portable Document Format, listing which characters corresponded to
which consonants and vowels, and how different combinations of them
yielded different pronunciations. This information was intended for
human transliterators, so I needed to read through it and represent it
in a data table such that a transliteration program could use it.
However, in languages such as Japanese and Chinese, there are
thousands of characters, many of which have multiple readings,
depending on their context. For these, I used free tables, also
available online (See the Tables section
at the end). Many of these tables were intended to be tables for an
Input Method Editor (IME), a program that allows users with an English
keyboard to type English letters and convert it to a foreign script -- essentially the opposite of the transliterator. For this reason, I had to reorganize or resort much of the data in some of these tables. Additionally, tables used by the transliterator all must have a simple header, so at the very minimum, all the tables I borrowed were modified by having a header added.
Back to Top
Testing Process:
Since I cannot speak any of the languages that I transliterated, I
cannot guarantee the accuracy of my transliterations. To tell if they
were somewhat accurate, I did the following:
- Researched rules for transliteration, and special cases that
applied to them
- Checked texts on the web that were transliterated by a human, and
compared their transliterations to mine
- Consulted native speakers whenever possible
- Listened to songs in the foreign languages and compared how they
sounded to what my transliterated output of the lyrics
was.
(Although Roman letters may not seem to fit the sounds in some
languages, this method was very effective in identifying glaring
errors.)
Back to Top
Evaluation of Results (broken down by language):
- Russian
- Pro:
- Fairly straightforward; letters are transliterated in the order they are written, and it seemed that one character always corresponded to the same character or sequence of characters in English
- Con:
- Certain consonants can be one word. For example
“v” can be one word. It sounded to me like
“voh”, but I have seen professional
transliterations transcribe it as “v”13, so my transliterator does the
same.
- Greek
- Pro:
- Fairly straightforward, like Russian
- Cons:
- Two “i” sounds: one that is like a short
“i” and one that is like a long
“i”. 2
I could not find a simple way of making this clear without slightly
garbling the output.
- Certain pairs of consonants are used for foreign sounds in
borrowed words. For example, the consonant pair
“mp” is sometimes used to represent
“b”.5 Thus, the
Greek word for “basketball” is
“mpasket”.15
Unfortunately, I could not find an efficient way to make the
transliterator handle cases like this. Simply replacing every
instance of “mp” with “b”
would make words that were supposed to have an
“mp” sound be spelled incorrectly.
- Chinese (Simplified Mandarin, Traditional
Mandarin, Cantonese, Teochew, Hakka, and Taiwanese)
- Pros:
- Output was fairly readable in most Traditional Chinese texts for
most dialects, whenever the transliterator was told not to consider
tones in output.
- Cons:
- In simplified Chinese and Taiwanese, many characters had several
alternate readings, making it difficult to tell how something was
pronounced.
- If tones are enabled, there are too many readings per character
for long text inputs to be readable. A context based table and/or a
frequency chart might fix this, but I could not find free
one.
- Korean (Hanja)
- Pros:
- Output was very readable; there was only one reading per character
- Cons:
- The table might have been incomplete
- The transliterator considered each character in isolation;
pronunciations might shift a bit depending on surrounding
consonants.
- Korean (Hangeul)
- Pros:
- It accounts for several special cases in which the pronunciations
of certain consonants change.
- It transliterates “시, 샤,
etc.” as “shi, sha, etc.” instead of
“si, sya, etc.” to be more consistent with
pronunciation, even though this diverges a bit from the new standard
transliteration. (This might be a con, if you are trying to stick to
the standard.)
- Hyphens are inserted in certain cases where there is ambiguity.
(For instance, “한글” is
transliterated as “han-geul” so that it is not
misread as “hang eul”)
- Cons:
- It does not account for all possible special cases that
occur (although it is easy to tweak the code if there are many more
significant special cases to add).
- In certain cases, there are two consonants in the bottom row of
one Hangeul unit; the consonant on the bottom right is pronounced if
the top consonant on the next unit is silent.6 In some special cases, this top consonant
is not silent, so my transliterator makes it gets pronounced
and the bottom-right one gets ignored. In reality, there are rules to
follow to decide which consonant gets pronounced6, but that would complicate things too much
for my purposes.
- Japanese (Kana)
- Pro:
- It accounts for changes due to lengthened vowels, and doubled consonants.
- Con:
- Certain kana can have multiple pronunciations:
- “は” is “ha”
unless it is used as a grammatical particle, in which case it is
“wa”.
- “へ” is “he”
unless it is a grammatical particle, in which case it is
“e”.
- “を” is normally
“o”, but in songs, it can be pronounced
“wo”.
In the case of the first two examples, there is no way to
determine which pronunciations without lexical analysis. Thus, the
transliterator always gives the first pronunciation listed.
- Japanese (Kanji)
(When this transliterator is enabled, it takes over for the kana
transliterator, since the readings for kanji sometimes depend on the
surrounding kana)
- Pro:
- Apostrophes or spaces are inserted where necessary to distinguish
between “n” as a nasal sound and
“n” as an initial consonant
- Cons:
- See cons for kana
- Unlike characters in Chinese, Japanese can have many (in some
cases as many as 10) completely different alternate readings for one
character, depending on the context of the character. Using a huge
context-based table, the transliterator was able to narrow down the
possibilities in many cases, but in others, there were still several
readings, making the output hard to read. It seems a more
sophisticated table would be need, complete with a frequency table
suggesting when certain readings would be most appropriate.
- The frequency table takes a few seconds to load, and can eat up
~20MB of RAM.
- Thai
- Pros:
- The program considers pronunciation shifts depending on whether
consonants are initial or final.
- Vowel clusters are identified; this is critical to deciding what
vowel sound goes with a syllable.
- Cons:
- The transliterator was not designed to identify tones for each
syllable. All tone indicating characters get ignored.
- Due to my lack of knowledge of the Thai alphabet, I am unable to
predict whether a consonant in the middle of a sentence is an initial
consonant or an end consonant; this leads to occasional erroneous
transliterations.
- There are normally no spaces between Thai words; spaces come only
between sentences.11 Without a
dictionary table, the transliterator is unable to tell where one word
ends and another begins. This makes transliterated output hard to
read.
- When no vowel is specified to go in between an initial consonant
and an end consonant, a vowel, either “a” or
“o” is assumed by the reader.12 I, however, have no way of telling
whether it should be an “a” or an
“o”, so the transliterator always makes it an
“o”.
- Certain special Thai characters, such as nikkhahit (
ํ), which indicates an “n” or
“ng”, and maiyamok (ๆ), which
indicates the repetition of one or more symbols12, are not implemented.
- Hindi (Devanagari)
- Pro:
- Its written system seems similar to that of Thai8, but a little simpler, making it conform
to the Thai class well.
- Cons:
- Half consonants are not implemented.
- Some reading rules in the Thai class, such as one which removes
the “h” consonant in special cases, might
interfere with the Hindi transliteration process.
- Gujarati (for AkrutiGopi)
- Pro:
- Its written system seems similar to that of Thai7, but a little simpler, making it conform
to the Thai class well.
- Cons:
- See cons for Hindi above
- This one is font specific, so it does not conform to Unicode;
however, it is much easier to find Gujarati documents in AkrutiGopi)
- Either the table I generated is flawed, or I am misunderstanding
the reading rules for Gujarati; many transliterated Strings come up
containing the substring “null”, indicating that
something went wrong when reading the original Gujarati
text.
- Cherokee and Yi
- Pro:
- Characters are always mapped to Roman characters in the same exact
way, making transliteration trivial.
- Con:
- As far as I can tell, there are no sites (excluding Unicode
test pages) written in these languages
Back to Top
Languages That Could Not Be Transliterated Well:
Arabic and Hebrew
These two languages have vowels, but seldom use them. The reader must
be able to recognize the written words, going on consonants alone. As
a result, transliterated output was not very readable. For example,
when transliterating Microsoft's Egyptian
homepage, the String “m'ykrwswft” showed up
in the output; this was most likely an Arabic rendering of the
company's name, but with words like this throughout a text, the
transliterator cannot serve its purpose well. Perhaps a Romanized
Arabic dictionary with words in the original script and the
transliterated form would be necessary to process such texts. For
words that are not in the dictionary, maybe a frequency table could be
used to figure out the most probable vowel sound.
Back to Top
Text Sources:
Unfortunately, many of these links point to places on Søren
Binks' "Transliteration of Non-Roman Scripts" site, which no longer
exists. Here
is an archived copy of his homepage, just to give you an idea of what
his site once had.
- “Arabic: Transliteration, rev. 2,
2001-09-01” http://homepage.mac.com/sirbinks/Arabic.r2.pdf
- Binks, Søren.
“Greek” http://homepage.mac.com/sirbinks/Greek.pdf
- “Cherokee: Transliteration, rev. 2,
2001-03-15” http://homepage.mac.com/sirbinks/Cherokee.r2.pdf
- “CJKV Dictionaries and Printed Resources” http://www.sungwh.freeserve.co.uk/flux/resources.htm
- “Classics Log 9805c - Message Number
131” http://omega.cohums.ohio-state.edu:8080/hyper-lists/classics-l/listserve_archives/log98/9805c/9805c.131.html
- Eisenberg, David J. “Introduction to
Korean” http://langintro.com/kintro/
- “Gujarati: Transliteration, rev. 3,
2001-03-10” http://homepage.mac.com/sirbinks/Gujarati.r3.pdf
- “Hindi: Transliteration, rev. 4,
2001-03-10” http://homepage.mac.com/sirbinks/Hindi.r4.pdf
- “New System of Romanization<br>
for the Korean Language” http://www.homestaykorea.com/2002_01/intro/romanization.htm
- “Russian: Transliteration, rev. 1,
2001-02-25” http://homepage.mac.com/sirbinks/Russian.r1.pdf
- “Thai Language - Learn to Read
Thai” http://www.learningthai.com/thai_alphabet.html
- “Thai: Transliteration, rev. 2,
2001-03-10” http://homepage.mac.com/sirbinks/Thai.r2.pdf
- “"Vsya Rossiya" v Sovete
Federacii. 1996-...” http://www.cityline.ru:8084/politika/fs/sf2fvr.html
- http://www.prettyodango.net/senshi/info/faq/japanese/
- http://www.smartphrase.com/Greek/gr_going_out_voc.shtml
Back to Top
Table Sources:
Back to Top
My Homepage
This
site