My name is not really Andrei

» 31 July 2006 » In PHP »

Ryan Kennedy commented on the presentation I gave at OSCON; specifically, about the transliteration support in PHP 6. I wanted to follow up and explain exactly what it is and, unfortunately, what it is not.

Ryan was excited about the possibilities presented by transliteration, especially as it applies to representing foreign names in reader’s native script (think mail readers). This works really well for Japanese names:

echo str_transliterate("やまもと, のぼる", "Any", "Latin");

And the result is:

  yamamoto, noboru

This is sweet, right? We get an approximate spelling of the foreign name and one could even attempt to pronounce it. But does it work for all script pairs?

echo str_transliterate("Tom Cruise", "Latin", "Cyrillic");

What do we get for this paragon of fame?

  Том Цруисе

Hmm. If I had to reconstruct for you English speakers what that sounds like in Russian, it would be something close to TOM TSRU-EE-SEH. Probably not how he’d like to be known in Russia. What is the problem here?

The problem is that English orthography is defective. There is a disconnect between the orthography (spelling) of English and the phonemes (sounds) of the language. We’ve all seen this: each English letter may represent more than one sound (c can be [s] or [k]), and each English sound may be written by more than one letter ([f] can be f, or ph, or gh).

This plays havoc with transliteration which is a literal mapping from one system of writing to another. Transliteration is supposed to be lossless and thus, reversible. In order to achieve this the mapping rules must represent each letter (glyph) of the source script as a separate glyph or a unique combination of glyphs in the target script. Transliteration knows nothing about the underlying sounds of the language and works only with the written forms. You can see how this is problematic when you come to a language like English.

What you really want is transcription which maps the sounds of the source language to the script of the target language. This is fairly easy for an efficient language, one where the sounds have one-to-one mapping to glyphs, and becomes progressively more difficult for less efficient languages. With English, the transcription process has to rely on a dictionary providing exact phonetic transcription of pretty much every word.

Still, transliteration works fairly well for a good number of script pairs since it attempts to map the letters of the source script to similar sounding letters in the target one. The results mostly depend on how efficient the source language is (as with Russian->English). Transliteration rules can be customized, and if you’re willing to live without the reversibility requirement, one can get fairly accurate representations. The str_transliterate() function in PHP 6 uses default built-in transliteration rules, but there will be a way to provide your own rules towards achieving this goal.

Hope this helps explain some of the issues concerning mapping one writing system to another. Until next time.

Trackback URL