My name is not really Andrei

» 31 July 2006 » In PHP »

Ryan Kennedy commented on the presentation I gave at OSCON; specifically, about the transliteration support in PHP 6. I wanted to follow up and explain exactly what it is and, unfortunately, what it is not.

Ryan was excited about the possibilities presented by transliteration, especially as it applies to representing foreign names in reader’s native script (think mail readers). This works really well for Japanese names:

echo str_transliterate("やまもと, のぼる", "Any", "Latin");

And the result is:

  yamamoto, noboru

This is sweet, right? We get an approximate spelling of the foreign name and one could even attempt to pronounce it. But does it work for all script pairs?

echo str_transliterate("Tom Cruise", "Latin", "Cyrillic");

What do we get for this paragon of fame?

  Том Цруисе

Hmm. If I had to reconstruct for you English speakers what that sounds like in Russian, it would be something close to TOM TSRU-EE-SEH. Probably not how he’d like to be known in Russia. What is the problem here?

The problem is that English orthography is defective. There is a disconnect between the orthography (spelling) of English and the phonemes (sounds) of the language. We’ve all seen this: each English letter may represent more than one sound (c can be [s] or [k]), and each English sound may be written by more than one letter ([f] can be f, or ph, or gh).

This plays havoc with transliteration which is a literal mapping from one system of writing to another. Transliteration is supposed to be lossless and thus, reversible. In order to achieve this the mapping rules must represent each letter (glyph) of the source script as a separate glyph or a unique combination of glyphs in the target script. Transliteration knows nothing about the underlying sounds of the language and works only with the written forms. You can see how this is problematic when you come to a language like English.

What you really want is transcription which maps the sounds of the source language to the script of the target language. This is fairly easy for an efficient language, one where the sounds have one-to-one mapping to glyphs, and becomes progressively more difficult for less efficient languages. With English, the transcription process has to rely on a dictionary providing exact phonetic transcription of pretty much every word.

Still, transliteration works fairly well for a good number of script pairs since it attempts to map the letters of the source script to similar sounding letters in the target one. The results mostly depend on how efficient the source language is (as with Russian->English). Transliteration rules can be customized, and if you’re willing to live without the reversibility requirement, one can get fairly accurate representations. The str_transliterate() function in PHP 6 uses default built-in transliteration rules, but there will be a way to provide your own rules towards achieving this goal.

Hope this helps explain some of the issues concerning mapping one writing system to another. Until next time.

Trackback URL

  1. andrei
    Abu Hurayrah
    31/07/2006 at 11:48 am Permalink

    All the more reason to implement customizable transliteration tables…:-D

    Actually, would it be so absurd to transliterate the letters into their abstract phonological sound (represented by whatever complete system you like) – and then, once they are in this unambiguous form, you can translate them into the appropriate format in the destination language based on a “grammar” – we are, after all, talking about languages, aren’t we? Are there any projects that aim to approach a solution to this by codifying rules like “i-before-e-except-after-c” and so on? Surely, a good portion of each language, English or otherwise, can be described by discrete rules…or am I living in some kind of a fantasy world?

    How ironic is it that the most imprecise language is the one we use with which to communicate to one-another?

  2. andrei
    Abu Hurayrah
    31/07/2006 at 11:56 am Permalink

    I just read the WP entry for a defective script, by the way, and I wanted to comment – is English so much a defective script as it is an imprecise one? For example, there are more than enough letters for most sounds, except the obvious ones like ‘th’ (the), ‘th’ (bath), ‘ch’, but the problem is the same letter can have a different sounds and the same sound can have different letters.

    Contrast this with the example of Arabic given in the defective scripts entry. That is, Arabic, when written without an diacritical marks, can be somewhat ambiguous, and when written without dotted-letters, can be downright incomprehensible for one not well-versed in the dialect in which it was written.

    So, in that particular example you gave, is it the defectiveness of English that breaks the transliteration, or is it the imprecision with which letters are used for sounds? I think there is a notable difference, subtle though it may be.

    I am NOT a linguist, by the way, so I apologize for being rather untechnical in my analysis.

  3. andrei
    Ezku
    31/07/2006 at 12:28 pm Permalink

    Abu Hurayrah:

    You have a point. It’s not English, the language, that is “defective” – but its orthography, or the way that language is put into symbols, very much so. Many other languages have at one point or another gone through a spelling reform, in which ambiguities, redundancies and outright wrongs have been corrected. There’ve been many attempts to do the same for the English language, but they have all faced failure.

    The wikipedia article on this might prove of interest: http://en.wikipedia.org/wiki/Spelling_reform

  4. andrei
    Andrei
    31/07/2006 at 9:53 pm Permalink

    Abu Hurayrah,

    | Surely, a good portion of each language, English or otherwise, can be described
    | by discrete rules…or am I living in some kind of a fantasy world?

    While English does have some pronunciation rules, there are also a lot of exceptions. Consider the pairs such as “cough” and “bough”, or “laugh” and “caught”. What kind of rule can you make up that will distinguish the [f] sound of “gh” from the silent one?

    As for your comment on the English not being “defective”, you are correct, I should have used the term “deficient” or “inefficient”.

  5. andrei
    Abu Hurayrah
    31/07/2006 at 10:07 pm Permalink

    Andrei,

    The complexity to codifying English would be a factor of how much of it really CAN be codified, and how much of it is simple going to be a large array of exceptions. I don’t know what a logical threshold point regarding efficiency would be between a look-up table and a complex set of grammar rules, but surely, as anything else, it’s POSSIBLE, right? I wonder if any projects exist in this direction.

    I would consider a more complex issue being that of heteronyms (described at http://en.wikipedia.org/wiki/Homonym), namely, words that are SPELLED the same but pronounced differently (e.g., desert – an arid, dry place with lots of sand, or desert – to abandon someone or something) – would we need to utilize context? If that’s the case, then the parser has to actually UNDERSTAND what it’s parsing on a language level, and not just on a character or word level…talk about complex. 😀

  6. andrei
    David Rodger
    01/08/2006 at 3:18 am Permalink

    There’s also a dreadful disconnect (sic) between English grammar and perverse usage when one makes the word “disconnect” a noun.

    Stop it right now!

  7. andrei
    Andrei
    01/08/2006 at 10:11 am Permalink

    Hey David,

    Dictionaries seem to disagree with you:

    http://dictionary.reference.com/search?q=disconnect

    Welcome to the wonderful world of word formation, by the way.

  8. andrei
    Abu Hurayrah
    01/08/2006 at 1:26 pm Permalink

    It’s a perfectly cromulent usage of the word “disconnect”…

Additional comments powered by BackType