Table of Contents

The first article of the Universal Declaration of Human Rights, Vietnamese translation. Note that at the end of the video, the engine can be seen correctly handling obviously non-Vietnamese words like "system" and "wonderful" which if you apply the TELEX typing rules blindly would end up as sýtem and ươnderful.

The first article of the Universal Declaration of Human Rights, Vietnamese translation. Note that at the end of the video, the engine can be seen correctly handling obviously non-Vietnamese words like "system" and "wonderful" which if you apply the TELEX typing rules blindly would end up as sýtem and ươnderful.

Making a Vietnamese keyboard engine was one of the earlier challenges I have worked with in my career. The first version I made was in Python (bogo-python), back in 2012 or so (almost 10 years now, I feel old 😆), porting cmptig's C version. The motivating problem was to make a Vietnamese keyboard for Linux desktop that doesn't use pre-edit (the annoying underline with wacky behaviors that breaks a lot of people's productivity). We decided to make our own engine because the Unikey engine's code was rather difficult to work with and, you know, young and hard-headed engineers find all the reasons they can to reinvent all the wheels (the Not Invented Here syndrome ran strong!). It worked well for a while but never quite well enough and we were constantly running into problems as we were working against the system (see Ước mơ bộ gõ kiểu Unikey trên Linux).

Anyway, even if the goal of a keyboard without pre-edit were futile, the problem of processing Vietnamese diacritical marks stays with me. Over the years I have made various attempts at tackling it from many angles, using different theoretical methods, trying to find the most elegant implementation to derive fun from and to satisfy my curiosity.

This time as I'm learning Prolog, a symbolic first-order logic programming language, it's only natural that the challenge be attempted using it. I find Prolog to be a particularly good fit for this (unsurprisingly since it was created in the first place to deal with computational linguistics problems back in the 1970s). The program might change in the future but I try to explain the core logic here. For brevity, whenever I mention Prolog, I mean SWI-Prolog as there are some differences among various Prolog vendors. I assume the reader has only a cursory familiarity with Vietnamese.

Disclaimer: I'm a relative novice at working with Prolog and I'm sure there are many ways to more concisely and performantly improve the engine. Please feel free to throw your ideas at me. And hope you enjoy this!

Basics of Vietnamese Orthography

First, let's talk a bit about the Vietnamese writing system. It's a pretty strict and close approximation of the phonology of standard Northern Vietnamese. The mapping is close enough that for our intents and purposes, I'll treat the orthography as the same as the phonology. Vietnamese is written syllable by syllable, separated by space and punctuation marks. Note that the space character is a syllable separator, not a word separator like in European languages. In this regard, modern written Vietnamese is closer to written Japanese and Mandarin Chinese.

The syllable is the unit we're working with. A syllable has 4 major parts:

  1. An optional initial consonant letter sequence. There is only one diacritic consonant letter đ and it can only be the whole initial consonant.
  2. A non-empty vowel nucleus letter sequence, the members of which must belong to one of these groups:
    1. non-diacritic letters: a, e, i, o, u, y
    2. hat above modified letters: â, ê, ô
    3. breve modified letters: ă
    4. horn modified letters: ơ, ư
  3. An optional tone marking on the vowel nucleus, one of:
    1. unmarked (ngang). e.g. a
    2. grave (huyền). e.g. à
    3. acute (sắc). e.g. á
    4. hook above (hỏi). e.g.
    5. tilde (ngã). e.g. ã
    6. dot below (nặng). e.g
  4. An optional final consonant letter sequence

Some Examples

The tone marks can combine with the vowel modification marks to make this monstrous table:

àáảãạa ằắẳẵặă ầấẩẫậâ
èéẻẽẹe ềếểễệê ìíỉĩịi
òóỏõọo ồốổỗộô ờớởỡợơ
ùúủũụu ừứửữựư ỳýỷỹỵy

Yes. They laugh at you. 🙃