Rating: 7.5/10.
Fairly short book that introduces readers to the Unicode standard and the pitfalls of encoding. It’s suitable for both programmers and linguists. It’s probably one of the shorter books on the topic that I’ve come across. However, it seems to focus more on topics of interest to linguists, and several important issues related to Unicode aren’t discussed.
The book heavily emphasizes the challenges of equivalence, where the same string can be represented in multiple ways. While this is a crucial issue, it’s not the only one. The book doesn’t touch on other common problems that programmers often face with Unicode, like issues with string length, substring handling, and the sorting challenges. There’s also no mention of locale-specific challenges with writing systems like Chinese and Japanese logographs or complications in right-to-left scripts. Instead, there’s a lot of focus on the International Phonetic Alphabet (IPA) and how it’s represented in Unicode, which is something linguists use frequently.
Chapter 1: The history of encoding writing began with telegraphs, which required languages to be encoded into orthographic systems like Morse code. Subsequent binary codes assigned a fixed length to each character: ITA had 64 characters, ASCII had 128, and MCS had 256, though the latter only supported European languages. Various other proposals were crafted for different writing systems worldwide, but it wasn’t until 1991 that Unicode was introduced, offering a 16-bit encoding supporting all these languages.
Some terminology: Transcription is recording the exact sounds of speech; while IPA is sometimes used for this purpose, it doesn’t always capture the sounds perfectly. Orthography is the standard way of writing, but it often deviates from actual pronunciation due to historical reasons. The term grapheme or character refers to the smallest symbolic unit in a writing system, but its exact definition can vary depending on context. A diacritic is a mark, like an accent, that accompanies a character.
Chapter 2: Unicode was groundbreaking in its proposal for a single system to represent all scripts. A code point is an integer that uniquely identifies Unicode characters, but this might be an abstract concept without a fixed visual representation. In contrast, a glyph is a visual rendering of a character, and a font is a collection of these glyphs. While blocks are used in Unicode as organizational units for contiguous code points, they don’t always strictly correspond to a particular script.
Characters in Unicode can sometimes be represented in multiple ways; for instance, characters with accents might be encoded as several code points or just one, but they’ll look the same visually. Normalization is the process of converting these different representations into a consistent form, often by merging multiple code point sequences into a single one where possible.
Chapter 3. There are various pitfalls and surprises with Unicode. There is no simple mapping between characters and glyphs. Depending on the context, a single character might have different glyphs or be a sequence of multiple glyphs; conversely, a single glyph might represent a combination of multiple characters. Characters also differ from graphemes, like “sh”. These differ across languages, and you can use Unicode locales to specify this distinction. The zero-width joiner can be used to denote instances where “s” and “h” together aren’t a single grapheme.
Most fonts don’t cover all possible glyphs, so a fallback font is used. As a result, the glyph might look different from the rest. Font rendering gets complex for languages that can have multiple simultaneous diacritics.
Blocks are organizational units in Unicode, but they’re only rough indicators of where different characters can be found. The names of characters aren’t always precise and can sometimes be misleading. There are homoglyphs, which look similar but have different code points. People often use the wrong homoglyphs accidentally when writing.
Normalization can be done in either NFC, which normalizes to combined code points, or NFD, which decomposes them into multiple code points. The application can choose how it wishes to normalize. There are exceptions where characters look the same but aren’t handled by normalization.
Encoding is about converting the integer code points into a binary format, with challenges like managing line endings and choosing between little versus big endian. UTF-32 encodes every character into 32 bits, while UTF-8 dynamically encodes into 8, 16, or 32 bits. UTF-8 is recommended and commonly used because it’s more space-efficient, especially when dealing with mostly ASCII text.
Chapter 4. The International Phonetic Alphabet (IPA) was first proposed in 1886 to record arbitrary speech sounds. However, in practice, the same symbol might represent different sounds across languages. Symbols typically serve to distinguish contrasting sounds. While unique symbols represent contrasting sounds, diacritics are often used to provide additional information, like allophonic variations.
Various attempts have been made to encode IPA: some used three-digit numbers, while others, like SAMPA and X-SAMPA, encoded IPA as ASCII, which is sometimes used in speech technology. Eventually, IPA was included in the Unicode standard, though some pre-Unicode encodings are still in use.
Chapter 5. There are challenges when encoding IPA in Unicode. For instance, there isn’t a dedicated IPA code block, so symbols are scattered in various places, including within the Latin block. There are many homoglyphs, both within IPA and in characters that look similar in other scripts. The book provides numerous examples of these. Some IPA symbols have been deprecated, but Unicode characters are never removed, leading to the presence of deprecated symbols in the IPA set.
Chapter 6. Practical recommendations for programmers, discussing how to import special characters and which fonts work best for rendering IPA.
Chapter 7. A new format called orthography profile is proposed to represent orthographic systems. While Unicode locales aim to serve a similar purpose, they aren’t well-suited for linguistic use cases. The profile defines graphemes and details how to split them.
Chapter 8. Discusses a proposed library in Python and R to handle orthography profiles. It’s designed to convert strings to graphemes in the orthography and then translate them to IPA. The chapter dives deep into the complexities of grapheme tokenization and explores ways to address these challenges using the library.