Time spent digging through unicode categories

August 8, 2022 by Don Hosek

So, in the course of writing documentation for my Unicode Rust crate, I started digging into the depths of what’s in each category and what I’ve found is, well, it’s a mess. Some of the basic stuff, what I’m using in finl, is ok: letters are correctly identified and classed and marks are mostly ok (although it’s not clear why in some scripts derived from Brahmi, vowels are treated as spacing marks and in others they’re letters). Perhaps if I knew a bit more about the alphabets I could make sense of them.¹ Certainly, some assumptions that are made in Latin script (e.g., that we can do a decompose normalization and then strip out marks to get a simplified version of a word for use in, e.g., URLs) turn out not to make sense for non-Latin scripts (taking out the spacing and non-spacing marks from a word in Devanagari script would render the result as nonsensical as stripping vowels from Latin alphabet text).

In the punctuation realm, some of the decisions about what is opening, closing, initial and final punctuation are arbitrary and somewhat random. Format characters include a mix of printing and non-printing characters and some characters that have been repurposed for emoji. The curse of backwards compatibility, it seems, has forced Unicode to live with its early mistakes for eternity.

The only Brahmi-derived script I have familiarity with is Thai, which uses a mix of non-spacing marks and letters to indicate the text.

Comments |0|

Cancel

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Legend *) Required fields are marked
**) You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Category: Uncategorized
Tags: unicode

finl is not LaTeX