Time spent digging through unicode categories
So, in the course of writing documentation for my Unicode Rust crate, I started digging into the depths of what’s in each category and what I’ve found is, well, it’s a mess. Some of the basic stuff, what I’m using in finl, is ok: letters are correctly identified and classed and marks are mostly ok (although it’s not clear why in some scripts derived from Brahmi, vowels are treated as spacing marks and in others they’re letters). Perhaps if I knew a bit more about the alphabets I could make sense of them.¹ Certainly, some assumptions that are made in Latin script (e.g., that we can do a decompose normalization and then strip out marks to get a simplified version of a word for use in, e.g., URLs) turn out not to make sense for non-Latin scripts (taking out the spacing and non-spacing marks from a word in Devanagari script would render the result as nonsensical as stripping vowels from Latin alphabet text).
In the punctuation realm, some of the decisions about what is opening, closing, initial and final punctuation are arbitrary and somewhat random. Format characters include a mix of printing and non-printing characters and some characters that have been repurposed for emoji. The curse of backwards compatibility, it seems, has forced Unicode to live with its early mistakes for eternity.
- The only Brahmi-derived script I have familiarity with is Thai, which uses a mix of non-spacing marks and letters to indicate the text.