Why finl? A manifesto

In 1994, LaTeX2e was released as a transitional step towards LaTeX 3. 26 years later, there still isn’t a 1.0 release of LaTeX 3. In the interim, we’ve seen the rise of HTML and the web, the dominance of PDF as a format for representation of printed material (and now there is a plan to have PDF extended with “liquid mode” that allows reflowing of PDF text for smaller screens).

In the meantime, the TeX engine has been extended multiple times, the little-used TeX-XeT, some early efforts to support large Asian character sets, and we have in widish use pdfTeX, XeTeX, LuaTeX along with an assortment of abandoned engines. Worst of all, it seems that none of pdfTeX, XeTeX or LuaTeX can serve as the one TeX to rule them all, each with some limitations that can require users to switch engines depending on their needs.

As I’ve thought about it, the problem at its root is TeX itself. It’s what would be referred to in contemporary software engineering parlance, as a tightly-coupled monolith. Worse still, it’s a tightly-coupled monolith with numerous compromises baked in because of the limitations of 1970s computing hardware. It seems that the vast majority of what work has been done with LaTeX 3 has been geared towards dealing with the limitations of TeX as a programming language.

On top of that, there’s been an explosion of questionable, if not outright harmful practices from the greater LaTeX community. Ideally, a document should be translated from one document class to another structurally similar class (naming-wise, the choice of “class” to name document classes is unfortunate, but understandable) should not require changing anything after the preamble, better still, nothing but the \documentclass command itself. All the appearance should be handled through the document class and packages should be employed to provide document structure enhancements or new capabilities). There are numerous violations of this. The memoir class is a mess, claiming to be a replacement for article, report and book (this reminds me of the mess that’s PHP where the same data structure acts as an array and an associative array and as a consequence manages to merge the worst aspects of both in one inefficient construct) and at the same time, providing a number of bits of functionality that belong in packages rather than the document class. On the flipside, packages like geometry and fancyhdr fall into a category that LaTeX2e doesn’t really define, bits of common code that would be helpful to document class writers but shouldn’t really be exposed to document authors.

So what can be done? I believe a complete rethink of LaTeX is necessary. The macro-based language of LaTeX is intellectually satisfying in many of the same ways that solving a sudoku puzzle or playing chess can be. It is not, however, a straightforward programming model and does not conform to any popular paradigm (I don’t think any other programming language is of similar structure). It’s time to separate out the language for formatting the document from the language for writing the document, to make a clear break between extensions that modify appearance from extensions that provide new functionality.

While we’re at it, a number of other TeX limitations need to be done away with. TeX was written with the needs of typesetting The Art of Computer Programming and so doesn’t handle complicated page layouts well and for all its valiant efforts, this has impacted LaTeX. Multi-column layouts are painful, float handling is a mess, some things that designers expect to be able to do, like specifying baseline-to-baseline skips or forcing text onto grid lines are difficult if not impossible.

Unicode needs to be a first-class citizen. There’s no reason in 2020 for a document writer to have to type \’a instead of á in a document. UTF-8 is the new 7-bit ASCII.

The code of the engine should be decoupled. Why shouldn’t any application that is willing to interface with the code be able to directly access the parser or the paragraph layout engine or the math layout engine? Imagine if a user of Word could access a plugin that let her type \sum_{i=0}^{\infty} \frac{1}{2^{i}} or even ∑_{i=0}^{∞} \frac{1}{2^{i}} or selecting these symbols via GUI into a low-level token stream to get

in her document. Imagine if a document could be painlessly transformed into HTML or even (unicode) plain text because the page layout and formatting engines are decoupled enough to manage such things.

Announcing finl_unicode 1.0.0

The first externally useful module of the finl project is now available on crates.io: finl_unicode. This library provides the Unicode handling that finl needs for its parsing module, in particular, it allows checking the character category of a char and getting the next grapheme from Peekable<CharIndices> or Peekable<Chars>. The category determination is roughly ten times faster than the equivalent code in the unicode_categories crate (which is also a bit out of date these days), and the grapheme segmentation runs roughy twice as fast as the unicode_segmentation crate (much to my surprise).

More details are in the readme. available at the crates.io link above.

Time spent digging through unicode categories

So, in the course of writing documentation for my Unicode Rust crate, I started digging into the depths of what’s in each category and what I’ve found is, well, it’s a mess. Some of the basic stuff, what I’m using in finl, is ok: letters are correctly identified and classed and marks are mostly ok (although it’s not clear why in some scripts derived from Brahmi, vowels are treated as spacing marks and in others they’re letters). Perhaps if I knew a bit more about the alphabets I could make sense of them.¹ Certainly, some assumptions that are made in Latin script (e.g., that we can do a decompose normalization and then strip out marks to get a simplified version of a word for use in, e.g., URLs) turn out not to make sense for non-Latin scripts (taking out the spacing and non-spacing marks from a word in Devanagari script would render the result as nonsensical as stripping vowels from Latin alphabet text).

In the punctuation realm, some of the decisions about what is opening, closing, initial and final punctuation are arbitrary and somewhat random. Format characters include a mix of printing and non-printing characters and some characters that have been repurposed for emoji. The curse of backwards compatibility, it seems, has forced Unicode to live with its early mistakes for eternity.

1. The only Brahmi-derived script I have familiarity with is Thai, which uses a mix of non-spacing marks and letters to indicate the text.

Two approaches to the two-stage table implementation, benchmark results

On my first pass on implementing a two-stage table for the Unicode functionality that I’m writing for finl, I had the first table consist of entries like this:

  Code(0x91),
Page(1),
Page(2),
...


Where an entry of Code indicated that everything in that 256-byte range had the same code, and an entry of Page was a pointer into a table of actual values. I figured this way, I would reduce the amount of memory that I was using.

But in the back of my mind, I kept thinking that this was a false economy. There wouldn’t be that many distinct values for Code so the memory shouldn’t be that big of a deal (even if it doubled the roughly 32K that I was reserving, in applications where everything is measured in megabytes, those tables were a rounding error¹). I rewrote my table-generation code to test this and ran some benchmarks.²

Not too surprisingly, eliminating having to look at a discriminant in the first-stage table improved things, but not consistently. For Latin-alphabet text, whether in English or Czech, I saw no change when looking to see if something was a letter, but a mysterious 30–35% improvement for the lowercase-only check³. The big difference came with processing Japanese text where the code ran roughly twice as fast. In that case, I suspect that the difference comes from the fact that for Japanese, almost every call would be returning a Code entry rather than a Page entry. On the surface, one might expect those cases to not show an improvement, after all, there isn’t a need to look at a second memory address. I think, however, that what’s happening here is a branch prediction failure: the CPU has decided that Page is the more likely result than Code and every time it’s wrong, it needs to back up.

The moral of the story:⁴ avoiding conditional branches in code is a good thing. Table lookups will usually be faster.

1. My earliest programming environments were severely memory constrained. I started on Apple ][+ machines which had 48K of RAM for program, data and display memory, later got to work with the luxurious 128K of the Apple //e, and then the relatively unconstrained but still constrained by modern standards memory of the IBM and VAX time share systems that were my primary computing environment for my college years. I didn’t go over 16G of RAM until I’d been programming for over a decade so it’s no surprise that concerns about memory usage and processing speed always haunt me.
2. Even shutting down everything on my Mac except for terminal left the benchmarks fairly noisy. Successive runs on the same code could show changes of as much as ±5–10%.
3. My only explanation for the latter is that there is some lingering cache available to the CPU for the second test over the first, although why this would only show up with this optimization is a mystery.
4. Which might not be correct

Latest news

Since it’s been a while since I’ve posted an update, current status: I’m working on the parser, but had a bit of a digression to create some Unicode support with the interface that I need for finl. The Unicode Segmentation crate is a nice piece of code, but it assumes that it is to be the primary interface for iterating over input and I, instead, want to be able to augment the CharIndices iterator that I’m using in the parser. So I had to create my own implementation of Unicode grapheme segmentation. I had initially (incorrectly) assumed I would also need to have to implement Unicode category support as well for this but it turned out I was wrong. Regardless, I did it anyway because I find this kind of coding fun. My implementation of Unicode category identification is significantly faster than the Unicode Categories crate (by a factor of ten according to my benchmarks) which is good because I call this on all command names so it will have a noticeable improvement in speed. My segmentation code on the other hand is running much slower than the Unicode segmentation crate, but it may be an artifact of the fact of how the two are implemented (I return a new String from a call on the CharIndices iterator while they, because they’re creating an iterator over *str, are able to return *str references and avoid the copy and allocate). As a result, to make sure that my algorithm is not flawed (it does at least pass all the tests), I’m also implementing the algorithm as an iterator on a *str for benchmarking purposes. I then need to do some documentation and I will put a 1.0 crate for finl_unicode out into the world. I’m torn between hoping that I won’t have to implement more Unicode support and hoping that I will. This stuff is fun to code.¹

1. If anyone wants to hire me to write Unicode support code, I’m listening.

Revised definition of commands/environments

From the readme for finl_parse, some thoughts on how commands and environments are defined.

Input text

Input text is assumed to be in NFD normalized format.

Commands

You indicate a command with \ followed by either a named command or a command symbol.

Named commands consist of one or more letters. Letters are defined to include all characters in the categories letter, non-spacing mark and combining mark. Inside user-defined macros only, _ can be considered as part of a named comand

A command symbol is a single non-letter. A single non-letter is defined to include symbols which may be represented by multiple code points but a single output grapheme, For example, the symbol 🇨🇦 is actually represented by the input sequence U+1F1E8, U+1F1E6. Since this is generally invisible to the user, it’s most ergonomic to handle these sequences as if they were a single code point.

A command may have zero or more parameters.

Parameters

Each parameter has a format and a type

Parameter formats

Parameters are one of the following formats:

• * an optional star indicating some alternate behavior for a command. (Corresponding to xparse s)
• optional arguments delimited with [] (corresponding to xparse o/O)
• required arguments consisting either of a single token or multiple tokens delimited with {} (corresponding to xparse m)
• required arguments with mandatory braces
• arbitrary required argument where either the delimiters are {} or will be delimited by the first non-space character after the command and the next occurrence of that character. This argument may not be broken across lines. (corresponding to xparse v)

Possible future enhancements:

• optional argument indicating style override delimited with <>
• something akin to the xparse r/R type but allowing arbitrary delimiters (for dealing with TeX-style argument handling)
• something akin to the xparse d/D type but allowing arbitrary delimiters to an optional argument (I probably won’t do this one)
• something akin to the xparse e/E embellishment mechanism.

Parameter types

Parameters can have one of the following types:

• parsed tokens. The passed argument will be parsed and passed to the command implementer as a list of tokens
• verbatim text. The passed argument will be treated as verbatim and passed to the command implementer without further processing. Depending on the delimiter style, it may require braces or brackets to be balanced within the argument, e.g., if an optional argument is treated as verbatim text, the user cannot do something like

\foo[ [ ] 

I’m undecided as to whether to allow any sort of workaround for this, like

\foo[{[}] 

or just not allow that use case at all.

• boolean values. For * arguments, this will simply pass true if present false if absent. Possibly a required or optional argument can have a boolean expression in them, although this might be something reserved for commands called within macro definitions.
• key-value lists. This will be a list of keys (which must conform to UAX31) followed by a value with = separating the keys and values. Leading and trailing spaces will be ignored and an outer set of braces will be stripped off so that, e.g.,

foo = { e } , bar = kappa , baz=3 

will set foo to e, bar to kappa and baz to 3. All values are treated as token lists and cannot contain undefined commands. If a command is found when searching for an identifier, it must be a user-defined macro and expand to valid key-value list data.

Trailing commas in the key-value list will be ignored.

• Macro definition token lists. This applies only to macro definitions. All spaces are ignored, ~ puts a space into the token list, and _ can appear in a named command. I don’t know if I’ll allow extensions and document types to create new commands with this as a parameter type.
• Math mode.

Possible future types:

• JSON. Perhaps with a proviso that allows dropping the enclosing braces on a JSON object, so, e.g.,

\foo{ this: true, that: 3.1 } 

rather than

\foo{ { this: true, that: 3.1 } } 

and perhaps also dropping brackets from an array so:

\foo{ 1, 2, 3, 4, 5 } 

rather than

\foo{ [1, 2, 3, 4, 5] } 

This will possibly also require a definition of the actual data types used in the JSON, or maybe not, I’m still not decided.

Environments

Environments work similarly to commands but have an additional declaration of the contents of the environment. A first parameter of * indicates that we’re also providing for a *-form of the environment.

I think that environment names must conform to UAX31 (so finl would be stricter than LaTeX on this front¹).

Environment bodies can have the following types:

• Parse token list
• Math
• Verbatim²
• YAML³
6-Nov-2021: Add required argument with mandatory braces as format.

1. This is perhaps not a big deal though—a quick grep on the LaTeX directory indicates that only a handful of existing packages use environment names that don’t conform to UAX31 with a couple having internal dashes and one package defining + versions of some math environments.
2. The finl version of the verbatim environment will be able to “see” the indentation of the \end{verbatim} command and strip that indentation from all lines automatically.
3. Worth remembering that YAML is a superset of JSON. The JSON shortcuts mentioned for arguments will not be supported in an environment body, but environment implementer are free to manage this themselves by taking input as verbatim and parsing it however they like.

At long last code

The first piece of code (something relatively trivial) for finl is now on crates.io: finl-charsub is the character substitution module for finl. It may have use outside of finl when fixed character sequence replacements are needed in a text stream.

Future enhancements: I’m pretty sure that there’s room to optimize the hash map implementation that’s currently in use for building the trie. Right now, I’m using FnvHashMap, but I have to believe that there’s a way, perhaps by eschewing Swiss Tables, to get still better performance for this use case. Since I’ll likely be implementing at least one more trie in the  code (if I remember correctly, the TeX hyphenation algorithm, which I plan to shamelessly steal also uses a trie internally), there will be a chance to return to this for future optimization of the code.

Building a trie in Rust—optimizations

A trie is a variation on a tree whose purpose is to show the result of a sequence of values, something that I’ll be using a lot in finl. For example, TeX has a number of input conventions implemented via ligatures (plus an active character) that I’ll be making optionally available through a character substitution mechanism. The trie for this looks like:

The implementation of this is to have each node in the trie consisting of a map of characters to nodes and an optional substitution text.

The obvious choice of what to reach for as a map is a hash map of some sort, but Rust makes things interesting in that the hashing function is easily configurable¹ so I decided to do some experimenting since the default hasher is heavier weight than I need and with the char sub algorithm, nearly every single character in the file will end up getting hashed at least once.² I don’t need to worry about hash DOS attacks or cryptographic strength in this situation so the default hasher SipHasher13 is overkill. The recommended alternative for short keys is FNVHasher from the fnv crate, so I set up a benchmark that runs a collection of Dickens stories in TeX format against the TeX character substitution table to run benchmarks. As expected, I saw an improvement of 24%. But I wondered if I could do better. In particular, since my keys are a simple type and guaranteed to be less than 24 bits. Perhaps even an identity hasher would make sense in this case.

And then writing that last sentence, I remembered, hash collisions. I once interviewed for Amazon³ and having heard they liked to ask a lot of data structure books, I read a data structures book the week before the interview, so I have a good idea of what happens below the hood with a hash map: Essentially, you’re creating an array of keys and pointers to your stored data and you take the hash value, mod it the size of the array, then look at the resulting point element in the array. If it’s empty, there’s no match. If it’s not, you then check the keys for equality. Here’s where things get tricky: If they’re equal, you’re done, but if they’re not, you need to then iterate through the array until you either get equality, reach an empty element in the array or finish the array.

And Rust’s implementation of this is… not great in this respect. For an application like my trie, I want to fail fast on lookup, but Rust will happily fill its entire capacity before resizing on insert. This is one place where Java wins: It allows users of HashMap to specify a load factor (0.55 by default) so that rather than resizing when the capacity is filled, it resizes once you have capacity × load factor elements inserted. Since I’m expecting most lookups in the trie to fail, it’s better to fail faster. I boosted the initial capacity of the map to 64 and the benchmark improved by 53% (one small optimization I made was to only specify an initial capacity for the root map of children. For this application, at least, once we’re starting down a trie, the possible matches are a smaller set and there’s less danger of there being a significant cost to false positives).

But one more thing. My keys are short and of constant size. Maybe I could get better performance with a simpler hashing function. I tried creating a simple null hasher that just used the value passed in:

impl Hasher for NullHasher {    fn finish(&self) -> u64 {        self.0}    fn write(&mut self, bytes: &[u8]) {        self.0 = bytes.iter()            .take(7)            .rfold(0u64, |acc, x| acc * 256 + *x as u64);    }    fn write_u32(&mut self, i: u32) {        self.0 = i as u64;    }}

and was surprised to discover that, in fact, things slowed down, apparently because of cleverness in the Swiss Tables algorithm that I don’t fully understand. So, without doing a full hash table algorithm using a simple mod trick, it looks like FNV + capacity is my best bet for performance at the moment.

1. Coming from Java this is a bit of a shock since Java’s HashMap can only use the hash() function of the key. If you want a custom hasher for a key, you need to create a custom class for the key with a new hash() function. But since common key types like String or Char are declared final, you can’t just subclass the type and override hash, you need to create a wrapper class instead.
2. One of the interesting challenges in writing the char sub algorithm was dealing with mappings like AaABCb where an input like ABD should get mapped to aD which meant that in that instance, we need to backtrack on the input and A and B end up getting hashed twice in that scenario.
3. They said no.

A current status on finl

This site got submitted to Hacker News (a lot sooner than I would have chosen), so a few notes on where things are and where they’re going.

I’m definitely writing in Rust. Much like I expect finl to remove pain points around LaTeX, Rust removes a lot of pain points around code that I would have otherwise written in C++. I’m currently focused on finishing up gftopdf, a replacement for gftodvi, as a tool to explore Rust, PDF creation and begin implementation of some sub-libraries of finl.

My roadmap for the immediate future:

1. gftopdf (along with finl-charsub)
2. finl-parse (be able to parse a finl document into an internal token list)
3. finl-math (math typesetting initial version)
4. finl-html (output a document to HTML)

finl is not LaTeX not in the same way that GNU is not Unix. The latter is, with a few idiosyncrasies, a drop-in replacement for Unix tools. finl is not attempting to replicate TeX and LaTeX but to reimagine and replace it. It’s more like Java vs C++. The document markup language is going to be almost identical to LaTeX (although the tabular environment might well be very different). But there will not be a TeX-style macro language. There will be non-Turing complete macros available to document authors, but any more sophisticated handling will be done through extensions written in Python.

Document structure and style will be separate concepts. Style will be declarative. I’m leaning towards YAML for this, but I’ve not made any firm commitments in my mind on this.

Extension loading will be designed to work with artifact repositories and will have three-part naming similar to dependencies in mvn/groovy.

A document preamble might look like the following (everything is subject to change):

\DocumentType{article}
\DocumentStyle{IEEE:transactions:1.0.0}

\LoadExtension{ams:commutative-diagrams:1.2.3}[\cd -> \CD, *]

\RenewDocumentCommand{\le}{\leqslant}


This shows specifying the document style, loading an extension (with a command name mapped to a different name for the document) and a user-level macro to have \le typeset ⩽ instead of ≤ . Where a namespace is not specified (like in the \DocumentType command), the default namespace of finl will be assumed. Likewise, a missing version will be assumed to be the special version of LATEST. I’m thinking that the first time a document is processed, a file foo.lock will be created which will have the versions applied for any unspecified (or range-based) versions. Extensions are allowed to have their own dependencies. I think (but I can’t commit to it), that it should be possible to have extensionA require version 2 of extensionC and extensionB require version 3 of extensionC without conflict.

Oh, one more thing: Right now, some formatting on the blog is off—in particular, you cannot click on an article title to get to the permalink for the article, and comments can only be accessed from the article’s page which is not readily accessible. I welcome comments at my email of don.hosek@gmail.com

First impressions on Rust

I really like the expressiveness of the language. I’m writing code to read GF files from Metafont right now and there’s the Knuthian 1–4 byte parameter reading. This ends up being rather elegant in Rust, particularly the 3-byte version:

pub fn read3<R: Read>(input: &mut R) -> io::Result<i32> {    let mut buf = [0u8; 4];    input.read(&mut buf[1..4])?;    Ok(i32::from_be_bytes(buf))}

This initializes a 4-byte buffer with zeros, fills the last three bytes of that buffer with the next three bytes of the input file and then translates the 4-byte buffer into an integer with big-endian semantics. When I figured out how to do this, I got happy endorphins rushing through my brain. It’s just cleaner than the C++ equivalent

std::int_fast32_t read3() {    stream->read(buffer, 3);    return static_cast<uint8_t>(buffer[0]) << 16      | static_cast<uint8_t>(buffer[1]) << 8      | static_cast<uint8_t>(buffer[2]);}

where I had to assemble the data manually (I didn’t need to declare or initialize the buffer here since in my implementation it was a field in the enclosing class).

I’m reading three Rust books simultaneously as I learn. My thoughts on the books: The Rust Programming Language by  Steve Klabnik and Carol Nichols is superb. I like reading paper and I got it from the library, but it’s also freely available online (and updated).

Beginning Rust: From Novice to Professional by Carlo Milanesi seems a bit of a hot mess. It introduces things earlier than necessary, especially given its avowed audience. I could have predicted that though, from the publisher. I have yet to see a good book from Apress.

I’ve only just started Programming Rust: Fast, Safe Systems Development by Jim Blandy and Jason Overdorff so I have no opinion on that one yet.

I’m a bit disappointed with the Rust plugin for CLion. I’ve gotten spoiled by Jetbrains’s support for Java and C++ in their IDEs where the IDE provide a lot of input-time checking and suggesting for me. It made getting back into C++ a lot easier. With Rust, numerous errors don’t actually show up until a proper compilation is run, and the suggestions for resolving issues aren’t always present or useful, particularly around borrowing. IDEs have made me fat and lazy and now I have to work. I don’t like that.

I’ve had questions along the way and it seems like the two online support communities I’ve found aren’t all that great. Neither reddit nor Stack Overflow have seemed adequate. In many instances, the suggestions are completely wrong, but the language and documentation are such that I’ve managed to find my way despite this.

Stupid Mac tricks

I recently switched from Dropbox to sync for my synchronized cloud folder needs. Along the way, I moved all the files that were in ~/Dropbox to ~/Sync  I had a terminal window open with a shell cd’d to ~/DropBox/PreppyLion for the book I’m working on. I expected to get an error about that directory being in use and I couldn’t move the directory along with the other files. No error. Then I look in the terminal window and type in pwd  It showed ~/Sync/PreppyLion  It’s little things like this that always make me happy with modern computing platforms. I don’t know if this would work on Linux or Windows, but I’m happy it does on MacOS.