Why finl? A manifestoSticky

In 1994, LaTeX2e was released as a transitional step towards LaTeX 3. 26 years later, there still isn’t a 1.0 release of LaTeX 3. In the interim, we’ve seen the rise of HTML and the web, the dominance of PDF as a format for representation of printed material (and now there is a plan to have PDF extended with “liquid mode” that allows reflowing of PDF text for smaller screens).

In the meantime, the TeX engine has been extended multiple times, the little-used TeX-XeT, some early efforts to support large Asian character sets, and we have in widish use pdfTeX, XeTeX, LuaTeX along with an assortment of abandoned engines. Worst of all, it seems that none of pdfTeX, XeTeX or LuaTeX can serve as the one TeX to rule them all, each with some limitations that can require users to switch engines depending on their needs.

As I’ve thought about it, the problem at its root is TeX itself. It’s what would be referred to in contemporary software engineering parlance, as a tightly-coupled monolith. Worse still, it’s a tightly-coupled monolith with numerous compromises baked in because of the limitations of 1970s computing hardware. It seems that the vast majority of what work has been done with LaTeX 3 has been geared towards dealing with the limitations of TeX as a programming language.

On top of that, there’s been an explosion of questionable, if not outright harmful practices from the greater LaTeX community. Ideally, a document should be translated from one document class to another structurally similar class (naming-wise, the choice of “class” to name document classes is unfortunate, but understandable) should not require changing anything after the preamble, better still, nothing but the \documentclass command itself. All the appearance should be handled through the document class and packages should be employed to provide document structure enhancements or new capabilities). There are numerous violations of this. The memoir class is a mess, claiming to be a replacement for article, report and book (this reminds me of the mess that’s PHP where the same data structure acts as an array and an associative array and as a consequence manages to merge the worst aspects of both in one inefficient construct) and at the same time, providing a number of bits of functionality that belong in packages rather than the document class. On the flipside, packages like geometry and fancyhdr fall into a category that LaTeX2e doesn’t really define, bits of common code that would be helpful to document class writers but shouldn’t really be exposed to document authors.

So what can be done? I believe a complete rethink of LaTeX is necessary. The macro-based language of LaTeX is intellectually satisfying in many of the same ways that solving a sudoku puzzle or playing chess can be. It is not, however, a straightforward programming model and does not conform to any popular paradigm (I don’t think any other programming language is of similar structure). It’s time to separate out the language for formatting the document from the language for writing the document, to make a clear break between extensions that modify appearance from extensions that provide new functionality.

While we’re at it, a number of other TeX limitations need to be done away with. TeX was written with the needs of typesetting The Art of Computer Programming and so doesn’t handle complicated page layouts well and for all its valiant efforts, this has impacted LaTeX. Multi-column layouts are painful, float handling is a mess, some things that designers expect to be able to do, like specifying baseline-to-baseline skips or forcing text onto grid lines are difficult if not impossible.

Unicode needs to be a first-class citizen. There’s no reason in 2020 for a document writer to have to type \’a instead of á in a document. UTF-8 is the new 7-bit ASCII.

The code of the engine should be decoupled. Why shouldn’t any application that is willing to interface with the code be able to directly access the parser or the paragraph layout engine or the math layout engine? Imagine if a user of Word could access a plugin that let her type \sum_{i=0}^{\infty} \frac{1}{2^{i}} or even ∑_{i=0}^{∞} \frac{1}{2^{i}} or selecting these symbols via GUI into a low-level token stream to get

sum from 0 to infinity of 1/2^i

in her document. Imagine if a document could be painlessly transformed into HTML or even (unicode) plain text because the page layout and formatting engines are decoupled enough to manage such things.

At long last code

The first piece of code (something relatively trivial) for finl is now on crates.io: finl-charsub is the character substitution module for finl. It may have use outside of finl when fixed character sequence replacements are needed in a text stream.

Future enhancements: I’m pretty sure that there’s room to optimize the hash map implementation that’s currently in use for building the trie. Right now, I’m using FnvHashMap, but I have to believe that there’s a way, perhaps by eschewing Swiss Tables, to get still better performance for this use case. Since I’ll likely be implementing at least one more trie in the  code (if I remember correctly, the TeX hyphenation algorithm, which I plan to shamelessly steal also uses a trie internally), there will be a chance to return to this for future optimization of the code.

Building a trie in Rust—optimizations

A trie is a variation on a tree whose purpose is to show the result of a sequence of values, something that I’ll be using a lot in finl. For example, TeX has a number of input conventions implemented via ligatures (plus an active character) that I’ll be making optionally available through a character substitution mechanism. The trie for this looks like:

Trie example

The implementation of this is to have each node in the trie consisting of a map of characters to nodes and an optional substitution text. 

The obvious choice of what to reach for as a map is a hash map of some sort, but Rust makes things interesting in that the hashing function is easily configurable¹ so I decided to do some experimenting since the default hasher is heavier weight than I need and with the char sub algorithm, nearly every single character in the file will end up getting hashed at least once.² I don’t need to worry about hash DOS attacks or cryptographic strength in this situation so the default hasher SipHasher13 is overkill. The recommended alternative for short keys is FNVHasher from the fnv crate, so I set up a benchmark that runs a collection of Dickens stories in TeX format against the TeX character substitution table to run benchmarks. As expected, I saw an improvement of 24%. But I wondered if I could do better. In particular, since my keys are a simple type and guaranteed to be less than 24 bits. Perhaps even an identity hasher would make sense in this case.

And then writing that last sentence, I remembered, hash collisions. I once interviewed for Amazon³ and having heard they liked to ask a lot of data structure books, I read a data structures book the week before the interview, so I have a good idea of what happens below the hood with a hash map: Essentially, you’re creating an array of keys and pointers to your stored data and you take the hash value, mod it the size of the array, then look at the resulting point element in the array. If it’s empty, there’s no match. If it’s not, you then check the keys for equality. Here’s where things get tricky: If they’re equal, you’re done, but if they’re not, you need to then iterate through the array until you either get equality, reach an empty element in the array or finish the array.

And Rust’s implementation of this is… not great in this respect. For an application like my trie, I want to fail fast on lookup, but Rust will happily fill its entire capacity before resizing on insert. This is one place where Java wins: It allows users of HashMap to specify a load factor (0.55 by default) so that rather than resizing when the capacity is filled, it resizes once you have capacity × load factor elements inserted. Since I’m expecting most lookups in the trie to fail, it’s better to fail faster. I boosted the initial capacity of the map to 64 and the benchmark improved by 53% (one small optimization I made was to only specify an initial capacity for the root map of children. For this application, at least, once we’re starting down a trie, the possible matches are a smaller set and there’s less danger of there being a significant cost to false positives).

But one more thing. My keys are short and of constant size. Maybe I could get better performance with a simpler hashing function. I tried creating a simple null hasher that just used the value passed in:

impl Hasher for NullHasher {
fn finish(&self) -> u64 {
self.0
}

fn write(&mut self, bytes: &[u8]) {
self.0 = bytes.iter()
.take(7)
.rfold(0u64, |acc, x| acc * 256 + *x as u64);
}

fn write_u32(&mut self, i: u32) {
self.0 = i as u64;
}
}

and was surprised to discover that, in fact, things slowed down, apparently because of cleverness in the Swiss Tables algorithm that I don’t fully understand. So, without doing a full hash table algorithm using a simple mod trick, it looks like FNV + capacity is my best bet for performance at the moment. 


  1. Coming from Java this is a bit of a shock since Java’s HashMap can only use the hash() function of the key. If you want a custom hasher for a key, you need to create a custom class for the key with a new hash() function. But since common key types like String or Char are declared final, you can’t just subclass the type and override hash, you need to create a wrapper class instead. 
  2. One of the interesting challenges in writing the char sub algorithm was dealing with mappings like AaABCb where an input like ABD should get mapped to aD which meant that in that instance, we need to backtrack on the input and A and B end up getting hashed twice in that scenario.
  3. They said no.

A current status on finl

This site got submitted to Hacker News (a lot sooner than I would have chosen), so a few notes on where things are and where they’re going.

I’m definitely writing in Rust. Much like I expect finl to remove pain points around LaTeX, Rust removes a lot of pain points around code that I would have otherwise written in C++. I’m currently focused on finishing up gftopdf, a replacement for gftodvi, as a tool to explore Rust, PDF creation and begin implementation of some sub-libraries of finl.

My roadmap for the immediate future:

  1. gftopdf (along with finl-charsub)
  2. finl-parse (be able to parse a finl document into an internal token list)
  3. finl-math (math typesetting initial version)
  4. finl-html (output a document to HTML)

finl is not LaTeX not in the same way that GNU is not Unix. The latter is, with a few idiosyncrasies, a drop-in replacement for Unix tools. finl is not attempting to replicate TeX and LaTeX but to reimagine and replace it. It’s more like Java vs C++. The document markup language is going to be almost identical to LaTeX (although the tabular environment might well be very different). But there will not be a TeX-style macro language. There will be non-Turing complete macros available to document authors, but any more sophisticated handling will be done through extensions written in Python.

Document structure and style will be separate concepts. Style will be declarative. I’m leaning towards YAML for this, but I’ve not made any firm commitments in my mind on this.

Extension loading will be designed to work with artifact repositories and will have three-part naming similar to dependencies in mvn/groovy. 

A document preamble might look like the following (everything is subject to change):

\DocumentType{article}
\DocumentStyle{IEEE:transactions:1.0.0}

\LoadExtension{ams:commutative-diagrams:1.2.3}[\cd -> \CD, *]

\RenewDocumentCommand{\le}{\leqslant}

This shows specifying the document style, loading an extension (with a command name mapped to a different name for the document) and a user-level macro to have \le typeset ⩽ instead of ≤ . Where a namespace is not specified (like in the \DocumentType command), the default namespace of finl will be assumed. Likewise, a missing version will be assumed to be the special version of LATEST. I’m thinking that the first time a document is processed, a file foo.lock will be created which will have the versions applied for any unspecified (or range-based) versions. Extensions are allowed to have their own dependencies. I think (but I can’t commit to it), that it should be possible to have extensionA require version 2 of extensionC and extensionB require version 3 of extensionC without conflict.

Oh, one more thing: Right now, some formatting on the blog is off—in particular, you cannot click on an article title to get to the permalink for the article, and comments can only be accessed from the article’s page which is not readily accessible. I welcome comments at my email of don.hosek@gmail.com

First impressions on Rust

I really like the expressiveness of the language. I’m writing code to read GF files from Metafont right now and there’s the Knuthian 1–4 byte parameter reading. This ends up being rather elegant in Rust, particularly the 3-byte version:

pub fn read3<R: Read>(input: &mut R) -> io::Result<i32> {
let mut buf = [0u8; 4];
input.read(&mut buf[1..4])?;
Ok(i32::from_be_bytes(buf))
}

This initializes a 4-byte buffer with zeros, fills the last three bytes of that buffer with the next three bytes of the input file and then translates the 4-byte buffer into an integer with big-endian semantics. When I figured out how to do this, I got happy endorphins rushing through my brain. It’s just cleaner than the C++ equivalent 

std::int_fast32_t read3() {
stream->read(buffer, 3);
return static_cast<uint8_t>(buffer[0]) << 16
| static_cast<uint8_t>(buffer[1]) << 8
| static_cast<uint8_t>(buffer[2]);
}

where I had to assemble the data manually (I didn’t need to declare or initialize the buffer here since in my implementation it was a field in the enclosing class). 

I’m reading three Rust books simultaneously as I learn. My thoughts on the books: The Rust Programming Language by  Steve Klabnik and Carol Nichols is superb. I like reading paper and I got it from the library, but it’s also freely available online (and updated).

Beginning Rust: From Novice to Professional by Carlo Milanesi seems a bit of a hot mess. It introduces things earlier than necessary, especially given its avowed audience. I could have predicted that though, from the publisher. I have yet to see a good book from Apress.

I’ve only just started Programming Rust: Fast, Safe Systems Development by Jim Blandy and Jason Overdorff so I have no opinion on that one yet. 

I’m a bit disappointed with the Rust plugin for CLion. I’ve gotten spoiled by Jetbrains’s support for Java and C++ in their IDEs where the IDE provide a lot of input-time checking and suggesting for me. It made getting back into C++ a lot easier. With Rust, numerous errors don’t actually show up until a proper compilation is run, and the suggestions for resolving issues aren’t always present or useful, particularly around borrowing. IDEs have made me fat and lazy and now I have to work. I don’t like that.

I’ve had questions along the way and it seems like the two online support communities I’ve found aren’t all that great. Neither reddit nor Stack Overflow have seemed adequate. In many instances, the suggestions are completely wrong, but the language and documentation are such that I’ve managed to find my way despite this.

Stupid Mac tricks

I recently switched from Dropbox to sync for my synchronized cloud folder needs. Along the way, I moved all the files that were in ~/Dropbox to ~/Sync  I had a terminal window open with a shell cd’d to ~/DropBox/PreppyLion for the book I’m working on. I expected to get an error about that directory being in use and I couldn’t move the directory along with the other files. No error. Then I look in the terminal window and type in pwd  It showed ~/Sync/PreppyLion  It’s little things like this that always make me happy with modern computing platforms. I don’t know if this would work on Linux or Windows, but I’m happy it does on MacOS.

Page references

I’ve been thinking about page references lately and how they translate to non-print presentations. At the moment, I’m reading The Rust Programming Language by Steve Klabnik and Carol Nichols and I noticed that there was a page reference when they wrote:

We’ll discuss this concept in detail in “Variables and Mutability” on page 32.

This, in itself is not notable, but what is, is that there is a freely available version of the book on the web. A chance to see how this got translated in practice. There, the sentence is 

We’ll be discussing this concept in detail in the “Variables and Mutability” section in Chapter 3.

(hyperlink preserved in my copy of the sentence). Two things: This is clearly a manual rewrite of the sentence (with some change to the tense of the verb from simple future to future progressive) and they avoid the page reference entirely. 

A bit disappointing, but I am curious about what to do with things like the equivalent of a LaTeX \pageref if the document is translated for a web or ebook presentation. I had hoped to get something usable as a model from this wasn’t it. Ideally I’d like to be able to automatically translate something like page~\pageref{foo} into either “page 32” for a print/PDF presentation or some alternate text for the HTML/ebook presentation, but I have no idea what that alternate text should be.

Looking at rust is already beneficial

In addition to programming, I’m also a minor writer of fiction. One thing that I often do while working on a novel is write a short story to develop some aspect of craft. For example, part of the motivation for writing my story, Saint Jude’s Medallion was to work out how to tell a story through a conversation between two characters so that I could employ what I learned in the novel that I was writing at the time (which served as my MFA thesis but has since been trunked). In an analogous mode, finl functions to me as a novel and along the way, I’ll end up writing some “short story” programs to try things out and learn how to solve problems in a way that will be useful for writing finl.

The first of these “short story” programs is a mostly useless program, gftopdf, which will serve as a drop-in replacement for Knuth’s gftodvi. I have little need for this myself. It’s been decades since I wrote Metafont code seriously. But as Knuth mentions in gftodvi.web, gftodvi is a mini-TeX, and similarly, I could see using gftopdf as a mini-finl.

My gf reading code was my chance to finally implement some ideas I had played with around 1995 using function pointers as a means for simplifying the reading of a Knuthian byte file (I was thinking in terms of dvi back then, but gf is similar in spirit if not detail). I ended up creating a complicated structure using functors to implement each of the thirteen possible commands implemented over 240+ opcodes along with the four possible argument readers (1–4 bytes of signed/unsigned data). Then, while writing a post to ask about how to idiomatically implement this in rust, I realized that all the functor infrastructure was completely unnecessary. I could instead just directly implement the functors as method calls since I was building an array of lambdas anyway¹. The rust version of the code is going to be far simpler than the C++ version and only part of that (if any of it) will be due to rust. As an added bonus, I had puzzled over how the original code would be extensible for use by other programs and using virtual functions in place of functors means that my gf reading class would be able to be subclassed to enable variations on the existing functionality in the unlikely event that anyone ever cared to reuse the code.

¹ The original idea was that it was going to be an array of Opcode functor/Argument functor pairs that would be composed during the file reading loop. A combination of scoping rules and C++ syntax made that less appealing once I wrote that loop and I ended up abandoning that original plan and going with an array of lambdas instead.

Choosing a programming language revisited

A comment from someone on Hacker News who happened to stumble across this site, inspired me to give another look at rust as a programming language. Having spent a few months digging into C++, my initial thoughts are that as pros:

  • ICU for Unicode support
  • harfbuzz for text rendering
  • within the limits of compiler support for bleeding edge features, code should be widely portable by sticking to standard C++

and cons

  • library support is painful. I’ve been spoiled writing for the JVM over the last 13 years. I can easily add a library to a Java project by just adding a dependency to my pom.xml and I’m done. Since there is no real standard for C++ object/library files, adding a library often means downloading and compiling from source. It’s not uncommon to deal with dependencies by creating a git submodule and if you’re lucky, it will have a CMake configuration that allows it to be used as a subproject.
  • Lack of a standard project structure means that there’s a lot of mental labor involved in just figuring out where to put files.
  • PDF creation seems to have only a couple of options, but they seem to not be actively maintained. libharu has a post on its homepage asking for a volunteer maintainer to take over and the mailing list has at least one volunteer who has offered only to receive silence. And libharu’s CMake configuration is a bit broken so it requires (a) some unmerged changes to build as a subproject and (2) some non-standard declarations in the master project to integrate it.
  • C++20 is still not quite all here yet. I had hoped to use some features including modules and ranges in my coding, but it appears that these are not yet supported by clang. I have an irrational resistance to gcc and a rational resistance to msvc but both of those compilers don’t fully support C++20 either. Again, Java world has spoiled me. It’s only inertia that keeps me from setting my compile level higher than Java 11 on stuff I’m working on right now.

So what about rust? Some digging reveals that people have used it as a development language for iOS. It’s still in its infancy, but it should be much better by the time I need to care. Since it’s really a single development system rather than an undefined galaxy of compilers implementing a common standard like with C++, they’re able to wrap up library dependencies using the crate mechanism. ICU and harfbuzz both have rust crates and there’s an actively maintained PDF creation crate.

And as an added bonus, I can use the rust plugin for IntelliJ (or CLion, although I may drop my use of that if rust works out).

The big negative is that rust is a completely new language to me, but then again, C++ has changed enough since I was actively using it in the 90s that so is C++.

The things we forget

Looking at a post on tex.stackexchange.com I found myself digging into man pages to find out about some (comparatively) new options on the modern TeX engines and not remembering what virTeX was (it’s TeX with no format preloaded). Along the way of refreshing my memory, I stumbled upon this post from thirty years ago (almost exactly, in fact), which reveals that (a) once upon a time, this was part of my basic knowledge and (2) my recollections that during my maintenance on the VAX/VMS port of TeX I did, in fact, create the first TeX executable with iniTeX built in to the same executable as TeX and accessed via a command-line switch. 

Character substitutions in text

TeX handles some character sequence substitutions by (ab)using the ligature mechanism, e.g., ``→“. This works reasonably well for Computer Modern which defines these in its ligature table, but falls apart once we start trying to use non-TeX fonts. Furthermore, there’s the added complication that most fonts put the characters ' and ` in character positions 39 and 126 while TeX is expecting those characters to typeset ’ and ‘.

 

I’m thinking that a solution to this would be to have a character sequence substitution that’s run-time configurable as part of the text-input pipeline. This would happen after commands have been interpreted but before the text stream is processed for ligatures and line breaks. The standard route would be to import a tab-delimited table of input and output sequences of Unicode characters. The standard TeX input would look like:

 

`
``
'
''
--
---
!` ¡
?` ¿
~ \u00a0

 

Note that we no longer have an active character concept to allow using ~ for non-breaking spaces. Also, the timing of when the substitutions take place mean that we cannot use this mechanism to insert commands into the input stream. doing a mapping like TeX\TeX will not typeset the TeX logo but will typeset the sequence \TeX instead including the backslash (Actually, given the use of \ to open a Unicode hex sequence it might produce something like a tab followed by eX depending on what other escape sequences are employed.

 

Other TeX conventions, like the AMS Cyrillic transliteration where ligatures are used to map sequences like yu→ю can easily be managed. Similarly Silvio Levy’s ASCII input scheme for polytonic Greek can also be easily managed. These would allow for easy input of non-Latin alphabets for users who primarily write in Latin alphabets and work on operating systems where switching keyboard layouts to allow input of non-Latin scrips are difficult.