Character substitutions in text

TeX handles some character sequence substitutions by (ab)using the ligature mechanism, e.g., ``→“. This works reasonably well for Computer Modern which defines these in its ligature table, but falls apart once we start trying to use non-TeX fonts. Furthermore, there’s the added complication that most fonts put the characters ' and ` in character positions 39 and 126 while TeX is expecting those characters to typeset ’ and ‘.

 

I’m thinking that a solution to this would be to have a character sequence substitution that’s run-time configurable as part of the text-input pipeline. This would happen after commands have been interpreted but before the text stream is processed for ligatures and line breaks. The standard route would be to import a tab-delimited table of input and output sequences of Unicode characters. The standard TeX input would look like:

 

`
``
'
''
--
---
!` ¡
?` ¿
~ \u00a0

 

Note that we no longer have an active character concept to allow using ~ for non-breaking spaces. Also, the timing of when the substitutions take place mean that we cannot use this mechanism to insert commands into the input stream. doing a mapping like TeX\TeX will not typeset the TeX logo but will typeset the sequence \TeX instead including the backslash (Actually, given the use of \ to open a Unicode hex sequence it might produce something like a tab followed by eX depending on what other escape sequences are employed.

 

Other TeX conventions, like the AMS Cyrillic transliteration where ligatures are used to map sequences like yu→ю can easily be managed. Similarly Silvio Levy’s ASCII input scheme for polytonic Greek can also be easily managed. These would allow for easy input of non-Latin alphabets for users who primarily write in Latin alphabets and work on operating systems where switching keyboard layouts to allow input of non-Latin scrips are difficult.

Mistakes of LaTeX: the tabular environment

One constant annoyance that I’ve encountered when people learn LaTeX is the default behavior of the tabular environment. Rather than presenting itself as a separate paragraph block, it is instead automatically in TeX’s horizontal mode. This means that if, for example, a user writes:

some text

\begin{tabular}{...}
...
\end{tabular}

more text

the tabular will be presented left-ish aligned (actually, indented by \parindent). This is never what the user intended. Worse still is the result if the blank line before or after the tabular is omitted. My solution to this was to encourage users to create an aroundtbl environment which usually was just a \begin{center}\end{center} but the correct solution would have been to not require this in the first case. Instances where a user wants a tabular to be in horizontal mode are rare and if they really were important, wrapping the tabular in a minipage would be a better solution (or having a variant of the tabular environment).

It’s long past time for this to be able to be fixed in LaTeX proper, but it makes sense to do it with the tabular replacement in finl.

Defining a document markup language for finl

The markup language for finl will be based on LaTeX, but many of the pain points of LaTeX come from the macro-expansion approach that Knuth’s TeX takes towards parsing the document. I can remember being a teenager reading The TeXbook and puzzling over the whole mouth-gullet-stomach description and finding that challenging to follow. 

LaTeX attempts to impose a sense of order on some of the randomness of TeX’s syntax (albeit with the price of occasional verbosity and losing some of the natural language sense of plain TeX, cf., the difference between typing {a \over b} vs \frac{a}{b}). Still, LaTeX’s basic markup language is a good starting point. This gives us commands and environments as the basic markup for a document. There’s something to be said for different modes of parsing as well. Parsing rules for math would differ from parsing rules for text, and there should be an option to be able to take a chunk of input completely unparsed, e.g., for \verb or the verbatim environment. Changing the timing of how things are parsed would enable us to do things like \footnote{This footnote has \verb+some_verbatim+ text}.

Commands

We retain the basic definition of how commands are parsed. A command begins with \ and is followed by a single non-letter or a string of letters until we come to a non-letter. Letters are defined as those Unicode characters that are marked as alphabetic, which means that not only is \bird a valid command, but so is \pták and also \طائر and \鳥

Commands can take any number of arguments. Arguments can be of the following forms:

  • Required argument. This will be either a single token (a unicode character, a command with its arguments or an environment) or a delimited token list (usually enclosed in braces but see below). 
    Note that this varies from existing LaTeX syntax in that a user could write, e.g., \frac 1\sqrt{2} rather than \frac 1{\sqrt{2}}. I may change my mind on this later.
  • Optional argument.  This must be delimited by square brackets. Inside an optional argument, if a closing square bracket is desired, it must appear inside braces, or, can be escaped as \]. \[ will also be treated as a square bracket within an optional argument.
  • Ordered pair. Two floating-point numbers, separated with a comma enclosed in parentheses. Any white space inside the parentheses will be ignored. This would be used for, e.g., graphics environments.

A command can have a single * immediately after the command which indicates an alternate form of the command.

Arguments can have types as well. These can be:

  • Parsed text. This will be parsed as normal, including any enclosed macros.
  • Mathematics. Spaces will be ignored. Math-only commands will be defined. ^ will indicate a superscript, _ will indicate a subscript (note that outside of math mode, ^ and _ will simply typeset those characters).
  • Unparsed text. This will be treated as a straight unparsed character stream. The command will be responsible for parsing this stream. The argument can either be enclosed in braces or the character at the beginning of the unparsed character stream will be used to indicate the closing of the character stream.
  • Key-value pair list. This will be a list of key-value pairs separated by ->  Any white space at the beginning or end of the list will be ignored as well as any white space surrounding the arrows. If a value contains ->  the whole value should be enclosed in braces.
  • No space mode. Parsed as normal except all spaces are ignored.

As an aside, unparsed text can be re-submitted for parsing, potentially after manipulation of the text by the command definition.

Environments

Environments are marked with \begin{X} and \end{X} where X is the environment name.

Environment names can consist of any characters except {} or *. A single * at the end of the environment name indicates that this is an alternate form of the command, as with commands above.

The \begin environment command can take any number of arguments as above.

The contents of the environment can be any of the types for command arguments as above. Unparsed text, however, can only be concluded with the appropriate \end environment command.

There will be some special shortcuts for environments. \(…\) and $…$ shall be equivalent to \begin{math}…\end{math} \[…\] and $$…$$ shall be equivalent to \begin{displaymath}…\end{displaymath}

Updates

26 Feb 2021 Minor formatting change, indicate that key-value pairs are separated by ->.

27 May 2021 Oh my, the key-value language was crud. Fixed that. 

Choosing a programming language

There are five platforms in common usage in 2020. On traditional computing platforms, we see Windows, Linux and MacOS. For mobile we have Android and iOS. Other platforms exist, but they have negligible usage (e.g., various BSD variants and quixotic efforts at creating a third OS for mobile). Of these, iOS is the most restricted in its options for a development language. If we were to put aside iOS, I would likely choose a JVM-based language if only because I spend my days writing Java code and I know it well.

But I’m an Apple guy. I have a MacBook Pro on my lap, a Mac Mini on my desk, an iPhone in my pocket and an iPad at my side. I’ll be doing all my development on my Mac and that will be the first target for finl and what I’ll use as my reference implementation. I’m be tempted to write the code in Swift a language with which I’ve dabbled, but I suspect that it retains some of the unusual aspects of Objective C (most notably the method dispatch) which could make it challenging to integrate with applications written in other languages. For example, calling C++ from Java is well-defined, but calling Swift from Java requires creating a C wrapper around the Swift code

Given my desire for finl to be usable as a library by other applications, it looks like C++ is the best way forward (the only other option is to write in C which feels like it would be a hard pill to swallow in 2020 for application development). On the plus side, there are robust libraries for many of the features we want to work with, most notably ICU4C which seems like it will be essential for dealing with unicode text (and also finding candidate line breaks in text that doesn’t use western standards of inter-word spacing such as Amharic or East-Asian Han-based scripts). It does mean I need to bring myself 20 years up to date on the state of C++ (I attempted to write one C++ program in 2005 only to discover that the language had already changed enough to make my skills at the language partially obsolete). This is going to slow things down considerably versus writing code in a JVM-based language, but I’m up for the challenge.

Why finl? A manifesto

In 1994, LaTeX2e was released as a transitional step towards LaTeX 3. 26 years later, there still isn’t a 1.0 release of LaTeX 3. In the interim, we’ve seen the rise of HTML and the web, the dominance of PDF as a format for representation of printed material (and now there is a plan to have PDF extended with “liquid mode” that allows reflowing of PDF text for smaller screens).

In the meantime, the TeX engine has been extended multiple times, the little-used TeX-XeT, some early efforts to support large Asian character sets, and we have in widish use pdfTeX, XeTeX, LuaTeX along with an assortment of abandoned engines. Worst of all, it seems that none of pdfTeX, XeTeX or LuaTeX can serve as the one TeX to rule them all, each with some limitations that can require users to switch engines depending on their needs.

As I’ve thought about it, the problem at its root is TeX itself. It’s what would be referred to in contemporary software engineering parlance, as a tightly-coupled monolith. Worse still, it’s a tightly-coupled monolith with numerous compromises baked in because of the limitations of 1970s computing hardware. It seems that the vast majority of what work has been done with LaTeX 3 has been geared towards dealing with the limitations of TeX as a programming language.

On top of that, there’s been an explosion of questionable, if not outright harmful practices from the greater LaTeX community. Ideally, a document should be translated from one document class to another structurally similar class (naming-wise, the choice of “class” to name document classes is unfortunate, but understandable) should not require changing anything after the preamble, better still, nothing but the \documentclass command itself. All the appearance should be handled through the document class and packages should be employed to provide document structure enhancements or new capabilities). There are numerous violations of this. The memoir class is a mess, claiming to be a replacement for article, report and book (this reminds me of the mess that’s PHP where the same data structure acts as an array and an associative array and as a consequence manages to merge the worst aspects of both in one inefficient construct) and at the same time, providing a number of bits of functionality that belong in packages rather than the document class. On the flipside, packages like geometry and fancyhdr fall into a category that LaTeX2e doesn’t really define, bits of common code that would be helpful to document class writers but shouldn’t really be exposed to document authors.

So what can be done? I believe a complete rethink of LaTeX is necessary. The macro-based language of LaTeX is intellectually satisfying in many of the same ways that solving a sudoku puzzle or playing chess can be. It is not, however, a straightforward programming model and does not conform to any popular paradigm (I don’t think any other programming language is of similar structure). It’s time to separate out the language for formatting the document from the language for writing the document, to make a clear break between extensions that modify appearance from extensions that provide new functionality.

While we’re at it, a number of other TeX limitations need to be done away with. TeX was written with the needs of typesetting The Art of Computer Programming and so doesn’t handle complicated page layouts well and for all its valiant efforts, this has impacted LaTeX. Multi-column layouts are painful, float handling is a mess, some things that designers expect to be able to do, like specifying baseline-to-baseline skips or forcing text onto grid lines are difficult if not impossible.

Unicode needs to be a first-class citizen. There’s no reason in 2020 for a document writer to have to type \’a instead of á in a document. UTF-8 is the new 7-bit ASCII.

The code of the engine should be decoupled. Why shouldn’t any application that is willing to interface with the code be able to directly access the parser or the paragraph layout engine or the math layout engine? Imagine if a user of Word could access a plugin that let her type \sum_{i=0}^{\infty} \frac{1}{2^{i}} or even ∑_{i=0}^{∞} \frac{1}{2^{i}} or selecting these symbols via GUI into a low-level token stream to get

sum from 0 to infinity of 1/2^i

in her document. Imagine if a document could be painlessly transformed into HTML or even (unicode) plain text because the page layout and formatting engines are decoupled enough to manage such things.