Character substitutions in text

February 17, 2021 by Don Hosek

TeX handles some character sequence substitutions by (ab)using the ligature mechanism, e.g., ``→“. This works reasonably well for Computer Modern which defines these in its ligature table, but falls apart once we start trying to use non-TeX fonts. Furthermore, there’s the added complication that most fonts put the characters ' and ` in character positions 39 and 126 while TeX is expecting those characters to typeset ’ and ‘.

I’m thinking that a solution to this would be to have a character sequence substitution that’s run-time configurable as part of the text-input pipeline. This would happen after commands have been interpreted but before the text stream is processed for ligatures and line breaks. The standard route would be to import a tab-delimited table of input and output sequences of Unicode characters. The standard TeX input would look like:

`	‘
``	“
`'`	’
`''`	”
`--`	–
`---`	—
!`	¡
?`	¿
`~`	`\u00a0`

Note that we no longer have an active character concept to allow using ~ for non-breaking spaces. Also, the timing of when the substitutions take place mean that we cannot use this mechanism to insert commands into the input stream. doing a mapping like TeX→\TeX will not typeset the TeX logo but will typeset the sequence \TeX instead including the backslash (Actually, given the use of \ to open a Unicode hex sequence it might produce something like a tab followed by eX depending on what other escape sequences are employed.

Other TeX conventions, like the AMS Cyrillic transliteration where ligatures are used to map sequences like yu→ю can easily be managed. Similarly Silvio Levy’s ASCII input scheme for polytonic Greek can also be easily managed. These would allow for easy input of non-Latin alphabets for users who primarily write in Latin alphabets and work on operating systems where switching keyboard layouts to allow input of non-Latin scrips are difficult.

Comments |1|

Re: At long last code – finl is not LaTeX Cancel

This site uses Akismet to reduce spam. Learn how your comment data is processed.

At long last code – finl is not LaTeX says:

5 October 2021 at 08:58

[…] piece of code (something relatively trivial) for finl is now on crates.io: finl-charsub is the character substitution module for finl. It may have use outside of finl when fixed character sequence replacements are needed in […]

Reply

Legend *) Required fields are marked
**) You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Category: architecture

finl is not LaTeX