Two approaches to the two-stage table implementation, benchmark results
On my first pass on implementing a two-stage table for the Unicode functionality that I’m writing for finl, I had the first table consist of entries like this:
Code(0x91), Page(1), Page(2), ...
Where an entry of
Code indicated that everything in that 256-byte range had the same code, and an entry of
Page was a pointer into a table of actual values. I figured this way, I would reduce the amount of memory that I was using.
But in the back of my mind, I kept thinking that this was a false economy. There wouldn’t be that many distinct values for
Code so the memory shouldn’t be that big of a deal (even if it doubled the roughly 32K that I was reserving, in applications where everything is measured in megabytes, those tables were a rounding error¹). I rewrote my table-generation code to test this and ran some benchmarks.²
Not too surprisingly, eliminating having to look at a discriminant in the first-stage table improved things, but not consistently. For Latin-alphabet text, whether in English or Czech, I saw no change when looking to see if something was a letter, but a mysterious 30–35% improvement for the lowercase-only check³. The big difference came with processing Japanese text where the code ran roughly twice as fast. In that case, I suspect that the difference comes from the fact that for Japanese, almost every call would be returning a
Code entry rather than a
Page entry. On the surface, one might expect those cases to not show an improvement, after all, there isn’t a need to look at a second memory address. I think, however, that what’s happening here is a branch prediction failure: the CPU has decided that
Page is the more likely result than Code and every time it’s wrong, it needs to back up.
The moral of the story:⁴ avoiding conditional branches in code is a good thing. Table lookups will usually be faster.
- My earliest programming environments were severely memory constrained. I started on Apple ][+ machines which had 48K of RAM for program, data and display memory, later got to work with the luxurious 128K of the Apple //e, and then the relatively unconstrained but still constrained by modern standards memory of the IBM and VAX time share systems that were my primary computing environment for my college years. I didn’t go over 16G of RAM until I’d been programming for over a decade so it’s no surprise that concerns about memory usage and processing speed always haunt me.
- Even shutting down everything on my Mac except for terminal left the benchmarks fairly noisy. Successive runs on the same code could show changes of as much as ±5–10%.
- My only explanation for the latter is that there is some lingering cache available to the CPU for the second test over the first, although why this would only show up with this optimization is a mystery.
- Which might not be correct