Build: marp index.md -o index.html --html=true Watch: marp index.md -o index.html --html=true -w

Input processor (transforms input text to tokens) Expansion processor (expand macros, conditionals, etc.) Execution processor (perform commands, i.e. change state, assignments, add to horizontal/vertical list) Visual processor (kerning, ligatures, line break, page break) Backend

\isempty scans one argument, and we want to do different things based on whether it is empty or not. TeX has on the expansion level a few `\if` primitives, but they mostly compare two consecutive tokens for equality. But we can still use it with a trick: `\ifx` checks whether two tokens are equal. We can check `\ifx\something<maybe empty>\something`. If the argument turns out to be empty, \ifx will see \something \something as the two following tokens, and we go to the "true branch". (Going to the true branch means that when TeX will encounter \else, it will keep reading tokens, but ignore everything until \fi, we are still on expansion level, not in a true programming language.) If the argument is not empty (e.g. it is Hello), it will see `\ifx\something Hello\something`, then `\something` is not the same token as `H`, so TeX will ignore everything up to `\else` and "execute" the else branch. This conveniently also means that we don't care whether the argument has single token or not, even though we only have a single token comparison.

You probably have used `\frac` macro for typesetting fractions in LaTeX before. Actually on the primitive level in TeX it uses `\over` primitive, which expects numerator to precede it and denominator to follow it. Weirdly enough, Knuth actually made a lot of the math internals much more complicated (there is an extra pass to figure out the sizes), just to be able to say `\over`, which is more natural, but we get `\frac`, which removes that benefit. As the `\frac` macro takes two undelimited arguments, they will capture contents wrapped in `{}` or just one token. This means that if we have a simple fraction with single digit numbers, we can just omit the braces.

There is `\vskip` primitive for inserting glue to current vertical list, or `\hskip` to insert glue to current horizontal list. If we wanted to have a single macro called `\skip`, which would expand to `\hskip` or `\vskip` depending on the current mode, we need to dynamically construct a control sequence name (`\hskip`, `\vskip`). We can do this by checking the current mode (`\ifhmode`), and based on it emitting `h` or `v`. Finally to get a dynamic control sequence, we use `\csname` and `\endcsname`. Everything written between the pair is expanded, and the remaining "ASCII" codes (ignoring category codes) are used as the control sequence name.

Every control sequence has its meaning stored in table of equivalents. We can copy the meaning with `\let` This allows us to create true copy of primitives, macros, and but also other tokens. For example `\bgroup` and `\egroup` commands are just copies of `{`, `}` which are the more primitive ways to start a group. (Disclaimer: it's actually more involved, `{` and `}` are special syntactically - they always need to be scanned in balanced way, `\bgroup` and `\egroup` bypass that, as they have begin group and end group meaning semantically, but don't need to be balanced syntactically. Also, there is `\begingroup` and `\endgroup`, but they only start/end group, they can't be used e.g. with `\hbox\bgroup\egroup`.

Hash tables which allow mapping of keys to values are a very important data structure. We often want them also in TeX, and we can emulate them with dynamicly constructed control sequences. We can define macros `\setkv` and `\getkv` that will allow us to store and retrieve arbitrary things. To store the values we define a control sequence named `kv:key`, which stores the `value`. As the name includes a character which normally can't be part of macro name (i.e. `:`), it is unlikely to collide with anything user writes. To generate the dynamic control sequence name, we of course use `\csname`. But we have a problem, we want to say `\def\dynamiccontrolsequence{something}`. But `\def` is a command which in the main loop of TeX just says "give me next token, assert that it is a control sequence, then scan matching template and text in braces (`{}`). If we write `\def\csname...`, TeX will just redefine `\csname`. We need to first construct the dynamic control sequence name, so that `\def` already sees it. For this we want to change the expansion order. Most important primitive in that regard is `\expandafter`. `\expandafter` reads a token temporarily, leaves it unexpanded, but reads a second token which it expands. So in this case, `\expandafter\def\csname` first creates the dynamic control sequence name, and only then `\def` runs and already sees the right thing.

Another case where we need to be careful with expansion is if we want to for example define a macro that appends some tokens to an existing macro. Let's break it down on a simpler example, where we just want to add to macro \a some text. First thing we might try is `\def\a{\a hello}` -- define `\a` to be `\a` followed by something. But this doesn't really work - we will define `\a` to mean token `\a` followed by a few letter tokens. We just got a recursive definition of a macro that will cause stack overflow on expansion. What we really want to do, is to define `\a` to be _expanded value_ of `\a` followed by what we want to add. But `\def` when it scans meaning of macro doesn't expand. We can use `\expandafter` to solve this. Just before `\def` we will start a chain of `\expandafter`s that will reach the `\a` that we want to expand. Thanks to this, by the time `\def` runs, it will already see the expanded meaning of `\a` followed by what we want to add. To create the generic `\addto` macro that we wanted initially, we can just make `\a` and the text to add into parameters of `\addto`.

Say we have a macro called `\bold` that receives an arguments and typesets it in bold, and `\italoc` macro that typesets argument in italic. Say that based on some condition, we want to make a text between braces either bold or italic. First thing that we may come up with is to just use `\bold` and `\italic` in each respective branch. But the problem is, that both just scan the next token, which will be `\else` or `\fi` respectively, completely confusing TeX. Before `\bold` and `\italic` execute, we want them to just see `{word}` after them, so they scan the correct thing as argument. We will use the fact that `\if` `\else` and `\fi` work on expansion level. Expansion of `\else` just keeps ignoring tokens until it reaches `\fi`. Expansion of `\fi` just pops the `\if` from stack. If we use `\expandafter` to expand `\else` and `\fi` before we `\bold` and `\italic` take effect, the condition first disappears from the input completely, and `\bold` and `\italic` will correctly scan the text following the condition as argument.

Another trick to achieve a similar thing (insert macro only after `\fi`) is to introduce a macro, which will eat everything up to the `\fi`, ignore it, execute the `\fi` to get rid of the condition on the stack, and then to insert the thing that we wanted to carry out of the condition. And finally, yet another thing to carry out something out of the condition is to just define a temporary macro, and use it after the condition.

Say we want to implement a macro called `\cite`, which gets a list of references, and we want to resolve them and print a nice label for them. We don't care much about the typesetting part, but only about how to parse a comma separated list in square brackets. Unlike `{` `}`, square brackets are not special to TeX. So we can just define a macro which will match an opnening square brace, and then an argument delimited by ending square brace. We won't use any predefined loop macros, we will just handcode recursion ourselves. A single step of recursion will process one argument. To get one argument from input, we will can define a macro that reads an argument delimited by comma. As we need all elements to be delimited by comma, before even calling this recursive processing we must add a comma to after our initial `\cite` argument. But as always with recursion, we must come up with the base case. For that we will actually add another comma after the initial argument. This will mean that our recursive macro will read one extra element which will be empty. As we already know how to check for empty string, it will serve nicely as our terminating condition. In the recursive step, we just need to print the citation somehow (not really important for us here), and invoke recursively. But again we have a problem, `\citeimpl` immediately tries to read until first comma. But the `\fi` macro is in the way, and we need to get rid of it. We already know how to do that, in this case `\expandafter` works nicely. The nice thing about our macro is that it is _tail recursive_. Because the `\expandafter` gets rid of the `\fi` from input stack, our recursive macro actually calls itself as the very last thing. It doesn't need any extra stack space and is quite efficient. Most similar loop macros in TeX actually are tail recursive, usually for the same purpose ours is - we need to read past `\fi`.

We can define a more user friendly variant of the `\cite` macro. The problem with the previous version is, that it scan all arguments as just everything until the next comma. So if some elements start with leading whitespace, we will treat them as part of the citation name. This is not nice, and we can get rid of the leading whitespace with a trick. The trick is that reading undelimited arguments (i.e. single token or text surrounded with braces) ignores whitespace until it finds the argument. So if we read the input with undelimited argument, we will get rid of leading whitespace automatically. But that means that we will not read until the next comma, but only one letter. To solve the problem, we can scan both undelimited argument and the delimited argument. The first will ignore leading whitespace and scan the first letter. The second will read the rest of the argument until the comma. The only thing we have to be careful about is to reconstruct the argument we must combine `#1` and `#2`. And also, our terminating condition has to change slightly -- now our macro scanning will not read empty argument, but will read into first argument whatever next token is, and it will read one of our extra commas that we insert. So in this case, we need to insert another one into input, and check the base case by comparing the first argument to comma.

In plain TeX loops are implemented with a macro called `\loop`. It scans body of the loop - everything up until `\repeat`, which needs to end with `\if` condition. It looks like a special syntax, but it really isn't. Under the hood, it just defines `\body` as the body of the loop including the condition at the end, and recursively calls itself until the condition terminates by using the `\next` trick.

In this case however, we could leave out the `\next` trick and just have simple recursion. But it isn't tail recursion -- there will be `\fi` left on the input stack for each iteration. The `\next` variant is tail recursive, as would be the `\expandafter` variant.

The demo document shows some features that we didn't have time to cover: 1. Some example of typesetting commands like `\section`. There we want to handle numbering of sections, setting up penalties to encourage break before section, forbid page break between section title and first paragraph, and. 2. Marks = notes about what is the current capture, which can be used to typeset name of the current chapture on a page. 3. Footnotes and figures which are "inserts" - floating elements for which TeX tries to reserve space on the page, and ultimately the output routine chooses where to place them. Simplest example is `\topinsert` which is placed either in the current location or at the top of next page, and `\footnote` which typesets a footnote at the bottom of the current page. 4. Writing to and readin from files. For reading a file in full, we can just use `\input` with the file name. But writes with `\write` are surprisingly not performed where they are executed by TeX's main loop. Instead, they are added as a node which executes the write as part of `\shipout`. This means that writes happen on finalized pages, and e.g. page number is usually known at that time. 5. The "at shipout" property of writes is important for implementing features like table of contents. As TeX ships out pages sequentially, and in a single pass, it can't really typeset table of contents at the beginning of the document, as it doesn't yet know what are the chapters. Writes provide a solution - write the information about the chapters to a helper file in the first pass, and read that information in the second pass, which will be able to typeset table of contents. The write with section information has to be very careful with expansion, as it needs some part expanded (like section number register, since it can change multiple times in a page), but some registers need to be read only when the read happens (like the page number, which is correct only at that point). A typical problem with these writes to files is also special characters (control sequences and e.g. `~`) that could otherwise be expanded, but we don't want that, for that we escape with `\detokenize` which makes all characters into "other" tokens (catcode 11), except spaces which are left as spacer, catcode 10. 6. Leaders - repeated elements, used to typeset "leading" dots in table of contents, but also the header lines. Internally, they are represented like glue, but instead of being just whitespace, they are realized by repeating their contents.

Another interesting thing to implement in TeX is a verbatime environment, which allows to print a piece of text verbatim as it appears in source code. Usually in a monospace font. While TeX normally ignores duplicate spaces, and line breaks in source code are more or less treated just like spaces, in verbatime environment we need each space and end line character to have their desired effect. Verbatim environment thus often boils down to setting character codes of everything to 12, except for space and newline which have to have a bit of special handling, as they are not only semantically, but syntactically significant for TeX. But the challenge is, that when we set all category codes of all ASCII codes to 12, then how can we scan until `\endverbatim`? Our definition tries to match up to single token which is `endverbatim` control seqeunce. But since we set category code of backslash to 12, then it will not create control sequences. So we actually need to define `\verbatim` to not be delimited by control sequence `\endverbatim`, but by the twelve category code 12 tokens `\`, `e` `n`, etc. For this we can use `\string` to get the control sequence `endverbatim` expanded into just category 12 tokens. But of course, to have them there at the right time, we need a `\expandafter` sequence.

In plain TeX assignments to registers are done directly, more often with the optional equals sign that increases readability. In LaTeX there is for example a macro called `\setlength` that just receives the register and dimension as two parameters, and expands to putting them after each other, which executes the assignment. Finally the `\relax` makes sure there are no additional things scanned as part of the dimension.

In plain TeX, temporary change of font to bold would be achieved with a local group. The font would be set to bold, and the assignment undone at the end of the group, and all the text in between would be typeset in bold. LaTeX hides the underlying concept of the group, and just exposes a command that makes it's argument bold. It's also a bit less efficient, as it needs to scan it's argument.

LaTeX environments are marked with `\begin` and `\end`. Behind the scenes, these just translate to the control sequence without begin for the start, and control sequence starting with end for the end. Usually the definitions envrionments internally also start a group, so all assignments are local.

Instead of using primitive `\hrule` or `\vrule` with key word arguments, LaTeX instead uses `\rule` with two macro arguments.

The famous `\makatletter` and `\makatother` macros from LaTeX just change the category of the at sign (`@`), making it either scan part of control sequences or not. Usually these macros are used to temporarily be able to access "internal" definitions which are using the `@` to avoid accidental redefinitions.

Everything you didn't want to know about TeX

Goals of the talk

What is TeX?

History of TeX

Yak shave

About TeX

Product of its time

Inner workings of TeX

Boxes and glue model

Interchangeable terms

Boxes and glue in the backend

Output deep dive

Execution processor

Main loop of TeX

Main loop deep dive

Semantic nest

Semantic nest deep dive

Visual processor

Line breaking

Line breaking II

Nodes for breaking

Page breaking

Glue elasticity

Getting tokens

Category codes

Input processor

Expandable commands

Macros

Macros II

Macros deep dive

Beware

TeX examples

Check if macro argument is empty

Macro arguments

Dynamic control sequences

Copy current meaning of another token:

Hash tables

Append to macro

Ending ifs

\afterfi and \let\next tricks

Parsing comma separated input

Parsing comma separated input - ignoring leading whitespace

Plain TeX generic loop

Plain TeX generic loop - alternatives

Demo

Verbatim

LaTeXisms

Resources

`\afterfi` and `\let\next` tricks