https://en.wikibooks.org/wiki/TeX/catcode
0 = Escape character, normally \
1 = Begin grouping, normally {
2 = End grouping, normally }
3 = Math shift, normally $
4 = Alignment tab, normally &
5 = End of line, normally <return>
6 = Parameter, normally #
7 = Superscript, normally ^
8 = Subscript, normally _
9 = Ignored character, normally <null>
10 = Space, normally <space> and <tab>
11 = Letter, normally only contains the letters a,...,z and A,...,Z.
12 = Other, normally everything else not listed in the other categories
13 = Active character, for example ~
14 = Comment character, normally %
15 = Invalid character, normally <delete>
In first stage of input processing, TeX applies system dependent conversion to internal representation of ASCII (Unicode) and consistent line breaks.
Line breaks are mostly converted to spaces, but empty line emits \par
(end of paragraph).
In second stage, in a simplified view either TeX sees \
(escape character)
and scans a control sequence (letter characters followed by spaces as
delimiter), e.g. \hbox
, or reads a character token of some category, e.g. a
Code reference: scan_control_sequence
There is limited amount of unexpandable commands (up to max_command
,
~100-140) that can go to main loop.
The remaining of commands are either expandable primitive commands, or macros ("call commands").
Expandable primitive examples:
\expandafter
, \noexpand
\if
, \else
, \fi
\csname
, \string
Macros are defined with \def
:
\def\cs{<tokens>}
Definition of a macros is stored as a token list (linked list of tokens). Example:
\def\a{A}
- \a
stores one token\def\a{\B\C}
- \b
stores two tokensExpanding a macro involves pushing the contents of the stored token list to the top of the input stack.
In any momement, TeX may be reading input tokens from any token list (e.g. macro) or from some input file.
Macros can have parameters:
\def\greet#1{Hello #1!}
- scans one parameter, and stores 8 tokensActually, macro definition provides a matching template:
\def\mac a#1#2 \b {#1\−a ##1#2 #2}
There are:
{
}
).Tracing:
\tracingmacros=1
Code references:
Spaces not consumed on the input processor and expansion level are
interpreted as spacer commands that insert glue in horizontal mode or are
ignored in vertical mode.
Beware TeX scanning for numbers or dimensions (e.g. plus
) further than you
wanted:
\def\a{\penalty200} vs \def\a{\penalty200\relax}
... \a 0 ...
\def\isempty#1{%
\ifx\end#1\end
% empty
\else
% not empty
\fi
}
\def\frac#1#2{{#1 \over #2}}
$\frac{11}{5}$
$\frac12$
\def\skip{\csname \ifhmode h\else v\fi skip\endcsname}
\let\bgroup={
\let\egroup=}
\let\endgraf=\par
\setkv{a}{b}
\getkv{a}
\def\setkv#1#2{%
\expandafter\def\csname kv:#1\endcsname{#2}%
}
\def\getkv#1{%
\csname kv:#1\endcsname
}
% more elegant, but has problem
\def\setkv#1{%
\expandafter\def\csname kv:#1\endcsname
}
\setkv{a}b % doesn't work
\def\a{}
\addto\a{hello} % hello
\addto\a{ world} % hello world
\addto\a{!} % hello world!
\def\a{}
\def\a{\a hello} % WRONG: \a recursively refers to itself, we want to expand it first
\def\a{}
\expandafter\def\expandafter\a\expandafter{\a hello}
% \def\a{hello}
\expandafter\def\expandafter\a\expandafter{\a world!}
% \def\a{helloworld!}
\def\addto#1#2{\expandafter\def\expandafter#1\expandafter{#1#2}}
\def\addto#1#2{\def#1{#1#2}} % WRONG
\def\bold#1{{\bf #1}}
\def\italic#1{{\it #1}}
\ifnum1>0 \bold \else \italic \fi {word} % bad
\ifnum1>0 \expandafter \bold \else \expandafter \italic \fi {word} % good
\afterfi
and \let\next
tricks\def\afterfi#1#2\fi{\fi#1}
\ifnum1>0 \afterfi \bold \else \afterfi \italic \fi {word} % good
\ifnum1>0 \let\next=\bold \else \let\next=\italic \fi \next{word}
\cite[a,b,c]
\def\cite[#1]{\citeimpl#1,,}
\def\citeimpl#1,{%
\ifx\end#1\end
% terminating condition reached
\else
[#1]% print citation
\expandafter\citeimpl % loop and continue
\fi
}
\cite[a, b, c]
\def\cite[#1]{\citeimpl#1,,,}
\def\citeimpl#1#2,{
\ifx#1,
% terminating condition reached
\else
[#1#2]
\expandafter\citeimpl
\fi
}
\loop
\message{\number\MyCount}
\advance\MyCount by 1
\ifnum\MyCount<100 \repeat
\def\loop#1\repeat{\def\body{#1}\iterate}
\def\iterate{%
\body
\let\next=\iterate
\else
\let\next=\relax
\fi
\next
}
\loop
\message{\number\MyCount}
\advance\MyCount by 1
\ifnum\MyCount<100 \repeat
\def\loop#1\repeat{\def\body{#1}\iterate}
\def\iterate{\body \iterate\fi}
\def\iterate{\body \expandafter\iterate\fi}
document.tex
= plain TeX with:
\verbatim
Text printed verbatim
\endverbatim
\def\verbatim#1\endverbatim{...}
\expandafter\def\expandafter\verbatim\expandafter#\expandafter1\string\endverbatim{...}
% plain TeX
\parindent=1em
% LaTeX
\def\setlength#1#2{#1#2\relax}
\setlength{\parindent}{1em}
% plain TeX
{\bf bold text}
% LaTeX
\def\textbf#1{{\bf #1}}
\textbf{bold text}
\begin{environment}
\end{environment}
\def\begin#1{%
\csname#1\endcsname
}
\def\end#1{%
\csname end#1\endcsname
}
\environment
\endenvironment
% LaTeX
\rule{1cm}{0.4pt}
% TeX
\hrule height 0.4pt width 1cm
\def\makatletter{\catcode`\@=11\relax}
\def\makatother{\catcode`\@=12\relax}
Build: marp index.md -o index.html --html=true Watch: marp index.md -o index.html --html=true -w
Input processor (transforms input text to tokens) Expansion processor (expand macros, conditionals, etc.) Execution processor (perform commands, i.e. change state, assignments, add to horizontal/vertical list) Visual processor (kerning, ligatures, line break, page break) Backend
\isempty scans one argument, and we want to do different things based on whether it is empty or not. TeX has on the expansion level a few `\if` primitives, but they mostly compare two consecutive tokens for equality. But we can still use it with a trick: `\ifx` checks whether two tokens are equal. We can check `\ifx\something<maybe empty>\something`. If the argument turns out to be empty, \ifx will see \something \something as the two following tokens, and we go to the "true branch". (Going to the true branch means that when TeX will encounter \else, it will keep reading tokens, but ignore everything until \fi, we are still on expansion level, not in a true programming language.) If the argument is not empty (e.g. it is Hello), it will see `\ifx\something Hello\something`, then `\something` is not the same token as `H`, so TeX will ignore everything up to `\else` and "execute" the else branch. This conveniently also means that we don't care whether the argument has single token or not, even though we only have a single token comparison.
You probably have used `\frac` macro for typesetting fractions in LaTeX before. Actually on the primitive level in TeX it uses `\over` primitive, which expects numerator to precede it and denominator to follow it. Weirdly enough, Knuth actually made a lot of the math internals much more complicated (there is an extra pass to figure out the sizes), just to be able to say `\over`, which is more natural, but we get `\frac`, which removes that benefit. As the `\frac` macro takes two undelimited arguments, they will capture contents wrapped in `{}` or just one token. This means that if we have a simple fraction with single digit numbers, we can just omit the braces.
There is `\vskip` primitive for inserting glue to current vertical list, or `\hskip` to insert glue to current horizontal list. If we wanted to have a single macro called `\skip`, which would expand to `\hskip` or `\vskip` depending on the current mode, we need to dynamically construct a control sequence name (`\hskip`, `\vskip`). We can do this by checking the current mode (`\ifhmode`), and based on it emitting `h` or `v`. Finally to get a dynamic control sequence, we use `\csname` and `\endcsname`. Everything written between the pair is expanded, and the remaining "ASCII" codes (ignoring category codes) are used as the control sequence name.
Every control sequence has its meaning stored in table of equivalents. We can copy the meaning with `\let` This allows us to create true copy of primitives, macros, and but also other tokens. For example `\bgroup` and `\egroup` commands are just copies of `{`, `}` which are the more primitive ways to start a group. (Disclaimer: it's actually more involved, `{` and `}` are special syntactically - they always need to be scanned in balanced way, `\bgroup` and `\egroup` bypass that, as they have begin group and end group meaning semantically, but don't need to be balanced syntactically. Also, there is `\begingroup` and `\endgroup`, but they only start/end group, they can't be used e.g. with `\hbox\bgroup\egroup`.
Hash tables which allow mapping of keys to values are a very important data structure. We often want them also in TeX, and we can emulate them with dynamicly constructed control sequences. We can define macros `\setkv` and `\getkv` that will allow us to store and retrieve arbitrary things. To store the values we define a control sequence named `kv:key`, which stores the `value`. As the name includes a character which normally can't be part of macro name (i.e. `:`), it is unlikely to collide with anything user writes. To generate the dynamic control sequence name, we of course use `\csname`. But we have a problem, we want to say `\def\dynamiccontrolsequence{something}`. But `\def` is a command which in the main loop of TeX just says "give me next token, assert that it is a control sequence, then scan matching template and text in braces (`{}`). If we write `\def\csname...`, TeX will just redefine `\csname`. We need to first construct the dynamic control sequence name, so that `\def` already sees it. For this we want to change the expansion order. Most important primitive in that regard is `\expandafter`. `\expandafter` reads a token temporarily, leaves it unexpanded, but reads a second token which it expands. So in this case, `\expandafter\def\csname` first creates the dynamic control sequence name, and only then `\def` runs and already sees the right thing.
Another case where we need to be careful with expansion is if we want to for example define a macro that appends some tokens to an existing macro. Let's break it down on a simpler example, where we just want to add to macro \a some text. First thing we might try is `\def\a{\a hello}` -- define `\a` to be `\a` followed by something. But this doesn't really work - we will define `\a` to mean token `\a` followed by a few letter tokens. We just got a recursive definition of a macro that will cause stack overflow on expansion. What we really want to do, is to define `\a` to be _expanded value_ of `\a` followed by what we want to add. But `\def` when it scans meaning of macro doesn't expand. We can use `\expandafter` to solve this. Just before `\def` we will start a chain of `\expandafter`s that will reach the `\a` that we want to expand. Thanks to this, by the time `\def` runs, it will already see the expanded meaning of `\a` followed by what we want to add. To create the generic `\addto` macro that we wanted initially, we can just make `\a` and the text to add into parameters of `\addto`.
Say we have a macro called `\bold` that receives an arguments and typesets it in bold, and `\italoc` macro that typesets argument in italic. Say that based on some condition, we want to make a text between braces either bold or italic. First thing that we may come up with is to just use `\bold` and `\italic` in each respective branch. But the problem is, that both just scan the next token, which will be `\else` or `\fi` respectively, completely confusing TeX. Before `\bold` and `\italic` execute, we want them to just see `{word}` after them, so they scan the correct thing as argument. We will use the fact that `\if` `\else` and `\fi` work on expansion level. Expansion of `\else` just keeps ignoring tokens until it reaches `\fi`. Expansion of `\fi` just pops the `\if` from stack. If we use `\expandafter` to expand `\else` and `\fi` before we `\bold` and `\italic` take effect, the condition first disappears from the input completely, and `\bold` and `\italic` will correctly scan the text following the condition as argument.
Another trick to achieve a similar thing (insert macro only after `\fi`) is to introduce a macro, which will eat everything up to the `\fi`, ignore it, execute the `\fi` to get rid of the condition on the stack, and then to insert the thing that we wanted to carry out of the condition. And finally, yet another thing to carry out something out of the condition is to just define a temporary macro, and use it after the condition.
Say we want to implement a macro called `\cite`, which gets a list of references, and we want to resolve them and print a nice label for them. We don't care much about the typesetting part, but only about how to parse a comma separated list in square brackets. Unlike `{` `}`, square brackets are not special to TeX. So we can just define a macro which will match an opnening square brace, and then an argument delimited by ending square brace. We won't use any predefined loop macros, we will just handcode recursion ourselves. A single step of recursion will process one argument. To get one argument from input, we will can define a macro that reads an argument delimited by comma. As we need all elements to be delimited by comma, before even calling this recursive processing we must add a comma to after our initial `\cite` argument. But as always with recursion, we must come up with the base case. For that we will actually add another comma after the initial argument. This will mean that our recursive macro will read one extra element which will be empty. As we already know how to check for empty string, it will serve nicely as our terminating condition. In the recursive step, we just need to print the citation somehow (not really important for us here), and invoke recursively. But again we have a problem, `\citeimpl` immediately tries to read until first comma. But the `\fi` macro is in the way, and we need to get rid of it. We already know how to do that, in this case `\expandafter` works nicely. The nice thing about our macro is that it is _tail recursive_. Because the `\expandafter` gets rid of the `\fi` from input stack, our recursive macro actually calls itself as the very last thing. It doesn't need any extra stack space and is quite efficient. Most similar loop macros in TeX actually are tail recursive, usually for the same purpose ours is - we need to read past `\fi`.
We can define a more user friendly variant of the `\cite` macro. The problem with the previous version is, that it scan all arguments as just everything until the next comma. So if some elements start with leading whitespace, we will treat them as part of the citation name. This is not nice, and we can get rid of the leading whitespace with a trick. The trick is that reading undelimited arguments (i.e. single token or text surrounded with braces) ignores whitespace until it finds the argument. So if we read the input with undelimited argument, we will get rid of leading whitespace automatically. But that means that we will not read until the next comma, but only one letter. To solve the problem, we can scan both undelimited argument and the delimited argument. The first will ignore leading whitespace and scan the first letter. The second will read the rest of the argument until the comma. The only thing we have to be careful about is to reconstruct the argument we must combine `#1` and `#2`. And also, our terminating condition has to change slightly -- now our macro scanning will not read empty argument, but will read into first argument whatever next token is, and it will read one of our extra commas that we insert. So in this case, we need to insert another one into input, and check the base case by comparing the first argument to comma.
In plain TeX loops are implemented with a macro called `\loop`. It scans body of the loop - everything up until `\repeat`, which needs to end with `\if` condition. It looks like a special syntax, but it really isn't. Under the hood, it just defines `\body` as the body of the loop including the condition at the end, and recursively calls itself until the condition terminates by using the `\next` trick.
In this case however, we could leave out the `\next` trick and just have simple recursion. But it isn't tail recursion -- there will be `\fi` left on the input stack for each iteration. The `\next` variant is tail recursive, as would be the `\expandafter` variant.
The demo document shows some features that we didn't have time to cover: 1. Some example of typesetting commands like `\section`. There we want to handle numbering of sections, setting up penalties to encourage break before section, forbid page break between section title and first paragraph, and. 2. Marks = notes about what is the current capture, which can be used to typeset name of the current chapture on a page. 3. Footnotes and figures which are "inserts" - floating elements for which TeX tries to reserve space on the page, and ultimately the output routine chooses where to place them. Simplest example is `\topinsert` which is placed either in the current location or at the top of next page, and `\footnote` which typesets a footnote at the bottom of the current page. 4. Writing to and readin from files. For reading a file in full, we can just use `\input` with the file name. But writes with `\write` are surprisingly not performed where they are executed by TeX's main loop. Instead, they are added as a node which executes the write as part of `\shipout`. This means that writes happen on finalized pages, and e.g. page number is usually known at that time. 5. The "at shipout" property of writes is important for implementing features like table of contents. As TeX ships out pages sequentially, and in a single pass, it can't really typeset table of contents at the beginning of the document, as it doesn't yet know what are the chapters. Writes provide a solution - write the information about the chapters to a helper file in the first pass, and read that information in the second pass, which will be able to typeset table of contents. The write with section information has to be very careful with expansion, as it needs some part expanded (like section number register, since it can change multiple times in a page), but some registers need to be read only when the read happens (like the page number, which is correct only at that point). A typical problem with these writes to files is also special characters (control sequences and e.g. `~`) that could otherwise be expanded, but we don't want that, for that we escape with `\detokenize` which makes all characters into "other" tokens (catcode 11), except spaces which are left as spacer, catcode 10. 6. Leaders - repeated elements, used to typeset "leading" dots in table of contents, but also the header lines. Internally, they are represented like glue, but instead of being just whitespace, they are realized by repeating their contents.
Another interesting thing to implement in TeX is a verbatime environment, which allows to print a piece of text verbatim as it appears in source code. Usually in a monospace font. While TeX normally ignores duplicate spaces, and line breaks in source code are more or less treated just like spaces, in verbatime environment we need each space and end line character to have their desired effect. Verbatim environment thus often boils down to setting character codes of everything to 12, except for space and newline which have to have a bit of special handling, as they are not only semantically, but syntactically significant for TeX. But the challenge is, that when we set all category codes of all ASCII codes to 12, then how can we scan until `\endverbatim`? Our definition tries to match up to single token which is `endverbatim` control seqeunce. But since we set category code of backslash to 12, then it will not create control sequences. So we actually need to define `\verbatim` to not be delimited by control sequence `\endverbatim`, but by the twelve category code 12 tokens `\`, `e` `n`, etc. For this we can use `\string` to get the control sequence `endverbatim` expanded into just category 12 tokens. But of course, to have them there at the right time, we need a `\expandafter` sequence.
In plain TeX assignments to registers are done directly, more often with the optional equals sign that increases readability. In LaTeX there is for example a macro called `\setlength` that just receives the register and dimension as two parameters, and expands to putting them after each other, which executes the assignment. Finally the `\relax` makes sure there are no additional things scanned as part of the dimension.
In plain TeX, temporary change of font to bold would be achieved with a local group. The font would be set to bold, and the assignment undone at the end of the group, and all the text in between would be typeset in bold. LaTeX hides the underlying concept of the group, and just exposes a command that makes it's argument bold. It's also a bit less efficient, as it needs to scan it's argument.
LaTeX environments are marked with `\begin` and `\end`. Behind the scenes, these just translate to the control sequence without begin for the start, and control sequence starting with end for the end. Usually the definitions envrionments internally also start a group, so all assignments are local.
Instead of using primitive `\hrule` or `\vrule` with key word arguments, LaTeX instead uses `\rule` with two macro arguments.
The famous `\makatletter` and `\makatother` macros from LaTeX just change the category of the at sign (`@`), making it either scan part of control sequences or not. Usually these macros are used to temporarily be able to access "internal" definitions which are using the `@` to avoid accidental redefinitions.