>From BNB@math.ams.com Mon Oct 23 08:40:43 1989
Return-Path: <root@ese>
Date: Sun 22 Oct 89 10:51:40-EST
From: bbeeton <BNB@math.ams.com>
Subject: Knuth's description of new TeX and Metafont
To: TeX-implementors@math.ams.com
Mail-System-Version: <VAX-MM(229)+TOPSLIB(132)+PONY(228)@MATH.AMS.COM>
Sender: BNB@vax01.ams.com

Date:	  22 October 89				Message No:	019

To:	  TeX implementors and distributors

From:	  Barbara Beeton

Subject:  Knuth's description of new TeX and Metafont


Knuth has sent me the article below for TUGboat and to disseminate to
the world.  You are welcome to distribute this further, but if you do,
please mark it "draft", and note that it will appear in TUGboat.

I have successfully retrieved the main TeX README file from the directory
/pub/tex at labrea.stanford.edu and scanned the contents of the TeX
directories.  Nearly everything was brought from Score with the original
timestamp more-or-less intact.  The directory structure is changed somewhat
from Score, but it is logical, and should be easy enough to navigate.
However, nothing new seems to have appeared, in particular the new errata
(it will be called "errata.fiv") and the additions to the bug files (which
will be called tex82.89 and mf84.89).  The errata and TeX bug files will
be in the subdirectory /pub/tex/tex and the MF bug file in /pub/tex/mf .
The README file is included for information.

I will send a notice as soon as I see that the new .WEB files and
documentation have been installed at labrea.  However, I will be away
from tomorrow through 1 November, so you won't hear anything until at
least the beginning of November.


########################################################################

[ README file from  /pub/tex at labrea.stanford.edu ]

This is the official TeX distribution directory at Stanford University.
Most of the files here were moved from Score.Stanford.EDU in September
1989 when that host was retired from service.  File modification times
were preserved as much as possible.

All of the files here are available for anonymous FTP by anyone and
may be freely redistributed.  Some are copyrighted; see the files
themselves for more details.  If you have any questions or comments,
please send mail to "tex@labrea.stanford.edu".

(Note: at present, things are still somewhat unsettled, so some files
may be missing or have problems.  Some of the directories are being
reorganized.  By the end of October 1989 it should be more stable.)

Here is a short description of what is contained in the subdirect-
ories.  For further information, refer to the file README (or a
similar name) in each subdirectory.

amsfonts	AMS Cyrillic and math symbols fonts

amstex		The AMSTeX macros

bibtex		The BibTeX program and related files

cm		Metafont sources for the Computer Modern fonts

fonts		TFM files for standard (and some non-standard) fonts

gf		GF font files (for Imagen and Lasewriter)

imagen		Various files to support Imagen printers

latex		LaTeX macros, font sources, related files

lib		Input files read by TeX, Metafont and other programs

ln03		Various files to support LN03 printers

mf		Metafont source code and documentation

mfware		Metafont utilities

misc		Miscellaneous files (temporary)

tex		TeX source code and documentation

texware		TeX utilities

texhax		TeXhax archives

tugboat		Files for things mentioned in TUGboat

unix		Unix TeX distribution from University of Washington

web		The WEB system


########################################################################

%	Knuth's article

\font\logo=logo10 % font used for the METAFONT logo
\font\logosl=logosl10 % font used for slanted METAFONT logo

\def\MF{{\logo META}\-{\logo FONT}}
\def\MFbook{{\sl The {\logosl METAFONT}\kern1pt book}}
\def\TeX{T\hbox{\hskip-.1667em\lower.424ex\hbox{E}\hskip-.125em X}}
\def\ldt{\mathinner{\ldotp\ldotp}}

\line{\bf The New Versions of \TeX\ and \MF\ \hfill by Donald E. Knuth}
\bigskip
\noindent
For more than five years I held firm to my conviction that a stable system
was far better than a system that continues to evolve. But during the TUG
meeting at Stanford in August, 1989, I~was persuaded to make one last set of
changes, in order to bring \TeX\ and \MF\ to a state of completion consistent
with their overall philosophy and goals.

The main reason for the changes was the fact that I~had guessed wrong about
7-bit character sets versus 8-bit character sets. I~believed that standard text
input would continue indefinitely to be confined to at most 128~characters,
since I~did not think a keyboard with 256~different outputs would be
especially efficient. Needless to say, I~was proved wrong, especially by
developments in Europe and Asia. As soon as I~realized that a text formatting
program with 7-bit input would rapidly begin to seem as archaic as the 6-bit
systems we once had, I~knew that a fundamental revision was necessary.

But the 7-bit assumption pervaded everything, so I needed to take the programs
apart and redo them thoroughly in 8-bit style. This put \TeX\
onto the operating table and under the knife
for the first time since 1984, and I~had a final
opportunity to include a few new features that had occurred to me or been
suggested by users since then.

The new extensions are entirely upward compatible with previous versions
of \TeX\ and \MF\ (with a few small exceptions mentioned below).
This means that error-free inputs to the old \TeX\ and \MF\ will still
be error-free inputs to the new systems, and they will still produce the
same outputs.

However, anybody who dares to use the new extensions will be unable to get
the desired results from old versions of \TeX\ and \MF\null. I~am therefore
asking the \TeX\ community to update all copies of the old versions
as soon as possible. Let us root out and destroy the obsolete 7-bit systems,
even though we were able to do many fine things with them.

In this note I'll discuss the changes, one by one; then I'll describe
the exceptions to upward compatibility.

\bigskip
\noindent
{\bf 1. The character set.}
Up to 256 distinct characters are now allowed in input files. The codes that
were formerly limited to the range $0\ldt 127$ are now in the range
$0\ldt 255$. All characters are alike; you are free to use any character
for any purpose in \TeX, assigning appropriate values to its
{\tt{\char'134}catcode},
{\tt{\char'134}mathcode},
{\tt{\char'134}lccode},
{\tt{\char'134}uccode},
{\tt{\char'134}sfcode},
and
{\tt{\char'134}delcode}.
Plain \TeX\ initializes these code values for characters above~127 just as
it initializes the codes for ordinary punctuation characters
like~`{\tt{\char'041}}'. 

There's a new convention for inputting an arbitrary 8-bit character
to \TeX\ when you can't necessarily type~it: The four consecutive
characters 
{\tt{\char'136\char'136}}$\alpha\beta$, where $\alpha$ and~$\beta$ are
any of the ``lowercase hexadecimal digits'' 
{\tt{0}},
{\tt{1}},
{\tt{2}},
{\tt{3}},
{\tt{4}},
{\tt{5}},
{\tt{6}},
{\tt{7}},
{\tt{8}},
{\tt{9}},
{\tt{a}},
{\tt{b}},
{\tt{c}},
{\tt{d}},
{\tt{e}},
or
{\tt{f}},
are treated by \TeX\ on input as if they were a single character with
specified code digits. For example, 
{\tt{\char'136\char'136}80}
gives character code~128; the entire character set
is available from
{\tt{\char'136\char'136}00}
to
{\tt{\char'136\char'136}ff}.
The old convention discussed in Appendix~C, under which character~0 was
{\tt{\char'136\char'136\char'100}},
character~1 (control--A) was
{\tt{\char'136\char'136}A},
\dots,
and character~127 was
{\tt{\char'136\char'136}?},
still works for the first 128~character codes, except that the
character following
{\tt{\char'136\char'136}}
should not be a lowercase hexadecimal digit when the immediately following
character is another such digit.

The existence of 8-bit characters has less effect
in \MF\ than in \TeX, because \MF's character classes are built in to each
installation. The normal set of 95~printing characters described on
page~51 of 
\MFbook\
can be supplemented by extended characters as discussed on page~282, but this
is rarely done because it leads to problems of portability. \MF's 
{\bf char} operator is now redefined to operate modulo~256 instead 
of modulo~128.

\bigskip\noindent
{\bf 2. Hyphenation tables.}
Up to 256 distinct sets of rules for hyphenation are now allowed in \TeX.
There's a new integer parameter called
{\tt{\char'134}language},
whose current value specifies the hyphenation convention in force. If
{\tt{\char'134}language}
is negative or greater than~255, \TeX\ acts as if 
$\hbox{\tt{\char'134}language}=0$.

When you list hyphenation exceptions with \TeX's 
{\tt{\char'134}hyphenation}
primitive, those exceptions apply to the current language only. Similarly,
the
{\tt{\char'134}patterns}
primitive tells \TeX\ to remember new hyphenation patterns for the current
language; this operation is allowed only in the special ``initialization''
program called {\tt INITEX}\null. Hyphenation exceptions can be added at any
time, but new patterns cannot be added after a paragraph has been typeset.

When \TeX\ reads the text of a paragraph, it automatically inserts
``whatsit nodes'' into the horizontal list for that paragraph whenever
a character comes from  a different
{\tt{\char'134}language}
than its predecessor. In that way \TeX\ can tell what hyphenation
rules to use on each word of the paragraph even if you switch 
frequently back and forth among many different languages.

The special whatsit nodes are inserted automatically in unrestricted horizontal
mode (i.e.,  when you are creating a paragraph, but not when you are
specifying the contents of an hbox). You can insert a special whatsit
yourself in restricted horizontal mode by saying
{\tt{\char'134}language}$\langle$number$\rangle$.
This is needed only if you are doing something tricky, like unboxing some
contribution to a paragraph.

\bigskip\noindent
{\bf 3. Hyphenated fragment control.}
\TeX\ has new parameters 
{\tt{\char'134}lefthyphenmin}
and
{\tt{\char'134}righthyphenmin},
which specify the smallest word fragments that will appear at the beginning
or end of a word that has been hyphenated. Previously the values
{\tt{\char'134}lefthyphenmin=2}
and
{\tt{\char'134}righthyphenmin=3}
were hard-wired into \TeX\ and impossible to change. Now plain \TeX\
format supplies the old values, which are still recommended for most
American publications; but you can get more hyphens by decreasing these
parameters, and you can get fewer hyphens by increasing them. If the sum of
{\tt{\char'134}lefthyphenmin}
and
{\tt{\char'134}righthyphenmin}
is~63 or more, all hyphenation is suppressed. (You can also suppress 
hyphenation by using a font with
{\tt{\char'134}hyphenchar=-1},
or by switching to a 
{\tt{\char'134}language}
that has no hyphenation patterns or exceptions.)

\bigskip\noindent
{\bf 4. Smarter ligatures.}
Now here's the most radical change.
Previous versions of \TeX\ had only one kind of ligature, in which two
characters like~`f' and~`i' were changed into a single character like~`fi'
when they appeared consecutively. The new \TeX\ understands much more
complex constructions by which, for example, we could change
an~`i' following~`f' to a dotless~`\i' while the~`f' remains
 unchanged:~`f\i'.

As before, you get ligatures only if they have been provided in the font
you are using. So let's look at the new features of \MF\ by which
enhanced ligatures can be created. A~\MF\ programmer can specify a
``ligature/kerning program'' for any character of the font being
created. If, for example, the~`fi' combination appears in font
position~12, the replacement of~`f' and~`\i' by~`fi' is specified by
including the statement
$$\hbox{\tt{"i"~=:~12}}$$
in the ligature/kerning program for {\tt{"f"}}; this is \MF's present
convention.

The new ligatures allow you to retain one or both of the original characters
while inserting a new one. Instead of {\tt{=:}} you can also write
{\tt{\char'174}=:} if you wish to retain the left character, or
{\tt{=:{\char'174}}} if you wish to retain the right character,
or {\tt{\char'174}=:{\char'174}} if you want to keep them both.
For example, if the dotless~\i\ appears in font position~16, you can
get the behavior mentioned above by having
$$\hbox{%
{\tt{"i" {\char'174}=: 16}}
}$$
in f's program.

There also are four additional operators
$$\hbox{%
{\tt{\char'174}=:{\char'076}},\qquad
{\tt{=:{\char'174\char'076}}},\qquad
{\tt{\char'174}=:{\char'174\char'076}},\qquad
{\tt{\char'174}=:{\char'174\char'076\char'076}},
}$$
where each {\tt\char'076} tells \TeX\ to shift its focus one position
to the right. For example, if~f and~i had been replaced by~f
and dotless~\i\ as above, \TeX\ would begin again to execute f's
ligature/kern program, possibly inserting a kern before the dotless~\i,
or possibly changing the~f to an entirely different character, etc.
But if the instruction had been
$$\hbox{%
{\tt{"i" {\char'174}=:{\char'076} 16}}
}$$
instead, \TeX\ would turn immediately to the ligature/kern program for
characters following character~16 (the dotless \i);
no further change would be made between~f and~\i\ even if the font
had something specified there.

\bigskip\noindent
{\bf 5. Boundary ligatures.}
Every consecutive string of `characters' read by \TeX\ in horizontal mode
(after macro expansion) can be called a `word'. (Technically we consider
a `character' in this definition to be either a character whose
{\tt{\char'134}catcode}
is a
letter or otherchar, or a control sequence that has been
{\tt{\char'134}let}
equal to such a character, or a control sequence that has been defined by
{\tt{\char'134}chardef},
or the construction
{\tt{\char'134}char}$\langle$number$\rangle$.)
The new \TeX\ now imagines that there is an invisible ``left boundary
character'' just before every such word, and an invisible ``right boundary
character'' just after it. These boundary characters take effect if the font
designer has specified ligatures and/or kerning between them and the
adjacent letters. Thus, the first or last character of a word can 
now be made to change its shape automatically.

A ligature/kern program for the left boundary character is specified within
\MF\ by using the special label~
{\tt{\char'174\char'174}:}
in a {\bf ligtable} command. A~ligature or kern with the right
boundary character is specified by assigning a value to the new internal
\MF\ parameter 
{\it boundarychar},
and by specifying a ligature or kern with respect to this character.
The
{\it boundarychar\/}
may or may not exist as a real character in the font.

For example, suppose we want to change the first letter of a word from~`F'
to~`ff' if we are doing some olde English. The \MF\ font designer could then
say
$$\hbox{ligtable {\tt{\char'174\char'174}: "F" {\char'174}:= 11}}$$
if character 11 is the `ff'. The same ligtable instruction should
appear in the programs for characters like~( and~` and~`` and~- that can
precede strings of letters; then `{\tt Bassington-French}' will
yield `Bassington-ffrench'.

If the `s' of our font is the pre-19th
century~s that looks like a mutilated~`f', and if we have a modern~`s'
in position~128, we can convert the final~s's as Ben Franklin did by
introducing ligature instructions such as
$$\vcenter{\halign{{\tt{#}}\hfil$\;$&{\tt{#}}\hfil\cr
boundarychar :=&255;\cr
ligtable "s":&255 =:{\char'174} 128,\cr
&"." =:{\char'174} 128,\cr
&"," =:{\char'174} 128,\cr
&")" =:{\char'174} 128,\cr
&"'" =:{\char'174} 128,\cr}}$$
and so on. (A true oldstyle font would also have 
ligatures for 
ss and si and sl and ssi and ssl
and~st; it would be fun to create a Computer Modern Oldstyle.)

The implicit left boundary character is omitted by \TeX\ if you say
{\tt{\char'134}noboundary}
just before the word; the implicit right boundary is omitted if you say
{\tt{\char'134}noboundary}
just after it.

\bigskip\noindent
{\bf 6. More compact ligatures.}
Two or more ligtables can now share common code. To do this in \MF, you
say `{\bf skipto}~$\langle n\rangle$' at the end of one {\bf ligtable}
command, then you say `$\langle n\rangle$::' within another. Such local labels
can be reused; e.g., you can say {\bf skipto}~1 again after {\tt 1::} has
appeared, and this skips to the {\it next\/} appearance of~{\tt 1::}.  There
are 256~local labels, numbered~0 to~255. Restriction: At most 128 ligature
or kern commands can intervene between a {\bf skipto} and its matching label.

The {\tt TFM} file format has been upwardly extended to allow more than 32,500
ligature/kern commands per font. (Previously there was an effective limit
of 256.)

\bigskip\noindent
{\bf 7. Better looking sloppiness.}
There is now a better way to avoid overfull boxes, for people who don't want
to look at their documents to fix unfeasible line breaks manually. Previously
people tried to do this by setting 
{\tt{\char'134}tolerance=10000},
but the result was terrible because \TeX\ would tend to consolidate
all the badness in one truly horrible line. (\TeX\ considers all badness
$\ge10000$ to be infinitely bad, and all these infinities are equal.)

The new feature is a dimension parameter called
{\tt{\char'134}emergencystretch}.
If
{\tt{\char'134}emergencystretch}
is positive and if \TeX\ has been unable to typeset a paragraph without
exceeding the given tolerances, another pass over the paragraph is made
in which \TeX\ pretends that additional stretchability equal to
{\tt{\char'134}emergencystretch}
is present in every line. The effect of this is to scale down all the
badnesses into a range where previously infinite cases become finite; 
\TeX\ will find an optimum solution to the scaled-down problem, and this
will be about as good as possible in a practical sense. (The extra stretching
is not really present; therefore underfull boxes will be reported in warning
messges unless
{\tt{\char'134}hbadness}
is increased.)

\bigskip\noindent
{\bf 8. Looking at badness.}
\TeX\ has a new internal integer parameter called
{\tt{\char'134}badness}
that records the badness of the box it has most recently constructed.
If that box was overfull,
{\tt{\char'134}badness}
will be 1000000; otherwise
{\tt{\char'134}badness}
will be between~0 and~10000.

\bigskip\noindent
{\bf 9. Looking at the line number.}
\TeX\ also has a new internal integer parameter called
{\tt{\char'134}inputlineno},
which contains the number of the line that \TeX\ would show on an error message
if an error occurred now. (This parameter and
{\tt{\char'134}badness}
are ``read only'' in the same way as
{\tt{\char'134}lastpenalty}:
You can use them in the context of a $\langle$number$\rangle$, e.g., by saying
`{\tt{\char'134}ifnum{\char'134}inputlineno{\char'076\char'134}badness ...\
{\char'134}fi}'
or
`{\tt{\char'134}the{\char'134}inputlineno}',
but you cannot set them to new values.)

\bigskip\noindent
{\bf 10. Not looking at error context.}
There's a new integer parameter called
{\tt{\char'134}errorcontextlines}
that specifies the maximum number of two-line pairs of context displayed with
\TeX's error messages (in addition to the top and bottom lines, which always
appear). Plain \TeX\ now sets
{\tt{\char'134}errorcontextlines=5},
but higher level format packages might prefer
{\tt{\char'134}errorcontextlines=1} 
or even
{\tt{\char'134}errorcontextlines=0}.
In the latter case, an error that previously involved three or more pairs of
context would now appear as follows:

\halign{\qquad\qquad{\tt{#}}\hfil\cr
{\char'041} Error.\cr
$\langle$somewhere$\rangle$ The {\char'134}top\cr
\phantom{$\langle$somewhere$\rangle$ The {\char'134}top\ }line\cr
...\cr
1.123 {\char'134}The\cr
\phantom{1.123 {\char'134}The\ }bottom line.\cr}

\noindent
(If 
{\tt{\char'134}errorcontextlines{\char'074}0}
you wouldn't even see the `{\tt{...}}' here.)

\bigskip\noindent
{\bf 11. Output recycling.}
One more new integer parameter completes the set. If
{\tt{\char'134}holdinginserts{\char'076}0}
when \TeX\ is putting the current page into
{\tt{\char'134}box255}
for the 
{\tt{\char'134}output}
routine, \TeX\ will not move anything from insertion nodes into the
corresponding boxes; all insertion nodes will stay in place. Designers of
output routines can use this when they want to put the contents of box~255 back
into the current page to be re-broken (because they might want to change
{\tt{\char'134}vsize}
or something).

\bigskip\noindent
{\bf 12. Exceptions to upward compatibility.}
The new features of \TeX\ and \MF\ imply that a few things work differently
than before. I~will try to list all such cases here (except when the 
previous behavior was erroneous due to a bug in \TeX\ or \MF\null).
I~don't know of any cases where users will actually be affected, because
all of these exceptions are pretty esoteric.

\medskip $\bullet$\enspace
\TeX\ used to convert the character strings
{\tt{\char'136\char'136}0},
{\tt{\char'136\char'136}1},
\dots,
{\tt{\char'136\char'136}9},
{\tt{\char'136\char'136}a},
{\tt{\char'136\char'136}b},
{\tt{\char'136\char'136}c},
{\tt{\char'136\char'136}d},
{\tt{\char'136\char'136}e},
{\tt{\char'136\char'136}f}
into the respective single characters
{\tt p},
{\tt q},
\dots,
{\tt y},
{\tt{\char'041}},
{\tt "},
{\tt{\char'043}},
{\tt{\char'044}},
{\tt{\char'045}},
{\tt{\char'046}}.
It will no longer do this if the following character is one of the characters
{\tt 0123456789abcdef}.

\medskip $\bullet$\enspace
\TeX\ used to insert no character at the end of an input line if
{\tt{\char'134}endlinechar{\char'076}127}.
It will now insert a character unless
{\tt{\char'134}endlinechar{\char'076}255}.
(As previously,
{\tt{\char'134}endlinechar{\char'074}0}
suppresses the end-of-line character. This character is normally
$13=$ ASCII control--M $=$ carriage return.)

\medskip $\bullet$\enspace
Some diagnostic messages from \TeX\ used to have the notation
{\tt ["80]} \dots {\tt ["FF]}
when referring to characters $128\ldots 255$ (for example when displaying the
contents of an overfull box involving fonts that include such characters).
The notation
{\tt{\char'136\char'136}80} $\ldots$ 
{\tt{\char'136\char'136}ff}
is now used instead.

\medskip $\bullet$\enspace
The expressions
{\tt{char128}} and {\tt{char0}} used to be equivalent in \MF; now
{\bf char} is defined modulo~256 instead. Hence {\tt{char-1}} $=$
{\tt{char255}}, etc.

\medskip $\bullet$\enspace
{\tt INITEX} used to forget all previous hyphenation patterns each time
you specified
{\tt{\char'134}patterns}.
Now all hyphenation pattern specifications are cummulative, and you are not
permitted to use
{\tt{\char'134}patterns}
after a paragraph has been hyphenated by {\tt INITEX}.

\medskip $\bullet$\enspace
\TeX\ used to act a bit differently when you tried to typeset missing
characters of a font. A~missing character is now considered to be a word
boundary, so you will get slightly more diagnostic output when
{\tt{\char'134}tracingcommands{\char'076}0}.

\medskip $\bullet$\enspace
\TeX\ and \MF\ will report different statistics at the end of a run because
they now have a different number of primitives.

\medskip $\bullet$\enspace
Programs that use the string pool feature of {\tt TANGLE} will no longer run
without changes, because the new {\tt TANGLE} starts numbering multicharacter
strings at~256 instead of~128.

\medskip $\bullet$\enspace
{\tt INITEX} programs must now set
{\tt{\char'134}lefthyphenmin=2} and
{\tt{\char'134}righthyphenmin=3}
in order to reproduce their previous behavior.

\bye


########################################################################

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%  Character code reference
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
%                       Upper case letters: ABCDEFGHIJKLMNOPQRSTUVWXYZ
%                       Lower case letters: abcdefghijklmnopqrstuvwxyz
%                                   Digits: 0123456789
% Square, curly, angle braces, parentheses: [] {} <> ()
%           Backslash, slash, vertical bar: \ / |
%                              Punctuation: . ? ! , : ;
%          Underscore, hyphen, equals sign: _ - =
%                Quotes--right left double: ' ` "
%"at", "number" "dollar", "percent", "and": @ # $ % &
%           "hat", "star", "plus", "tilde": ^ * + ~
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

[ end of message 019 ]
-------