\input texinfo @c -*-texinfo-*-
@c vim: filetype=texinfo tabstop=4 shiftwidth=4
@c %**start of header (This is for running Texinfo on a region.)
@setfilename texindex.info
@settitle Texindex @VERSION@: A program for sorting indices
@c %**end of header (This is for running Texinfo on a region.)
@c Merge the function and variable indexes into the concept index,
@c but without the code font; in the index entries we'll do the
@c font management ourselves. Also merge in the chunk definition
@c and reference entries, which jrweave creates for us.
@c (Ordinarily this would be in the header, but jrweave puts the
@c defindexes later.)
@synindex fn cp
@synindex vr cp
@synindex cd cp
@synindex cr cp
@ifnottex
@ifnotdocbook
@macro ii{text}
@i{\text\}
@end macro
@end ifnotdocbook
@end ifnottex
@ifdocbook
@macro ii{text}
@inlineraw{docbook,\text\}
@end macro
@end ifdocbook
@copying
This @command{texindex} program (version @VERSION@, @UPDATED@) sorts the
raw index files created by @file{texinfo.tex}. (This Texinfo source is
a literate program written using TexiWeb@tie{}Jr., not a user manual.)
Copyright @copyright{} 2014-2024 Free Software Foundation, Inc.
@quotation
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see @url{https://www.gnu.org/licenses/}.
@end quotation
@end copying
@titlepage
@title Texindex
@subtitle version @VERSION@, @UPDATED@
@author Arnold D. Robbins
@author and Texinfo maintainers
@page
@vskip 0pt plus 1filll
@insertcopying
@end titlepage
@contents
@ifnottex
@node Top
@top Texindex
This file defines @command{texindex} (version @VERSION@,
@UPDATED@), an @code{awk} program that processes the raw index
files produced by the @file{texinfo.tex} file.
@end ifnottex
@menu
* Preface:: Introductory remarks.
* Requirements:: How the program needs to work.
* High-level organization:: The overall outline.
* Processing records:: Processing each record.
* Necessary stuff:: Copyright, helper functions, i18n.
* Index:: Combined index.
@detailmenu
* Intended audience:: Who should read this document.
* History:: @file{texindex.awk} development history.
* Desired printed output:: What a printed index should look like.
* Texinfo indexing commands:: How to write indexing commands.
* Input form:: The input to @file{texindex.awk}.
* Output form:: The output from @file{texindex.awk}.
* Processing:: Processing the data.
* Assumptions:: Additional assumptions.
* Portability:: Using portable @command{awk}.
* First line:: The first line in the file.
* Initial setup:: Set up variables and constants used
throughout.
* Argument processing:: Processing command line arguments.
* Setup for each input file:: What happens at the start of each file.
* Processing each record:: Pulling apart the fields and storing data.
* Remove duplicates:: Removing duplicating entries.
* Remove leading @code{\entry}:: Remove the leading @code{\entry} command.
* Get the initial:: Get the initial for this entry.
* Set up and name fields:: Pull apart the line.
* Store the data for this line:: Store the data for later access.
* Check for more than one initial:: See if there are multiple initials.
* Splitting the record:: Split the record apart.
* End-of-file sorting and printing:: Sorting the entries for each index.
* Quicksort:: Sorting our input.
* Multilevel comparisons:: Handling multilevel entries.
* Comparing index entries:: The heart of the sorting algorithm.
* Printing the data:: Printing the final results.
* printing top level:: Top level logic.
* printing a single entry:: Handling a single entry.
* Copyright statement:: Copyright info.
* Library functions:: From the @code{gawk} library:
@file{ftrans.awk}, @code{join()}.
* Helper functions:: @code{del_array()},
@code{check_split_null()}, @code{fatal()},
@dots{}
* del_array:: Clearing out an array.
* check_split_null:: Checking if @command{awk} splits on the
null string.
* char_split:: Splitting a line into individual
characters.
* fatal:: Printing fatal errors.
* is@dots{} functions:: Checking character types.
* make_regexp:: Make a regexp to match @TeX{} control
sequences.
* escape:: Escaping backslashes for strings.
* min:: Get the minimum of two numbers.
* I18N:: Internationalization.
@end detailmenu
@end menu
@node Preface
@unnumbered Preface
This file defines @file{texindex.awk}, a reimplementation of the C
program @file{texindex.c}. The purpose is to make the program more
maintainable. As a practical benefit, it also supports correct sorting
and initials for the @samp{@{} and @samp{@}} characters in an index,
and multi-level index entries.
@cindex @{ (left brace), example index entry for
@cindex @} (right brace), example index entry for
@cindex TexiWeb Jr.@: literate programming system
@cindex Texinfo document formatting language
This is a @dfn{literate program}, written using the
@uref{https://github.com/arnoldrobbins/texiwebjr, @sc{TexiWeb Jr.@:}
literate programming system}. The underlying documentation system is
@uref{https://www.gnu.org/software/texinfo, Texinfo}, the GNU
documentation formatting language. A single source file produces the
runnable program, a printable document, and an online document.
@menu
* Intended audience:: Who should read this document.
* History:: @file{texindex.awk} development history.
@end menu
@node Intended audience
@section Intended Audience
You should read this if you want to understand how @file{texindex.awk}
works. You should be familiar with the @command{awk} programming
language.
If you are interested in array indexing, you've come to
the wrong place. @xref{knuth}.
@c Scale figure to 4.5 inches which is good for both smallbook
@c and regular. TeX will scale height also automatically.
@float Figure,knuth
@caption{Indexing (@url{https://xkcd.com/163/})}
@center @image{dek_idx, 5in, , Indexing}
@end float
@node History
@section Development History
This program was originally written in 2014 in order to enable
using left and right braces in index entries, and to provide a program
that would be more easily maintainable going forward than the C version.
In 2019, discussions on the Texinfo bug mailing list around adding
multi-level indexing and ``see'' and ``see also'' entries motivated
reworking the program.
@node Requirements
@chapter Requirements
The input to this program is the list of unsorted index entries produced
by @file{texinfo.tex} when a Texinfo document is processed.
This chapter presents the input to the program, the Texinfo
commands that produce that input, and the expected output from
this program. It also presents some additional notes concerning
the requirements.
@menu
* Desired printed output:: What a printed index should look like.
* Texinfo indexing commands:: How to write indexing commands.
* Input form:: The input to @file{texindex.awk}.
* Output form:: The output from @file{texindex.awk}.
* Processing:: Processing the data.
* Assumptions:: Additional assumptions.
* Portability:: Using portable @command{awk}.
@end menu
@node Desired printed output
@section Where We're Going
Let's first look at the kind of output desired.
A high-quality index has
several types of entries:
@table @asis
@item Single level entries
These are the most common; each entry has text and a list of one
or more page numbers.
@item Double level entries
These entries have subtopics; the top level entry may also have a
page number, or it may not.
@item Triple level entries
These entries have subtopics for the subtopics; the top level
and secondary entries may also have
page numbers, or they may not.
@item ``See @dots{}'' entries
Entries that point at other entries in the index, generally without
any subtopics. ``See'' entries do not have a page number of their own.
@item ``See also @dots{}'' entries
Entries that point at other entries, but often do have
subsequent direct page references of their own.
``See also'' entries are merged with regular entries,
their text coming after the page numbers.
@end table
Here's what they might look like when printed
(apologies in advance
for the use of a constant-width font):
@example
coffee makers . . . . . . . . . . 15 @ii{Single level entry}
electric . . . . . . . 17, 22 @ii{Double level entry}
blue . . . . . . . . 42 @ii{Triple level entry}
pink . . . . . . . . 35 @ii{Another triple level entry}
@end example
The same hierarchy might appear without page numbers:
@example
coffee makers, @ii{Single level entry}
electric, @ii{Double level entry}
blue . . . . . . . . . . 42 @ii{Triple level entry}
pink . . . . . . . . . . 35 @ii{Another triple level entry}
@end example
A ``See'' entry doesn't have page numbers:
@example
espresso makers, @ii{See} coffee makers
toasters, . . . . . . . . 42, @ii{See also} coffee makers
@end example
@node Texinfo indexing commands
@section Texinfo Indexing Commands
Texinfo provides a number of different commands for putting entries
into different indices. For discussion we use the @code{@@cindex}
command in the following examples. Of interest to @file{texindex.awk}
is the text of the one-to-three parts of an entry, and how the index
should be sorted. Some examples:
@example
@@cindex coffee makers @ii{One level}
@@cindex coffee makers @@subentry electric @ii{Tow levels}
@@cindex coffee makers @@subentry electric @@subentry blue @ii{Three levels}
@end example
Here, the @code{@@subentry} separates the secondary and tertiary parts of
the entry from the primary part.
Additionally, each part may have an @code{@@sortas} clause:
@example
@@cindex coffee makers @@sortas@{Coffee Makers@}
@end example
``See'' and ``See also'' entries look like this:
@example
@@cindex espresso makers @@seeentry@{coffee makers@}
@@cindex toasters @@seealso@{coffee makers@}
@end example
@noindent
Note that there is (or should be) no comma between the primary text and the @code{@@seeentry}
or @code{@@seealso}: @file{texindex.awk} supplies a comma in the final
printed entry.
@node Input form
@section Input To The Program
The output from @file{texinfo.tex} contains the data for the
different kinds of index entries described in the previous
section. Each line is an @dfn{entry}. Each entry has from three
to five fields, where the first three fields represent the same
data for all entries. Entries look as follows:
@example
@@entry@{@var{sortkey}@}@{@var{page or see}@}@{@var{primary}@}
@@entry@{@var{sortkey}@}@{@var{page or see}@}@{@var{primary}@}@{@var{secondary}@}
@@entry@{@var{sortkey}@}@{@var{page or see}@}@{@var{primary}@}@{@var{secondary}@}@{@var{tertiary}@}
@end example
The braces are balanced in all cases, although for use by this program,
literal braces (not necessarily balanced) can be included in the sort
key by escaping them with the @dfn{command character}.
@cindex backslash vs.@: at
@cindex command character, @samp{\} vs.@: @samp{@@}
In the example above, the command character is @samp{@@} (as in Texinfo
itself). Historically, however, the command character was backslash
(@samp{\}), and @file{texindex.awk} can handle either one.
Because older versions of @command{texi2dvi} only understand backslash as
the command character, it remains set to backslash, so that newer versions
of @file{texinfo.tex} will work with older versions of @command{texi2dvi}.
Once the newer version of @command{texi2dvi} that also understands
@samp{@@} has had a chance to spread, (we can hope that) the command
character will change to @samp{@@}.
The command character is determined at run time by looking at the first
character on the first line of each input file.
The fields are as follows:
@table @var
@item sortkey
The text to use for sorting index entries.
Generally, this is the text of the line with all markup removed.
When an @code{@@sortas} clause is provided, its contents are used instead.
This field should contain only ASCII characters.
When there are subentries, the sort key is the concatenation of three
fields (or their sort @code{@@sortas} clauses), separated by
@code{@@subentry} and a space. @file{texindex.awk} needs to recognize
where there are multiple sort keys in order to print entries appropriately.
@item page or see
Either a page number (as a roman numeral or an integer, possibly
with additional markup), or an indication that that this is a ``See''
or ``See also'' entry.
@item primary
The primary text of the index entry.
@item secondary
The (optional) secondary text of the index entry.
@item tertiary
The (optional) tertiary text of the index entry.
@end table
Our mission (which we choose to accept) is to read the above input,
sort it appropriately, and produce the correct output.
@node Output form
@section What The Output Should Look Like
Output consists four different commands, derived from the indexing input.
The middle two can have a variant where there is no page number.
@example
@@initial @{A@} @ii{For the initial over each group}
@@entry@{@var{indexing text}@}@{@var{pagenum}@} @ii{Primary entry}
@@entry@{@var{indexing text}@}@{@} @ii{Primary entry without page number}
@@secondary@{@var{indexing text}@}@{@var{pagenum}@} @ii{Secondary entry}
@@secondary@{@var{indexing text}@}@{@} @ii{Secondary entry without page number}
@@tertiary@{@var{indexing text}@}@{@var{pagenum}@} @ii{Tertiary entry}
@end example
The commands are:
@table @code
@item @@initial
Besides the index entries,
@file{texindex.awk} must output special lines indicating the
first character (the @dfn{initial}) of keys grouped together, but only
if there is more than one initial used throughout the input file.
@item @@entry
This is for a plain index entry, or for the primary term in a multi-level
index entry. When the primary appears only with a secondary entry,
there won't be a page number.
No page number is printed when there
is an @code{@@seeentry} in the entry. In that case,
the output from @file{texindex.awk} should be a combination of
the original input fields three and two.
@item @@secondary
This is for a secondary index entry in a multi-level
index entry. When the secondary appears only with a tertiary entry,
there won't be a page number.
@item @@tertiary
This is for a tertiary index entry in a multi-level
index entry. This one always has a page number, unless it is
a ``see'' entry.
@end table
@node Processing
@section Processing Index Entries
The job is to sort the entries, and merge those which are identical
except for the page numbers.
The sorting should be in the order of: all symbols first, then all
digits, then all letters, with uppercase letters following lowercase
ones, so we will need some smarts.
Once sorted, the lines must be output in the correct form, depending
upon how many entries and subentries they have.
Input lines might be duplicated (same entry, same page, more than once),
so we will have to deal with that.
@node Assumptions
@section Assumptions About Our Data
In the rest of the program we make two fundamental assumptions:
@enumerate 1
@item
If a given sort key has more than one display text, we only take the
first (this matches the behavior of C @command{texindex}). Put another
way, if the same sort key has two different display texts, it means that
different markup was used, probably inadvertently, and we just take the
first. As an example, consider these two Texinfo commands:
@example
@@cindex @@file@{field_split()@} function
@dots{}
@@cindex @@code@{field_split()@} function
@end example
@noindent
They produce the following output via @file{texinfo.tex},
which in turn is the input to @file{texindex.awk}:
@example
@@entry@{field_split() function@}@{2@}@{@@file @{field_split()@} function@}
@@entry@{field_split() function@}@{7@}@{@@code @{field_split()@} function@}
@end example
@noindent
The result will be a single entry, using @code{@@file},
accumulating the page numbers:
@example
@@entry@{@@file @{field_split()@} function@}@{2, 7@}
@end example
@item
@cindex roman numerals
For the same sort key and text, page numbers will be monotonically
increasing. This means we can just use a new page number when it comes
in, and not have to sort entries based on both sort key and page number.
In turn, this means that we don't need to worry about page numbers
that are roman numerals (which can occur).
@end enumerate
@node Portability
@section Using Portable @command{awk}
An additional requirement, for ease of deployment, is that the program
be written in portable @command{awk}, and not use features found only in
GNU @command{awk} (@command{gawk}). For our purposes, ``portable''
means ``new'' @command{awk} as defined in the 1988 book by Aho,
Weinberger and Kernighan. This gives us functions, multidimensional
arrays and a number of other important features over the original
@command{awk} shipped with V7 Unix.
In practice, we can also rely on basic features added in POSIX
@command{awk} should we need to (such as @code{CONVFMT}), although
at the moment there are no such features used by @file{texindex.awk}.
We tested the program with five versions of @command{awk} (@command{gawk},
@command{mawk}, Brian Kernighan's @command{awk}, @command{goawk},
and Busybox @command{awk}) on a large index and got byte-identical
results from all five.
@node High-level organization
@chapter High-level Organization
The general outline is as follows:
@(texindex.awk@) =
@
@
@
@
BEGIN {
@
@
}
@<@code{beginfile()} work function@>
@<@code{endfile()} work function@>
@
@
@
@menu
* First line:: The first line in the file.
* Initial setup:: Set up variables and constants used
throughout.
* Argument processing:: Processing command line arguments.
@end menu
@node First line
@section The Program's First Line
@cindex first line
@cindex @code{#!} header
@cindex header, shebang
For the first line of the generated output, we hardwire our intended
output file name and how it got made. We do not use a @samp{#!} header
because, being a GNU program, we need to accept the @option{--help} and
@option{--version} options. This cannot be done with a standalone
@code{awk} script; we need a shell wrapper, and hence, the @code{awk}
script itself need not be executable. Also, it's simpler not to worry
about the location of the @code{awk} program.
@=
# texindex.awk, generated by jrtangle from ti.twjr.
@
@node Initial setup
@section Initial Setup
@cindex initial setup
The initial setup sets up some constants, including the version of the
program. In the program itself, we follow a convenient convention:
global variable and array names start with a capital letter.
@cindex @code{Invocation_name} variable
Per GNU standards, we sometimes hardwire the string @samp{texindex} as
the name of the program, and sometimes use the name by which the program
was invoked. We'll call the latter @code{Invocation_name}; it's
supposed to be passed in from the shell wrapper.
@cindex @code{Can_split_null} variable
The last line below sets up @code{Can_split_null}, which tells us if the
built-in @code{split()} function will split apart a string into its
individual characters or if we have to do it manually.
@cindex @code{TRUE} constant
@cindex @code{FALSE} constant
@cindex @code{EXIT_SUCCESS} constant
@cindex @code{EXIT_FAILURE} constant
@cindex @code{Texindex_version} variable
@cindex @code{check_split_null()} function
@cindex @code{Can_split_null} variable
@cindex @code{Invocation_name} variable
@=
TRUE = 1
FALSE = 0
EXIT_SUCCESS = 0
EXIT_FAILURE = 1
Texindex_version = "@VERSION@"
if (! Invocation_name) {
# provide fallback in case it's not passed in.
Invocation_name = "texindex"
}
Can_split_null = check_split_null()
@
@node Argument processing
@section Command-line Argument Processing
@cindex argument processing
@cindex @code{usage()} function
@cindex @code{version()} function
Argument processing is straightforward, though manual. The important
thing is to remove options and their arguments from @code{ARGV} so that
they're not treated as filenames. The options that print version or
help information automatically exit, so there's no need to mess with
@code{ARGV} in those cases.
@cindex @code{-h} (@code{--help}) option
@cindex @code{-k} (@code{--keep}), no-op option
@cindex @code{--} option
@cindex @code{--version} option
@cindex @code{EXIT_SUCCESS} constant
@cindex @code{EXIT_FAILURE} constant
@cindex @code{fatal()} function
@=
for (i = 1; i < ARGC; i++) {
if (ARGV[i] == "-h" || ARGV[i] == "--help") {
usage(EXIT_SUCCESS)
} else if (ARGV[i] == "--version") {
version()
} else if (ARGV[i] == "-k" || ARGV[i] == "--keep") {
# do nothing, backwards compatibility
delete ARGV[i]
} else if (ARGV[i] == "--") {
delete ARGV[i]
break
} else if (ARGV[i] ~ /^--?.+/) {
fatal(_"%s: unrecognized option `%s'\n" \
"Try `%s --help' for more information.\n",
Invocation_name, ARGV[i], Invocation_name)
# fatal() will do `exit EXIT_FAILURE'
} else {
break
}
}
@
@node Processing records
@chapter Processing Records
Processing records includes setting things up for each input file,
pulling apart each record, sorting the data at the end, and writing out
the data properly.
@menu
* Setup for each input file:: What happens at the start of each file.
* Processing each record:: Pulling apart the fields and storing
data.
* End-of-file sorting and printing:: Sorting the entries for each index.
* Printing the data:: Printing the final results.
@end menu
@node Setup for each input file
@section Setup For Each Input File
At the beginning of each input file, the @code{beginfile()} function
clears our variables from any previous processing and sets up the
output file name. We always append an @samp{s} to the name of the input
file. This is the standard convention.
When @code{beginfile()} is called, the first record has already been
read, so it's possible to perform the checks for a Texinfo index file:
The first character must be either @samp{\} or @samp{@@}
(@pxref{Requirements}), and the next five characters must be the word
@samp{entry}.
@cindex @code{Special_chars} variable
@code{Special_chars} are the characters that must be preceded by
the command character inside the first key. This includes the command
character itself.
Finally, several variables are set to regular expressions that
match control sequences of interest.
@cindex @code{fatal()} function
@cindex @code{FALSE} constant
@cindex @code{beginfile()} function
@cindex @code{Output_file} variable
@cindex @code{Do_initials} variable
@cindex @code{Prev_initial} variable
@cindex @code{Command_char} variable
@cindex @code{Special_chars} variable
@cindex @code{Entries} variable
@<@code{beginfile()} work function@>=
function beginfile(filename)
{
Output_file = filename "s"
@
Entries = 0
Do_initials = FALSE
Prev_initial = ""
Command_char = substr($0, 1, 1)
if ((Command_char != "\\" && Command_char != "@") \
|| substr($0, 2, 5) != "entry")
fatal(_"%s is not a Texinfo index file\n", filename)
Special_chars = "{}" Command_char
@
}
@
@node Processing each record
@section Processing Each Record
Record processing consists of building the data structures for use in
sorting and printing once the whole file has been processed.
@=
{
@
@
@
@
@
@
@
}
@
@menu
* Remove duplicates:: Removing duplicating entries.
* Remove leading @code{\entry}:: Remove the leading @code{\entry} command.
* Get the initial:: Get the initial for this entry.
* Set up and name fields:: Pull apart the line.
* Store the data for this line:: Store the data for later access.
* Check for more than one initial:: See if there are multiple initials.
* Splitting the record:: Split the record apart.
@end menu
@node Remove duplicates
@subsection Removing Duplicate Lines
@cindex removing duplicates
@cindex duplicates, removing
@cindex @code{Seen} array
Duplicates are going to be exact. Removing them is thus easy; store
each incoming line as the index of an array named @code{Seen}. If a
line is not there, it has not been seen. Otherwise it has, and we move
on to the next record.
@cindex @code{TRUE} constant
@cindex @code{Seen} array
@=
# Remove duplicates, which can happen
if ($0 in Seen)
next
Seen[$0] = TRUE
@
We have to clear out the @code{Seen} array at the start of each input file.
@cindex @code{del_array()} function
@=
# Reinitialize these for each input file
del_array(Seen)
@
@node Remove leading @code{\entry}
@subsection Remove The Leading @code{\entry} Or @code{@@entry}
We use @code{substr()} here to avoid possible hassles with leading
backslashes in @code{sub()}.
@=
$0 = substr($0, 7) # remove leading \entry or @entry
@
@node Get the initial
@subsection Get The Initial
@cindex @code{extract_initial()} function
@=
initial = extract_initial($0)
@
The sort key is the first part of the line after @samp{@@entry},
starting with an open brace, and continuing to a matching close brace.
The very first character of the sort key can be an open brace.
If so, we extract the component of the sort key surrounded by balanced
braces. We don't account for @samp{\@{} or @samp{\@}} inside this component, as
@file{texinfo.tex} isn't expected to produce such output.
An example can be seen in what older versions of @file{texinfo.tex}
generated if you needed to index a real backslash, namely an input line
something like the following:
@example
\entry@{@{\tt \indexbackslash @} (backslash)@}@{14@}@{\code @{@{\tt @dots{}@}@}
@end example
Earlier versions of @command{texindex} took the first non-brace
character as the initial, in this example @samp{\}, and output it as
@samp{\\}; this was not, however, a control sequence recognized by the
older versions of @file{texinfo.tex}.
@cindex @code{extract_initial()} function
@cindex @code{char_split()} function
@cindex @code{fatal()} function
@=
function extract_initial(key, initial, nextchar, i, l, kchars)
{
l = char_split(key, kchars)
if (l >= 3 && kchars[2] == "{") {
bracecount = 1
i = 3
while (bracecount > 0 && i <= l) {
if (kchars[i] == "{")
bracecount++
else if (kchars[i] == "}")
bracecount--
i++
}
if (i > l)
fatal(_"%s:%d: Bad key %s in record\n", FILENAME, FNR, key)
initial = substr(key, 2, i - 2)
} else if (kchars[2] == Command_char) {
nextchar = kchars[3]
if (initial == Command_char && index("{}", nextchar) > 0)
initial = substr(key, 2, 3)
else {
initial = toupper(nextchar)
}
} else {
initial = toupper(kchars[2])
}
return initial
}
@
@node Set up and name fields
@subsection Set Up And Name The Fields
The next step is to pull out the data of interest from the multiple sets of
braces. This is delegated to a function named @code{field_split()}.
There must be at least three fields, and there can be up to five.
@cindex @code{fatal()} function
@cindex @code{field_split()} function
@cindex @code{fields} array, setting up
@=
numfields = field_split($0, fields, "{", "}", Command_char)
if (numfields < 3 || numfields > 5)
fatal(_"%s:%d: Bad entry; expected 3 to 5 fields, not %d\n",
FILENAME, FNR, numfields)
@
We give the fields names for later use.
@=
key = fields[1]
pagenum = fields[2]
primary_text = fields[3]
secondary_text = (numfields > 3 ? fields[4] : "")
tertiary_text = (numfields > 4 ? fields[5] : "")
@
@node Store the data for this line
@subsection Store The Data For This Line
@cindex storing data
We use multiple arrays to store different parts of the data.
The sort key from the input is invariant across entries, so we use that
as the index in the various arrays.
We need the following arrays:
@table @code
@item Numfields
How many fields (entries) in this line: one, two, or three.
@item Initials
The initial for this line.
@item Primary
The primary index text. This is the real text, not the stripped
value appearing in the sort key.
@item Secondary
The secondary index text, if present.
@item Tertiary
The tertiary index text, if present.
@item Pagedata
The page numbers on which identical entries occur.
@item See_count
The number of ``see'' entries for a given
index line. Nothing prevents there being multiple such:
@cindex toast @seeentry{toaster}
@cindex toast @seeentry{jam}
@example
@@cindex toast @@seeentry@{toaster@}
@@cindex toast @@seeentry@{jam@}
@end example
@noindent
So we have to be prepared to handle such input.
The @code{See_count} array counts the number of such
texts as there may be.
@item See
The actual texts of the @samp{@@seeentry} value for
the line's sort key. Each entry goes into
@code{See[key, 1]}, @code{See[key, 2]}, etc., up to
@code{See_count[key]}.
@item Seealso_count
@itemx Seealso
These serve the same purpose as @code{See_count} and @code{See},
but for @code{@@seealso} entries.
@end table
In the event that a particular key has more than one associated output
text, we'll keep the first and ignore the remainder (this is the same
behavior as the C implementation). @xref{Assumptions}.
For page numbers, we merely append the page number field from the input,
preceded by a comma and space, unless that page number was already the
last that's been stored. (We're assuming the page numbers don't jump
around, which, in fact, they don't, so we don't need a more complex
approach.) This also handles any page numbers that appear as roman
numerals (from the so-called front matter), should there be such.
In addition to all the previously described arrays, the key is stored in
the @code{Keys} array the first time it is seen; this array is sorted
later on. Its indices are just incremented integers, stored in the
global @code{Entries} variable. The @code{Allkeys} associative array
lets us easily track if we have seen a key before.
@cindex @code{Keys} array
@cindex @code{Allkeys} array
@cindex @code{Entries} variable
@cindex @code{Seeentry_re} variable
@cindex @code{Seealso_re} variable
@cindex @code{Primary} array
@cindex @code{Secondary} array
@cindex @code{Tertiary} array
@cindex @code{Numfields} array
@cindex @code{See} array
@cindex @code{See_count} array
@cindex @code{Seealso} array
@cindex @code{Seealso_count} array
@cindex @code{Pagedata} array
@cindex @code{escape()} function
@=
if (! (key in Allkeys)) {
# first time we've seen this full line
Keys[++Entries] = key
Allkeys[key] = key
Initials[key] = initial
Numfields[key] = numfields - 2 # don't count sortkey, page number
Primary[key] = primary_text
if (secondary_text)
Secondary[key] = secondary_text
if (tertiary_text)
Tertiary[key] = tertiary_text
@
if (pagenum ~ Seeentry_re) {
See_count[key]++
See[key, See_count[key]] = pagenum
} else if (pagenum ~ Seealso_re) {
Seealso_count[key]++
Seealso[key, Seealso_count[key]] = pagenum
} else
Pagedata[key] = pagenum
} else {
# We've seen this key before:
# Add to see or see also, or else add to list of pages.
# In the latter case, make sure we've not seen this
# page number before. (Shouldn't happen based on the
# earlier removal of exact duplicates, but we could have
# an identical key with different formatting of actual text.
if (pagenum ~ Seeentry_re) {
See_count[key]++
See[key, See_count[key]] = pagenum
} else if (pagenum ~ Seealso_re) {
Seealso_count[key]++
Seealso[key, Seealso_count[key]] = pagenum
} else if (! (key in Pagedata)) {
Pagedata[key] = pagenum
} else if (Pagedata[key] != pagenum \
&& Pagedata[key] !~ escape(", " pagenum "$")) {
Pagedata[key] = Pagedata[key] ", " pagenum
}
}
@
We split the key into subparts, using the @samp{@@subentry} as
the separator. The subparts are stored in the @code{Subkeys} array.
@cindex @code{Subkeys} array
@=
n = split(key, subparts, Subentry_re)
for (i = 1; i <= n; i++)
Subkeys[key, i] = subparts[i]
@
The @code{Seeentry_re}, @code{Seealso_re} and @code{Subentry_re}
variables are regular expressions that match the corresponding @TeX{}
control sequences. They're initialized once for each input file, since
the command character might be different between files.
The @code{make_regexp()} function is described in @ref{make_regexp}.
@cindex @code{Seeentry_re} variable
@cindex @code{Seealso_re} variable
@cindex @code{Subentry_re} variable
@cindex @code{make_regexp()} function
@=
Seeentry_re = make_regexp("%seeentry")
Seealso_re = make_regexp("%seealso")
Subentry_re = make_regexp(" *%subentry +")
@
Here too, we have to clear out the arrays we've used at the start of
each input file.
@cindex @code{Keys} array
@cindex @code{Subkeys} array
@cindex @code{Allkeys} array
@cindex @code{Initials} array
@cindex @code{Numfields} array
@cindex @code{Primary} array
@cindex @code{Secondary} array
@cindex @code{Tertiary} array
@cindex @code{See} array
@cindex @code{See_count} array
@cindex @code{Seealso} array
@cindex @code{Seealso_count} array
@cindex @code{Pagedata} array
@cindex @code{del_array()} function
@=
del_array(Keys)
del_array(Allkeys)
del_array(Subkeys)
del_array(Initials)
del_array(Numfields)
del_array(Primary)
del_array(Secondary)
del_array(Tertiary)
del_array(See)
del_array(See_count)
del_array(Seealso)
del_array(Seealso_count)
del_array(Pagedata)
@
@node Check for more than one initial
@subsection Check For More Than One Initial
@cindex initial, checking for more than one
Finally, we need to determine if more than one initial occurs in the
input. If so, we set @code{Do_initials} to true. As soon as it's true,
we don't need to do further checking on subsequent lines.
@cindex @code{TRUE} constant
@cindex @code{Do_initials} variable
@cindex @code{Prev_initial} variable
@=
if (! Do_initials) {
if (Prev_initial == "")
Prev_initial = initial
else if (initial != Prev_initial)
Do_initials = TRUE
}
@
@node Splitting the record
@subsection Splitting The Record: @code{field_split()}
Let's take a look at the function that breaks apart the record. Upon
entry to the function, the value of @code{record} looks something like:
@example
@{POSIX awk@}@{5@}@{POSIX @@command @{awk@}@}
@end example
The first field may have instances of @samp{@@@{} and/or @samp{@@@}} (or
@samp{\@{} and/or @samp{\@}}), so the braces aren't necessarily exactly
balanced.
The @code{field_split()} function uses fairly straightforward ``count
the delimiters'' code. The loop starts at two, since we know the first
character is an open brace. The main things to handle are the command
character and the final closing brace.
@cindex @code{field_split()} function
@cindex @code{char_split()} function
@=
function field_split( \
record, fields, start, end, com_ch, # parameters
chars, numchars, out, delim_count, i, j, k) # locals
{
del_array(fields)
numchars = char_split(record, chars)
j = 1 # index into fields
k = 1 # index into out
delim_count = 1
for (i = 2; i <= numchars; i++) {
if (chars[i] == com_ch) {
@
} else if (chars[i] == start) {
delim_count++
out[k++] = chars[i]
} else if (chars[i] == end) {
delim_count--
if (delim_count == 0) {
@
} else
out[k++] = chars[i]
} else
out[k++] = chars[i]
}
return j - 1 # num fields
}
@
If the command character is doubled, we pass that on through, so that
@TeX{} will process it correctly.
If the character following the command character is an open brace or close
brace, we pull it in. Otherwise, the
command character is left alone as part of the field.
@cindex @code{Command_char} variable
@cindex @code{Special_chars} variable
@=
if (chars[i+1] == Command_char) { # input was @@
out[k++] = chars[i+1]
out[k++] = chars[i+1]
i++
} else if (index(Special_chars, chars[i+1]) != 0) {
out[k++] = chars[i+1]
i++
} else
out[k++] = chars[i]
@
Upon seeing the final closing brace, we put all the characters back
together into a string using @code{join()}. We then reset the
@code{out} array for the next time through. If the next character isn't
an open brace, then the line is bad and we print a fatal error.
Otherwise, we reset @code{delim_count} to one.
@cindex @code{join()} function
@cindex @code{del_array()} function
@cindex @code{fatal()} function
@=
fields[j++] = join(out, 1, k-1, SUBSEP)
del_array(out) # reset for next time through
k = 1
i++
if (i <= numchars && chars[i] != start)
fatal(_"%s:%d: Bad entry; expected %s at column %d\n",
FILENAME, FNR, start, i)
delim_count = 1
@
@node End-of-file sorting and printing
@section End-of-file Sorting And Printing
Upon end of input, the processing is straightforward: sort the entries
and write them out. Additionally, if we are printing the initial,
handle that. (That printing task is delegated to a small function.)
@cindex @code{endfile()} function
@cindex @code{quicksort()} function
@cindex @code{write_index_entry()} function
@cindex @code{Entries} variable
@cindex @code{Keys} array
@cindex @code{Initials} array
@cindex @code{print_initial()} function
@cindex @code{Output_file} variable
@<@code{endfile()} work function@>=
function endfile(filename, i, prev_initial, initial)
{
# sort the entries
quicksort(Keys, 1, Entries, "index")
prev_initial = ""
for (i = 1; i <= Entries; i++) {
# deal with initial
initial = Initials[Keys[i]]
if (initial != prev_initial) {
prev_initial = initial
print_initial(initial)
}
write_index_entry(i)
}
close(Output_file)
}
@
Printing an initial is not complicated. The main thing is to precede
special characters with the command character.
@cindex @code{Command_char} variable
@cindex @code{Special_chars} variable
@cindex @code{Do_initials} variable
@cindex @code{print_initial()} function
@cindex @code{Output_file} variable
@=
function print_initial(initial)
{
if (! Do_initials)
return
if (index(Special_chars, initial) != 0)
initial = Command_char initial
printf("%cinitial {%s}\n",
Command_char, initial) > Output_file
}
@
@menu
* Quicksort:: Sorting our input.
* Multilevel comparisons:: Handling multilevel entries.
* Comparing index entries:: The heart of the sorting algorithm.
@end menu
@node Quicksort
@subsection Quicksort
@cindex quicksort algorithm
@cindex Hoare, C.A.R.
Sorting uses a standard quicksort algorithm. It turns out we need to
sort both multilevel index entries, and regular text. To that end
the @code{compare} variable indicates which way to do the ``less
than'' comparison.
@cindex @code{quicksort()} function
@cindex @code{quicksort_swap()} function
@=
# quicksort --- C.A.R. Hoare's quick sort algorithm. See Wikipedia
# or almost any algorithms or computer science text
# Adapted from K&R-II, page 110
#
function quicksort(data, left, right, compare, # parameters
i, last, use_index, lt) # locals
{
if (left >= right) # do nothing if array contains fewer
return # than two elements
@
quicksort_swap(data, left, int((left + right) / 2))
last = left
for (i = left + 1; i <= right; i++) {
@
if (lt)
quicksort_swap(data, ++last, i)
}
quicksort_swap(data, left, last)
quicksort(data, left, last - 1, compare)
quicksort(data, last + 1, right, compare)
}
# quicksort_swap --- quicksort helper function, could be inline
#
function quicksort_swap(data, i, j, temp)
{
temp = data[i]
data[i] = data[j]
data[j] = temp
}
@
We set a Boolean (numeric) variable to indicate what kind of
comparison to do, avoiding repeating the string
comparison whose result won't change upon each iteration.
@=
use_index = (compare == "index")
@
The @code{less_than()} function supplies the comparison for index entries
(@pxref{Multilevel comparisons}, and @pxref{Comparing index entries}).
The @code{key_compare()} function is used for string comparisons.
@cindex @code{less_than()} function
@cindex @code{key_compare()} function
@=
lt = (use_index \
? less_than(data, i, left) \
: key_compare(data[i], data[left]) < 0)
@
@node Multilevel comparisons
@subsection Handling Multilevel Entries
The @code{less_than()} function has to take into account
that we are comparing multilevel index entries. We can't just
compare the full sort key, since the @samp{@@subentry} throws off
the comparison; we want to compare based only on the key texts
themselves.
To that end, the comparison happens on two levels. At the higher
level, we compare the subkeys; if the first subkeys are equal
then we differentiate between them based on the second subkey.
If, in turn, the second ones are equal, we differentiate based
on the third one.
By definition, an index entry with only one subkey sorts to be before
an entry with two, and one with two comes before one with three.
The underlying @code{key_compare()} function, which does the hard
work of comparison, returns a three-way value a la the C @code{strcmp()}
function: less than zero if the first string is less than the second,
zero if they're equal, or greater than zero if the first string is
greater than the second one.
We make an effort here to call the comparison function only as
much as necessary, since it's a relatively expensive operation.
@cindex @code{less_than()} function
@cindex @code{key_compare()} function
@cindex @code{Numfields} array
@cindex @code{Subkeys} array
@=
function less_than(data, l, r, left, right, nfields, cmp1, cmp2)
{
left = data[l]
right = data[r]
left_fields = Numfields[left]
right_fields = Numfields[right]
nfields = min(left_fields, right_fields)
# At least one field, always check the first subkey
cmp1 = key_compare(Subkeys[left, 1], Subkeys[right, 1])
if (cmp1 != 0)
return cmp1 < 0
# cmp1 == 0: one side has 1 field, other side has 1 to 3 fields
if (nfields == 1)
return left_fields < right_fields
# At least two fields, check second subkey
cmp2 = key_compare(Subkeys[left, 2], Subkeys[right, 2])
if (cmp2 != 0)
return cmp2 < 0
# cmp1 == 0, cmp2 == 0, one side has 2 fields,
# other has 2 to 3 fields
if (nfields == 2)
return left_fields < right_fields
# Three fields
return key_compare(Subkeys[left, 3], Subkeys[right, 3]) < 0
}
@
@node Comparing index entries
@subsection Comparing Index Entries
The sort key comparison function is the heart of the sorting algorithm. The
comparison is based on the indexing rules, which are:
@itemize @bullet
@item
All symbols first.
@item
Followed by digits.
@item
Followed by letters. Lowercase precedes uppercase and both ``a'' and
``A'' precede anything starting with ``b'' or ``B'' (etc.).
@end itemize
Implementing these rules is a little complicated. The first thing we
need is a table that maps characters to comparison values. The
following code is based on the original C @command{texindex}, although
the actual comparison algorithm is more sophisticated.
We set up an @code{Ordval} array to map characters to numeric values.
Most characters map to their ASCII code. We add 512 to the value of each
of the digits; this causes them to come after all symbols.
The letters are handled a little differently. We set things up so that
lowercase letters come before uppercase ones, but both ``a'' and
``A'' come before ``b'', and so on. This then lets us use
a simple subtraction in the comparison to determine if two letters are
less than, equal to, or greater than each other. In any case, the mapping
also ensures that letters come after digits.
(This code should also work for EBCDIC systems, although @TeX{} does
everything in ASCII, so it's not likely to make a difference.)
The table must be built completely before changing the mapping of the
letters, because all of the uppercase and lowercase letters must be in
the table before we can change their values.
@cindex @code{Ordval} array
@cindex @code{isdigit()} function
@cindex @code{isupper()} function
@=
BEGIN {
for (i = 0; i < 256; i++) {
c = sprintf("%c", i)
Ordval[c] = i # map character to value
if (isdigit(c))
Ordval[c] += 512
}
# Set things up such that 'a' < 'A' < 'b' < 'B' < ...
i = Ordval["a"]
j = Ordval["z"]
newval = i + 512
for (; i <= j; i++) {
c = sprintf("%c", i)
if (islower(c)) {
Ordval[c] = newval++
Ordval[toupper(c)] = newval++
}
}
}
@
Here is the @code{key_compare()} function. It returns less than zero if the
@code{left} string is ``less than'' the @code{right} string, zero if they are
equal, and greater than zero if the @code{left} string is ``greater than'' the
@code{right} string.
The comparison algorithm is not too complicated, once we define how
things should work. We loop over each pair of characters in the
@code{left} and @code{right} strings, comparing them one at a time.
When comparing two characters, there are three cases, one of which has
three subcases, as follows:
@table @i
@item Two letters
@c nested table
@table @i
@item Same letter, but different case
This is the slightly complicated case.
When two characters are equal, we have to look ahead at the next
characters to decide whether to continue the loop or quit. As long as
we are not at the end of the string, and at least one of the following
characters in either string is a letter, we continue the loop.
Otherwise we do the character comparison and return.
@item Two different letters, but same case
@itemx Two different letters, different case
Use the comparison of the respective @code{Ordval} values.
@end table
@c end nested table
@item A letter and something else
@itemx Two nonletters
Use the comparison of the respective @code{Ordval} values.
@end table
@noindent
When the values are equal, continue around the loop. And, as
usual, if one string is an initial substring of the other, that one is
considered to be ``less than'' the other one.
The rules just described produce @emph{better} results than did the C
@command{texindex}. For example, @samp{beginfile()} sorts
before @samp{BEGINFILE}, whereas with the C version they came out in the
opposite order.
@cindex @code{Ordval} array
@cindex @code{char_split()} function
@cindex @code{key_compare()} function
@cindex @code{isalpha()} function
@=
function key_compare(left, right, len_l, len_r, len, chars_l, chars_r)
{
len_l = length(left)
len_r = length(right)
len = (len_l < len_r ? len_l : len_r)
char_split(left, chars_l)
char_split(right, chars_r)
for (i = 1; i <= len; i++) {
if (isalpha(chars_l[i]) && isalpha(chars_r[i])) {
# same char different case
# upper case comes out last
if (chars_l[i] != chars_r[i] &&
tolower(chars_l[i]) == tolower(chars_r[i])) {
if (i != len \
&& (isalpha(chars_l[i+1]) || isalpha(chars_r[i+1])))
continue
# negative, zero, or positive
return Ordval[chars_l[i]] - Ordval[chars_r[i]]
}
# same case, different char,
# or different case, different char:
# letter order wins
if (Ordval[chars_l[i]] < Ordval[chars_r[i]])
return -1
if (Ordval[chars_l[i]] > Ordval[chars_r[i]])
return 1
# equal, keep going
continue
}
# letter and something else, or two non-letters
# letter order wins
if (Ordval[chars_l[i]] < Ordval[chars_r[i]])
return -1
if (Ordval[chars_l[i]] > Ordval[chars_r[i]])
return 1
# equal, keep going
}
# equal so far, shorter one wins
if (len_l < len_r)
return -1
if (len_l > len_r)
return 1
return 0
}
@
@node Printing the data
@section Printing The Final Results
Printing an index entry is where all the data we collected
gets used. Much of the complexity is here, since we have to
output up to three lines per entry.
@menu
* printing top level:: Top level logic.
* printing a single entry:: Handling a single entry.
@end menu
@node printing top level
@subsection Top Level Logic For Printing An Entry
So, let's start. The logic is going to be a little complicated:
@cindex @code{write_index_entry()} function
@cindex @code{Keys} array
@cindex @code{Numfields} array
@=
function write_index_entry(current, key)
{
key = Keys[current] # current sort key
if (Numfields[key] == 1) {
@
} else if (Numfields[key] == 2) {
@
@
} else if (Numfields[key] == 3) {
@
@
@
}
}
@
Consider the three-level case, for an entry like:
@example
@@cindex coffee makers @@subentry electric @@subentry blue
@end example
@noindent
There may not have been separate preceding entries like @samp{@@cindex
coffee makers} or @samp{@@cindex coffee makers @@subentry electric}. Thus,
we have to generate the preceding @code{@@entry} and @code{@@secondary}
lines before generating the final @code{@@tertiary} line.
Printing the entries is similar no matter what kind of entry; there's
a lot of work to be done so it's isolated in the @code{print_entry()}
function described in the next section.
@cindex @code{print_entry()} function
@=
print_entry(key, "entry", Primary)
@
@cindex @code{print_entry()} function
@=
print_entry(key, "secondary", Secondary)
@
@cindex @code{print_entry()} function
@=
print_entry(key, "tertiary", Tertiary)
@
@node printing a single entry
@subsection Printing A Single Entry
Printing a single entry is quite involved:
@cindex @code{print_entry()} function
@cindex @code{print_see_entry()} function
@cindex @code{See} array
@cindex @code{See_count} array
@cindex @code{Seealso} array
@cindex @code{Seealso_count} array
@cindex @code{Pagedata} array
@cindex @code{Printed} array
@cindex @code{Output_file} variable
@=
function print_entry(key, entry_command, entry_text,
@)
{
if ((key, 1) in See) # at least one ``see''
print_see_entry(key, entry_command, entry_text,
See_count, See)
if (key in Pagedata) { # at least one page number
@
printf("%c%s{%s}{%s}\n",
Command_char, entry_command,
entry_text[key], Pagedata[key]) > Output_file
Printed[key] = True # mark this key as printed
} else if ((key, 1) in Seealso) { # at least one ``see also''
# Only ``see also'' entry, print it
@
printf("%c%s{%s}{",
Command_char, entry_command,
entry_text[key]) > Output_file
# now add them to the page data
for (i = 1; i <= count; i++) {
printf("%s", see_entries[i]) > Output_file
if (i != count)
printf(", ") > Output_file
}
printf("}\n") > Output_file
}
}
@
Note that we only take note of having printed the key
for lines with page numbers. Otherwise, a ``see'' entry followed
by a regular multilevel entry is not handled correctly.
When there exist both regular index entries for a topic and
also a ``see also'' entry, we place the ``see also'' text
after all the page numbers, so that there is only one printed
index entry for the topic.
This ends up being involved, since potentially there could be
multiple ``see also'' entries (even though this is bad form).
@=
if ((key, 1) in Seealso) {
@
# now add them to the page data
for (i = 1; i <= count; i++)
Pagedata[key] = Pagedata[key] ", " see_entries[i]
}
@
@=
count, see_entries, i
@
Although it's bad practice, there could be multiple
``see also'' entries for a given key. In that case, we
must sort them before using them.
@=
count = Seealso_count[key]
# Copy the entries to a separate array
for (i = 1; i <= count; i++)
see_entries[i] = Seealso[key, i]
# sort them
quicksort(see_entries, 1, count, "string")
@
Be sure to empty out @code{Printed} at the start of each file.
@cindex @code{Printed} array
@cindex @code{del_array()} function
@=
del_array(Printed)
@
And add our function to the group of work functions.
@=
@
@
Printing ``see'' entries is potentially messy if there
are more than one. (A good index won't have more than one, but nothing
prevents there being multiple such entries, so we have to handle them.)
@cindex @code{print_see_entry()} function
@cindex @code{quicksort()} function
@cindex @code{Output_file} variable
@=
function print_see_entry(key, entry_command, entry_text, # parameters
count_array, see_text_array, # parameters
i, count, see_entries) # locals
{
count = count_array[key]
if (count == 1) { # the easy case
printf("%c%s{%s, %s}{}\n",
Command_char, entry_command,
entry_text[key], see_text_array[key, 1]) > Output_file
return
}
# Otherwise, we need to sort the entries and then print them
# Copy the entries to a separate array
for (i = 1; i <= count; i++)
see_entries[i] = see_text_array[key, i]
# sort them
quicksort(see_entries, 1, count, "string")
# now print them
for (i = 1; i <= count; i++)
printf("%c%s{%s, %s}{}\n",
Command_char, entry_command,
entry_text[key], see_entries[i]) > Output_file
}
@
@cindex test case @seeentry{testing}
@cindex test case @seeentry{brief case}
@cindex coffee
@cindex coffee @seealso{tea}
@cindex coffee @seealso{coca-cola}
@cindex whisky @seealso{soda}
@cindex whisky @seealso{scotch}
@=
@
@
When checking if we need to print the primary and secondary entry,
we need to use the subparts of the key.
The subparts represent the key for those entries;
each will have an index in @code{Printed} if
we already printed such an entry.
The subparts are already available in the @code{Subkeys} array.
@cindex @code{Subkeys} array
@cindex @code{Printed} array
@cindex @code{Output_file} variable
@=
if (! (Subkeys[key, 1] in Printed)) {
printf("%centry{%s,}{}\n",
Command_char, Primary[key]) > Output_file
Printed[Subkeys[key, 1]] = True
}
@
Printing the secondary entry is a little subtle; we have to check
that the combination of primary and secondary subkeys have been
printed and use that combination as the index into @code{Printed}.
@cindex @code{Subkeys} array
@cindex @code{Printed} array
@cindex @code{Output_file} variable
@=
subkey = (Subkeys[key, 1] Command_char "subentry " Subkeys[key, 2])
if (! (subkey in Printed)) {
printf("%csecondary{%s,}{}\n",
Command_char, Secondary[key]) > Output_file
Printed[subkey] = True
}
@
Here are some test cases:
@c These should make a nice test case!
@cindex coffee makers
@cindex toasters @subentry british
@cindex toasters @subentry american
@cindex microwaves @subentry electric @subentry 110 volt
@cindex microwaves @subentry electric @subentry 220 volt
@cindex children
@cindex children @subentry small
@cindex children @subentry small @subentry toddlers
@cindex children @subentry small @subentry infants
@cindex children @subentry teenagers
@cindex children @subentry adult
@example
@@c Top level entry, with page
@@cindex coffee makers
@@c Double level, no separate top level
@@cindex toasters @@subentry british
@@cindex toasters @@subentry american
@@c Triple level, no separate 1st or 2nd level
@@cindex microwaves @@subentry electric @@subentry 110 volt
@@cindex microwaves @@subentry electric @@subentry 220 volt
@@c All 3 levels, with pages
@@cindex children
@@cindex children @@subentry small
@@cindex children @@subentry small @@subentry toddlers
@@cindex children @@subentry small @@subentry infants
@@cindex children @@subentry teenagers
@@cindex children @@subentry adult
@@c More examples
@@cindex test case @@seeentry@{testing@}
@@cindex test case @@seeentry@{brief case@}
@@cindex coffee
@@cindex coffee @@seealso@{tea@}
@@cindex coffee @@seealso@{coca-cola@}
@@cindex whisky @@seealso@{soda@}
@@cindex whisky @@seealso@{scotch@}
@end example
@noindent
@xref{Index}, to see if the above test cases are handled properly.
Finally, we add the above code into the set of work functions:
@=
@
@
@node Necessary stuff
@chapter Necessary Stuff That Isn't Thrilling
This chapter provides some necessary but unexciting elements.
@menu
* Copyright statement:: Copyright info.
* Library functions:: From the @code{gawk} library:
@file{ftrans.awk}, @code{join()}.
* Helper functions:: @code{del_array()}, @code{check_split_null()},
@code{fatal()}, @dots{}
* I18N:: Internationalization.
@end menu
@node Copyright statement
@section Copyright Statement
@cindex copyright statement
@cindex GNU General Public License
@cindex License, GNU General Public
@cindex GPL (GNU General Public License)
Every program needs a copyright statement.
@=
#
# Copyright 2014-2024 Free Software Foundation, Inc.
#
# This file is part of GNU Texinfo.
#
# Texinfo is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 3 of the License, or
# (at your option) any later version.
#
# Texinfo is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, see .
@
@node Library functions
@section Library Functions: @file{ftrans.awk} and @code{join()}
The program uses several library routines discussed in detail
in the @command{gawk} documentation. The first sets up the
infrastructure for the @code{beginfile()} and @code{endfile()} functions.
@xref{Filetrans Function,,, gawk, GNU Awk User's Guide},
for an explanation of how this function works.
@cindex @file{ftrans.awk} library file
@cindex @code{beginfile()} function
@cindex @code{endfile()} function
@=
# ftrans.awk --- handle data file transitions
#
# user supplies beginfile() and endfile() functions
#
# Arnold Robbins, arnold@skeeve.com, Public Domain
# November 1992
FNR == 1 {
if (_filename_ != "")
endfile(_filename_)
_filename_ = FILENAME
beginfile(FILENAME)
}
END { endfile(_filename_) }
@
The next function is @code{join()}, which joins an array of characters
back into a string. @xref{Join Function,,, gawk, GNU Awk User's Guide},
for an explanation of how this function works.
@cindex @file{join.awk} library file
@cindex @code{join()} function
@=
# join.awk --- join an array into a string
#
# Arnold Robbins, arnold@skeeve.com, Public Domain
# May 1993
function join(array, start, end, sep, result, i)
{
if (sep == "")
sep = " "
else if (sep == SUBSEP) # magic value
sep = ""
result = array[start]
for (i = start + 1; i <= end; i++)
result = result sep array[i]
return result
}
@
@node Helper functions
@section Helper Functions
These helper functions make the main code easier to follow.
@menu
* del_array:: Clearing out an array.
* check_split_null:: Checking if @command{awk} splits on the null string.
* char_split:: Splitting a line into individual characters.
* fatal:: Printing fatal errors.
* is@dots{} functions:: Checking character types.
* make_regexp:: Make a regexp to match @TeX{} control sequences.
* escape:: Escaping backslashes for strings.
* min:: Get the minimum of two numbers.
@end menu
@node del_array
@subsection @code{del_array()}: Deleting An Array
@code{del_array()} clears out an array.
@cindex @code{del_array()} function
@=
function del_array(a)
{
# Portable and faster than
# for (i in a)
# delete a[i]
split("", a)
}
@
@node check_split_null
@subsection @code{check_split_null()}: Checking If @command{awk} Splits On The Null String
@code{check_split_null()} determines whether the @command{awk} running
this program supports using the null string for the separator, splitting
each character off into a separate element. If so, the return value from @code{split()}
will be the number of elements in the array, and it will be more than
one. It is called at program startup.
@cindex @code{check_split_null()} function
@=
function check_split_null( n, a)
{
n = split("abcde", a, "")
return (n == 5)
}
@
@node char_split
@subsection @code{char_split()}: Splitting A String Into Characters
@code{char_split()} splits a string into separate characters, letting
@command{awk} do the work if possible. If not, each character is
extracted manually using a loop and @code{substr()}.
@cindex @code{char_split()} function
@cindex @code{Can_split_null} variable
@cindex @code{del_array()} function
@=
function char_split(string, array, n, i)
{
if (Can_split_null)
return split(string, array, "")
# do it the hard way
del_array(array)
n = length(string)
for (i = 1; i <= n; i++)
array[i] = substr(string, i, 1)
return n
}
@
@node fatal
@subsection @code{fatal()}: Printing Fatal Error Messages
@cindex @command{cat} command
@cindex stderr
The @code{fatal()} function prints a @code{printf}-formatted message to
standard error and then exits badly.
For maximal portability, it opens a pipeline to @command{cat},
redirected to standard error; not all systems have a @file{/dev/stderr}
file, and not all versions of @command{awk} recognize that name
internally. (Thus, we can't use @samp{print @dots{} > "/dev/stderr"}.)
@cindex @code{EXIT_FAILURE} constant
@cindex @code{fatal()} function
@=
function fatal(format, arg1, arg2, arg3, arg4, arg5,
arg6, arg7, arg8, arg9, arg10, cat)
{
cat = "cat 1>&2" # maximal portability
printf(format, arg1, arg2, arg3, arg4, arg5,
arg6, arg7, arg8, arg9, arg10) | cat
close(cat)
exit EXIT_FAILURE
}
@
@node is@dots{} functions
@subsection @code{is@dots{}} Functions: Checking Character Types
@cindex @code{isupper()} function
@cindex @code{islower()} function
@cindex @code{isalpha()} function
@cindex @code{isdigit()} function
The following functions help identify what a character is; they are
similar in nature to the various macros in the C @code{} header
file. Since most of them return a count, the return value could be used to
compute which character from the set was seen; this turned out not to be
necessary in this program but might be useful in some other context.
By using @code{index()} with lists of letters, these functions will
also work on EBCDIC systems, should that ever be necessary.
@=
function isupper(c)
{
return index("ABCDEFGHIJKLMNOPQRSTUVWXYZ", c)
}
function islower(c)
{
return index("abcdefghijklmnopqrstuvwxyz", c)
}
function isalpha(c)
{
return islower(c) || isupper(c)
}
function isdigit(c)
{
return index("0123456789", c)
}
@
@node make_regexp
@subsection @code{make_regexp()}: Matching @TeX{} Control Sequences
@file{texindex.awk} has to handle input where the command
character may be an @samp{@@} or a @samp{\}. When matching
command strings in regular expressions, if the command character
is a backslash, it must be doubled in order to be treated
literally. The @code{make_regexp()} function handles this
for us; a @samp{%} in the text of the regexp stands for
the command character and is replaced appropriately.
@cindex @code{make_regexp()} function
@=
function make_regexp(regexp, a, sep, n)
{
n = split(regexp, a, "%")
if (Command_char == "\\")
sep = Command_char Command_char
else
sep = Command_char
return join(a, 1, n, sep)
}
@
@node escape
@subsection @code{escape()}: Escaping Backslashes for Strings
If the command character is a backslash, occurrences of backslash need to be
doubled before the containing string can be used as a regexp. This
function does that job; it's very similar to @code{make_regexp()}
(@pxref{make_regexp}).
@cindex @code{escape()} function
@=
function escape(regexp, a, n)
{
if (Command_char != "\\")
return regexp
n = split(regexp, a, "\\")
if (n == 1)
return regexp
return join(a, 1, n, "\\\\")
}
@
@node min
@subsection @code{min()}: Getting The Minimum of Two Numbers
It'd be nice if @command{awk} had this built-in@enddots{}
@=
function min(a, b)
{
return (a < b ? a : b)
}
@
@node I18N
@section Internationalization
For @command{gawk}, we can arrange for the various messages, e.g., in
the @code{usage()} and @code{version()} functions, to be translated. We
do this by setting the text domain at startup. For more information on
internationalization in @command{gawk},
@pxref{Internationalization,,, gawk, GNU Awk User's Guide}.
@cindex @code{TEXTDOMAIN} variable
@=
TEXTDOMAIN = "texinfo"
@
@noindent
On non-GNU versions of @command{awk}, this is a harmless
assignment, and the @code{_"..."} construct below is a harmless
concatenation of an unassigned variable @code{_}, i.e., the empty
string, with the following string constant.
The @code{usage()} and @code{version()} functions print the necessary
information and then exit. The strings that can and should be
translated are prefixed with an underscore.
@cindex @code{Texindex_version} variable
@cindex @code{usage()} function
@cindex @code{version()} function
@cindex @code{EXIT_SUCCESS} constant
@use_smallexample
@=
function usage(exit_val)
{
printf(_"Usage: %s [OPTION]... FILE...\n", Invocation_name)
print _"Generate a sorted index for each TeX output FILE."
print _"Usually FILE... is specified as `foo.??' for a document `foo.texi'."
print ""
print _"Options:"
print _" -h, --help display this help and exit"
print _" --version display version information and exit"
print _" -- end option processing"
print ""
print _"Email bug reports to bug-texinfo@gnu.org,"
print _"general questions and discussion to help-texinfo@gnu.org."
print _"Texinfo home page: https://www.gnu.org/software/texinfo/"
exit exit_val
}
function version()
{
print "texindex (GNU texinfo)", Texindex_version
print ""
printf _"Copyright (C) %s Free Software Foundation, Inc.\n", "2024"
print _"License GPLv3+: GNU GPL version 3 or later "
print _"This is free software: you are free to change and redistribute it."
print _"There is NO WARRANTY, to the extent permitted by law."
exit EXIT_SUCCESS
}
@
@use_example
@node Index
@unnumbered Index
@printindex cp
@bye