\input texinfo @c -*-texinfo-*- @c vim: filetype=texinfo tabstop=4 shiftwidth=4 @c %**start of header (This is for running Texinfo on a region.) @setfilename texindex.info @settitle Texindex @VERSION@: A program for sorting indices @c %**end of header (This is for running Texinfo on a region.) @c Merge the function and variable indexes into the concept index, @c but without the code font; in the index entries we'll do the @c font management ourselves. Also merge in the chunk definition @c and reference entries, which jrweave creates for us. @c (Ordinarily this would be in the header, but jrweave puts the @c defindexes later.) @synindex fn cp @synindex vr cp @synindex cd cp @synindex cr cp @ifnottex @ifnotdocbook @macro ii{text} @i{\text\} @end macro @end ifnotdocbook @end ifnottex @ifdocbook @macro ii{text} @inlineraw{docbook,\text\} @end macro @end ifdocbook @copying This @command{texindex} program (version @VERSION@, @UPDATED@) sorts the raw index files created by @file{texinfo.tex}. (This Texinfo source is a literate program written using TexiWeb@tie{}Jr., not a user manual.) Copyright @copyright{} 2014-2023 Free Software Foundation, Inc. @quotation This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see @url{https://www.gnu.org/licenses/}. @end quotation @end copying @titlepage @title Texindex @subtitle version @VERSION@, @UPDATED@ @author Arnold D. Robbins @author and Texinfo maintainers @page @vskip 0pt plus 1filll @insertcopying @end titlepage @contents @ifnottex @node Top @top Texindex This file defines @command{texindex} (version @VERSION@, @UPDATED@), an @code{awk} program that processes the raw index files produced by the @file{texinfo.tex} file. @end ifnottex @menu * Preface:: Introductory remarks. * Requirements:: How the program needs to work. * High-level organization:: The overall outline. * Processing records:: Processing each record. * Necessary stuff:: Copyright, helper functions, i18n. * Index:: Combined index. @detailmenu * Intended audience:: Who should read this document. * History:: @file{texindex.awk} development history. * Desired printed output:: What a printed index should look like. * Texinfo indexing commands:: How to write indexing commands. * Input form:: The input to @file{texindex.awk}. * Output form:: The output from @file{texindex.awk}. * Processing:: Processing the data. * Assumptions:: Additional assumptions. * Portability:: Using portable @command{awk}. * First line:: The first line in the file. * Initial setup:: Set up variables and constants used throughout. * Argument processing:: Processing command line arguments. * Setup for each input file:: What happens at the start of each file. * Processing each record:: Pulling apart the fields and storing data. * Remove duplicates:: Removing duplicating entries. * Remove leading @code{\entry}:: Remove the leading @code{\entry} command. * Get the initial:: Get the initial for this entry. * Set up and name fields:: Pull apart the line. * Store the data for this line:: Store the data for later access. * Check for more than one initial:: See if there are multiple initials. * Splitting the record:: Split the record apart. * End-of-file sorting and printing:: Sorting the entries for each index. * Quicksort:: Sorting our input. * Multilevel comparisons:: Handling multilevel entries. * Comparing index entries:: The heart of the sorting algorithm. * Printing the data:: Printing the final results. * printing top level:: Top level logic. * printing a single entry:: Handling a single entry. * Copyright statement:: Copyright info. * Library functions:: From the @code{gawk} library: @file{ftrans.awk}, @code{join()}. * Helper functions:: @code{del_array()}, @code{check_split_null()}, @code{fatal()}, @dots{} * del_array:: Clearing out an array. * check_split_null:: Checking if @command{awk} splits on the null string. * char_split:: Splitting a line into individual characters. * fatal:: Printing fatal errors. * is@dots{} functions:: Checking character types. * make_regexp:: Make a regexp to match @TeX{} control sequences. * escape:: Escaping backslashes for strings. * min:: Get the minimum of two numbers. * I18N:: Internationalization. @end detailmenu @end menu @node Preface @unnumbered Preface This file defines @file{texindex.awk}, a reimplementation of the C program @file{texindex.c}. The purpose is to make the program more maintainable. As a practical benefit, it also supports correct sorting and initials for the @samp{@{} and @samp{@}} characters in an index, and multi-level index entries. @cindex @{ (left brace), example index entry for @cindex @} (right brace), example index entry for @cindex TexiWeb Jr.@: literate programming system @cindex Texinfo document formatting language This is a @dfn{literate program}, written using the @uref{https://github.com/arnoldrobbins/texiwebjr, @sc{TexiWeb Jr.@:} literate programming system}. The underlying documentation system is @uref{https://www.gnu.org/software/texinfo, Texinfo}, the GNU documentation formatting language. A single source file produces the runnable program, a printable document, and an online document. @menu * Intended audience:: Who should read this document. * History:: @file{texindex.awk} development history. @end menu @node Intended audience @section Intended Audience You should read this if you want to understand how @file{texindex.awk} works. You should be familiar with the @command{awk} programming language. If you are interested in array indexing, you've come to the wrong place. @xref{knuth}. @c Scale figure to 4.5 inches which is good for both smallbook @c and regular. TeX will scale height also automatically. @float Figure,knuth @caption{Indexing (@url{https://xkcd.com/163/})} @center @image{dek_idx, 5in, , Indexing} @end float @node History @section Development History This program was originally written in 2014 in order to enable using left and right braces in index entries, and to provide a program that would be more easily maintainable going forward than the C version. In 2019, discussions on the Texinfo bug mailing list around adding multi-level indexing and ``see'' and ``see also'' entries motivated reworking the program. @node Requirements @chapter Requirements The input to this program is the list of unsorted index entries produced by @file{texinfo.tex} when a Texinfo document is processed. This chapter presents the input to the program, the Texinfo commands that produce that input, and the expected output from this program. It also presents some additional notes concerning the requirements. @menu * Desired printed output:: What a printed index should look like. * Texinfo indexing commands:: How to write indexing commands. * Input form:: The input to @file{texindex.awk}. * Output form:: The output from @file{texindex.awk}. * Processing:: Processing the data. * Assumptions:: Additional assumptions. * Portability:: Using portable @command{awk}. @end menu @node Desired printed output @section Where We're Going Let's first look at the kind of output desired. A high-quality index has several types of entries: @table @asis @item Single level entries These are the most common; each entry has text and a list of one or more page numbers. @item Double level entries These entries have subtopics; the top level entry may also have a page number, or it may not. @item Triple level entries These entries have subtopics for the subtopics; the top level and secondary entries may also have page numbers, or they may not. @item ``See @dots{}'' entries Entries that point at other entries in the index, generally without any subtopics. ``See'' entries do not have a page number of their own. @item ``See also @dots{}'' entries Entries that point at other entries, but often do have subsequent direct page references of their own. ``See also'' entries are merged with regular entries, their text coming after the page numbers. @end table Here's what they might look like when printed (apologies in advance for the use of a constant-width font): @example coffee makers . . . . . . . . . . 15 @ii{Single level entry} electric . . . . . . . 17, 22 @ii{Double level entry} blue . . . . . . . . 42 @ii{Triple level entry} pink . . . . . . . . 35 @ii{Another triple level entry} @end example The same hierarchy might appear without page numbers: @example coffee makers, @ii{Single level entry} electric, @ii{Double level entry} blue . . . . . . . . . . 42 @ii{Triple level entry} pink . . . . . . . . . . 35 @ii{Another triple level entry} @end example A ``See'' entry doesn't have page numbers: @example espresso makers, @ii{See} coffee makers toasters, . . . . . . . . 42, @ii{See also} coffee makers @end example @node Texinfo indexing commands @section Texinfo Indexing Commands Texinfo provides a number of different commands for putting entries into different indices. For discussion we use the @code{@@cindex} command in the following examples. Of interest to @file{texindex.awk} is the text of the one-to-three parts of an entry, and how the index should be sorted. Some examples: @example @@cindex coffee makers @ii{One level} @@cindex coffee makers @@subentry electric @ii{Tow levels} @@cindex coffee makers @@subentry electric @@subentry blue @ii{Three levels} @end example Here, the @code{@@subentry} separates the secondary and tertiary parts of the entry from the primary part. Additionally, each part may have an @code{@@sortas} clause: @example @@cindex coffee makers @@sortas@{Coffee Makers@} @end example ``See'' and ``See also'' entries look like this: @example @@cindex espresso makers @@seeentry@{coffee makers@} @@cindex toasters @@seealso@{coffee makers@} @end example @noindent Note that there is (or should be) no comma between the primary text and the @code{@@seeentry} or @code{@@seealso}: @file{texindex.awk} supplies a comma in the final printed entry. @node Input form @section Input To The Program The output from @file{texinfo.tex} contains the data for the different kinds of index entries described in the previous section. Each line is an @dfn{entry}. Each entry has from three to five fields, where the first three fields represent the same data for all entries. Entries look as follows: @example @@entry@{@var{sortkey}@}@{@var{page or see}@}@{@var{primary}@} @@entry@{@var{sortkey}@}@{@var{page or see}@}@{@var{primary}@}@{@var{secondary}@} @@entry@{@var{sortkey}@}@{@var{page or see}@}@{@var{primary}@}@{@var{secondary}@}@{@var{tertiary}@} @end example The braces are balanced in all cases, although for use by this program, literal braces (not necessarily balanced) can be included in the sort key by escaping them with the @dfn{command character}. @cindex backslash vs.@: at @cindex command character, @samp{\} vs.@: @samp{@@} In the example above, the command character is @samp{@@} (as in Texinfo itself). Historically, however, the command character was backslash (@samp{\}), and @file{texindex.awk} can handle either one. Because older versions of @command{texi2dvi} only understand backslash as the command character, it remains set to backslash, so that newer versions of @file{texinfo.tex} will work with older versions of @command{texi2dvi}. Once the newer version of @command{texi2dvi} that also understands @samp{@@} has had a chance to spread, (we can hope that) the command character will change to @samp{@@}. The command character is determined at run time by looking at the first character on the first line of each input file. The fields are as follows: @table @var @item sortkey The text to use for sorting index entries. Generally, this is the text of the line with all markup removed. When an @code{@@sortas} clause is provided, its contents are used instead. This field should contain only ASCII characters. When there are subentries, the sort key is the concatenation of three fields (or their sort @code{@@sortas} clauses), separated by @code{@@subentry} and a space. @file{texindex.awk} needs to recognize where there are multiple sort keys in order to print entries appropriately. @item page or see Either a page number (as a roman numeral or an integer, possibly with additional markup), or an indication that that this is a ``See'' or ``See also'' entry. @item primary The primary text of the index entry. @item secondary The (optional) secondary text of the index entry. @item tertiary The (optional) tertiary text of the index entry. @end table Our mission (which we choose to accept) is to read the above input, sort it appropriately, and produce the correct output. @node Output form @section What The Output Should Look Like Output consists four different commands, derived from the indexing input. The middle two can have a variant where there is no page number. @example @@initial @{A@} @ii{For the initial over each group} @@entry@{@var{indexing text}@}@{@var{pagenum}@} @ii{Primary entry} @@entry@{@var{indexing text}@}@{@} @ii{Primary entry without page number} @@secondary@{@var{indexing text}@}@{@var{pagenum}@} @ii{Secondary entry} @@secondary@{@var{indexing text}@}@{@} @ii{Secondary entry without page number} @@tertiary@{@var{indexing text}@}@{@var{pagenum}@} @ii{Tertiary entry} @end example The commands are: @table @code @item @@initial Besides the index entries, @file{texindex.awk} must output special lines indicating the first character (the @dfn{initial}) of keys grouped together, but only if there is more than one initial used throughout the input file. @item @@entry This is for a plain index entry, or for the primary term in a multi-level index entry. When the primary appears only with a secondary entry, there won't be a page number. No page number is printed when there is an @code{@@seeentry} in the entry. In that case, the output from @file{texindex.awk} should be a combination of the original input fields three and two. @item @@secondary This is for a secondary index entry in a multi-level index entry. When the secondary appears only with a tertiary entry, there won't be a page number. @item @@tertiary This is for a tertiary index entry in a multi-level index entry. This one always has a page number, unless it is a ``see'' entry. @end table @node Processing @section Processing Index Entries The job is to sort the entries, and merge those which are identical except for the page numbers. The sorting should be in the order of: all symbols first, then all digits, then all letters, with uppercase letters following lowercase ones, so we will need some smarts. Once sorted, the lines must be output in the correct form, depending upon how many entries and subentries they have. Input lines might be duplicated (same entry, same page, more than once), so we will have to deal with that. @node Assumptions @section Assumptions About Our Data In the rest of the program we make two fundamental assumptions: @enumerate 1 @item If a given sort key has more than one display text, we only take the first (this matches the behavior of C @command{texindex}). Put another way, if the same sort key has two different display texts, it means that different markup was used, probably inadvertently, and we just take the first. As an example, consider these two Texinfo commands: @example @@cindex @@file@{field_split()@} function @dots{} @@cindex @@code@{field_split()@} function @end example @noindent They produce the following output via @file{texinfo.tex}, which in turn is the input to @file{texindex.awk}: @example @@entry@{field_split() function@}@{2@}@{@@file @{field_split()@} function@} @@entry@{field_split() function@}@{7@}@{@@code @{field_split()@} function@} @end example @noindent The result will be a single entry, using @code{@@file}, accumulating the page numbers: @example @@entry@{@@file @{field_split()@} function@}@{2, 7@} @end example @item @cindex roman numerals For the same sort key and text, page numbers will be monotonically increasing. This means we can just use a new page number when it comes in, and not have to sort entries based on both sort key and page number. In turn, this means that we don't need to worry about page numbers that are roman numerals (which can occur). @end enumerate @node Portability @section Using Portable @command{awk} An additional requirement, for ease of deployment, is that the program be written in portable @command{awk}, and not use features found only in GNU @command{awk} (@command{gawk}). For our purposes, ``portable'' means ``new'' @command{awk} as defined in the 1988 book by Aho, Weinberger and Kernighan. This gives us functions, multidimensional arrays and a number of other important features over the original @command{awk} shipped with V7 Unix. In practice, we can also rely on basic features added in POSIX @command{awk} should we need to (such as @code{CONVFMT}), although at the moment there are no such features used by @file{texindex.awk}. We tested the program with five versions of @command{awk} (@command{gawk}, @command{mawk}, Brian Kernighan's @command{awk}, @command{goawk}, and Busybox @command{awk}) on a large index and got byte-identical results from all five. @node High-level organization @chapter High-level Organization The general outline is as follows: @(texindex.awk@) = @ @ @ @ BEGIN { @ @ } @<@code{beginfile()} work function@> @<@code{endfile()} work function@> @ @ @ @menu * First line:: The first line in the file. * Initial setup:: Set up variables and constants used throughout. * Argument processing:: Processing command line arguments. @end menu @node First line @section The Program's First Line @cindex first line @cindex @code{#!} header @cindex header, shebang For the first line of the generated output, we hardwire our intended output file name and how it got made. We do not use a @samp{#!} header because, being a GNU program, we need to accept the @option{--help} and @option{--version} options. This cannot be done with a standalone @code{awk} script; we need a shell wrapper, and hence, the @code{awk} script itself need not be executable. Also, it's simpler not to worry about the location of the @code{awk} program. @= # texindex.awk, generated by jrtangle from ti.twjr. @ @node Initial setup @section Initial Setup @cindex initial setup The initial setup sets up some constants, including the version of the program. In the program itself, we follow a convenient convention: global variable and array names start with a capital letter. @cindex @code{Invocation_name} variable Per GNU standards, we sometimes hardwire the string @samp{texindex} as the name of the program, and sometimes use the name by which the program was invoked. We'll call the latter @code{Invocation_name}; it's supposed to be passed in from the shell wrapper. @cindex @code{Can_split_null} variable The last line below sets up @code{Can_split_null}, which tells us if the built-in @code{split()} function will split apart a string into its individual characters or if we have to do it manually. @cindex @code{TRUE} constant @cindex @code{FALSE} constant @cindex @code{EXIT_SUCCESS} constant @cindex @code{EXIT_FAILURE} constant @cindex @code{Texindex_version} variable @cindex @code{check_split_null()} function @cindex @code{Can_split_null} variable @cindex @code{Invocation_name} variable @= TRUE = 1 FALSE = 0 EXIT_SUCCESS = 0 EXIT_FAILURE = 1 Texindex_version = "@VERSION@" if (! Invocation_name) { # provide fallback in case it's not passed in. Invocation_name = "texindex" } Can_split_null = check_split_null() @ @node Argument processing @section Command-line Argument Processing @cindex argument processing @cindex @code{usage()} function @cindex @code{version()} function Argument processing is straightforward, though manual. The important thing is to remove options and their arguments from @code{ARGV} so that they're not treated as filenames. The options that print version or help information automatically exit, so there's no need to mess with @code{ARGV} in those cases. @cindex @code{-h} (@code{--help}) option @cindex @code{-k} (@code{--keep}), no-op option @cindex @code{--} option @cindex @code{--version} option @cindex @code{EXIT_SUCCESS} constant @cindex @code{EXIT_FAILURE} constant @cindex @code{fatal()} function @= for (i = 1; i < ARGC; i++) { if (ARGV[i] == "-h" || ARGV[i] == "--help") { usage(EXIT_SUCCESS) } else if (ARGV[i] == "--version") { version() } else if (ARGV[i] == "-k" || ARGV[i] == "--keep") { # do nothing, backwards compatibility delete ARGV[i] } else if (ARGV[i] == "--") { delete ARGV[i] break } else if (ARGV[i] ~ /^--?.+/) { fatal(_"%s: unrecognized option `%s'\n" \ "Try `%s --help' for more information.\n", Invocation_name, ARGV[i], Invocation_name) # fatal() will do `exit EXIT_FAILURE' } else { break } } @ @node Processing records @chapter Processing Records Processing records includes setting things up for each input file, pulling apart each record, sorting the data at the end, and writing out the data properly. @menu * Setup for each input file:: What happens at the start of each file. * Processing each record:: Pulling apart the fields and storing data. * End-of-file sorting and printing:: Sorting the entries for each index. * Printing the data:: Printing the final results. @end menu @node Setup for each input file @section Setup For Each Input File At the beginning of each input file, the @code{beginfile()} function clears our variables from any previous processing and sets up the output file name. We always append an @samp{s} to the name of the input file. This is the standard convention. When @code{beginfile()} is called, the first record has already been read, so it's possible to perform the checks for a Texinfo index file: The first character must be either @samp{\} or @samp{@@} (@pxref{Requirements}), and the next five characters must be the word @samp{entry}. @cindex @code{Special_chars} variable @code{Special_chars} are the characters that must be preceded by the command character inside the first key. This includes the command character itself. Finally, several variables are set to regular expressions that match control sequences of interest. @cindex @code{fatal()} function @cindex @code{FALSE} constant @cindex @code{beginfile()} function @cindex @code{Output_file} variable @cindex @code{Do_initials} variable @cindex @code{Prev_initial} variable @cindex @code{Command_char} variable @cindex @code{Special_chars} variable @cindex @code{Entries} variable @<@code{beginfile()} work function@>= function beginfile(filename) { Output_file = filename "s" @ Entries = 0 Do_initials = FALSE Prev_initial = "" Command_char = substr($0, 1, 1) if ((Command_char != "\\" && Command_char != "@") \ || substr($0, 2, 5) != "entry") fatal(_"%s is not a Texinfo index file\n", filename) Special_chars = "{}" Command_char @ } @ @node Processing each record @section Processing Each Record Record processing consists of building the data structures for use in sorting and printing once the whole file has been processed. @= { @ @ @ @ @ @ @ } @ @menu * Remove duplicates:: Removing duplicating entries. * Remove leading @code{\entry}:: Remove the leading @code{\entry} command. * Get the initial:: Get the initial for this entry. * Set up and name fields:: Pull apart the line. * Store the data for this line:: Store the data for later access. * Check for more than one initial:: See if there are multiple initials. * Splitting the record:: Split the record apart. @end menu @node Remove duplicates @subsection Removing Duplicate Lines @cindex removing duplicates @cindex duplicates, removing @cindex @code{Seen} array Duplicates are going to be exact. Removing them is thus easy; store each incoming line as the index of an array named @code{Seen}. If a line is not there, it has not been seen. Otherwise it has, and we move on to the next record. @cindex @code{TRUE} constant @cindex @code{Seen} array @= # Remove duplicates, which can happen if ($0 in Seen) next Seen[$0] = TRUE @ We have to clear out the @code{Seen} array at the start of each input file. @cindex @code{del_array()} function @= # Reinitialize these for each input file del_array(Seen) @ @node Remove leading @code{\entry} @subsection Remove The Leading @code{\entry} Or @code{@@entry} We use @code{substr()} here to avoid possible hassles with leading backslashes in @code{sub()}. @= $0 = substr($0, 7) # remove leading \entry or @entry @ @node Get the initial @subsection Get The Initial @cindex @code{extract_initial()} function @= initial = extract_initial($0) @ The sort key is the first part of the line after @samp{@@entry}, starting with an open brace, and continuing to a matching close brace. The very first character of the sort key can be an open brace. If so, we extract the component of the sort key surrounded by balanced braces. We don't account for @samp{\@{} or @samp{\@}} inside this component, as @file{texinfo.tex} isn't expected to produce such output. An example can be seen in what older versions of @file{texinfo.tex} generated if you needed to index a real backslash, namely an input line something like the following: @example \entry@{@{\tt \indexbackslash @} (backslash)@}@{14@}@{\code @{@{\tt @dots{}@}@} @end example Earlier versions of @command{texindex} took the first non-brace character as the initial, in this example @samp{\}, and output it as @samp{\\}; this was not, however, a control sequence recognized by the older versions of @file{texinfo.tex}. @cindex @code{extract_initial()} function @cindex @code{char_split()} function @cindex @code{fatal()} function @= function extract_initial(key, initial, nextchar, i, l, kchars) { l = char_split(key, kchars) if (l >= 3 && kchars[2] == "{") { bracecount = 1 i = 3 while (bracecount > 0 && i <= l) { if (kchars[i] == "{") bracecount++ else if (kchars[i] == "}") bracecount-- i++ } if (i > l) fatal(_"%s:%d: Bad key %s in record\n", FILENAME, FNR, key) initial = substr(key, 2, i - 2) } else if (kchars[2] == Command_char) { nextchar = kchars[3] if (initial == Command_char && index("{}", nextchar) > 0) initial = substr(key, 2, 3) else { initial = toupper(nextchar) } } else { initial = toupper(kchars[2]) } return initial } @ @node Set up and name fields @subsection Set Up And Name The Fields The next step is to pull out the data of interest from the multiple sets of braces. This is delegated to a function named @code{field_split()}. There must be at least three fields, and there can be up to five. @cindex @code{fatal()} function @cindex @code{field_split()} function @cindex @code{fields} array, setting up @= numfields = field_split($0, fields, "{", "}", Command_char) if (numfields < 3 || numfields > 5) fatal(_"%s:%d: Bad entry; expected 3 to 5 fields, not %d\n", FILENAME, FNR, numfields) @ We give the fields names for later use. @= key = fields[1] pagenum = fields[2] primary_text = fields[3] secondary_text = (numfields > 3 ? fields[4] : "") tertiary_text = (numfields > 4 ? fields[5] : "") @ @node Store the data for this line @subsection Store The Data For This Line @cindex storing data We use multiple arrays to store different parts of the data. The sort key from the input is invariant across entries, so we use that as the index in the various arrays. We need the following arrays: @table @code @item Numfields How many fields (entries) in this line: one, two, or three. @item Initials The initial for this line. @item Primary The primary index text. This is the real text, not the stripped value appearing in the sort key. @item Secondary The secondary index text, if present. @item Tertiary The tertiary index text, if present. @item Pagedata The page numbers on which identical entries occur. @item See_count The number of ``see'' entries for a given index line. Nothing prevents there being multiple such: @cindex toast @seeentry{toaster} @cindex toast @seeentry{jam} @example @@cindex toast @@seeentry@{toaster@} @@cindex toast @@seeentry@{jam@} @end example @noindent So we have to be prepared to handle such input. The @code{See_count} array counts the number of such texts as there may be. @item See The actual texts of the @samp{@@seeentry} value for the line's sort key. Each entry goes into @code{See[key, 1]}, @code{See[key, 2]}, etc., up to @code{See_count[key]}. @item Seealso_count @itemx Seealso These serve the same purpose as @code{See_count} and @code{See}, but for @code{@@seealso} entries. @end table In the event that a particular key has more than one associated output text, we'll keep the first and ignore the remainder (this is the same behavior as the C implementation). @xref{Assumptions}. For page numbers, we merely append the page number field from the input, preceded by a comma and space, unless that page number was already the last that's been stored. (We're assuming the page numbers don't jump around, which, in fact, they don't, so we don't need a more complex approach.) This also handles any page numbers that appear as roman numerals (from the so-called front matter), should there be such. In addition to all the previously described arrays, the key is stored in the @code{Keys} array the first time it is seen; this array is sorted later on. Its indices are just incremented integers, stored in the global @code{Entries} variable. The @code{Allkeys} associative array lets us easily track if we have seen a key before. @cindex @code{Keys} array @cindex @code{Allkeys} array @cindex @code{Entries} variable @cindex @code{Seeentry_re} variable @cindex @code{Seealso_re} variable @cindex @code{Primary} array @cindex @code{Secondary} array @cindex @code{Tertiary} array @cindex @code{Numfields} array @cindex @code{See} array @cindex @code{See_count} array @cindex @code{Seealso} array @cindex @code{Seealso_count} array @cindex @code{Pagedata} array @cindex @code{escape()} function @= if (! (key in Allkeys)) { # first time we've seen this full line Keys[++Entries] = key Allkeys[key] = key Initials[key] = initial Numfields[key] = numfields - 2 # don't count sortkey, page number Primary[key] = primary_text if (secondary_text) Secondary[key] = secondary_text if (tertiary_text) Tertiary[key] = tertiary_text @ if (pagenum ~ Seeentry_re) { See_count[key]++ See[key, See_count[key]] = pagenum } else if (pagenum ~ Seealso_re) { Seealso_count[key]++ Seealso[key, Seealso_count[key]] = pagenum } else Pagedata[key] = pagenum } else { # We've seen this key before: # Add to see or see also, or else add to list of pages. # In the latter case, make sure we've not seen this # page number before. (Shouldn't happen based on the # earlier removal of exact duplicates, but we could have # an identical key with different formatting of actual text. if (pagenum ~ Seeentry_re) { See_count[key]++ See[key, See_count[key]] = pagenum } else if (pagenum ~ Seealso_re) { Seealso_count[key]++ Seealso[key, Seealso_count[key]] = pagenum } else if (! (key in Pagedata)) { Pagedata[key] = pagenum } else if (Pagedata[key] != pagenum \ && Pagedata[key] !~ escape(", " pagenum "$")) { Pagedata[key] = Pagedata[key] ", " pagenum } } @ We split the key into subparts, using the @samp{@@subentry} as the separator. The subparts are stored in the @code{Subkeys} array. @cindex @code{Subkeys} array @= n = split(key, subparts, Subentry_re) for (i = 1; i <= n; i++) Subkeys[key, i] = subparts[i] @ The @code{Seeentry_re}, @code{Seealso_re} and @code{Subentry_re} variables are regular expressions that match the corresponding @TeX{} control sequences. They're initialized once for each input file, since the command character might be different between files. The @code{make_regexp()} function is described in @ref{make_regexp}. @cindex @code{Seeentry_re} variable @cindex @code{Seealso_re} variable @cindex @code{Subentry_re} variable @cindex @code{make_regexp()} function @= Seeentry_re = make_regexp("%seeentry") Seealso_re = make_regexp("%seealso") Subentry_re = make_regexp(" *%subentry +") @ Here too, we have to clear out the arrays we've used at the start of each input file. @cindex @code{Keys} array @cindex @code{Subkeys} array @cindex @code{Allkeys} array @cindex @code{Initials} array @cindex @code{Numfields} array @cindex @code{Primary} array @cindex @code{Secondary} array @cindex @code{Tertiary} array @cindex @code{See} array @cindex @code{See_count} array @cindex @code{Seealso} array @cindex @code{Seealso_count} array @cindex @code{Pagedata} array @cindex @code{del_array()} function @= del_array(Keys) del_array(Allkeys) del_array(Subkeys) del_array(Initials) del_array(Numfields) del_array(Primary) del_array(Secondary) del_array(Tertiary) del_array(See) del_array(See_count) del_array(Seealso) del_array(Seealso_count) del_array(Pagedata) @ @node Check for more than one initial @subsection Check For More Than One Initial @cindex initial, checking for more than one Finally, we need to determine if more than one initial occurs in the input. If so, we set @code{Do_initials} to true. As soon as it's true, we don't need to do further checking on subsequent lines. @cindex @code{TRUE} constant @cindex @code{Do_initials} variable @cindex @code{Prev_initial} variable @= if (! Do_initials) { if (Prev_initial == "") Prev_initial = initial else if (initial != Prev_initial) Do_initials = TRUE } @ @node Splitting the record @subsection Splitting The Record: @code{field_split()} Let's take a look at the function that breaks apart the record. Upon entry to the function, the value of @code{record} looks something like: @example @{POSIX awk@}@{5@}@{POSIX @@command @{awk@}@} @end example The first field may have instances of @samp{@@@{} and/or @samp{@@@}} (or @samp{\@{} and/or @samp{\@}}), so the braces aren't necessarily exactly balanced. The @code{field_split()} function uses fairly straightforward ``count the delimiters'' code. The loop starts at two, since we know the first character is an open brace. The main things to handle are the command character and the final closing brace. @cindex @code{field_split()} function @cindex @code{char_split()} function @= function field_split( \ record, fields, start, end, com_ch, # parameters chars, numchars, out, delim_count, i, j, k) # locals { del_array(fields) numchars = char_split(record, chars) j = 1 # index into fields k = 1 # index into out delim_count = 1 for (i = 2; i <= numchars; i++) { if (chars[i] == com_ch) { @ } else if (chars[i] == start) { delim_count++ out[k++] = chars[i] } else if (chars[i] == end) { delim_count-- if (delim_count == 0) { @ } else out[k++] = chars[i] } else out[k++] = chars[i] } return j - 1 # num fields } @ If the command character is doubled, we pass that on through, so that @TeX{} will process it correctly. If the character following the command character is an open brace or close brace, we pull it in. Otherwise, the command character is left alone as part of the field. @cindex @code{Command_char} variable @cindex @code{Special_chars} variable @= if (chars[i+1] == Command_char) { # input was @@ out[k++] = chars[i+1] out[k++] = chars[i+1] i++ } else if (index(Special_chars, chars[i+1]) != 0) { out[k++] = chars[i+1] i++ } else out[k++] = chars[i] @ Upon seeing the final closing brace, we put all the characters back together into a string using @code{join()}. We then reset the @code{out} array for the next time through. If the next character isn't an open brace, then the line is bad and we print a fatal error. Otherwise, we reset @code{delim_count} to one. @cindex @code{join()} function @cindex @code{del_array()} function @cindex @code{fatal()} function @= fields[j++] = join(out, 1, k-1, SUBSEP) del_array(out) # reset for next time through k = 1 i++ if (i <= numchars && chars[i] != start) fatal(_"%s:%d: Bad entry; expected %s at column %d\n", FILENAME, FNR, start, i) delim_count = 1 @ @node End-of-file sorting and printing @section End-of-file Sorting And Printing Upon end of input, the processing is straightforward: sort the entries and write them out. Additionally, if we are printing the initial, handle that. (That printing task is delegated to a small function.) @cindex @code{endfile()} function @cindex @code{quicksort()} function @cindex @code{write_index_entry()} function @cindex @code{Entries} variable @cindex @code{Keys} array @cindex @code{Initials} array @cindex @code{print_initial()} function @cindex @code{Output_file} variable @<@code{endfile()} work function@>= function endfile(filename, i, prev_initial, initial) { # sort the entries quicksort(Keys, 1, Entries, "index") prev_initial = "" for (i = 1; i <= Entries; i++) { # deal with initial initial = Initials[Keys[i]] if (initial != prev_initial) { prev_initial = initial print_initial(initial) } write_index_entry(i) } close(Output_file) } @ Printing an initial is not complicated. The main thing is to precede special characters with the command character. @cindex @code{Command_char} variable @cindex @code{Special_chars} variable @cindex @code{Do_initials} variable @cindex @code{print_initial()} function @cindex @code{Output_file} variable @= function print_initial(initial) { if (! Do_initials) return if (index(Special_chars, initial) != 0) initial = Command_char initial printf("%cinitial {%s}\n", Command_char, initial) > Output_file } @ @menu * Quicksort:: Sorting our input. * Multilevel comparisons:: Handling multilevel entries. * Comparing index entries:: The heart of the sorting algorithm. @end menu @node Quicksort @subsection Quicksort @cindex quicksort algorithm @cindex Hoare, C.A.R. Sorting uses a standard quicksort algorithm. It turns out we need to sort both multilevel index entries, and regular text. To that end the @code{compare} variable indicates which way to do the ``less than'' comparison. @cindex @code{quicksort()} function @cindex @code{quicksort_swap()} function @= # quicksort --- C.A.R. Hoare's quick sort algorithm. See Wikipedia # or almost any algorithms or computer science text # Adapted from K&R-II, page 110 # function quicksort(data, left, right, compare, # parameters i, last, use_index, lt) # locals { if (left >= right) # do nothing if array contains fewer return # than two elements @ quicksort_swap(data, left, int((left + right) / 2)) last = left for (i = left + 1; i <= right; i++) { @ if (lt) quicksort_swap(data, ++last, i) } quicksort_swap(data, left, last) quicksort(data, left, last - 1, compare) quicksort(data, last + 1, right, compare) } # quicksort_swap --- quicksort helper function, could be inline # function quicksort_swap(data, i, j, temp) { temp = data[i] data[i] = data[j] data[j] = temp } @ We set a Boolean (numeric) variable to indicate what kind of comparison to do, avoiding repeating the string comparison whose result won't change upon each iteration. @= use_index = (compare == "index") @ The @code{less_than()} function supplies the comparison for index entries (@pxref{Multilevel comparisons}, and @pxref{Comparing index entries}). The @code{key_compare()} function is used for string comparisons. @cindex @code{less_than()} function @cindex @code{key_compare()} function @= lt = (use_index \ ? less_than(data, i, left) \ : key_compare(data[i], data[left]) < 0) @ @node Multilevel comparisons @subsection Handling Multilevel Entries The @code{less_than()} function has to take into account that we are comparing multilevel index entries. We can't just compare the full sort key, since the @samp{@@subentry} throws off the comparison; we want to compare based only on the key texts themselves. To that end, the comparison happens on two levels. At the higher level, we compare the subkeys; if the first subkeys are equal then we differentiate between them based on the second subkey. If, in turn, the second ones are equal, we differentiate based on the third one. By definition, an index entry with only one subkey sorts to be before an entry with two, and one with two comes before one with three. The underlying @code{key_compare()} function, which does the hard work of comparison, returns a three-way value a la the C @code{strcmp()} function: less than zero if the first string is less than the second, zero if they're equal, or greater than zero if the first string is greater than the second one. We make an effort here to call the comparison function only as much as necessary, since it's a relatively expensive operation. @cindex @code{less_than()} function @cindex @code{key_compare()} function @cindex @code{Numfields} array @cindex @code{Subkeys} array @= function less_than(data, l, r, left, right, nfields, cmp1, cmp2) { left = data[l] right = data[r] left_fields = Numfields[left] right_fields = Numfields[right] nfields = min(left_fields, right_fields) # At least one field, always check the first subkey cmp1 = key_compare(Subkeys[left, 1], Subkeys[right, 1]) if (cmp1 != 0) return cmp1 < 0 # cmp1 == 0: one side has 1 field, other side has 1 to 3 fields if (nfields == 1) return left_fields < right_fields # At least two fields, check second subkey cmp2 = key_compare(Subkeys[left, 2], Subkeys[right, 2]) if (cmp2 != 0) return cmp2 < 0 # cmp1 == 0, cmp2 == 0, one side has 2 fields, # other has 2 to 3 fields if (nfields == 2) return left_fields < right_fields # Three fields return key_compare(Subkeys[left, 3], Subkeys[right, 3]) < 0 } @ @node Comparing index entries @subsection Comparing Index Entries The sort key comparison function is the heart of the sorting algorithm. The comparison is based on the indexing rules, which are: @itemize @bullet @item All symbols first. @item Followed by digits. @item Followed by letters. Lowercase precedes uppercase and both ``a'' and ``A'' precede anything starting with ``b'' or ``B'' (etc.). @end itemize Implementing these rules is a little complicated. The first thing we need is a table that maps characters to comparison values. The following code is based on the original C @command{texindex}, although the actual comparison algorithm is more sophisticated. We set up an @code{Ordval} array to map characters to numeric values. Most characters map to their ASCII code. We add 512 to the value of each of the digits; this causes them to come after all symbols. The letters are handled a little differently. We set things up so that lowercase letters come before uppercase ones, but both ``a'' and ``A'' come before ``b'', and so on. This then lets us use a simple subtraction in the comparison to determine if two letters are less than, equal to, or greater than each other. In any case, the mapping also ensures that letters come after digits. (This code should also work for EBCDIC systems, although @TeX{} does everything in ASCII, so it's not likely to make a difference.) The table must be built completely before changing the mapping of the letters, because all of the uppercase and lowercase letters must be in the table before we can change their values. @cindex @code{Ordval} array @cindex @code{isdigit()} function @cindex @code{isupper()} function @= BEGIN { for (i = 0; i < 256; i++) { c = sprintf("%c", i) Ordval[c] = i # map character to value if (isdigit(c)) Ordval[c] += 512 } # Set things up such that 'a' < 'A' < 'b' < 'B' < ... i = Ordval["a"] j = Ordval["z"] newval = i + 512 for (; i <= j; i++) { c = sprintf("%c", i) if (islower(c)) { Ordval[c] = newval++ Ordval[toupper(c)] = newval++ } } } @ Here is the @code{key_compare()} function. It returns less than zero if the @code{left} string is ``less than'' the @code{right} string, zero if they are equal, and greater than zero if the @code{left} string is ``greater than'' the @code{right} string. The comparison algorithm is not too complicated, once we define how things should work. We loop over each pair of characters in the @code{left} and @code{right} strings, comparing them one at a time. When comparing two characters, there are three cases, one of which has three subcases, as follows: @table @i @item Two letters @c nested table @table @i @item Same letter, but different case This is the slightly complicated case. When two characters are equal, we have to look ahead at the next characters to decide whether to continue the loop or quit. As long as we are not at the end of the string, and at least one of the following characters in either string is a letter, we continue the loop. Otherwise we do the character comparison and return. @item Two different letters, but same case @itemx Two different letters, different case Use the comparison of the respective @code{Ordval} values. @end table @c end nested table @item A letter and something else @itemx Two nonletters Use the comparison of the respective @code{Ordval} values. @end table @noindent When the values are equal, continue around the loop. And, as usual, if one string is an initial substring of the other, that one is considered to be ``less than'' the other one. The rules just described produce @emph{better} results than did the C @command{texindex}. For example, @samp{beginfile()} sorts before @samp{BEGINFILE}, whereas with the C version they came out in the opposite order. @cindex @code{Ordval} array @cindex @code{char_split()} function @cindex @code{key_compare()} function @cindex @code{isalpha()} function @= function key_compare(left, right, len_l, len_r, len, chars_l, chars_r) { len_l = length(left) len_r = length(right) len = (len_l < len_r ? len_l : len_r) char_split(left, chars_l) char_split(right, chars_r) for (i = 1; i <= len; i++) { if (isalpha(chars_l[i]) && isalpha(chars_r[i])) { # same char different case # upper case comes out last if (chars_l[i] != chars_r[i] && tolower(chars_l[i]) == tolower(chars_r[i])) { if (i != len \ && (isalpha(chars_l[i+1]) || isalpha(chars_r[i+1]))) continue # negative, zero, or positive return Ordval[chars_l[i]] - Ordval[chars_r[i]] } # same case, different char, # or different case, different char: # letter order wins if (Ordval[chars_l[i]] < Ordval[chars_r[i]]) return -1 if (Ordval[chars_l[i]] > Ordval[chars_r[i]]) return 1 # equal, keep going continue } # letter and something else, or two non-letters # letter order wins if (Ordval[chars_l[i]] < Ordval[chars_r[i]]) return -1 if (Ordval[chars_l[i]] > Ordval[chars_r[i]]) return 1 # equal, keep going } # equal so far, shorter one wins if (len_l < len_r) return -1 if (len_l > len_r) return 1 return 0 } @ @node Printing the data @section Printing The Final Results Printing an index entry is where all the data we collected gets used. Much of the complexity is here, since we have to output up to three lines per entry. @menu * printing top level:: Top level logic. * printing a single entry:: Handling a single entry. @end menu @node printing top level @subsection Top Level Logic For Printing An Entry So, let's start. The logic is going to be a little complicated: @cindex @code{write_index_entry()} function @cindex @code{Keys} array @cindex @code{Numfields} array @= function write_index_entry(current, key) { key = Keys[current] # current sort key if (Numfields[key] == 1) { @ } else if (Numfields[key] == 2) { @ @ } else if (Numfields[key] == 3) { @ @ @ } } @ Consider the three-level case, for an entry like: @example @@cindex coffee makers @@subentry electric @@subentry blue @end example @noindent There may not have been separate preceding entries like @samp{@@cindex coffee makers} or @samp{@@cindex coffee makers @@subentry electric}. Thus, we have to generate the preceding @code{@@entry} and @code{@@secondary} lines before generating the final @code{@@tertiary} line. Printing the entries is similar no matter what kind of entry; there's a lot of work to be done so it's isolated in the @code{print_entry()} function described in the next section. @cindex @code{print_entry()} function @= print_entry(key, "entry", Primary) @ @cindex @code{print_entry()} function @= print_entry(key, "secondary", Secondary) @ @cindex @code{print_entry()} function @= print_entry(key, "tertiary", Tertiary) @ @node printing a single entry @subsection Printing A Single Entry Printing a single entry is quite involved: @cindex @code{print_entry()} function @cindex @code{print_see_entry()} function @cindex @code{See} array @cindex @code{See_count} array @cindex @code{Seealso} array @cindex @code{Seealso_count} array @cindex @code{Pagedata} array @cindex @code{Printed} array @cindex @code{Output_file} variable @= function print_entry(key, entry_command, entry_text, @) { if ((key, 1) in See) # at least one ``see'' print_see_entry(key, entry_command, entry_text, See_count, See) if (key in Pagedata) { # at least one page number @ printf("%c%s{%s}{%s}\n", Command_char, entry_command, entry_text[key], Pagedata[key]) > Output_file Printed[key] = True # mark this key as printed } else if ((key, 1) in Seealso) { # at least one ``see also'' # Only ``see also'' entry, print it @ printf("%c%s{%s}{", Command_char, entry_command, entry_text[key]) > Output_file # now add them to the page data for (i = 1; i <= count; i++) { printf("%s", see_entries[i]) > Output_file if (i != count) printf(", ") > Output_file } printf("}\n") > Output_file } } @ Note that we only take note of having printed the key for lines with page numbers. Otherwise, a ``see'' entry followed by a regular multilevel entry is not handled correctly. When there exist both regular index entries for a topic and also a ``see also'' entry, we place the ``see also'' text after all the page numbers, so that there is only one printed index entry for the topic. This ends up being involved, since potentially there could be multiple ``see also'' entries (even though this is bad form). @= if ((key, 1) in Seealso) { @ # now add them to the page data for (i = 1; i <= count; i++) Pagedata[key] = Pagedata[key] ", " see_entries[i] } @ @= count, see_entries, i @ Although it's bad practice, there could be multiple ``see also'' entries for a given key. In that case, we must sort them before using them. @= count = Seealso_count[key] # Copy the entries to a separate array for (i = 1; i <= count; i++) see_entries[i] = Seealso[key, i] # sort them quicksort(see_entries, 1, count, "string") @ Be sure to empty out @code{Printed} at the start of each file. @cindex @code{Printed} array @cindex @code{del_array()} function @= del_array(Printed) @ And add our function to the group of work functions. @= @ @ Printing ``see'' entries is potentially messy if there are more than one. (A good index won't have more than one, but nothing prevents there being multiple such entries, so we have to handle them.) @cindex @code{print_see_entry()} function @cindex @code{quicksort()} function @cindex @code{Output_file} variable @= function print_see_entry(key, entry_command, entry_text, # parameters count_array, see_text_array, # parameters i, count, see_entries) # locals { count = count_array[key] if (count == 1) { # the easy case printf("%c%s{%s, %s}{}\n", Command_char, entry_command, entry_text[key], see_text_array[key, 1]) > Output_file return } # Otherwise, we need to sort the entries and then print them # Copy the entries to a separate array for (i = 1; i <= count; i++) see_entries[i] = see_text_array[key, i] # sort them quicksort(see_entries, 1, count, "string") # now print them for (i = 1; i <= count; i++) printf("%c%s{%s, %s}{}\n", Command_char, entry_command, entry_text[key], see_entries[i]) > Output_file } @ @cindex test case @seeentry{testing} @cindex test case @seeentry{brief case} @cindex coffee @cindex coffee @seealso{tea} @cindex coffee @seealso{coca-cola} @cindex whisky @seealso{soda} @cindex whisky @seealso{scotch} @= @ @ When checking if we need to print the primary and secondary entry, we need to use the subparts of the key. The subparts represent the key for those entries; each will have an index in @code{Printed} if we already printed such an entry. The subparts are already available in the @code{Subkeys} array. @cindex @code{Subkeys} array @cindex @code{Printed} array @cindex @code{Output_file} variable @= if (! (Subkeys[key, 1] in Printed)) { printf("%centry{%s,}{}\n", Command_char, Primary[key]) > Output_file Printed[Subkeys[key, 1]] = True } @ Printing the secondary entry is a little subtle; we have to check that the combination of primary and secondary subkeys have been printed and use that combination as the index into @code{Printed}. @cindex @code{Subkeys} array @cindex @code{Printed} array @cindex @code{Output_file} variable @= subkey = (Subkeys[key, 1] Command_char "subentry " Subkeys[key, 2]) if (! (subkey in Printed)) { printf("%csecondary{%s,}{}\n", Command_char, Secondary[key]) > Output_file Printed[subkey] = True } @ Here are some test cases: @c These should make a nice test case! @cindex coffee makers @cindex toasters @subentry british @cindex toasters @subentry american @cindex microwaves @subentry electric @subentry 110 volt @cindex microwaves @subentry electric @subentry 220 volt @cindex children @cindex children @subentry small @cindex children @subentry small @subentry toddlers @cindex children @subentry small @subentry infants @cindex children @subentry teenagers @cindex children @subentry adult @example @@c Top level entry, with page @@cindex coffee makers @@c Double level, no separate top level @@cindex toasters @@subentry british @@cindex toasters @@subentry american @@c Triple level, no separate 1st or 2nd level @@cindex microwaves @@subentry electric @@subentry 110 volt @@cindex microwaves @@subentry electric @@subentry 220 volt @@c All 3 levels, with pages @@cindex children @@cindex children @@subentry small @@cindex children @@subentry small @@subentry toddlers @@cindex children @@subentry small @@subentry infants @@cindex children @@subentry teenagers @@cindex children @@subentry adult @@c More examples @@cindex test case @@seeentry@{testing@} @@cindex test case @@seeentry@{brief case@} @@cindex coffee @@cindex coffee @@seealso@{tea@} @@cindex coffee @@seealso@{coca-cola@} @@cindex whisky @@seealso@{soda@} @@cindex whisky @@seealso@{scotch@} @end example @noindent @xref{Index}, to see if the above test cases are handled properly. Finally, we add the above code into the set of work functions: @= @ @ @node Necessary stuff @chapter Necessary Stuff That Isn't Thrilling This chapter provides some necessary but unexciting elements. @menu * Copyright statement:: Copyright info. * Library functions:: From the @code{gawk} library: @file{ftrans.awk}, @code{join()}. * Helper functions:: @code{del_array()}, @code{check_split_null()}, @code{fatal()}, @dots{} * I18N:: Internationalization. @end menu @node Copyright statement @section Copyright Statement @cindex copyright statement @cindex GNU General Public License @cindex License, GNU General Public @cindex GPL (GNU General Public License) Every program needs a copyright statement. @= # # Copyright 2014-2023 Free Software Foundation, Inc. # # This file is part of GNU Texinfo. # # Texinfo is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 3 of the License, or # (at your option) any later version. # # Texinfo is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, see . @ @node Library functions @section Library Functions: @file{ftrans.awk} and @code{join()} The program uses several library routines discussed in detail in the @command{gawk} documentation. The first sets up the infrastructure for the @code{beginfile()} and @code{endfile()} functions. @xref{Filetrans Function,,, gawk, GNU Awk User's Guide}, for an explanation of how this function works. @cindex @file{ftrans.awk} library file @cindex @code{beginfile()} function @cindex @code{endfile()} function @= # ftrans.awk --- handle data file transitions # # user supplies beginfile() and endfile() functions # # Arnold Robbins, arnold@skeeve.com, Public Domain # November 1992 FNR == 1 { if (_filename_ != "") endfile(_filename_) _filename_ = FILENAME beginfile(FILENAME) } END { endfile(_filename_) } @ The next function is @code{join()}, which joins an array of characters back into a string. @xref{Join Function,,, gawk, GNU Awk User's Guide}, for an explanation of how this function works. @cindex @file{join.awk} library file @cindex @code{join()} function @= # join.awk --- join an array into a string # # Arnold Robbins, arnold@skeeve.com, Public Domain # May 1993 function join(array, start, end, sep, result, i) { if (sep == "") sep = " " else if (sep == SUBSEP) # magic value sep = "" result = array[start] for (i = start + 1; i <= end; i++) result = result sep array[i] return result } @ @node Helper functions @section Helper Functions These helper functions make the main code easier to follow. @menu * del_array:: Clearing out an array. * check_split_null:: Checking if @command{awk} splits on the null string. * char_split:: Splitting a line into individual characters. * fatal:: Printing fatal errors. * is@dots{} functions:: Checking character types. * make_regexp:: Make a regexp to match @TeX{} control sequences. * escape:: Escaping backslashes for strings. * min:: Get the minimum of two numbers. @end menu @node del_array @subsection @code{del_array()}: Deleting An Array @code{del_array()} clears out an array. @cindex @code{del_array()} function @= function del_array(a) { # Portable and faster than # for (i in a) # delete a[i] split("", a) } @ @node check_split_null @subsection @code{check_split_null()}: Checking If @command{awk} Splits On The Null String @code{check_split_null()} determines whether the @command{awk} running this program supports using the null string for the separator, splitting each character off into a separate element. If so, the return value from @code{split()} will be the number of elements in the array, and it will be more than one. It is called at program startup. @cindex @code{check_split_null()} function @= function check_split_null( n, a) { n = split("abcde", a, "") return (n == 5) } @ @node char_split @subsection @code{char_split()}: Splitting A String Into Characters @code{char_split()} splits a string into separate characters, letting @command{awk} do the work if possible. If not, each character is extracted manually using a loop and @code{substr()}. @cindex @code{char_split()} function @cindex @code{Can_split_null} variable @cindex @code{del_array()} function @= function char_split(string, array, n, i) { if (Can_split_null) return split(string, array, "") # do it the hard way del_array(array) n = length(string) for (i = 1; i <= n; i++) array[i] = substr(string, i, 1) return n } @ @node fatal @subsection @code{fatal()}: Printing Fatal Error Messages @cindex @command{cat} command @cindex stderr The @code{fatal()} function prints a @code{printf}-formatted message to standard error and then exits badly. For maximal portability, it opens a pipeline to @command{cat}, redirected to standard error; not all systems have a @file{/dev/stderr} file, and not all versions of @command{awk} recognize that name internally. (Thus, we can't use @samp{print @dots{} > "/dev/stderr"}.) @cindex @code{EXIT_FAILURE} constant @cindex @code{fatal()} function @= function fatal(format, arg1, arg2, arg3, arg4, arg5, arg6, arg7, arg8, arg9, arg10, cat) { cat = "cat 1>&2" # maximal portability printf(format, arg1, arg2, arg3, arg4, arg5, arg6, arg7, arg8, arg9, arg10) | cat close(cat) exit EXIT_FAILURE } @ @node is@dots{} functions @subsection @code{is@dots{}} Functions: Checking Character Types @cindex @code{isupper()} function @cindex @code{islower()} function @cindex @code{isalpha()} function @cindex @code{isdigit()} function The following functions help identify what a character is; they are similar in nature to the various macros in the C @code{} header file. Since most of them return a count, the return value could be used to compute which character from the set was seen; this turned out not to be necessary in this program but might be useful in some other context. By using @code{index()} with lists of letters, these functions will also work on EBCDIC systems, should that ever be necessary. @= function isupper(c) { return index("ABCDEFGHIJKLMNOPQRSTUVWXYZ", c) } function islower(c) { return index("abcdefghijklmnopqrstuvwxyz", c) } function isalpha(c) { return islower(c) || isupper(c) } function isdigit(c) { return index("0123456789", c) } @ @node make_regexp @subsection @code{make_regexp()}: Matching @TeX{} Control Sequences @file{texindex.awk} has to handle input where the command character may be an @samp{@@} or a @samp{\}. When matching command strings in regular expressions, if the command character is a backslash, it must be doubled in order to be treated literally. The @code{make_regexp()} function handles this for us; a @samp{%} in the text of the regexp stands for the command character and is replaced appropriately. @cindex @code{make_regexp()} function @= function make_regexp(regexp, a, sep, n) { n = split(regexp, a, "%") if (Command_char == "\\") sep = Command_char Command_char else sep = Command_char return join(a, 1, n, sep) } @ @node escape @subsection @code{escape()}: Escaping Backslashes for Strings If the command character is a backslash, occurrences of backslash need to be doubled before the containing string can be used as a regexp. This function does that job; it's very similar to @code{make_regexp()} (@pxref{make_regexp}). @cindex @code{escape()} function @= function escape(regexp, a, n) { if (Command_char != "\\") return regexp n = split(regexp, a, "\\") if (n == 1) return regexp return join(a, 1, n, "\\\\") } @ @node min @subsection @code{min()}: Getting The Minimum of Two Numbers It'd be nice if @command{awk} had this built-in@enddots{} @= function min(a, b) { return (a < b ? a : b) } @ @node I18N @section Internationalization For @command{gawk}, we can arrange for the various messages, e.g., in the @code{usage()} and @code{version()} functions, to be translated. We do this by setting the text domain at startup. For more information on internationalization in @command{gawk}, @pxref{Internationalization,,, gawk, GNU Awk User's Guide}. @cindex @code{TEXTDOMAIN} variable @= TEXTDOMAIN = "texinfo" @ @noindent On non-GNU versions of @command{awk}, this is a harmless assignment, and the @code{_"..."} construct below is a harmless concatenation of an unassigned variable @code{_}, i.e., the empty string, with the following string constant. The @code{usage()} and @code{version()} functions print the necessary information and then exit. The strings that can and should be translated are prefixed with an underscore. @cindex @code{Texindex_version} variable @cindex @code{usage()} function @cindex @code{version()} function @cindex @code{EXIT_SUCCESS} constant @use_smallexample @= function usage(exit_val) { printf(_"Usage: %s [OPTION]... FILE...\n", Invocation_name) print _"Generate a sorted index for each TeX output FILE." print _"Usually FILE... is specified as `foo.??' for a document `foo.texi'." print "" print _"Options:" print _" -h, --help display this help and exit" print _" --version display version information and exit" print _" -- end option processing" print "" print _"Email bug reports to bug-texinfo@gnu.org," print _"general questions and discussion to help-texinfo@gnu.org." print _"Texinfo home page: https://www.gnu.org/software/texinfo/" exit exit_val } function version() { print "texindex (GNU texinfo)", Texindex_version print "" printf _"Copyright (C) %s Free Software Foundation, Inc.\n", "2023" print _"License GPLv3+: GNU GPL version 3 or later " print _"This is free software: you are free to change and redistribute it." print _"There is NO WARRANTY, to the extent permitted by law." exit EXIT_SUCCESS } @ @use_example @node Index @unnumbered Index @printindex cp @bye