%% This is part of the OpTeX project, see http://petr.olsak.net/optex \_codedecl \pdfunidef {PDFunicode strings for outlines <2021-02-08>} % preloaded in format \_doc ----------------------------- \`\_hexprint` is a command defined in Lua, that scans a number and expands to its UTF-16 Big Endian encoded form for use in PDF hexadecimal strings. \_cod ----------------------------- \bgroup \_catcode`\%=12 \_gdef\_hexprint{\_directlua{ local num = token.scan_int() if num < 0x10000 then tex.print(string.format("%04X", num)) else num = num - 0x10000 local high = bit32.rshift(num, 10) + 0xD800 local low = bit32.band(num, 0x3FF) + 0xDC00 tex.print(string.format("%04X%04X", high, low)) end }} \egroup \_doc ----------------------------- \`\pdfunidef``\macro{}` defines `\macro` as converted to Big Endian UTF-16 and enclosed to \code{<>}. Example of usage: `\pdfunidef\infoauthor{Petr Olšák} \pdfinfo{/Author \infoauthor}`.\nl \^`\pdfunidef` does more things than only converting to hexadecimal PDF string. The can be scanned in verbatim mode (it is true becuase \^`\_Xtoc` reads the in verbatim mode). First `\edef` do `\_scantextokens\unexpanded` and second `\edef` expands the parameter according to current values on selected macros from `\_regoul`. Then \`\_removeoutmath` converts `..$x^2$..` to `..x^2..`, i.e removes dollars. Then \`\_removeoutbraces` converts `..{x}..` to `..x..`. Finally, the is detokenized, spaces are preprocessed using \^`\replstring` and then the \`\_pdfunidefB` is repeated on each character. It calls the `\directlua` chunk to print hexadecimal numbers in the macro \^`\_hexprint`.\nl Characters for quotes (and separators for quotes) are activated by first `\_scatextokens` and they are defined as the same non-active characters. But `\_regoul` can change this definition. \_cod ----------------------------- \_def\_pdfunidef#1#2{% \_begingroup \_catcodetable\_optexcatcodes \_adef"{"}\_adef'{'}% \_the\_regoul \_relax % \_regmacro alternatives of logos etc. \_ifx\_savedttchar\_undefined \_def#1{\_scantextokens{\_unexpanded{#2}}}% \_else \_lccode`\;=\_savedttchar \_lowercase{\_prepinverb#1;}{#2}\fi \_edef#1{#1}% \_escapechar=-1 \_edef#1{#1\_empty}% \_escapechar=`\\ \_ea\_edef \_ea#1\_ea{\_ea\_removeoutmath #1$\_fin$}% $x$ -> x \_ea\_edef \_ea#1\_ea{\_ea\_removeoutbraces #1{\_fin}}% {x} -> x \_edef#1{\_detokenize\_ea{#1}}% \_replstring#1{ }{{ }}% text text -> text{ }text \_catcode`\\=12 \_let\\=\_bslash \_edef\_out{ \_out in octal \_ea \_endgroup \_ea\_def\_ea#1\_ea{\_out>} } \_def\_pdfunidefB#1{% \_ifx^#1\_else \_edef\_out{\_out \_hexprint `#1} \_ea\_pdfunidefB \_fi } \_def\_removeoutbraces #1#{#1\_removeoutbracesA} \_def\_removeoutbracesA #1{\_ifx\_fin#1\_else #1\_ea\_removeoutbraces\_fi} \_def\_removeoutmath #1$#2${#1\_ifx\_fin#2\_else #2\_ea\_removeoutmath\_fi} \_doc ----------------------------- The \`\_prepinverb``{}`, e.g.\ `\_prepinverb\tmpb|{aaa |bbb| cccc |dd| ee}` does `\def\tmpb{{aaa }bbb{ cccc }dd{ ee}}` where is `\scantextokens\unexpanded`. It means that in-line verbatim are not argument of `\scantextoken`. First `\edef\tmpb` tokenizes again the but not the parts which were in the the in-line verbatim. \_cod ----------------------------- \_def\_prepinverb#1#2#3{\_def#1{}% \_def\_dotmpb ##1#2##2{\_addto#1{\_scantextokens{\_unexpanded{##1}}}% \_ifx\_fin##2\_else\_ea\_dotmpbA\_ea##2\_fi}% \_def\_dotmpbA ##1#2{\_addto#1{##1}\_dotmpb}% \_dotmpb#3#2\_fin } \_doc ----------------------------- The \^`\regmacro` is used in order to set the values of macros `\em`, `\rm`, `\bf`, `\it`, `\bi`, `\tt`, `\/` and `~` to values usable in PDF outlines. \_cod ----------------------------- \_regmacro {}{}{\_let\em=\_empty \_let\rm=\_empty \_let\bf=\_empty \_let\it=\_empty \_let\bi=\_empty \_let\tt=\_empty \_let\/=\_empty \_let~=\_space } \public \pdfunidef ; \_endcode % -------------------------------- There are only two encodings for PDF strings (used in PDFoutlines, PDFinfo, etc.). The first one is PDFDocEncoding which is single-byte encoding, but it misses most international characters. The second encoding is Big Endian UTF-16 which is implemented in this file. It encodes a single character in either two or four bytes. This encoding is \TeX/-discomfortable because it looks like \begtt \endtt This example shows a hexadecimal PDF string (enclosed in \code{<>} as opposed to the literal PDF string enclosed in `()`). In these strings each byte is represented by two hexadecimal characters (`0-9`, `A-F`). You can tell the encoding is UTF-16BE, becuase it starts with \"Byte order mark" `FEFF`. Each unicode character is then encoded in one or two byte pairs. The example string corresponds to the text \"Cvičení je zátěž a ${\rm x} ∈ 𝕄$". Notice the 4 bytes for the last character, $𝕄$. (Even the whitespace would be OK in a PDF file, because it should be ignored by PDF viewers, but \LuaTeX\ doesn't allow it.) \_endinput 2021-02-08 \_octalprint -> \_hexprint 2020-03-12 Released