$Id: INTERNALS,v 1.3 1997/04/10 19:53:54 dps Exp $ Here is how the program works. reader.cc (1.10) read_character reads characters from a word document suitably translated, including dsitingishing between multiple and single ^Gs, etc. The output is fetched by chunk_reader::read_chunk_raw that assembles it into bits ignoring inclusions. chunk_Reader::read_chunk gets these chunks are parcels them out with inclusion seperated out. tok_seq::rd_token adds start and end tags for rows, fields, paragraphs and all the rest storing the tokens in a table on a seperate queue before transfering them all onto the main queue. tok_seq::rd_token also keeps track of the size and detects the probable end of the table. tok_seq::feed_token takes a token off the queue and requests a refill at the appropiate time. At the end of the document it tests a flag and if the flag is not set then adds a document end entry (and then feeds it to the caller). OK, so far? Now the fun begins! If you look at the outptut now you see horrofic stuff like 550 *eq \F(foom bar)= 42 so the input is further processed by tok_seq::math_collect(). math_collect() uses saved_tok as a one byte push back mechamism and will use this token before asking feed_token() for one. Non-paragraphs and non-equations go straight thorugh. When math_collect sees a paragaph is pears at the next item. If this is not an equation it just forwards the token and stashes the item it got in saved_token (saved_token is definately free: either it was used or feed_token supplied something). If it sees an euqation it calls math_reverse_scan to work out whether there is any equation in the string (guesswork but works quite nicely). If math_reverse_scan decides it is all real text the token is just forwarded (with the extra token still stashed in saved_tok). Assuming math_reverse_scan found something to move that material is moved into the equation and ntok and the current token modified. saved_token still pointds to ntok so we use the same structure but new strings. The reduced paragraoh token is returned. ----- When the code sees an equation special (quite possibly saved_tok from the paragraph process above) it ask feed_token() for the next two tokens. The next token is the end token for the special and the one after that interesting, and will be called T (the token itself is *ntok in the code). If T is an equation the end spec token is junked and the two equations joined. One of the equations is then junked. The end special is pushed onto the start of the outpiut for feed_token to find there; saved_tok is pointed to the expanded equations. The code then returns to the original read a token state so further aggregation can take place. If T is a paragraph then the code uses math_forward_scan to see how much of that is consumed as part of the equation. If none then the end special and paragraph tokens are pushed onto the front of the output queue and saved_tok invalided. The code is then returns the current (equation special) token. The end special passes straight through and then the accumulaion can begin again. If T (a paragraph) is partial consumed the current equation and it is adjusted and the same processing as if the paragraph had no formula contents. If T (a paragraph) entirely consumed its contents are added onto to the text, the paragraph junked, the end spec pushed pack. saved_tok is pointed to the current, expanded equations. The code then returns to the original read a token state so further aggregation can take place. The output now contians nice stuff like 550 * \F(foo,bar) = 29 and even horrors that word veiwer renders as displayed equations like 550 * \F(foo,bar) = 29. This output is requested by tok_seq::eqn_rd_token() which is an internal method. It is not devoid of tricks however. Anything other than the start of a paragragh passes straight through. When it sees a paragraph it pushes it onto a seperate queue and acculumates totals of characters and specials in it sees. The loop exits when any of the following applies: The paragaraph character total exceeds then (small, currently 3) treshold. The end of the paragraph is spotted. A non-special, non-pargraph, non-other character is seen (if this happens we add the treshold to the count to be sure of being >= to it. On exit from the loop if the total is less than the critical value the queue is reveresed and inserted at the front of the output queue minus the paragraph items. Since the tokens are inserted as the first character of the ouput they appear in reverse order of insertion (hence the reverse makes the elements appear it the original order on the output queue). This deletes that extraneous and wrong full stop, for example. Otherwise the queue is the elements are transfered to the front of the output queue in the existing order (this actually just sets a couple of pointers). Either way the temporary queue is now empty and is deleted. The first item dequeued is returned. (This is what rtest2 shows you). ----- The output of eqn_rd_token is fed to the listhandling guesswork. A list is started by A pargraph starting with a number 1 (enumerate) or a bullet character (itemize). If a paragraph does not fit then it is checked for a list start---if so it is assumed to be a sublist. The end of the list is signalled by MAX_ITEM_SEP_PARS which can not be part of a list (default value is 5). Only the last paragraph with the appropiate lead in text is included. Since this involves delaying tokens and looking ahead the easiest place to do this is in the reader. Only the bottom level list is actually ended by enough non-list paragraphs; the list is closed and the output is fed back in, giving them the chance to kill the list the next level up. The lists queues are completely seperate from anything eqn_rd_token uses: each list builds the tokens in items. Note that the first item includes the list start to make it easy to recover the original text if required. When a level of list is popped these tokens are pointed to by recycled, which is initially NULL. If the list only has one item the list is transformed back into its (listless) tokens, as recieved from eqn_rd_token. The first token is appended to the list bellow the one popped or the output queue if there is not such list. The main loop of read_token grabs values from recycled if it is not null, setting it to NULL if it becomes empty. If recycled is NULL then the code asks eqn_rd_token for a token. As with eqn_rd_token the loop waits for its own output queue (outqueue) to have something in it and returns the first item in this queue. The overall effect of the layered intelligence described above is that the code in reader.cc is too complicated for comfort. The overall performance is nice though... ---------------------------------------------------------------------- OH, yes and the *TeX output format also uses context cues. There is a minimal amount of context cue usage in the ascii format. Overall this program tends towards my idea of a complex AI program using context cues to do the right stuff with what word throws at it!! I hope this is now 100% clear.