bhanzi suite


Here's a brief summary of what this suite of programs is supposed to do. The goal of this suite is to make it easy for me to write and typeset Chinese text over an ASCII-based interface such as telnet, using a regular text editor such as emacs or pico to do all my work.

The input to the program suite will be a plain text file consisting of mixed English and Pinyin. The two languages are separated by the escape sequences \c and \e, on lines by themselves. \c switches to Chinese, and \e switches to English.

The pinyin text can use whitespace any way it likes; all whitespace will be removed by the suite. Chinese text enclosed in [brackets] will be passed along, brackets and all, to the next stage of processing -- with the exception of the empty bracket pair [], which is silently discarded.

The result of the passage through bpinyin is a file consisting solely of unescaped English text, with Chinese text escaped with [brackets]. Each [bracketed] section shall correspond to exactly one Hanzi glyph; if it doesn't, then it will be echoed without change to the output file. If it does match a Hanzi glyph in the database, then that glyph's Unicode number will be output in the form ^^^^1234 (in hexadecimal). Whitespace is preserved at this stage.

The result of the passage through bhanzi is a file consisting solely of English text with interspersed ^^^^1234 "Omega-format" Unicode glyphs. The program blatex takes that text and formats it (pretty-printing) for passage through unitrans --o2f and LaTeX. This stage adds a standardized header and footer, as well as tags intended to enlarge paragraphs consisting only of Chinese text.

Passing this LaTeX-formatted file through unitrans --o2f turns the Omega-formatted Unicode numbers to UTF-8, and then the source file is ready to be passed to latex or pdflatex. The user must have already installed a Unicode-ready font -- that's a non-trivial procedure in and of itself -- and will probably need to edit the output file slightly to make sure it finds the right font to use.

The Hanzi-matching program bhanzi is undoubtably the least trivial filter in the suite. It relies on two database files: one mechanically derived and containing a list of all glyphs in the Unihan database, and another which contains records derived from a Chinese corpus, including the frequencies of individual unigrams and digrams. It uses this data to predict which glyphs match the given pinyin at each step.

Update: bhanzi is pretty awful. Therefore I have re-written it from scratch in C++, using STL classes and real math. The new program, called hao, works a lot better, and I'm anticipating using hao as a base for a monolithic converter that will do the work of at least bpinyin and bhanzi, and possibly blatex as well.

The current version of hao is divided into three source modules: hao_main, hao_input, and hao_types. The first is the part with the hanzi-matching algorithm; the second is the part which recognizes pinyin syllables in the input stream and tokenizes them; and the third is the definition of all the C++ classes used by the program. The two data files used by bhanzi have been replaced by a single data file, hao.dat, which is much smaller and loads much more quickly. See its format here.