The format of the 'hao.dat' unified data file is as follows:
An 8-byte identification header, which must be the 8 bytes 89 48 41 4F 0D 0A 1A 0A
(This is a modification of the PNG header with the letters
"HAO" in place of "PNG".)
A 4-byte file offset of the Table1 data table, in big-endian
"network byte order" format. This should be the number 16.
A 4-byte file offset of the Table2 data table,
in big-endian format.
The Table1 data table.
The Table2 data table.
End of file. (There is no explicit end-of-data marker.)
The Table1 data format
The format of the Table1 data table is as follows:
A UTF8-encoded delta representing the difference between the
previous Unicode number and the current one. (The "previous"
number starts at zero.) If the resulting Unicode number is
equal to 0xFFFFFFFF, 232-1, then the table is
over and processing stops immediately.
A UTF8-encoded number representing the order-0 frequency of
this character in Chinese text.
A single byte representing the number of strokes in this
character.
A list of two-byte codes representing the
pronunciations of this character; see the table at the end of this
document. The single byte 0xFF terminates this list
immediately; the list will be followed immediately by the
UTF8-encoded delta of the next table entry. The byte 0xFF
will not appear as part of any legitimate two-byte code.
The Table2 data format
The format of the Table2 data table is as follows:
A UTF8-encoded delta representing the difference between the
previous Unicode number[1] and the current one. (The "previous"
number starts at zero.) If the resulting Unicode number[1] is
equal to 0xFFFFFFFF, 232-1, then the table is
over and processing stops immediately.
A UTF8-encoded delta representing the difference between the
previous Unicode number[2] and the current one. (The "previous"
number starts at zero.) This delta wraps around at 232,
so given a previous value of 0xF900 and a delta of
0xFFFF3001, the current value would compute to
0x00002901.
A UTF8-encoded number representing the order-1 frequency
of this character 2-tuple in Chinese text.
The syllable encoding
The encoding of syllable pronunciations into a two-byte code
goes as follows:
nil B C Ch D F G H J K L M N P Q R S Sh T W X Y Z Zh
24 cases
A E I O U V IA UA UE IO
10 cases
nil A E I O U
6 cases
nil N NG R
4 cases
tone0 tone1 tone2 tone3 tone4
5 cases
Thus there are 24*10*6*4*5 = 28800 possible syllables, whose
hexadecimal values range from 0x0000 to 0x7080.
However, there is a final modification: Any value whose low-order byte is
0xFF is modified by setting the high bit of the code
(0x8000) and clearing the low-order byte; thus the value
0x42FF is recorded as 0xC200.