Format of the hao.dat unified data file

The format of the 'hao.dat' unified data file is as follows:

The Table1 data format

The format of the Table1 data table is as follows:

The Table2 data format

The format of the Table2 data table is as follows:

The syllable encoding

The encoding of syllable pronunciations into a two-byte code goes as follows:
nil B C Ch D F G H J K L M N P Q R S Sh T W X Y Z Zh 24 cases
A E I O U V IA UA UE IO 10 cases
nil A E I O U 6 cases
nil N NG R 4 cases
tone0 tone1 tone2 tone3 tone4 5 cases
Thus there are 24*10*6*4*5 = 28800 possible syllables, whose hexadecimal values range from 0x0000 to 0x7080. However, there is a final modification: Any value whose low-order byte is 0xFF is modified by setting the high bit of the code (0x8000) and clearing the low-order byte; thus the value 0x42FF is recorded as 0xC200.

The "stop" code is the single byte 0xFF.