Unicode Follies
Internationalizalizing programs is hard. Programmers are naturally
more familiar with their native languages. Especially in the United
States, they may not know any other languages. And it goes without
saying that the stuff that you don't understand is the stuff you can't
test.
So programmers naturally welcome anything that makes
internationalization easier. One of those things is the Unicode
standard. As the name implies, Unicode is intended to be a universal
standard for representing text in any of the world's many languages.
The Unicode Consotrium has an impressive list of members: Adobe, Apple,
Google, IBM, Microsoft, and so forth. It has also been impressively
successful: all the major operating systems, including Windows and Mac
OS X, use Unicode internally.
Unicode has a lot of advantages. However, this wouldn't be a very good
rant if I didn't focus on the things that could be improved. So, as
you might have guessed, that's exactly what I'm going to do today.
The standard that wasn't
One of my biggest gripes against Unicode is that it doesn't specify a
canonical way of serializing text. In my naivety, I expected
that a standard for representing text would tell me how to, um,
represent text.
But Unicode doesn't quite do that. Because when you have only one way
to do things, that's like,
the man, keeping you down. Totally
uncool.
Instead, Unicode gave us lots and lots of different ways of serializing
text. We have UTF-8, UTF-16 (big endian), UTF-16 (little endian),
UTF-32 (big endian), UTF-32 (little endian), UCS-2, and even more
obscure encodings.
All of these encodings are simply ways of representing
code
points. Now what is a code point, you may ask? Is it a character?
Well, not quite. Characters are made up of one or more code points.
So, for example, in unicode, "Ä" (an A with two dots over it) is made
up of two unicode code points, U+0041 and U+0308.
Now, the distinction between code points and characters makes sense
from a programmer's point of view. There are things other than the
letter A that can have two dots over them. Rather than giving them all
their own code point, we can just combine U+0308, the dieresis code
point, with whatever vowel we want.
But here's what doesn't make sense: there are other ways to get the
letter Ä besides U+0041, U+0308. You can also use the single code
point U+00C4.
So the simple operation of comparing two strings. which was basically
a
memcmp back in ASCII-land, is a complete clusterfsck in
Unicode. First you have to figure out what encoding the strings are
in-- UTF-8, UTF-16, etc. Then you have to normalize the sequences of
code points so that they represent composable characters the same way.
If you're taking user input, you probably have no idea whether it's
normalized or not. So you'll probably have to normalize everything,
just to be safe. In fact, there's not even a single normalization form
for Unicode-- there's
four. Did we really need
four?
This is another case where having multiple ways to do things is
not better. Guys, if you are designing a standard, take a
stand. Have one way of doing things and stick to it. Don't come up
with a standard that just says "do what thou wilt shall be the law, and
the whole of the law."
Someone set us up the BOM
It's pretty clear, even to a first year comp sci student, that having
multiple encodings in the wild is going to cause problems. You're
always going to be reading text from one source and interpreting it as
something that it isn't. Avoiding such situations is kind of the point
of standardization. That, and eating donuts while sitting around a big
table.
However, the Consortium had a solution for this problem too. That
solution was BOMs, or "byte order marks." They were supposed to appear
at the start of documents to identify the kind of encoding and byte
order used within. Unlike with everything else, the Consortium
actually took a stand on what they should look like-- two bytes, at the
very start of the document.
I know, pretty bold. Couldn't they have given us more choices, like
"two bytes, except on alternate Tuesdays when it's an HTML tag encoded
in EBDIC." But no, two bytes it was.
The problem with this, of course, is that it's utterly unworkable in
the real world. Especially on UNIX, the assumption is that files are
flat streams of data in a standard format. So some people used the
BOM and some didn't. And of course, older documents never had a BOM,
because they had been created before the concept existed.
64k should be enough for anyone
According to the unicode FAQ,
the first version of Unicode
was a 16-bit encoding. The idea was that every code point could be
represented with 16 bits. 65535 code points should be enough for
anyone, right?
Well,
no, actually. There are more than 65535 Chinese
characters in existence. So even if all you wanted to support was
Chinese and English, you would already need more than 16 bits.
To be fair, the average Chinese person only needs to know a few
thousand glyphs to be considered fluent. However, you really still
want to be able to represent those characters so that you can, for
example, store and view ancient documents without mangling them.
However, the Consoritum didn't see it that way-- at least at first.
Chinese has a lot of glyphs. So do Korean and Japanese. In order to
stay within the self-imposed 16-bit limit, the Unicode Consortium
decided to perform something called the "Han unification." Basically,
the idea was that many glyphs from Chinese looked similar to glyphs in
Japanese or Korean. So they were given the same code point. (Japan
actually has multiple writing systems, which I'm glossing over
here.)
The problem with this is that although these characters may look
similar, they're not the same! In the words of
Suzanne
Topping:
For example, the traditional Chinese glyph for "grass" uses four
strokes for the "grass" radical, whereas the simplified Chinese,
Japanese, and Korean glyphs use three. But there is only one Unicode
point for the grass character (U+8349) regardless of writing system.
Another example is the ideograph for "one," which is different in
Chinese, Japanese, and Korean. Many people think that the three versions
should be encoded differently.
Eventually, the 16-bit dream faded. The original 16-bit code point
space is now referred to as the
"basic multilingual
plane," or BMP. There are 17 planes in total, the equivalent of a little
bit more than 20 bits worth of space.
And now, I can finally explain the difference between UCS-2 and UTF-16.
Basically, both are two-byte encodings of Unicode, but UTF-16 is
variable length, but UCS-2 is not. So UCS-2 is doomed to be unable to
represent anything but the original 16-bit code point space (the BMP).
UCS-2 is deprecated, but you may still find it kicking around in some
places.
It's now illegal to sell computers in China that do not conform to
GB18030. This standard
mandates support for code points outside of the original BMP.
Han unification is still a sore point for many in Japan and Korea.
Unicode adoption in Japan has been slow, partially because of this.
The same font cannot correctly represent both Japanese and Chinese
unicode text, because the same code points are used for glyphs that
look different.
How it should have ended
It's easy to complain about things. It's harder to make constructive
suggestions. With that in mind, here is what I think the Unicode
Consortium should have done:
- Only have one encoding-- UTF-8.
UTF-8 is a variable-length encoding, so you never have to worry about
running out of code points. It's also easy to upgrade existing C
programs to UTF-8. They can continue to pass around "pointers to
char." There is no flag day when everything needs to change all at
once. Since it's a single-byte encoding, UTF-8 avoids all the
religious wars over endianness. It also is extremely compact for
ASCII text, which still comprises the majority of computer text out
there.
- Represent every glyph with exactly one series of code points
The Unicode normalization quagmire is annoying and pointless. Each
valid series of code points should represent something unique, so
that comparing strings becomes a single memory comparison again.
Considering the lengths they went to to try to stuff everything in 16
bits, the Consortium gave remarkably little thought to the efficiency
problems caused by denormalized forms. Obviously, there's also an
interoperability risk here too.
- Don't try to "unify" glyphs that look different! Text should
look right in any font-- at least, assuming the font has support for
the code points you're using.
The Future
Unicode support has been improving everywhere. For C programmers, good
libraries like IBM's ICU have been created to deal with Unicode.
Higher-level langauges usually come with Unicode support built-in
(although it's not always perfect.)
People are also slowly converging on UTF-8 as a standard for
interchanging data-- at least in the UNIX world. The hateful idea of
BOMs has slowly faded into the mist, along with Ace of Base and
Criss-Cross, two other scourges of the 1990s.
At the end of the day, Unicode has made things better for everyone.
But we should keep in mind the lessons learned here so that future
standards bodies can do even better.
Update
Apparently unicode now has a "pile of poo" character.
Name: PILE OF POO
Block: Miscellaneous Symbols And Pictographs
Category: Symbol, Other [So]
Index entries: POO, PILE OF
Comments: dog dirt
Version: Unicode 6.0.0 (October 2010)
HTML Entity: 💩
I guess we should all be happy that we are living in a bright future
where even piles of poo are standardized and cross-platform.