One of the (many) ongoing discussions in the Drizzle developer community is the level of support the database server kernel should provide for non-Unicode character set encodings. Actually, when I say non-Unicode, I actually mean non-UTF8, since we’ve stripped out all other character sets and “standardized” on 4-byte UTF8. I’ll come back to why exactly I put standardized in quotes in just a bit…but to sum up, in childish terms, my thoughts after spending 4 hours tonight reading about character sets and collations, here is an exchange between Toru and myself on Freenode #drizzle:
<jaypipes> tmaesaka: how do you write "I wish everyone would just speak English" in Japanese?<tmaesaka> みんな英語使うといいのに。
A Little Background
For those of you new to the world of character sets and collations, I’ll briefly summarize the concepts and terms I’ll talk about in this article. Incidentally, I consider myself to be in this crowd, since I’ve never really had to deal with anything more than a cursory knowledge of them in reference to how they work in MySQL (not the internals).
Character Sets and Encodings
A character set, or character encoding scheme, is a system for matching characters — such as “A” or “み” or “ß” — with a machine-readable code for the character. This machine-readable code can be represented simply as a decimal number, or in more complex character sets, a hexidecimal number. The “encoding” of the character set is the protocol, or instructions, that the character set uses in order to enable the computer to understand a series of byte sequences and interpret the sequence as a specific character.

Character sets such as ASCII are very simple, and consist of a single 8-bit byte for each character contained in the character set — with only 7 bits actually used for the character code. ASCII consists of the English-language alphabetic characters, including their captalized forms, the numbers 0 through 9, a variety of punctuation and “common” symbols like “$” and “;”, and a series of non-printable “control” characters. This encoding scheme works wonderfully for those of us in the U.S., but it is utterly lacking when it comes to representing the myriad characters and symbols in other languages.
Other more-complex character encodings are localized for a specific language, or writing system. For instance, the Shift_JIS character encoding scheme encodes, in 2 bytes, the ASCII character set (with 2 exceptions), the “half-width Katakana” characters, and the JIS X 0208 set of kanji symbols. Sound complicated? It is. And it gets even more complicated the further down the rabbit-hole one goes.
Which leads me to Unicode…
What the Heck is Unicode and UTF?
Many folks think that Unicode is merely another character set or encoding scheme. It’s not. It’s actually more than that. It’s an entire system which endeavours to standardize the way that computers can read, sort, and transform characters encoded in various character sets.
Actually, The Unicode standard according to Wikipedia
…consists of a repertoire of more than 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of standard character encodings, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering and bidirectional display order…
Got all that?
So, Unicode is a set of standards for dealing with lots of varying languages and characters, and transcoding character codes from one encoding scheme to another. What, then, is UTF[8|16|32]?
UTF stands for Unicode Transformation Format, and is a set of mapping methods for translating one of Unicode’s 1,114,112 code points (characters or control sequences) to a hexadecimal number. UTF8 is a variably-sized mapping method, which uses between one and four bytes to represent one of the code points. ASCII and most Western character sets take up 1 byte of storage, whilst CJK (Chinese/Japanese/Korean) characters typically consume 3 bytes of space per character. It is important to note that this 3 bytes is one more byte per character than encoding schemes like Shift_JIS, which use either 1 or 2 bytes for characters. Yoshinori Matsunobu published a short article today on these storage space differences.
UTF16 is a variable-width mapping scheme which uses the first 16 bits of the hexadecimal number to represent what “category” or “plane” of characters the code point belongs to. UTF16 generally uses a little bit less storage space for CJK characters versus UTF8. However, when analyzing actual CJK text, which includes spaces and other ASCII characters, the storage difference seems to be negligible. UTF32 is a fixed-length mapping method which uses 4 bytes to store each code point.
UTF8 is dominant in the web space, with all modern browsers able to understand and encode for UTF8.
OK, So What is a Collation?
So, if a character encoding scheme, such as UTF8, is used to identify a set of characters and symbols as a machine-readable sequence of bytes, then what exactly is a collation, and why are they important?
Glad you asked. A collation , or collating sequence, refers to the order in which different characters in a character set should appear when sorted in a list. The alphabetic collating sequence is the one some of us, in our little English-only world, are familiar with. But in various regions of the world, the same set of characters may be ordered differently when appearing in a list of characters. And therefore, even with a character encoding scheme like UTF8, one must also specify a collation when listing textual results in a specific order.
In MySQL, as well as Drizzle, the method for ordering results by a specific collation is fairly simple: one merely specifies the collation in the ORDER BY clause, like the example below shows:
mysql> CREATE TABLE utf8_tests (
-> my_text VARCHAR(100) NOT NULL
-> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.01 sec)
mysql> INSERT INTO utf8_tests VALUES ('comb'),('cukor'),('csak'),('folyik'),('folyó'),('folyosó'),('fő'),('födém');
Query OK, 8 rows affected (0.00 sec)
Records: 8 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM utf8_tests ORDER BY my_text COLLATE utf8_general_ci;
+----------+
| my_text |
+----------+
| comb |
| csak |
| cukor |
| födém |
| fő |
| folyó |
| folyik |
| folyosó |
+----------+
8 rows in set (0.00 sec)
mysql> SELECT * FROM utf8_tests ORDER BY my_text COLLATE utf8_hungarian_ci;
+----------+
| my_text |
+----------+
| comb |
| csak |
| cukor |
| fő |
| födém |
| folyó |
| folyik |
| folyosó |
+----------+
8 rows in set (0.00 sec)
You’ll notice that the words “fő” and “födém” are reversed depending on the collation used in the ORDER BY clause.
Any Hungarians reading this article? If there are, you’ll likely have already spotted the problem with the above output. The problem is that it’s wrong. “csak” should appear after cukor, since “cs” is a digraph (two-characters interpreted as one) which comes after “c” in the Hungarian alphabet.
The above behaviour is known bug in MySQL since August 2005, over three years. The above bug is something I noticed while reading up on collations and comparing what’s going on in MySQL/Drizzle to what the standard expects. The ICU project has a set of HTML pages where you can type in a list of words in a language and sort by various collations, and it will show you the correct sort order. I ran into the bug above, as well as a new bug in the German collation I found today.
Where Drizzle Is Right Now
Currently, all but the UTF8 character set have been removed from Drizzle. Furthermore, the UTF8 implementation in Drizzle is full 4-byte UTF8, which differs from the 3-byte variety used in MySQL <= 5.1. There are two major benefits that this decision and subsequent removal has given Drizzle:
- Reductions in the size and complexity of the Drizzle parser — removal of some CONVERT() stuff, introducers, and more
- Easier to understand (and potentially refactor) code surrounding character sets and collations
So, it seems that although we’ve stripped out a lot of complexity by moving to only UTF8 and its collations, we’ve inherited a system that, frankly, was never designed to handle complex collations. Instead, it is designed to be fast, not entirely accurate. So, what is a project to do?
We have a number of options, all of which we’ve been debating over on the mailing lists:
- Use libICU for all character set and collation services — libICU is a full-featured library written by experts in the languages and transcoding fields, why not take advantage of that expertise?
- Use GLib’s locale facilities — this has mostly been ruled out because of performance concerns over non-reentrant code dependent on setlocale()
- Write our own — this is essentially where we are right now
- Use C++’s <locale> facilities, as Monty demonstrated on the mailing lists — actually, I also have O’Reilly’s C++ Cookbook, so I know where that code originated…
libICU is, frankly, quite a large library, and it’s not certain that the performance of it would be satisfactory. However, I can certainly envision taking libICU’s test case suite and converting it to the Drizzle test suite format. This would certainly poke holes in our current character set handling that need to be discovered.
Although Yoshinori-san’s objections about UTF8 storage requirements versus localized Japanese character sets are valid, I don’t think at this point that we’ll re-introduce non-UTF8 character sets into the server at this time. If there is a huge uproar over this, in the future, pluggable character sets are a possibility, after changes to the plugin API to enable it. Pluggable collations too…
This last option is the one which interests me the most, and I find most appealing. In fact, I compiled a small test program based on the C++ <locale> facilities which actually produces the correct collation order for the bug demonstrated above:
#include <locale> #include <iostream> #include <vector> #include <string> #include <algorithm> using namespace std; bool localeLessThan(const string& s1, const string& s2) { locale locale1("hu_HU.utf8"); const collate<char>& col= use_facet<collate<char> >(locale1); const char* pb1= s1.data(); const char* pb2= s2.data(); return (col.compare(pb1, pb1 + s1.size(), pb2, pb2 + s2.size()) < 0); } int main(int argc, char** argv) { string s1 = "comb"; string s2 = "csak"; string s3 = "cukor"; vector<string> all_the_strings; all_the_strings.push_back(s1); all_the_strings.push_back(s2); all_the_strings.push_back(s3); sort(all_the_strings.begin(), all_the_strings.end(), localeLessThan); for (vector<string>::const_iterator p= all_the_strings.begin(); p != all_the_strings.end(); ++p) cout << *p << endl; }
Compiling and running the program shows the correct sorted order for the words:
[518][jpipes@serialcoder: /home/jpipes/repos/drizzle/test-hun]$ g++ test.cc [519][jpipes@serialcoder: /home/jpipes/repos/drizzle/test-hun]$ ./a.out comb cukor csak
I’m thinking that the refactoring work that still needs to be completed around CHARSET_INFO and MY_CHARSET_HANDLER should experiment with the technique above and verify any performance regression (or improvement) that may occur. Accuracy, in my opinion, and the ability to let a library not written by Drizzle developers do the heavy lifting, is more important than a small performance increase.
The Edwin Strikes Back

As he is sometimes prone to do, my dear colleague, Edwin DeSouza, shot me an email with a link he thought I might find interesting. He saw I had been chatting about character sets and had discovered a forum thread over on the Ruby Forum which went into some detail about the difficulties surrounding encoding conversion, localization, and internationalization.

Suffice it to say, “some detail” would be the understatement of the year in this case. The forum thread is longer than Jörmungandr, the mythical Norse sea serpent.
Notable voices on the thread include Matz, creator of Ruby and a core influencer in its direction, and Tim Bray, of our own Sun Microsystems and XML fame. The original poster, one Michael Selig, began the thread, entitled Character encodings – a radical suggestion, with an ostensibly simple suggestion:
Remove internal support for non-ASCII encodings completely, and when
reading/writing UTF-16 (and UTF-32) files automatically transcode
to/from UTF-8.
Unfortunately for Michael, this small suggestion was the online equivalent of stepping in a pile of elephant dung.
Until reading the above-mentioned forum thread, I really had no idea about the complexities involved in character set handling, especially in the Asian countries. If you are interested in character sets, collations, and Unicode vs. local encodings, reading through the forum thread will truly enlighten you as to the various arguments for and against UTF8. It’s highly recommended reading, but be warned, it may leave you gasping for breath at some points…enjoy.
