Last Week in Drizzle – Volume 2

This is the second post in the weekly series “Last Week in Drizzle” where we summarize the efforts of various folks in the Drizzle community over the past week. This edition encapsulates the work and conversations taking place over the past two weeks as both a vacation and procrastination took their toll on getting the weekly edition done. As with the week before, a number of developers and community advocates continue to refactor the code base, come together in discussions on the mailing list, and brainstorm on how to solve the tough problems that Drizzle is trying to address. Mark Schoonover and myself are now collaborating on the Last Week in Drizzle series. Thanks Mark!

Growth in the Drizzle Community

The week before last, we had 148 subscribers to the Drizzle mailing list — this week, that number is up to 176 members, a good growth of 19% in just two weeks. As I type this, there are 41 folks hanging out on the #drizzle Freenode channel, but last week I saw the number fluctuate between 30 and 50 developers, which was awesome to see.

The Drizzle wiki has been steadily growing in its content; so much so that there has been some discussion around moving from Wikia to a dedicated wiki on the newly-acquired drizzleproject.org domain. It’s not a high-priority thing right now, so perhaps there will be more news about the wiki move in future editions of this series… As for new content on the wiki, there’s been a number of entries added, including the following:

Ongoing Conversations on the Mailing List

As with previous weeks, the mailing list discussions, both Bikeshed stuff and tough problem-solving discussions, are flourishing. On the bikeshed front, discussions have been around the domain name, future home page, and wiki stuff. On more substantive issues, there have been interesting threads regarding a proposed persistence and stream format for table metadata (more below, and also see Brian Aker’s section) and on the current plugin architecture. Also, the ongoing debate over the “proper” way of dealing with drizzle configuration files and CLI options continues, with various parties making a stand for and against a solution which simplifies the configuration directives and the number of files.

The enthusiasm and openness with which these conversations have been happening continues to thrill me. We’ve been getting input from MySQL engineers, PostgreSQL engineers, DBAs, storage engine developers, and many others in the community. It’s fantastic to see the exchange of ideas in a transparent arena, and I encourage those of you who are interested in such things to check out the mailing lists!

Config File Organization

One of the ongoing conversations is what to do with the configuration files and command line options. A number of different suggestions have been offered regarding how to simplify the process of searching for configuration files, but AFAIK, no decision has been made regarding moving forward to a solution.

Table Schema Metadata Persistence

Another big discussion has been around how to eliminate the older .frm (format) files that, in MySQL, store the definition of the schema’s table objects. The FRM file format hasn’t changed a whole lot since the early (1980s) days of Unireg, which was the basis of modern-day MySQL. Whilst the FRM files have served their purpose up until now, there are a number of more modern and efficient ways of communicating object definitions between subsystems that have emerged over the past five years.

Brian has led the charge on incorporating one of these object definition communication methods, namely the Google Proto Buffers library, into the kernel. After some discussion about alternatives, Brian decided to prototype the Proto Buffers integration in a private bzr branch. After getting it going, this branch was merged into trunk early this current week. Expect to see a lot more writing from Brian and others explaining the move and what benefits it brings the server.

UTF8, Collations and Charsets

When many of us were at OSCON, a discussion started between Brian, myself, Monty Widenius, and others regarding how to reduce the complexity involved in handling multiple character sets and collations in the strings/ library and the MySQL server codebase. MySQL currently supports a fantastic array of charsets and collations, and allows DBAs to be extremely flexible when defining schemas — giving the DBA the ability to change charsets and collations all the way down to the table column level. Of course, this flexibility comes with a cost: while not particularly a performance penalty, the charset and collation library calls add complexity to the overall server design. We wanted to simplify much of this complexity as well as investigate whether we could remove the custom library code and use existing, community-supported libraries such as GLib or libICU.

So, one might ask, “why mess with something that works just fine?” Good question, and Monty W was (and likely still is) of the opinion that the existing custom charset/collation library is just fine and there is no need to mess with it. But, there is something to be said for utilizing an open source library that is well-known and supported by the broader community. First, there’s the principle of many eyes making better, less buggy software. Whether or not you subscribe to this mantra, it is difficult to dispute that it is easier and less error-prone to “outsource” the development, and especially maintenance, of library code, to a group of programmers that are experts in the field in which the library deals. In the case of charset and collation support, developers who know a ton about charset conversion and collation build and maintain the ICU project. Why not utilize their expertise and allow them to maintain this piece of the puzzle? A decision was made early on by Brian and others to make use of open source projects within the server kernel as much as possible, so that drizzle can fully benefit from the open source development world. Whether it is charset support, SSL routines, Proto Buffers, or PCRE regex support, we look to libraries developed by the open source community to fill the needs of the server, so that we do not need to support custom libraries.

So, back to this past couple week’s conversation around character sets and collation on the mailing list… There seems to be some consensus on the mailing list that scrapping all character sets in favor of using UTF8 only is a pretty good idea. There are some detractors to this strategy who argue that the additional space that UTF8 characters consume in some non-Western languages will negatively impact performance for various international users. Others argue that compression routines can significantly increase performance in these cases, and that removing the complexities of dealing with character sets all over the kernel is a worthy trade-off.

As for collations, this is where the trickier part comes into play. Whilst character sets typically come into play on the storage and retrieval end of the code, collations come into play during the sorting and searching pieces of the code. It is not feasible to get rid of collations since different peoples order characters differently from region to region. So, the current debate on the mailing list is in regards to which external library, if any, to use in order to handle collations. GLib, which implements collation-switching using the setlocale() method, is almost a non-starter since setlocale(3) requires a global mutex on the server process locale structure and therefore would be a significant source of contention. libICU, on the other hand, seems to be overkill for many things and has the additional problem of being natively UTF-16, meaning that characters are always 2 bytes wide, which could be a performance regression… so the issue remains up in the air for now. Likely, character sets will be simplified first, and then the collation issue will be looked into.

The SET data type

On the 13th, Brian asked the mailing lists what to do about the SET data type. Keep it? Is it useful? What are the gotchas? A number of folks responded with opinions and the resulting suggestion was to kill support for this data type in drizzle and in the future, when custom user-defined data types are supported, perhaps making a plugin which provides a SET data type. There were apparently very few people who actually use the functionality in MySQL now, and its limitations and performance have typically forced users to employ a more flexible custom solution revolving around bit operations on standard integer types.

Plugin Architecture

Under the thread posted originally by Brian entitled “PAM, What I learned“, a very interesting conversation broke out surrounding proposed improvements in the current plugin architecture. Mats, Stewart, Monty Taylor, myself, and Brian hashed out some ideas about making the plugin architecture more modular, and cleanly separating the public plugin API from the internal server plugin factory interface. This prompted me to map out a prototype plugins functionality which I will be continuing work on this week and getting up and running on a separate bzr branch. More info on this in a future blog post.

MyISAM Concurrent insert

Brian asked if there was ever a situation where you would want to disable the concurrent insert functionality in the MyISAM storage engine. This lead to a discussion from Sheeri Kritzer and Arjen Lentz about removal of a number of legacy or irrelevant global server variables such as old-passwords and such.

Development Last Week

With last week came a flurry of bug reports from Giuseppe Maxia and others. Many of the bugs revolved around support and crashes on MacOSX. Through careful debugging, Brian, Monty Taylor and Andrew Garner were able to pinpoint the issues — which revolved around drizzled’s handling of socket option setting on sockets indicating a failure — and subsequently release fixes into trunk. Other work still revolved around cleanup, and details of the work is below.

IPv6 Support and Bug Fixes from Andrew Garner

Andrew Garner (muzazzi) threw himself into the Drizzle development community these past two weeks by diagnosing and providing patches for a boatload of bugs, particularly some of the nasty sockets handling errors that were plagueing MacOSX users. He writes:

I’ve primarily been working on bug fixes. However, I did submit a
couple patches that enable ipv6 support in drizzle and lets drizzled
listen on multiple interfaces (it would previously stop at the first
one found.) We’re still looking at ways to further improve ipv6
support.

Elaborating more on the bug fixes, Andrew continues,

I fixed a handful of bugs this week:
– Bug 252312 – Accessing an ARCHIVE table crashes the server – This
fixed the archive engine which was crashing drizzled. I also fixed
the archive test which exposed another bug (also fixed) related to
renaming tables which likely affected other storage engines.
– Bug 255860 – using information_schema crashes both client and server
– This was a twofer. Uninitialized memory would cause the drizzle
client to crash when building the completion hash. An unexpected
client abort would then cause the drizzle server to assertion-crash.
Both fixes with a couple simple patches.
– Bug 252507 – scanning drizzle port crashes the server – This bug
primarily affected OS X (so it was less obvious what the problem was),
but perhaps other platforms as well. This was similar to some other
headaches experienced with setsockopt on closed sockets or sockets w/
errors (e.g. aborted connect) on OS X.

There were a few other bugs late last week that I initially submitted
patches for through the bug system (before I quite grokked Launchpad)
that Mark Atwood was nice enough to merge in for me:
– Bug 250961 – com_statistics was crashing drizzle.
– Bug 250078 – parser error with space at end of statement.
– Bug 250065 – drizzle crashes when using unknown functions.

Some of these fixes will be made obsolete by further refactoring
others are actively working on, but I hope these contributions help.
They’ve definitely helped me get familiar with launchpad and the
drizzle source. At any rate, we did fix a ton of bugs this week.
We currently don’t have any reported, open “critical” bugs without a
fix available. I’m sure that will be fixed soon. 🙂

Removal of Assember-based Strings Functions and Continued Code Style Cleanup

Monty Taylor (mordred), who just celebrated his 33rd birthday (congrats Monty!), worked mostly on fixing bugs with Andrew Garner these past two weeks and continuing his battle for code style and consistency in his codestyle branch. In addition, he removed the legacy assembler-based strings functions and made the conditional build of various string functions cleaner. This week, he’ll be looking at the aforementioned ICU character set and collation library and at using the C++ standard template library to remove dependencies on the custom mysys library calls in the drizzle command line client.

Google Proto Buffers for Object Definition Communication

Brian Aker (krow) has been working diligently these past two weeks on removing the legacy dependency on the Unireg library routines for creating the .FRM format files for table definitions. The active development branch of drizzle now includes a dependency on the Google Proto Buffers library, which is a stream-based C++ library for communicating simplified, standardized object definitions between subsystems.

PAM Authentication Plugin

As you may or may not know, one of the first things on the drizzle chopping block that Brian removed from the original MySQL server source code was the ACL (access control list) functionality — you know, all that GRANT stuff? The idea behind removing it was not to make drizzle the least secure database server in the world, but rather to pull all non-essential functionality out of the core server and into extensible modules.

PAM stands for Pluggable Authentication Modules, and provides a high-level API for dealing with multiple lower-level authentication mechanisms.

The PAM authentication module, a ~30 line plugin written by Brian last week, is the first such example plugin which hands off the responsibility of authentication from the database kernel to a plugin. Brian notes that there is a README in the plugins/pam_auth/ directory which explains installation and usage. Check it out, and if you’re interested in developing plugins for authentication and authorization for drizzle, investigate the source code of the pam_auth.cc plugin as a guide.

Merge of the mem* patches and Initial Work on Cloud-Aware Replication Protocol

Mats Kindahl, resident C++ wizard, focused his contributions these past couple weeks on continued refactoring and cleanup of redundant or unnecessary casts in the codebase, as well as a merge of the mem* patches into the mainline trunk. He writes:

Got the mem* patch merged with main.

Going to go over the code some more and see if there are things I can
improve to make the code faster and simpler. Going to have a look at
virtual functions usage, since that is the single most abused C++
feature (from a performance perspective) for modern processors.

Started designing a replication focused on clouds, it is there as the
blueprint drizzle-replication, but it will be slow progress since I have
a day work to take care of next week. 🙂

Heh, much agreed, Matz! 🙂

EBay’s new MEMORY storage engine now included

Harrison Fisk did the work of porting EBay’s improved MEMORY storage engine to drizzle. The improvements to the MEMORY storage engine include support for variable-length fields in the table definition. Thanks Harrison, and great work!

Doxygen Builds

One smaller blueprint I had the chance to address was automating doxygen code auto-documentation into the build system. After I created a Doxyfile configuration file and added a make doxygen target to Makefile.am, Ronald Bradford added a new Builder to the Build Farm over at 42SQL.com and we now have doxygen documentation automatically being generated from the sources. Expect to see more code comments get converted into the JavaDoc standard and seeing more code documentation creep into the doxygen output.

Cleanup of the Include File Spiderweb

Another task that I took on two weeks ago was the cleanup of the monolithic mysql_priv.h header file. There were a number of issues with it outside of the obvious issue of its massive size. First, the include file was being used for both client and server programs and was conditionally defined to handle special circumstances. So, I pulled client-specific code out into libdrizzle/drizzle_comm.h and include/drizzle.h where it belonged. Secondly, there was next to no comments in mysql_priv.h which indicated what was actually being included in the include pipeline. Thirdly, a number of dependent include headers had no include guards on them, which meant you had to be extremely careful about not including files twice. Fourthly, the mysql_priv.h include header file was a dumping ground for developers to simply throw class and function declarations when they didn’t want to create a corresponding header file for their code. Finally, the order in which server data structures was very much convoluted. So, after making all the cleanup, mysql_priv.h was renamed to drizzled/server_includes.h and now only includes server-specific declarations. More cleanup is needed, but this was a good start.

Build Farm News

Ronald has continued to champion the Drizzle Build Farm, and writes the following update and request to the community:

The Build Farm has helped us in identifying problems but we are still
seeking people to contribute different platforms to our automated process
become better.

Currently we cover the following platforms

– Ubuntu 8.04 – 32 & 64 bit
– Debian 5 – 32 & 64 bit
– Gentoo 8 – 32 & 64 bit
– CentOS 5 – 64 bit
– Fedora 8 – 32 & 64 bit
– SUSE 11 – 32 bit
– Mac OS/X 10.5 – 64 bit

See the wiki for how to contribute to the Drizzle Build Farm.

Final Words

That wraps up this week’s entry. My apologies to anyone I missed in this edition. Feel free to add errata and additions to the comments of the entry and I will update the blog post accordingly.