Understanding reservations, concurrency, and locking in Nova

Imagine that two colleagues, Alice and Bob, issue a command to launch a new virtual machine at approximately the same moment in time. Both Alice’s and Bob’s virtual machines must be given an IP address within the range of IP addresses granted to their project. Let’s say that range is 192.168.20.0/28, which would allow for a total of 16 IP addresses for virtual machines [1]. At some point during the launch sequence of these instances, Nova must assign one of those addresses to each virtual machine.

How do we prevent Nova from assigning the same IP address to both virtual machines?

In this blog post, I’ll try to answer the above question and shed some light on issues that have come to light about the way in which OpenStack projects currently solve (and sometimes fail) to address this issue.

Demonstrating the problem

figure A

figure A

Dramatically simplified, the launch sequence of Nova looks like figure A. Of course, I’m leaving out hugely important steps, like the provisioning and handling of block devices, but the figure demonstrates the important steps in the launch sequence for the purposes of our discussion here. The specific step in which we find our IP address reservation problem is the determine networking details step.

figure B

figure B

Now, within the determine networking details step, we have a set of tasks that looks like figure B. All of the tasks except the last revolve around interacting with the Nova database [2]. The tasks are all pretty straightforward: we grab a record for a “free” IP address from the database and mark it “assigned” by setting the IP address record’s instance ID to the ID of the instance being launched, and the host field to the ID of the compute node that was selected during the determine host machine step in figure A. We then save the updated record to the database.

OK, so back to our problem situation. Imagine if Alice and Bob’s launch requests were made at essentially the same moment in time, and that both requests arrived at the start of the determine networking details step at the same point in time, but that the tasks from figure B are executed in an interleaved fashion between Alice and Bob’s requests like figure C shows.

figure C

figure C

If you step through the numbered actions in both Alice and Bob’s request process, you will notice a problem. Actions #7 and #9 will both return the same IP address information to their callers. Worse, the database record for that single IP address will show the IP address is assigned to Alice’s instance, even though Bob’s instance was (very briefly) assigned to the IP address because the database update in action #5 occurred (and succeeded) before the database update in action #8 occurred (and also succeeded). In the words of Mr. Mackey, “this is bad, m’kay”.

There are a number of ways to solve this problem. Nova happens to employ a traditional solution: database-level write-intent locks.

Database-level Locking

At its core, any locking solution is intended to protect some critical piece of data from simultaneous changes. Write-intent locks in traditional database systems are no different. One thread announces that it intends to change one or more records that it is reading from the database. The database server will mark the records in question as locked by the thread, and return the records to the thread. While these locks are held, any other thread that attempts to either read the same records with the intent to write, or write changes to those records, will get what is called a lock wait.

Only once the thread indicates that it is finished making changes to the records in question — by issuing a COMMIT statement — will the database release the locks on the records. What this lock strategy accomplishes is prevention of two threads simultaneously reading the same piece of data that they intend to change. One thread will wait for the other thread to finish reading and changing the data before its read succeeds. This means that using a write-intent lock on the database system results in the following order of events:

figure D

figure D

For MySQL and PostgreSQL, the SQL keyword that is used to indicate to the database server that the calling thread intends to change records that it is asking for is called SELECT ... FOR UPDATE.

Using a couple MySQL command-line client sessions, I’ll show you what affect this SELECT FOR UPDATE construct has on a normal MySQL database server (though the effect is identical for PostgreSQL). I created a test database table called fixed_ips that looks like the following:

CREATE TABLE `fixed_ips` (
  `id` INT(11) NOT NULL AUTO_INCREMENT,
  `host` INT(11) DEFAULT NULL,
  `instance_id` INT(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

I then populate the table with a few records representing IP addresses, all “available” for an instance: the host and instance_id fields are set to NULL:

mysql> SELECT * FROM fixed_ips;
+----+------+-------------+
| id | host | instance_id |
+----+------+-------------+
|  1 | NULL |        NULL |
|  2 | NULL |        NULL |
|  3 | NULL |        NULL |
+----+------+-------------+
3 rows in set (0.00 sec)

And now, here interleaved in time order in a tabular format, are the SQL commands executed in each of the sessions. Thread A is on the left, thread B on the right.

Alice (thread A) Bob (thread B)
sessA>BEGIN;
Query OK, 0 rows affected (...)

sessA>SELECT NOW();
+---------------------+
| NOW()               |
+---------------------+
| 2014-12-31 09:03:07 |
+---------------------+
1 row in set (0.00 sec)

sessA>SELECT * FROM fixed_ips
    -> WHERE instance_id IS NULL
    -> AND host IS NULL
    -> ORDER BY id LIMIT 1
    -> FOR UPDATE;
+----+------+-------------+
| id | host | instance_id |
+----+------+-------------+
|  2 | NULL |        NULL |
+----+------+-------------+
1 row in set (...)
 
 
sessB>BEGIN;
Query OK, 0 rows affected (...)

sessB>SELECT NOW();
+---------------------+
| NOW()               |
+---------------------+
| 2014-12-31 09:04:05 |
+---------------------+
1 row in set (0.00 sec)

sessB>SELECT * FROM fixed_ips
    -> WHERE instance_id IS NULL
    -> AND host IS NULL
    -> ORDER BY id LIMIT 1
    -> FOR UPDATE;
sessA>UPDATE fixed_ips
    -> SET host = 42,
    ->     instance_id = 42
    -> WHERE id = 2;
Query OK, 1 row affected (...)
Rows matched: 1  Changed: 1

sessA>COMMIT;
Query OK, 0 rows affected (...)
 
 
+----+------+-------------+
| id | host | instance_id |
+----+------+-------------+
|  3 | NULL |        NULL |
+----+------+-------------+
1 row in set (42.03 sec)

sessB>COMMIT;
Query OK, 0 rows affected (...)

I’ve highlighted in red above the important things to note about the interplay between session A and session B. The 42.03 seconds is important: it shows the amount of time the SELECT ... FOR UPDATE statement waited on the write-intent locks held by session A. Secondly, the 3 returned by session B’s SELECT ... FOR UPDATE statement indicates that a different row was returned for the same query that session A issued. In other words, MySQL waited until session A issued a COMMIT before executing session B’s SELECT ... FOR UPDATE statement.

In this way, the write-intent locks constructed with the SELECT ... FOR UPDATE statement prevent the collision of threads changing the same record at the same time.

How locks “fail” with MySQL Galera Cluster

galera_replication1

At the Atlanta design summit, I co-led an Ops Meetup session on databases and was actually surprised by my poll of who was using which database server for their OpenStack deployments. Out of approximately 220 people in the room, MySQL Galera Cluster was by far the most popular way of deploying MySQL for use by OpenStack services, with around 200 or so operators raising their hands that they used it. Standard MySQL was next, and there was one person using PostgreSQL.

MySQL Galera Cluster is a system that wraps the standard MySQL row-level binary replication log transmission with something called working-set replication, enabling synchronous replication between many nodes running the MySQL database server. Now, that’s a lot of fancy words to really say that Galera Cluster allows you to run a cluster of database nodes that do not suffer from replication slave lag. You are guaranteed that the data on disk on each of the nodes in a Galera Cluster is exactly the same.

One interesting thing about MySQL Galera Cluster is that it can efficiently handle writes to any node in the cluster. This is different from standard MySQL replication, which generally relies on a single master database server that handles writes and real-time reads, and one or more slave database servers that serve read requests from applications that can tolerate some level of lag between the master and slave. Many people refer to this setup as multi-master mode, but that is actually a misnomer, because with Galera Cluster, there is no such thing as a master and a slave. Every node in a cluster is the same. Each can apply writes coming to the node directly from a MySQL client. For this reason, I like to refer to such a setup as multi-writer mode.

This ability to have writes be directed to and processed by any node in the Galera Cluster is actually pretty awesome. You can direct a load balancer to spread read and write load across all nodes in the cluster, allowing you to scale writes as well as reads. This multi-writer mode is ideal for WAN-replicated environments, believe it or not, as long as the amount of data being written to is not crazy-huge (think: Ceilometer), because you can have application servers send writes to the closest database server in the cluster, and let Galera handle the efficiency of transmitting writesets across the WAN.

However, there’s a catch. Peter Boros, a principal architect at Percona, a company that makes a specialized version of Galera Cluster called Percona XtraDB Cluster, was actually the first to inform the OpenStack community about this catch — in the aforementioned Ops Meetup session. The problem with MySQL Galera Cluster is that it does not replicate the write-intent locks for SELECT ... FOR UPDATE statements. There’s actually a really good reason for this. Galera does not have any idea about the write-intent locks, because those locks are constructions of the underlying InnoDB storage engine, not the MySQL database server itself. So, there’s no good way for InnoDB to communicate to the MySQL row-based replication stream that write-intent locks are being held inside of InnoDB for a particular thread’s SELECT ... FOR UPDATE statement [3].

Figure E

Figure E

The ramifications of this catch are interesting, indeed. If two application server threads issue the same SELECT ... FOR UPDATE request to a load balancer at the same time, which directs each thread to different Galera Cluster nodes, both threads will return the exact same record(s) with no lock waits [4]. Figure E illustrates this phenomenon, with the circled 1, 2, and 3 events representing things occurring at exactly the same time (due to no locks being acquired/held).

One might be tempted to say that Galera Cluster, due to its lack of support for SELECT ... FOR UPDATE write-intent locks, is no longer ACID-compliant, since now two threads can simultaneously select the same record with the intent of changing it. And while it is indeed true that two threads can select the same record with the intent of changing it, it is extremely important to point out that Galera Cluster is still ACID-compliant.

The reason is because even though two threads can simultaneously read the same record with the intent of changing it (which is the identical behaviour that would be seen if the FOR UPDATE was left off the SELECT statement), if both threads attempt to write a change to the same record via an UPDATE statement, either one or none of the threads would succeed in updating the record, but not both. The reason for this is in the way that Galera Cluster certifies a working set (the set of changes to data). If node 1 writes an update to disk, it must certify with a quorum of nodes in the cluster that its update does not conflict with updates to those nodes. If node 3 has begun changing the same row of data, but has not certified with the other nodes in the cluster for that working set, then it will fail to certify the original working set from node 1 and will send a certification failure back to node 1.

This certification failure manifests itself as a MySQL deadlock error, specifically error 1213, which will look like this:

ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

All nodes other than the one that first “won” — i.e. successfully committed and certified its transaction — will return this deadlock error to any other thread that attempted to change the same record(s) at the same time as the thread that “won”. Need a visual of all this interplay? Check out figure F, which I scraped together for the graphically-inclined.

Figure F

Figure F

If you ever wondered why, in the Nova codebase, we make prodigious use of a decorator called @_retry_on_deadlock in the SQLAlchemy API module, it is partly because of this issue. These deadlock errors can be consistently triggered by running load tests or things like Tempest that can put a load on the database that forces “hot spots” in the data to occur. This decorator does exactly what you’d think it would do: it retries the transaction if a deadlock error is returned from the database server.

So, given what we know about MySQL Galera Cluster, one thing we are trying to do is entirely remove any use of SELECT ... FOR UPDATE from the Nova code base. Since we know it doesn’t work the way people think it works on Galera Cluster, we might as well stop using this construct in our code. However, the retry-on-deadlock mechanism is actually not the most effective or efficient mechanism we could use to solve the concurrent update problems in the Nova code base. There is another technique, which I’ll call compare and swap, which offers a variety of benefits over the retry-on-deadlock technique.

Compare and swap

One of the drawbacks to the retry-on-deadlock method of handling concurrency problems is that it is reactive by nature. We essentially wrap calls that may tend to deadlock with a decorator that catches the deadlock error if it arises and retry the entire database transaction again. The problem with this is that the deadlock error that manifests itself from the Galera Cluster working set certification failure (see Figure F above) takes some non-insignificant amount of time to occur.

Think about it. A thread manages to start a write transaction on a Galera Cluster node. It writes the transaction on the local node and gets all the way up to the point of doing the COMMIT. At that point, the node sends out a certification request to each node in the cluster (in parallel). It must wait until a quorum of those nodes respond with a successful certification. If another node has an active working set that changes the same modified rows, then a deadlock will occur, and that deadlock will eventually bubble its way back to the caller, who will retry the exact same database transaction. All of these things, while individually very quick in Galera Cluster, do take some amount of time.

What if we used a technique that would allow us to structure our SQL statements in such a way that we can avoid the roundtrips from one Galera Cluster node to the other nodes? Well, there is.

Consider the following SQL statements, taken from the above CLI examples:

BEGIN;
/* Grab the "first" unassigned IP address */
SELECT id FROM fixed_ips
WHERE host IS NULL
AND instance_id IS NULL
ORDER BY id
LIMIT 1
FOR UPDATE;
/* Let's assume that the above query returned the
   fixed_ip with ID of 1
   We now "assign" the IP address to instance #42
   and on host #99 */
UPDATE fixed_ips
SET host = 99, instance_id = 42
WHERE id = 1;
COMMIT;

Now, we know that the locks taken for the FOR UPDATE statement won’t actually be considered by any other nodes in a Galera Cluster, so we need to get rid of the use of SELECT ... FOR UPDATE. But, how can we structure things so that the SQL code sent to any node in the Galera Cluster will guarantee to us that we will neither stumble into a deadlock error and that the cluster node we end up executing our statements on will not need to contact any other node to determine that another thread has updated the same record during the time that we SELECT‘d our record and when we go to UPDATE it?

The answer lies in constructing an UPDATE statement that contains a WHERE clause that contains all the fields from the previously SELECT‘ed record, like so:

/* Grab the "first" unassigned IP address */
SELECT id FROM fixed_ips
WHERE host IS NULL
AND instance_id IS NULL
ORDER BY id
LIMIT 1;
/* Let's assume that the above query returned the
   fixed_ip with ID of 1
   We now "assign" the IP address to instance #42
   and on host #99, but specify that the host and
   instance_id fields must match our original view
   of that record -- i.e., they must both be NULL
*/
UPDATE fixed_ips
SET host = 99, instance_id = 42
WHERE id = 1
AND host IS NULL
AND instance_id IS NULL;

If we structure our application code so that it is executing the above SQL statements, each statement can be executed on any node in the cluster, without waiting for certification failures to occur before “knowing” if the UPDATE would succeed. Remember that working set certification in Galera only happens once the local node (i.e. the node originally receiving the SQL statement) is ready to COMMIT the changes. Well, if thread B managed to update the fixed_ip record with id = 1 in between the time when thread A does its SELECT and the time thread A does its UPDATE, then the WHERE condition:

WHERE id = 1
AND host IS NULL
AND instance_id IS NULL;

Will fail to select any any rows in the database to update, since host IS NULL AND instance_id IS NULL will no longer be true if another thread updated the record. We can catch this failure to update any rows in the database more efficiently than the certification timeout, since the thread that sent the UPDATE ... WHERE ... host IS NULL AND instance_id IS NULL statement will receive notification about no rows updated before any certification traffic would ever be generated (since there’s no certification needed if nothing was updated).

Do we still need a retry mechanism? Yes, of course we do, in order to retry the SELECT, then UPDATE ... WHERE statements when a previous UPDATE ... WHERE statement returned zero rows affected. The difference between this compare-and-swap approach and the brute-force retry-on-deadlock approach is that we’re no longer reacting to an exception being emitted after some timeout of certification, but instead being proactive and just structuring our UPDATE statement to pass in our previous view of the record we want to change, allowing for a tighter retry loop to occur (no timeout waits needed, simply detect whether rows_affected is greater than zero).

This compare and swap mechanism is what I describe in the lock-free-quota-management Nova blueprint specification. There’s been a number of mailing list threads and IRC conversations about this particular issue, so I figured I would write a bit and create some pretty graphics to illustrate the sequencing of events that occurs. Hope this has been helpful. Let me know if you have thoughts on the topic or see any errors in my work. Always happy for feedback.

[1] This is just for example purposes. Technically, such a CIDR would result in 13 available addresses in Nova, since addresses for the gateway, cloudpipe VPN, and broadcast addresses are reserved for use by Nova.

[2] We are not using Neutron in our example here, but the same general problem resides in Neutron’s IPAM code as is described in this post.

[3] Technically, there are trade-offs between pessimistic locking (which InnoDB uses locally) and optimistic locking (that Galera uses in its working-set certification. For an excellent read on the topic, check out Jay Janssen‘s blog article on multi-node writing and deadlocks in Galera.

[4] If both threads happened to hit the same Galera Cluster node, then the last thread to execute the SELECT ... FOR UPDATE would end up waiting for the locks (in InnoDB) on that particular cluster node.

MySQL Stored Procedures Ain’t All That

I give quite a lot of presentations. A whole lot less than I used to, but still quite a few per year. Most of the time, the presentations are on performance tuning MySQL.

Almost every time I give a presentation on MySQL performance tuning — and this happens 100% of the time if I am presenting to a Windows SQL Server crowd — I get the following question:

Why don’t you cover using stored procedures in order to increase performance? Wouldn’t that be the easiest way to get better performance since the stored procedures will only be parsed once and then the compiled bytecode would be efficiently executed from then on?

Every person that asks this question assumes something about MySQL’s stored procedure implementation; they incorrectly believe that stored procedures are compiled and stored in a global stored procedure cache, similar to the stored procedure cache in Microsoft SQL Server[1] or Oracle[2].

This is wrong. Flat-out incorrect.

Here is the truth: Every single connection to the MySQL server maintains it’s own stored procedure cache.

This means two very important things that users of stored procedures should understand:

  • If you operate in a shared-nothing environment — for example, the majority of PHP and Python applications that do not use connection pooling or persistent connections — if your application uses stored procedures, the connection is compiling the stored procedure, storing it in a cache, and destroying that cache every single time you connect to the database server and issue a CALL statement
  • If you use stored procedures, the memory usage of every single connection that uses those stored procedures is going to increase, and will increase substantially if you use many stored procedures

Ooops, I Invalidated Everything Again

So, what happens when you CREATE, ALTER, or DROP any stored procedures? Since MySQL stores all stored procedure execution code on the connection threads, each of those connection threads must invalidate the procedure in its caches that has changed, right?

No, it’s worse. Every time ANY stored procedure is added, dropped, or updated, ALL stored procedures on ALL connection threads will be invalidated and must be re-compiled. Here is how the “caches” are invalidated:

from /sql/sp_cache.cc, lines 193-197, in MySQL 5.5

/*
  Invalidate all routines in all caches.
 
  SYNOPSIS
    sp_cache_invalidate()
 
  NOTE
    This is called when a VIEW definition is created or modified (and in some
    other contexts). We can't destroy sp_head objects here as one may modify
    VIEW definitions from prelocking-free SPs.
*/
void sp_cache_invalidate()
{
  DBUG_PRINT("info",("sp_cache: invalidating"));
  thread_safe_increment(Cversion, &Cversion_lock);
}

It’s a bit misleading, since it actually doesn’t invalidate anything at all. What the above code does is increment the global “Cversion” variable. When a connection thread attempts to execute, drop or insert a new procedure, it will notice that it’s local cache’s version number is less than this Cversion number and will destroy the entire cache and rebuild it gradually as procedures are affected or executed.

So, Should You Use Stored Procedures in MySQL?

The above warning doesn’t necessarily mean that you should never use stored procedures? No. What it means (besides being a bit of a rant on the implementation of MySQL’s stored procedures) is that you should be aware of these issues and use stored procedures where they make the most sense:

  • When you know that you will be executing the stored procedure over and over again on the same connection — for instance, in a bulk loading script or similar
  • When you know that you will not be disconnecting from the MySQL server at the end of script execution — for instance, if you use JDBC connection pooling
  • When you know that you have a limited number of stored procedures and the memory usage of connections won’t be an issue

Finally, if you see benchmarks that purport to show a huge performance increase from using stored procedures in MySQL, be careful to understand what the benchmark is doing and whether that benchmark represents your real-world environment. For instance, if you see a huge performance increase in sysbench when using stored procedures, but you have a PHP shared-nothing environment, understand that those benchmark results mean very little to you, since sysbench connections don’t get destroyed until the end of the run…

[1] From my copy of Inside SQL Server 2000, Delaney (2001), pages 852-865. For a short, but decent, online explanation of SQL Server’s stored procedure cache, see here

[2] Oracle’s stored procedures are stored in the shared pool of the Oracle system global area (SGA)

Now Recording Drizzle Contributor Tutorial

Hi all!

I was swamped with registrations for the online contributor tutorial for Drizzle, and so I’ve bumped up my account to a DimDim Pro account. This means two things:

  1. I can take >20 registrations
  2. I can record the session

So, Diego, rest assured, the session will be recorded (hopefully with no glitches). I’m going to call DimDim to see if I can do a practice recording beforehand to verify Linux64 is a platform they support for recording (if not, I’ll go to my neighbour’s Windows computer to record)

Again, if you’re interested in the webinar, please do register using the widget below:


Cheers,

jay

Signup for Drizzle Contributor Tutorial Webinar – May 15th

Hi all!

I’ll be giving an online webinar for Drizzle contributors on Saturday, May 15th @ 1am GMT (In the U.S. this is Friday, May14th @ 9pm EDT, 6pm PDT).

Note that the DimDim widget below shows the time as May 14th @ 8pm. The widget is wrong, since DimDim does not account for daylight savings.

Space is strictly limited to 20 people and this will be done via DimDim.com. Please register for the webinar by entering your email address in the widget below and clicking “Sign Up”.

The agenda for this 2-3 hour tutorial will be:

  1. First Steps
    • Getting registered as a contributor for Drizzle on Launchpad
    • Registering your SSH keys with Launchpad
    • Picking up and creating blueprints
    • Basics of Bazaar
    • Setting up a local code repository for Drizzle
    • Committing your work to a Bazaar branch
    • Pushing your code to Launchpad
    • Requesting a code review and merge into trunk
    • One slide explaining the license your contributions may be submitted under
  2. The Drizzle Source Code
    • Our coding standards
    • Our build system
    • Walkthrough of major directories in Drizzle
    • Understanding the plugin system
    • Understanding what the kernel is responsible for
    • Where the Dragons live — and how to avoid them
  3. Walkthrough of a SELECT statement

    • Client communication with server
    • The role of the session scheduler plugin
    • How, when and where authentication and authorization plugins are called
    • How the drizzled::statement::Statement subclasses work
    • Dive into drizzled::statement::Select::execute()
    • Walkthrough how a Table’s definition (metadata) is read from a protobuffer file or an engine
    • Dive into mysql_lock_tables()
    • How does drizzled::plugin::StorageEngine::startStatement() work?
    • How does drizzled::plugin::TransactionalStorageEngine::startTransaction() work?
    • Inside the join optimizer and Join::optimize()
    • How does the nested loops algorithm get executed and how does READ_RECORD work?
    • How does drizzled::Cursor perform table and index scans and seeks?
    • How are result sets packaged up and sent to clients?
  4. Plugin Development Tutorial

    • What plugin classes are even available?
    • Creating your basic plugin
    • The plugin.ini file
    • The module initialization and configuration file
    • Registering your plugin with the kernel with plugin::Context::add()
    • Publishing your plugin’s information using table functions
    • Providing users control over your plugin with user-defined functions

Holy Google Summer of Code, Batman

So, last year, Drizzle participated in the Google Summer of Code under the MySQL project organization. We had four excellent student submissions and myself, Monty Taylor, Eric Day and Stewart Smith all mentored students for the summer. It was my second year mentoring, and I really enjoyed it, so I was looking forward to this year’s summer of code.

This year, Padraig O’Sullivan, a GSoC student last year, is now working at Akiban Technologies, partly on Drizzle, and is the GSoC Adminsitrator and also a mentor for Drizzle this year, and Drizzle is its own sponsored project organization this year. Thank you, Padraig!

I have been absolutely floored by the flood of potential students who have shown up on the mailing list and the #drizzle IRC channel. I have been even more impressed with those students’ ambition, sense of community, and willingness to ask questions and help other students as they show up. A couple students have even gotten code contributed to the source trees even before submitting their official applications to GSoC. See, I told you they were ambitious! :)

This year, Drizzle has a listing of 16 potential projects for students to work on. The projects are for students interested in developing in C++, Python, or Perl.

If you are interested in participating, please do check out Drizzle! For those new to Launchpad, Bazaar, and C++ development with Drizzle, feel free to check out these blog articles which cover those topics:

And, in other news, Go Buckeyes!

Understanding Drizzle’s Transaction Log

Today I pushed up the initial patch which adds XA support to Drizzle’s transaction log. So, to give myself a bit of a rest from coding, I’m going to blog a bit about the transaction log and show off some of its features.

WARNING: Please keep in mind that the transaction log module in Drizzle is under heavy development and should not be used in production environments. That said, I’d love to get as much feedback as possible on it, and if you feel like throwing some heavy data at it, that would be awesome ;)

What is the Transaction Log?

Simply put, the transaction log is a record of every modification to the state of the server’s data. It is similar to MySQL’s binlog, with some substantial differences:

  • The transaction log is composed of Google Protobuffer messages. Because of this, it is possible to read the log using a variety of programming languages, as Marcus Eriksson‘s RabbitReplication project demonstrates.
  • The transaction log is a plugin[1]. It lives entirely outside of the Drizzle kernel. The advantage of this is that development of the transaction log does not need to be linked with development in the kernel and versioning of the transaction log can happen independently of the kernel.
  • Currently, there is only a single log file. MySQL’s binlog can be split into multiple files. This may or may not change in the future. :)
  • Drizzle’s transaction log is indexed. Among other things, this means that you can query the transaction log directly from within a Drizzle client via DATA_DICTIONARY views. I will demonstrate this feature below.

It is important to also point out that Drizzle’s transaction log is not required for Drizzle replication. This probably sounds very weird to folks who are accustomed to MySQL replication, which depends on the MySQL binlog. In Drizzle, the replication API is different. Although the transaction log can be used in Drizzle’s replication system, it’s not required. I’ll write more on this in later blog posts which demonstrate how the replication system is not dependent on the transaction log, but in this article I just want to highlight the transaction log module.

How Do I Enable the Transaction Log

First things first, let’s see how we can enable the Transaction Log. If you’ve built Drizzle from source or have installed Drizzle locally, you will be familiar with the process of starting up a Drizzle server. To review, here is how you do so:

cd $basedir
./drizzled [options] &

Where $basedir is the directory you built Drizzle or installed Drizzle. For the [options], typically you will need at the very least a --datadir=$DATADIR and a --mysql-protocol-port=$PORT value. For an explanation of the --mysql-protocol-port option, see Eric Day‘s recent article.

To demonstrate, I’ve built a Drizzle server in a local directory of mine, and I’ll use the /tests/var/ directory as my $datadir:

cd /home/jpipes/repos/drizzle/xa-transaction-log/drizzled/
./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 &

You should see output similar to this:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 &
[1] 31499
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ InnoDB: The InnoDB memory heap is disabled
InnoDB: Mutexes and rw_locks use GCC atomic builtins.
InnoDB: The first specified data file ./ibdata1 did not exist:
InnoDB: a new database to be created!
100317 15:41:51  InnoDB: Setting file ./ibdata1 size to 10 MB
InnoDB: Database physically writes the file full: wait...
100317 15:41:52  InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
InnoDB: Database physically writes the file full: wait...
100317 15:41:52  InnoDB: Log file ./ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
100317 15:41:53 InnoDB Plugin 1.0.4 started; log sequence number 0
Listening on 0.0.0.0:9306
Listening on :::9306
Listening on 0.0.0.0:4427
Listening on :::4427
./drizzled: Forcing close of thread 0 user: ''
./drizzled: ready for connections.
Version: '2010.03.1314' Source distribution (xa-transaction-log)

To connect to the above server, I then do:

../client/drizzle --port=9306

If all went well, you should be at a drizzle client prompt:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ../client/drizzle --port=9306
Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 2
Server version: 7 Source distribution (xa-transaction-log)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle> 

You can check to see whether the transaction log is enabled by querying the DATA_DICTIONARY.VARIABLES table. The transaction log is not on by default:

drizzle> use data_dictionary
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> SELECT * FROM GLOBAL_VARIABLES WHERE VARIABLE_NAME LIKE 'transaction_log%';
+---------------------------------+-----------------+
| VARIABLE_NAME                   | VARIABLE_VALUE  |
+---------------------------------+-----------------+
| transaction_log_enable          | OFF             | 
| transaction_log_enable_checksum | OFF             | 
| transaction_log_enable_xa       | OFF             | 
| transaction_log_log_file        | transaction.log | 
| transaction_log_sync_method     | 0               | 
| transaction_log_truncate_debug  | OFF             | 
| transaction_log_xa_num_slots    | 8               | 
+---------------------------------+-----------------+
7 rows in set (0 sec)

OK, let’s start up the server, this time with the transaction log enabled. To shutdown Drizzle, there is no need to use a tool like mysqladmin. You can shutdown the server via the client:

drizzle> exit
Bye
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ../client/drizzle --port=9306 --shutdown
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ./drizzled: Normal shutdown
100317 15:53:48  InnoDB: Starting shutdown...
100317 15:53:49  InnoDB: Shutdown completed; log sequence number 44244
...

Now let’s start up the server, this time passing the --transaction-log-enable and the --default-replicator-enable options. The --default-replicator-enable option is needed when the transaction log is not in XA mode (more on that later):

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 --transaction-log-enable --default-replicator-enable &
[2] 31582
[1]   Done                    ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ InnoDB: The InnoDB memory heap is disabled
...
./drizzled: ready for connections.

And again, connect to the server and check our transaction log variables again:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ../client/drizzle --port=9306
Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 2
Server version: 7 Source distribution (xa-transaction-log)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle> use data_dictionary
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> SELECT * FROM GLOBAL_VARIABLES WHERE VARIABLE_NAME LIKE 'transaction_log%';
+---------------------------------+-----------------+
| VARIABLE_NAME                   | VARIABLE_VALUE  |
+---------------------------------+-----------------+
| transaction_log_enable          | ON              | 
| transaction_log_enable_checksum | OFF             | 
| transaction_log_enable_xa       | OFF             | 
| transaction_log_log_file        | transaction.log | 
| transaction_log_sync_method     | 0               | 
| transaction_log_truncate_debug  | OFF             | 
| transaction_log_xa_num_slots    | 8               | 
+---------------------------------+-----------------+
7 rows in set (0 sec)

drizzle> 

OK. So, if you check the $datadir, you should see a file called transaction.log, with a size of 0:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ls -lha ../tests/var/
total 21M
drwxr-xr-x  6 jpipes jpipes 4.0K 2010-03-17 15:54 .
drwxr-xr-x 11 jpipes jpipes 4.0K 2010-03-17 14:57 ..
-rw-rw----  1 jpipes jpipes  10M 2010-03-17 15:54 ibdata1
-rw-rw----  1 jpipes jpipes 5.0M 2010-03-17 15:54 ib_logfile0
-rw-rw----  1 jpipes jpipes 5.0M 2010-03-17 15:41 ib_logfile1
-rwxr-----  1 jpipes jpipes    6 2010-03-17 15:54 serialcoder.pid
-rwx------  1 jpipes jpipes    0 2010-03-17 15:54 transaction.log

Back in the drizzle client, let’s go ahead and create a new schema, a new table, and add a single row to that table. This will add some entries to the transaction log that we’ll be able to view:

drizzle> CREATE SCHEMA lebowski;
Query OK, 1 rows affected (0.06 sec)
drizzle> USE lebowski
Database changed
drizzle> CREATE TABLE characters (name VARCHAR(20) NOT NULL PRIMARY KEY,
    -> hobby VARCHAR(10) NOT NULL) ENGINE=InnoDB;
Query OK, 0 rows affected (0.06 sec)

drizzle> INSERT INTO characters VALUES ('the dude','bowling');
Query OK, 1 row affected (0.05 sec)

Checking in on our transaction log file, we see it now has some size to it:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ls -lha ../tests/var/
total 21M
drwxr-xr-x  7 jpipes jpipes 4.0K 2010-03-17 16:11 .
drwxr-xr-x 11 jpipes jpipes 4.0K 2010-03-17 14:57 ..
-rw-rw----  1 jpipes jpipes  10M 2010-03-17 16:11 ibdata1
-rw-rw----  1 jpipes jpipes 5.0M 2010-03-17 16:11 ib_logfile0
-rw-rw----  1 jpipes jpipes 5.0M 2010-03-17 16:11 ib_logfile1
drwxrwx--x  2 jpipes jpipes 4.0K 2010-03-17 16:11 lebowski
-rwxr-----  1 jpipes jpipes    6 2010-03-17 16:11 serialcoder.pid
-rwx------  1 jpipes jpipes  444 2010-03-17 16:11 transaction.log

Finding Out What’s In the Transaction Log

OK, so now for the really cool part of this little demonstration. :) Let’s take a look at what is now contained in the transaction log, all via the Drizzle client and the DATA_DICTIONARY views.

There are currently three DATA_DICTIONARY views which show information about the transaction log and its contents:

  • DATA_DICTIONARY.TRANSACTION_LOG
  • DATA_DICTIONARY.TRANSACTION_LOG_ENTRIES
  • DATA_DICTIONARY.TRANSACTION_LOG_TRANSACTIONS

To see what each view contains, simply do a DESC on them:

drizzle> use data_dictionary
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> DESC TRANSACTION_LOG;
+---------------------+---------+-------+---------+-----------------+-----------+
| Field               | Type    | Null  | Default | Default_is_NULL | On_Update |
+---------------------+---------+-------+---------+-----------------+-----------+
| FILE_NAME           | VARCHAR | FALSE |         | FALSE           |           | 
| FILE_LENGTH         | BIGINT  | FALSE |         | FALSE           |           | 
| NUM_LOG_ENTRIES     | BIGINT  | FALSE |         | FALSE           |           | 
| NUM_TRANSACTIONS    | BIGINT  | FALSE |         | FALSE           |           | 
| MIN_TRANSACTION_ID  | BIGINT  | FALSE |         | FALSE           |           | 
| MAX_TRANSACTION_ID  | BIGINT  | FALSE |         | FALSE           |           | 
| MIN_END_TIMESTAMP   | BIGINT  | FALSE |         | FALSE           |           | 
| MAX_END_TIMESTAMP   | BIGINT  | FALSE |         | FALSE           |           | 
| INDEX_SIZE_IN_BYTES | BIGINT  | FALSE |         | FALSE           |           | 
+---------------------+---------+-------+---------+-----------------+-----------+
9 rows in set (0 sec)

drizzle> DESC TRANSACTION_LOG_ENTRIES;
+--------------+---------+-------+---------+-----------------+-----------+
| Field        | Type    | Null  | Default | Default_is_NULL | On_Update |
+--------------+---------+-------+---------+-----------------+-----------+
| ENTRY_OFFSET | BIGINT  | FALSE |         | FALSE           |           | 
| ENTRY_TYPE   | VARCHAR | FALSE |         | FALSE           |           | 
| ENTRY_LENGTH | BIGINT  | FALSE |         | FALSE           |           | 
+--------------+---------+-------+---------+-----------------+-----------+
3 rows in set (0 sec)

drizzle> DESC TRANSACTION_LOG_TRANSACTIONS;
+-----------------+--------+-------+---------+-----------------+-----------+
| Field           | Type   | Null  | Default | Default_is_NULL | On_Update |
+-----------------+--------+-------+---------+-----------------+-----------+
| ENTRY_OFFSET    | BIGINT | FALSE |         | FALSE           |           | 
| TRANSACTION_ID  | BIGINT | FALSE |         | FALSE           |           | 
| SERVER_ID       | BIGINT | FALSE |         | FALSE           |           | 
| START_TIMESTAMP | BIGINT | FALSE |         | FALSE           |           | 
| END_TIMESTAMP   | BIGINT | FALSE |         | FALSE           |           | 
| NUM_STATEMENTS  | BIGINT | FALSE |         | FALSE           |           | 
| CHECKSUM        | BIGINT | FALSE |         | FALSE           |           | 
+-----------------+--------+-------+---------+-----------------+-----------+
7 rows in set (0 sec)

Let’s see what each of the views tells us about what is in the transaction log. Remember, we’ve executed a CREATE SCHEMA, a CREATE TABLE, and a single INSERT. Here is what the TRANSACTION_LOG view shows:

drizzle> SELECT * FROM TRANSACTION_LOG\G
*************************** 1. row ***************************
          FILE_NAME: transaction.log
        FILE_LENGTH: 444
    NUM_LOG_ENTRIES: 3
   NUM_TRANSACTIONS: 3
 MIN_TRANSACTION_ID: 1
 MAX_TRANSACTION_ID: 3
  MIN_END_TIMESTAMP: 1268856698672620
  MAX_END_TIMESTAMP: 1268856707093000
INDEX_SIZE_IN_BYTES: 73736

The column names should be self explanatory. The FILE_LENGTH shows the size in bytes of the log (which matches the output we had from our ls -lha above.) The INDEX_SIZE_IN_BYTES is total amount of memory allocated for the transaction log index.

The TRANSACTION_LOG_ENTRIES view isn’t that interesting at first glance:

drizzle> SELECT * FROM TRANSACTION_LOG_ENTRIES;
+--------------+-------------+--------------+
| ENTRY_OFFSET | ENTRY_TYPE  | ENTRY_LENGTH |
+--------------+-------------+--------------+
|            0 | TRANSACTION |           89 | 
|           89 | TRANSACTION |          223 | 
|          312 | TRANSACTION |          132 | 
+--------------+-------------+--------------+

You might be tempted to ask what the heck the purpose of the TRANSACTION_LOG_ENTRIES view is for. It is a bit of a bridge table that allows one to see the type of entries at each offset. Currently, the only types of entries in the transaction log are of type TRANSACTION — basically a serialized GPB Protobuffer message — and a BLOB entry, which is for storage of large blob data.

The TRANSACTION_LOG_TRANSACTIONS view shows all the transaction log entries which are of type TRANSACTION:

drizzle> SELECT * FROM TRANSACTION_LOG_TRANSACTIONS;
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
| ENTRY_OFFSET | TRANSACTION_ID | SERVER_ID | START_TIMESTAMP  | END_TIMESTAMP    | NUM_STATEMENTS | CHECKSUM |
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
|            0 |              1 |         1 | 1268856698672606 | 1268856698672620 |              1 |        0 | 
|           89 |              2 |         1 | 1268856702792284 | 1268856702792331 |              1 |        0 | 
|          312 |              3 |         1 | 1268856707025455 | 1268856707093000 |              1 |        0 | 
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
3 rows in set (0 sec)

As you can see, there is some basic information about each transaction entry in the log, including the offset in the transaction log, the start and end timestamp of the transaction, it’s transaction identifier, the number of statements involved in the transaction, and an optional checksum for the message (more on checksums below).

Viewing the Transaction Content

While the above view output may be nice, what we’d really like to be able to do is see what precisely were the changes a Transaction effected. To see this, we can use the PRINT_TRANSACTION_MESSAGE(log_file, offset) UDF. Below, I’ve added two more rows to the lebowski.characters table within an explicit transaction. I then query the DATA_DICTIONARY views using the PRINT_TRANSACTION_MESSAGE() function to show the changes logged to the transaction log:

drizzle> use lebowski
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> START TRANSACTION;
Query OK, 0 rows affected (0 sec)

drizzle> INSERT INTO characters VALUES ('walter','bowling');
Query OK, 1 row affected (0 sec)

drizzle> INSERT INTO characters VALUES ('donny','bowling');
Query OK, 1 row affected (0 sec)

drizzle> COMMIT;
Query OK, 0 rows affected (0.09 sec)

We now see an additional Transaction Log entry and can see that this transaction contains the two individual INSERT statements just executed:

drizzle> SELECT * FROM TRANSACTION_LOG_TRANSACTIONS;
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
| ENTRY_OFFSET | TRANSACTION_ID | SERVER_ID | START_TIMESTAMP  | END_TIMESTAMP    | NUM_STATEMENTS | CHECKSUM |
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
|            0 |              1 |         1 | 1268856698672606 | 1268856698672620 |              1 |        0 | 
|           89 |              2 |         1 | 1268856702792284 | 1268856702792331 |              1 |        0 | 
|          312 |              3 |         1 | 1268856707025455 | 1268856707093000 |              1 |        0 | 
|          444 |              4 |         1 | 1268857926482600 | 1268857938514312 |              1 |        0 | 
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
...
drizzle> SELECT PRINT_TRANSACTION_MESSAGE('transaction.log', ENTRY_OFFSET) as info
    -> FROM TRANSACTION_LOG_TRANSACTIONS WHERE ENTRY_OFFSET = 444\G
*************************** 1. row ***************************
info: transaction_context {
  server_id: 1
  transaction_id: 4
  start_timestamp: 1268857926482600
  end_timestamp: 1268857938514312
}
statement {
  type: INSERT
  start_timestamp: 1268857926482605
  end_timestamp: 1268857938514310
  insert_header {
    table_metadata {
      schema_name: "lebowski"
      table_name: "characters"
    }
    field_metadata {
      type: VARCHAR
      name: "name"
    }
    field_metadata {
      type: VARCHAR
      name: "hobby"
    }
  }
  insert_data {
    segment_id: 1
    end_segment: true
    record {
      insert_value: "walter"
      insert_value: "bowling"
    }
    record {
      insert_value: "donny"
      insert_value: "bowling"
    }
  }
}

1 row in set (0.01 sec)

You may notice that NUM_STATEMENTS is equal to 1 even though there were 2 INSERT statements issued. This is because the kernel packages both the INSERTs into a single message::Statement::InsertData package for more efficient storage. If there had been an INSERT and an UPDATE, NUM_STATEMENTS would be 2.

Enable Automatic Checksumming

One final feature I’ll highlight in this blog post is an option to automatically store a checksum of each transaction message when writing entries to the transaction log. To enable this feature, simply use the --transaction-log-enable-checksum command line option. You can view the checksums of entries in the TRANSACTION_LOG_TRANSACTIONS view, as demonstrated below:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 --transaction-log-enable --default-replicator-enable --transaction-log-enable-checksum &
[5] 32042
[4]   Done                    ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 --transaction-log-enable --default-replicator-enable
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ InnoDB: The InnoDB memory heap is disabled
InnoDB: Mutexes and rw_locks use GCC atomic builtins.
InnoDB: The first specified data file ./ibdata1 did not exist:
InnoDB: a new database to be created!
100317 16:47:07  InnoDB: Setting file ./ibdata1 size to 10 MB
InnoDB: Database physically writes the file full: wait...
100317 16:47:07  InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
InnoDB: Database physically writes the file full: wait...
100317 16:47:08  InnoDB: Log file ./ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
100317 16:47:08 InnoDB Plugin 1.0.4 started; log sequence number 0
Listening on 0.0.0.0:9306
Listening on :::9306
Listening on 0.0.0.0:4427
Listening on :::4427
./drizzled: Forcing close of thread 0 user: ''
./drizzled: ready for connections.
Version: '2010.03.1314' Source distribution (xa-transaction-log)
...
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ../client/drizzle --port=9306
Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 2
Server version: 7 Source distribution (xa-transaction-log)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle> CREATE SCHEMA lebowski;
Query OK, 1 row affected (0.05 sec)

drizzle> CREATE TABLE characters (name VARCHAR(20) NOT NULL PRIMARY KEY, hobby VARCHAR(10) NOT NULL) ENGINE=InnoDB;
ERROR 1046 (3D000): No database selected
drizzle> use lebowski
Database changed
drizzle> CREATE TABLE characters (name VARCHAR(20) NOT NULL PRIMARY KEY, hobby VARCHAR(10) NOT NULL) ENGINE=InnoDB;
Query OK, 0 rows affected (0.11 sec)

drizzle> INSERT INTO characters VALUES ('the dude','bowling');
Query OK, 1 row affected (0.1 sec)

drizzle> use data_dictionary
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> SELECT ENTRY_OFFSET, TRANSACTION_ID, CHECKSUM FROM TRANSACTION_LOG_TRANSACTIONS;
+--------------+----------------+------------+
| ENTRY_OFFSET | TRANSACTION_ID | CHECKSUM   |
+--------------+----------------+------------+
|            0 |              2 |  143866125 | 
|           89 |              8 | 1466831622 | 
|          312 |              9 |  460824986 | 
+--------------+----------------+------------+
3 rows in set (0 sec)

DDL is not Statement-based Replication

As a final note, I’d like to point out that even DDL in Drizzle is replicated as row-based transaction messages, and not as raw SQL statements like in MySQL. You can see, for instance, the message::Statement::CreateTableStatement inside the transaction message which contains all the metadata about the table you just created. :)

drizzle> SELECT PRINT_TRANSACTION_MESSAGE('transaction.log', ENTRY_OFFSET)
    -> FROM TRANSACTION_LOG_TRANSACTIONS WHERE ENTRY_OFFSET = 89\G
*************************** 1. row ***************************
PRINT_TRANSACTION_MESSAGE('transaction.log', ENTRY_OFFSET): transaction_context {
  server_id: 1
  transaction_id: 2
  start_timestamp: 1268858897017396
  end_timestamp: 1268858897017447
}
statement {
  type: CREATE_TABLE
  start_timestamp: 1268858897017402
  end_timestamp: 1268858897017445
  create_table_statement {
    table {
      name: "characters"
      engine {
        name: "InnoDB"
      }
      field {
        name: "name"
        type: VARCHAR
        format: DefaultFormat
        constraints {
          is_nullable: false
        }
        string_options {
          length: 20
          collation_id: 45
          collation: "utf8_general_ci"
        }
      }
      field {
        name: "hobby"
        type: VARCHAR
        format: DefaultFormat
        constraints {
          is_nullable: false
        }
        string_options {
          length: 10
          collation_id: 45
          collation: "utf8_general_ci"
        }
      }
      indexes {
        name: "PRIMARY"
        is_primary: true
        is_unique: true
        type: UNKNOWN_INDEX
        key_length: 80
        index_part {
          fieldnr: 0
          compare_length: 80
          key_type: 0
        }
        options {
          binary_pack_key: true
          var_length_key: true
        }
      }
      type: STANDARD
      options {
        collation: "utf8_general_ci"
        collation_id: 45
      }
    }
  }
}

1 row in set (0 sec)

If you like or don’t like what you see, please do get in touch with me or fire off a wishlist to the Drizzle Discuss mailing list. We’d love to hear from ya!

[1] Actually, the transaction log module is a set of plugins.

O’Gara Cloud Computing Article Off Base

Maureen O’Gara, self-described as “the most read technology reporter for the past 20 years”, has written an article about Drizzle at Rackspace for one of Sys-con’s online zines called Cloud Computing Journal, of which she is an editor.

I tried commenting on Maureen’s article on their website, but the login system is apparently borked, at least for registered users who use OpenID, which it wants to still have a separate user ID and login. Note to sys-con.com: OpenID is designed so that users don’t have to remember yet another login for your website.

Besides having little patience for content-sparse websites that simply provide an online haven for dozens of Flash advertisements per web page, the article had some serious problems with it, not the least of which was using large chunks of my Happiness is a Warm Cloud article without citation. Very professional.

OK, to start with, let’s take this quote from the article:

Drizzle runs the risk of not being as stable as MySQL, because the Drizzle team is taking things out and putting other stuff in. Of course it may be successful in trying to create a product that’s more stable than MySQL. But creating a stable DBMS engine is something that has always taken years and years.

This is just about the most naïve explanation for whether a product will or will not be stable that I’ve ever read. If Maureen had bothered to email or call any one of the core Drizzle developers, they’d have been happy to tell her what is and is not stable about Drizzle, and why. Drizzle has not changed the underlying storage engines, so the InnoDB storage engine in Drizzle is the same plugin as available in MySQL (version 1.0.6).

The pieces of MySQL which were removed from Drizzle happen to be the parts of MySQL which have had the most stability issues — namely the additional features added to MySQL 5.0: stored procedures, views, triggers, stored functions, the INFORMATION_SCHEMA implementation, and server-side cursors and prepared statements. In addition to these removed features of MySQL, Drizzle also has no built-in Query Cache, does not support anything other than UTF-8 character sets, and has removed the MySQL replication system and binary logging — moving a rewrite of these pieces out into the plugin ecosystem.

The pieces that were added to Drizzle have mostly been done by adding plugins that provide functionality. Maureen, the reason this was done was precisely to allow for greater stability of the kernel by segregating new features and functionality into the plugin ecosystem, where they can be properly versioned and quarantined, therefore increasing kernel stability. It’s pretty much the biggest principle of Drizzle’s design…

The core developers of Drizzle (and much of the Drizzle community) would also have been happy to tell Maureen how the Drizzle team defines “stability”: when the community says Drizzle is stable — simple as that.

OK, so the next thing I took objection to is the following line:

Half of Rackspace’s customers are on MySQL so there’ll be some donkey-style nosing to get them to migrate.

I think my Rackspace colleagues might have quite a bit to say about the above. I haven’t seen any Rackers talking about mass migration from MySQL to Drizzle. As far as I have seen, the plan is to provide Drizzle as an additional service to Rackspace customers.

Rackspace evidently wants its new boys, who were not the core pillars of the MySQL engineering team, to hitch MySQL, er, Drizzle to Cassandra

MySQL != Drizzle. Implying that the two are equal do a disservice to both, as they have very different target markets and developer audiences.

The smart money is betting that even if a good number of high-volume web sites go down this route, an even higher number such as Facebook and Google will continue with relational databases, primarily MySQL.

Again, probably best to do your homework on this one, too. Facebook runs an amalgamation of a custom MySQL version and storage engines, distributed key-value stores, and Memcached servers. I would think that Facebook moving to Drizzle would be one tough migration. Thousands (tens of thousands?) of MySQL servers all running custom software and integrated into their caching layers is a huge barrier to entry, and not one I would expect a large site like Facebook to casually undertake. But, the same could be said about a move to SQL Server or Oracle, for that matter, and has little to do with Drizzle.

Google is moving away from using MySQL entirely. Mark Callaghan, previously at Google, has moved over to Facebook (possibly because of this trend at Google to get rid of MySQL), and Anthony Curtis, formerly of MySQL, then Google, left Google partially because of this reason.

OK, so the next quote got me really fired up because it demonstrates a complete lack of understanding (maybe not Maureen’s, but the unnamed source it’s from at least):

Somebody – sorry we forget who exactly – claimed that as GPL 2 code Drizzle “severely limits revenue opportunities. For Rackspace, the opportunity to have some key Drizzle developers on its payrolls basically comes down to a promotional benefit, trying to position Rackspace as particularly Drizzle-savvy in the eyes of the community and currying favor for its seemingly generous contributions. What’s unclear is whether they may develop some Drizzle-related functionality that they will then not release as open source and just rent out to Rackspace hosting customers…that would be a way for them to differentiate themselves from competitors and GPLv2 would in principle allow this.”

A few points to make about the above quote.

First, name your source. I find it difficult to believe that the most-read technology writer would not write down a source. Is it the same person you deliberately left out of a quote from my Happiness article? (why did you do that, btw?).

Second, the MySQL server source code is licensed under the GPL 2, and so is Drizzle’s kernel, because it is a derivative work of the MySQL server.

Let me be clear: Developers who contribute code to Drizzle do so under the GPLv2 if that contribution is in the Drizzle kernel. If the code contribution is a plugin, the contributor is free to pick whatever license they choose.

Third, licensing has little if anything to do with revenue at all. The license is besides the point. There are two things which dictate the company’s revenue derivation from software:

  1. Copyright ownership
  2. Principles of the Company

Drizzle, Rackspace, or any company a Drizzle contributor works for, does not have the copyright ownership of the MySQL source code, from which Drizzle’s kernel is derived. Oracle does. Therefore, companies do not have any right to re-sell Drizzle (under any license) without explicit permission from Oracle. Period. Has nothing to do with the GPLv2.

That said, contributors do have the right to make money on plugins built for the Drizzle server, and Rackspace, while not having expressed any interest to yours truly in doing so, has the right like any other Drizzle contributor, to make money on plugins its contributors create for Drizzle.

It is my knowledge (after actually having talked to Rackspace managers and decision makers), that Rackspace is not interested in getting into the business of selling commercial Drizzle plugins. Their core direction is to create value for their customers, and I fail to see how getting into the commercial software sales business meets that goal.

Next time, please feel free to contact myself or any other Drizzle contributor to get the low-down on Drizzle-related stuff. We’ll be nice. I promise.

Recent Work on Improving Drizzle’s Storage Engine API

Over the past six weeks or so, I have been working on cleaning up the pluggable storage engine API in Drizzle.  I’d like to describe some of this work and talk a bit about the next steps I’m taking in the coming months as we roll towards implementing Log Shipping in Drizzle.

First, how did it come about that I started working on the storage engine API?

From Commands to Transactions

Well, it really goes back to my work on Drizzle’s replication system.  I had implemented a simple, fast, and extensible log which stored records of the data changes made to a server.  Originally, the log was called the Command Log, because the Google Protobuffer messages it contained were called message::Commands.  The API  for implementing replication plugins was very simple and within a month or so of debuting the API, quite a few replication plugins had been built, including one replicating to Memcached, a prototype one replicating to Gearman, and a filtering replicator plugin.

In addition, Marcus Eriksson had created the RabbitReplication project which could replicate from Drizzle to other data stores, including Cassandra and Project Voldemort.  However, Marcus did not actually implement any C/C++ plugins using the Drizzle replication API.  Instead, RabbitReplication simply read the new Command Log, which due to it simply being a file full of Google Protobuffer messages, was quick and easy to read into memory using a variety of different programming languages.  RabbitReplication is written in Java, and it was great to see other programming languages be able to read Drizzle’s replication log so easily.  Marcus later coded up a C++ TransactionApplier plugin which replaces the Drizzle replication log and instead replicates the GPB messages directly to RabbitMQ.

And there, you’ll note that one of the plugins involved in Drizzle’s replication system is called TransactionApplier.  It used to be called CommandApplier. That was because the GPB Command messages were individual row change events for the most part.  However, I made a series of changes to the replication API and now the GPB messages sent through the APIs are of class message::Transactionmessage::Transaction objects contain a transaction context, with information about the transaction’s start and end time, it’s transaction identifer, along with a series of message::Statement objects, each of which representing a part of the data changes that the SQL transaction made.

Thus, the Command Log now turned into the Transaction Log, and everywhere the term Command was used now was replaced with the terms Transaction and Statement (depending on whether you were talking about the entire Transaction or a piece of it). Log entries were now written at COMMIT to the Transaction Log and were not written if no COMMIT occurred1.

After finishing this work to make the transaction log write Transaction messages at commit time, I was keen to begin coding up the publisher and subscriber plugins which represent a node in the replication environment. However, Brian had asked me to delay working on other replication features and ensure that the replication API could support fully distributed transactions via the X/Open XA distributed transaction protocol. XA support had been removed from Drizzle when the MySQL binlog and original replication system was ripped out and needed some TLC. Fair enough, I said. So, off I went to work on XA.

If Only It Were Simple…

As anyone who has worked on the MySQL source code or developed storage engines for MySQL knows, working with the MySQL pluggable storage engine API is sometimes not the easiest or most straightforward thing. I think the biggest problem with the MySQL storage engine API is that, due to understandable historical reasons, it’s an API that was designed with the MyISAM and HEAP storage engines in mind. Much of the transactional pieces of the API seem to be a bolted-on afterthought and can be very confusing to work with.

As an example, Paul McCullagh, developer of the transactional storage engine PBXT, recently emailed the mysql internals mailing list asking how the storage engine could tell when a SQL statement started and ended. You would think that such a seemingly basic functionality would have a simple answer. You’d be wrong. Monty Widenius answered like this:

Why not simply have a counter in your transaction object for how start_stmt – reset(); When this is 0 then you know stmnt ended.

In Maria we count number of calls to external_lock() and when the sum goes to 0 we know the transaction has ended.

To this, Mark Callaghan responded:

Why does the solution need to be so obscure?

Monty answered (emphasis mine):

Historic reasons.

MySQL never kept a count of which handlers are used by a transaction, only which tables.

So the original logic was that external_lock(lock/unlock) is called for each usage of the table, which is normally more than enough information for a handler to know when a statement starts/ends.

The one case this didn’t work was in the case someone does lock tables as then external_lock is not called per statement. It was to satisfy this case that we added a call to start_stmt() for each table.

It’s of course possible to change things so that start_stmt() / end_stmt() would be called once per used handler, but this would be yet another overhead for the upper level to do which the current handlers that tracks call to external_lock() doesn’t need.

Well, in Drizzle-land, we aren’t beholden to “historic reasons” :) So, after looking through the in-need-of-attention transaction processing code in the kernel, I decided that I would clean up the API so that storage engines did not have to jump through hoops to notify the kernel they participate in a transaction or just to figure out when a statement and a transaction started and ended.

The resulting changes to the API are quite dramatic I think, but I’ll leave it to the storage engine developers to tell me if the changes are good or not. The following is a summary of the changes to the storage engine API that I committed in the last few weeks.

plugin::StorageEngine Split Into Subclasses

The very first thing I did was to split the enormous base plugin class for a storage engine, plugin::StorageEngine, into two other subclasses containing transactional elements. plugin::TransactionalStorageEngine is now the base class for all storage engines which implement SQL transactions:

/**
 * A type of storage engine which supports SQL transactions.
 *
 * This class adds the SQL transactional API to the regular
 * storage engine.  In other words, it adds support for the
 * following SQL statements:
 *
 * START TRANSACTION;
 * COMMIT;
 * ROLLBACK;
 * ROLLBACK TO SAVEPOINT;
 * SET SAVEPOINT;
 * RELEASE SAVEPOINT;
 */
class TransactionalStorageEngine :public StorageEngine
{
public:
  TransactionalStorageEngine(const std::string name_arg,
                             const std::bitset<HTON_BIT_SIZE> &flags_arg= HTON_NO_FLAGS);
 
  virtual ~TransactionalStorageEngine();
...
private:
  void setTransactionReadWrite(Session& session);
 
  /*
   * Indicates to a storage engine the start of a
   * new SQL transaction.  This is called ONLY in the following
   * scenarios:
   *
   * 1) An explicit BEGIN WORK/START TRANSACTION is called
   * 2) After an explicit COMMIT AND CHAIN is called
   * 3) After an explicit ROLLBACK AND RELEASE is called
   * 4) When in AUTOCOMMIT mode and directly before a new
   *    SQL statement is started.
   */
  virtual int doStartTransaction(Session *session, start_transaction_option_t options)
  {
    (void) session;
    (void) options;
    return 0;
  }
 
  /**
   * Implementing classes should override these to provide savepoint
   * functionality.
   */
  virtual int doSetSavepoint(Session *session, NamedSavepoint &savepoint)= 0;
  virtual int doRollbackToSavepoint(Session *session, NamedSavepoint &savepoint)= 0;
  virtual int doReleaseSavepoint(Session *session, NamedSavepoint &savepoint)= 0;
 
  /**
   * Commits either the "statement transaction" or the "normal transaction".
   *
   * @param[in] The Session
   * @param[in] true if it's a real commit, that makes persistent changes
   *            false if it's not in fact a commit but an end of the
   *            statement that is part of the transaction.
   * @note
   *
   * 'normal_transaction' is also false in auto-commit mode where 'end of statement'
   * and 'real commit' mean the same event.
   */
  virtual int doCommit(Session *session, bool normal_transaction)= 0;
 
  /**
   * Rolls back either the "statement transaction" or the "normal transaction".
   *
   * @param[in] The Session
   * @param[in] true if it's a real commit, that makes persistent changes
   *            false if it's not in fact a commit but an end of the
   *            statement that is part of the transaction.
   * @note
   *
   * 'normal_transaction' is also false in auto-commit mode where 'end of statement'
   * and 'real commit' mean the same event.
   */
  virtual int doRollback(Session *session, bool normal_transaction)= 0;
  virtual int doReleaseTemporaryLatches(Session *session)
  {
    (void) session;
    return 0;
  }
  virtual int doStartConsistentSnapshot(Session *session)
  {
    (void) session;
    return 0;
  }
};

As you can see, plugin::TransactionalStorageEngine inherits from plugin::StorageEngine and extends it with a series of private pure virtual methods that implement the SQL transaction parts of a query — doCommit(), doRollback(), etc. Implementing classes simply inherit from plugin::TransactionalStorageEngine and implement their internal transaction processing in these private methods.

In addition to the SQL transaction, however, is the concept of an XA transaction, which is for distributed transaction coordination. The XA protocol is a two-phase commit protocol because it implements a PREPARE step before a COMMIT occurs. This XA API is exposed via two other classes, plugin::XaResourceManager and plugin::XaStorageEngine. plugin::XaResourceManager derived classes implement the resource manager API of the XA protocol. plugin::XaStorageEngine is a storage engine subclass which, while also implementing SQL transactions, also implements XA transactions.

Here is the plugin::XaResourceManager class:

/**
 * An abstract interface class which exposes the participation
 * of implementing classes in distributed transactions in the XA protocol.
 */
class XaResourceManager
{
public:
  XaResourceManager() {}
  virtual ~XaResourceManager() {}
...
private:
  /**
   * Does the COMMIT stage of the two-phase commit.
   */
  virtual int doXaCommit(Session *session, bool normal_transaction)= 0;
  /**
   * Does the ROLLBACK stage of the two-phase commit.
   */
  virtual int doXaRollback(Session *session, bool normal_transaction)= 0;
  /**
   * Does the PREPARE stage of the two-phase commit.
   */
  virtual int doXaPrepare(Session *session, bool normal_transaction)= 0;
  /**
   * Rolls back a transaction identified by a XID.
   */
  virtual int doXaRollbackXid(XID *xid)= 0;
  /**
   * Commits a transaction identified by a XID.
   */
  virtual int doXaCommitXid(XID *xid)= 0;
  /**
   * Notifies the transaction manager of any transactions
   * which had been marked prepared but not committed at
   * crash time or that have been heurtistically completed
   * by the storage engine.
   *
   * @param[out] Reference to a vector of XIDs to add to
   *
   * @retval
   *  Returns the number of transactions left to recover
   *  for this engine.
   */
  virtual int doXaRecover(XID * append_to, size_t len)= 0;
};

and here is the plugin::XaStorageEngine class:

/**
 * A type of storage engine which supports distributed
 * transactions in the XA protocol.
 */
class XaStorageEngine :public TransactionalStorageEngine,
                       public XaResourceManager
{
public:
  XaStorageEngine(const std::string name_arg,
                  const std::bitset<HTON_BIT_SIZE> &flags_arg= HTON_NO_FLAGS);
 
  virtual ~XaStorageEngine();
  ...
};

Pretty clear. A plugin::XaStorageEngine inherits from both plugin::TransactionStorageEngine and plugin::XaResourceManager because it implements both SQL transactions and XA transactions. The InnobaseEngine is a plugin which inherits from plugin::XaStorageEngine because InnoDB supports SQL transactions as well as XA.

Explicit Statement and Transaction Boundaries

The second major change I made addressed the problem that Mark Callaghan noted in asking why finding out when a statement starts and ends was so obscure. I added two new methods to plugin::StorageEngine called doStartStatement() and doEndStatement(). The kernel now explicitly tells storage engines when a SQL statement starts and ends. This happens before any calls to Cursor::external_lock() happen, and there are no exception cases. In addition, the kernel now always tells transactional storage engines when a new SQL transaction is starting. It does this via an explicit call to plugin::TransactionalStorageEngine::doStartTransaction(). No exceptions, and yes, even for DDL operations.

What this means is that for a transactional storage engine, it no longer needs to “count the calls to Cursor::external_lock()” in order to know when a statement or transaction starts and ends. For a SQL transaction, this means that there is a clear code call path and there is no need for the storage engine to track whether the session is in AUTOCOMMIT mode or not. The kernel does all that work for the storage engine. Imagine a Session executes a single INSERT statement against an InnoDB table while in AUTOCOMMIT mode. This is what the call path looks like:

 drizzled::Statement::Insert::execute()
 |
 -> drizzled::mysql_lock_tables()
    |
    -> drizzled::TransactionServices::registerResourceForTransaction()
       |
       -> drizzled::plugin::TransactionalStorageEngine::startTransaction()
          |
          -> InnobaseEngine::doStartTransaction()
       |
       -> drizzled::plugin::StorageEngine::startStatement()
          |
          -> InnobaseEngine::doStartStatement()
       |
       -> drizzled::plugin::StorageEngine::getCursor()
          |
          -> drizzled::Cursor::write_row()
             |
             -> InnobaseCursor::write_row()
       |
       -> drizzled::TransactionServices::autocommitOrRollback()
          |
          -> drizzled::plugin::TransactionStorageEngine::commit()
             |
             -> InnobaseEngine::doCommit()

I think this will come as a welcome change to storage engine developers working with Drizzle.

No More Need for Engine to Call trans_register_ha()

There was an interesting comment in the original documentation for the transaction processing code. It read:

Roles and responsibilities
————————–

The server has no way to know that an engine participates in
the statement and a transaction has been started
in it unless the engine says so. Thus, in order to be
a part of a transaction, the engine must “register” itself.
This is done by invoking trans_register_ha() server call.
Normally the engine registers itself whenever handler::external_lock()
is called. trans_register_ha() can be invoked many times: if
an engine is already registered, the call does nothing.
In case autocommit is not set, the engine must register itself
twice — both in the statement list and in the normal transaction
list.

That comment, and I’ve read it dozens of times, always seemed strange to me. I mean, does the server really not know that an engine participates in a statement or transaction unless the engine tells it? Of course not.

So, I removed the need for a storage engine to “register itself” with the kernel. Now, the transaction manager inside the Drizzle kernel (implemented in the TransactionServices component) automatically monitors which engines are participating in an SQL transaction and the engine doesn’t need to do anything to register itself.

In addition, due to the break-up of the plugin::StorageEngine class and the XA API into plugin::XaResourceManager, Drizzle’s transaction manager can now coordinate XA transactions from plugins other than storage engines. Yep, that’s right. Any plugin which implements plugin::XaResourceManager can participate in an XA transaction and Drizzle will act as the transaction manager. What’s the first plugin that will do this? Drizzle’s transaction log. The transaction log isn’t a storage engine, but it is able to participate in an XA transaction, so it will implement plugin::XaResourceManager but not plugin::StorageEngine.

Performance Impact of Code Changes

So, that “yet another overhead” Monty talked about in the quote above? There wasn’t any noticeable impact in performance or scalability at all. So much for optimize-first coding.

What’s Next?

The next thing I’m working on is removing the notion of the “statement transaction”, which is also a historical by-product, this time because of BerkeleyDB. Gee, I’ve got a lot of work ahead of me…

[1] Actually, there is a way that a transaction that was rolled back can get written to the transaction log. For bulk operations, the server can cut a Transaction message into multiple segments, and if the SQL transaction is rolled back, a special RollbackStatement message is written to the transaction log.

A Follow Up on the SQL Puzzle

Or…What the Heck is Wrong with CREATE TABLE IF NOT EXISTS ... SELECT?

So, earlier this week, I blogged about an SQL puzzle that had come up in my current work on Drizzle’s new transaction log.

I posed the question to readers what the “correct” result of the following would be:

CREATE TABLE t1 (a INT, b INT);
INSERT INTO t1 VALUES (1,1),(1,2);
CREATE TEMPORARY TABLE t2 (a INT, b INT, PRIMARY KEY (a));
BEGIN;
INSERT INTO t2 VALUES (100,100);
CREATE TEMPORARY TABLE IF NOT EXISTS t2 (PRIMARY KEY (a)) SELECT * FROM t1;
 
# The above statement will correctly produce an ERROR 23000: Duplicate entry '1' FOR KEY 'PRIMARY'
# What should the below RESULT be?
 
SELECT * FROM t2;
COMMIT;

A number of readers responded, and, to be fair, most everyone was “correct” in their own way. Why? Well, because the way that MySQL deals with calls to CREATE TABLE ... SELECT, CREATE TABLE IF NOT EXISTS ... SELECT and their temporary-table counterparts is completely stupid, as I learned this week. Rob Wultsch essentially sums up my feelings about the behaviour of DDL statements in regards to transactions in a session:

Implicit commit is evil and stupid. Ideally we the server should error and roll back, imho.

The Officially Correct Answer (at least in MySQL)

OK, so here’s the “official” correct answer:

CREATE TABLE IF NOT EXISTS ... SELECT does not first check for the existence of the table in question. Instead, if the table in question does exist, CREATE TABLE IF NOT EXISTS ... SELECT behaves like an INSERT INTO ... SELECT statement. Yep, you heard right. So, instead of throwing a warning when it notices that the table exists, MySQL instead attempts to insert rows from the SELECT query into the existing table.

Here is the official MySQL explanation:

For CREATE TABLE … SELECT, if IF NOT EXISTS is given and the table already exists, MySQL handles the statement as follows:
* The table definition given in the CREATE TABLE part is ignored. No error occurs, even if the definition does not match that of the existing table.
* If there is a mismatch between the number of columns in the table and the number of columns produced by the SELECT part, the selected values are assigned to the rightmost columns. For example, if the table contains n columns and the SELECT produces m columns, where m < n, the selected values are assigned to the m rightmost columns in the table. Each of the initial n – m columns is assigned its default value, either that specified explicitly in the column definition or the implicit column data type default if the definition contains no default. If the SELECT part produces too many columns (m > n), an error occurs.
* If strict SQL mode is enabled and any of these initial columns do not have an explicit default value, the statement fails with an error.

So, given the above manual explanation, the correct answer to the original blog post is:

  a  |  b
100 | 100

partly because there is an implicit COMMIT directly before the CREATE TABLE is executed (committing the 100,100 record to the table) and the primary key violation kills off the INSERTs of 1,1 in InnoDB. For a MyISAM table, the 1,1 record would be in the table, since MyISAM has no idea what a ROLLBACK is.

I Think Drizzle Should Follow PostgreSQL’s Example Here

On implicit commits before DDL operations, I believe they should all go bye-bye. DDL should be transactional in Drizzle and if a statement cannot be executed in a transaction, it should throw an error if there is an active transaction. Period.

For behaviour of CREATE TABLE ... SELECT acting like an INSERT INTO ... SELECT, that entire code path should be ripped out.

PostgreSQL’s DDL operations, IMHO, are sane. Sane is good. PostgreSQL allows quite a bit of flexibility by implementing the SQL standard’s CREATE TABLE and CREATE TABLE AS statements. I believe Drizzle should scrap all the DDL table-creation code and instead implement PostgreSQL’s much-nicer DDL methods. There, I said it. Slashdot MySQL haters, there ya go.

An SQL Puzzle?

Dear Lazy Web,

What should the result of the SELECT be below? Assume InnoDB for all storage engines.

CREATE TABLE t1 (a INT, b INT);
INSERT INTO t1 VALUES (1,1),(1,2);
CREATE TEMPORARY TABLE t2 (a INT, b INT, PRIMARY KEY (a));
BEGIN;
INSERT INTO t2 VALUES (100,100);
CREATE TEMPORARY TABLE IF NOT EXISTS t2 (PRIMARY KEY (a)) SELECT * FROM t1;
 
# The above statement will correctly produce an ERROR 23000: Duplicate entry '1' FOR KEY 'PRIMARY'
# What should the below RESULT be?
 
SELECT * FROM t2;
COMMIT;