Developing Nova on Linux – Getting Started

In the past few weeks, I’ve gotten involved in the newly-debuted OpenStack project. Right now, my focus is on the Compute sub-project of the stack, called Nova. The initial pieces I am focusing on are the unit tests and end-to-end systems testing of the compute stack.

I struggled over the last couple days to solve a bug that turned out to be not a bug at all, but an issue with the Python development environment I use. I figured I’d write a blog article for those Python developers who are looking to contribute to the Nova project and may also be struggling to get up and going.

If you’re contributing to an open source project like Nova, you’ll want to be able to work on multiple branches of the source code at the same time — for instance, if you’re working on fixing a few bugs simultaneously.

There are quite a few dependencies for Nova, and, because of the way Python searches for packages, it’s imperative that you use a tool such as virtualenv to isolate your multiple branches into their own development environments. Otherwise, as I learned today, the location of your site-packages and what has previously been installed on your development machine can wreak havoc on you. :)

NOTE: For this article, I assume the reader is on Debian/Ubuntu Linux, since that is what I use as my development machine. If you’re on a different flavour of Linux, feel free to adapt the instructions here to suit your particular package manager.

Installing the Tools for Installing the Tools

Before we get into our virtual development environments, you’ll first want to ensure you’ve got a few packages installed, including bzr, libssl-dev, swig and virtualenv. The following should do the trick:

sudo apt-get install -y swig libssl-dev bzr python-virtualenv

A Setup for Source Control and Virtual Environments

In order to get properly setup to contribute to the Nova project, you’ll want to setup a local repository to keep branches of source code that you work on. Although bzr is not required as your revision control system, I use bzr myself and will use it in this article. Adapt as needed if you use git-bzr or similar.

I like to have the following directory structure for working on Python projects:

~/repos/$projectname/ <-- shared repository for branches of your project
~/repos/$projectname/trunk <-- local trunk branch
~/repos/$projectname/$branch <-- a branch you work in
~/virtenvs/$projectname/ <-- Development environments for your project
~/virtenvs/$projectname/$branch <-- development environment for a branch you work in

Assuming you want to contribute to the Nova project and you want to work on fixing a bug #XXXXX, then following would get you started:

bzr init-repo ~/repos/nova
cd ~/repos/nova
bzr branch lp:nova trunk
bzr branch trunk bugXXXXX
mkdir -p ~/virtenvs/nova

At this point, we'll go ahead and create a virtual development environment for bugXXXXX:

cd ~/virtenvs/nova
virtualenv --no-site-packages bugXXXXX
cd bugXXXXX
source bin/activate

At this point, you'll notice your prompt change, indicating that you are now in a virtual development environment. The --no-site-packages ensures that your locally-installed Python packages aren't included in your Python PATH when inside your virtual environment.

Next step is to install into this virtual development environment all the packages and dependencies we'll need. This should do the trick:

easy_install twisted tornado boto M2Crypto IPy carrot mox redis
easy_install http://python-gflags.googlecode.com/files/python_gflags-1.3-py2.5.egg

Alright, next we simply link to our bzr branch location from inside the virtual environment and run the Nova test suite:

ln -s ~/repos/nova/bugXXXXX bugXXXXX
cd bugXXXXX
python run_tests.py

If all went smoothly, you'll see all passing test cases, like below :)

Having issues getting up and running? Find us on Freenode IRC #openstack.

See ya,

Jay

MySQL Stored Procedures Ain’t All That

I give quite a lot of presentations. A whole lot less than I used to, but still quite a few per year. Most of the time, the presentations are on performance tuning MySQL.

Almost every time I give a presentation on MySQL performance tuning — and this happens 100% of the time if I am presenting to a Windows SQL Server crowd — I get the following question:

Why don’t you cover using stored procedures in order to increase performance? Wouldn’t that be the easiest way to get better performance since the stored procedures will only be parsed once and then the compiled bytecode would be efficiently executed from then on?

Every person that asks this question assumes something about MySQL’s stored procedure implementation; they incorrectly believe that stored procedures are compiled and stored in a global stored procedure cache, similar to the stored procedure cache in Microsoft SQL Server[1] or Oracle[2].

This is wrong. Flat-out incorrect.

Here is the truth: Every single connection to the MySQL server maintains it’s own stored procedure cache.

This means two very important things that users of stored procedures should understand:

  • If you operate in a shared-nothing environment — for example, the majority of PHP and Python applications that do not use connection pooling or persistent connections — if your application uses stored procedures, the connection is compiling the stored procedure, storing it in a cache, and destroying that cache every single time you connect to the database server and issue a CALL statement
  • If you use stored procedures, the memory usage of every single connection that uses those stored procedures is going to increase, and will increase substantially if you use many stored procedures

Ooops, I Invalidated Everything Again

So, what happens when you CREATE, ALTER, or DROP any stored procedures? Since MySQL stores all stored procedure execution code on the connection threads, each of those connection threads must invalidate the procedure in its caches that has changed, right?

No, it’s worse. Every time ANY stored procedure is added, dropped, or updated, ALL stored procedures on ALL connection threads will be invalidated and must be re-compiled. Here is how the “caches” are invalidated:

from /sql/sp_cache.cc, lines 193-197, in MySQL 5.5

/*
  Invalidate all routines in all caches.
 
  SYNOPSIS
    sp_cache_invalidate()
 
  NOTE
    This is called when a VIEW definition is created or modified (and in some
    other contexts). We can't destroy sp_head objects here as one may modify
    VIEW definitions from prelocking-free SPs.
*/
void sp_cache_invalidate()
{
  DBUG_PRINT("info",("sp_cache: invalidating"));
  thread_safe_increment(Cversion, &Cversion_lock);
}

It’s a bit misleading, since it actually doesn’t invalidate anything at all. What the above code does is increment the global “Cversion” variable. When a connection thread attempts to execute, drop or insert a new procedure, it will notice that it’s local cache’s version number is less than this Cversion number and will destroy the entire cache and rebuild it gradually as procedures are affected or executed.

So, Should You Use Stored Procedures in MySQL?

The above warning doesn’t necessarily mean that you should never use stored procedures? No. What it means (besides being a bit of a rant on the implementation of MySQL’s stored procedures) is that you should be aware of these issues and use stored procedures where they make the most sense:

  • When you know that you will be executing the stored procedure over and over again on the same connection — for instance, in a bulk loading script or similar
  • When you know that you will not be disconnecting from the MySQL server at the end of script execution — for instance, if you use JDBC connection pooling
  • When you know that you have a limited number of stored procedures and the memory usage of connections won’t be an issue

Finally, if you see benchmarks that purport to show a huge performance increase from using stored procedures in MySQL, be careful to understand what the benchmark is doing and whether that benchmark represents your real-world environment. For instance, if you see a huge performance increase in sysbench when using stored procedures, but you have a PHP shared-nothing environment, understand that those benchmark results mean very little to you, since sysbench connections don’t get destroyed until the end of the run…

[1] From my copy of Inside SQL Server 2000, Delaney (2001), pages 852-865. For a short, but decent, online explanation of SQL Server’s stored procedure cache, see here

[2] Oracle’s stored procedures are stored in the shared pool of the Oracle system global area (SGA)

Now Recording Drizzle Contributor Tutorial

Hi all!

I was swamped with registrations for the online contributor tutorial for Drizzle, and so I’ve bumped up my account to a DimDim Pro account. This means two things:

  1. I can take >20 registrations
  2. I can record the session

So, Diego, rest assured, the session will be recorded (hopefully with no glitches). I’m going to call DimDim to see if I can do a practice recording beforehand to verify Linux64 is a platform they support for recording (if not, I’ll go to my neighbour’s Windows computer to record)

Again, if you’re interested in the webinar, please do register using the widget below:


Cheers,

jay

Signup for Drizzle Contributor Tutorial Webinar – May 15th

Hi all!

I’ll be giving an online webinar for Drizzle contributors on Saturday, May 15th @ 1am GMT (In the U.S. this is Friday, May14th @ 9pm EDT, 6pm PDT).

Note that the DimDim widget below shows the time as May 14th @ 8pm. The widget is wrong, since DimDim does not account for daylight savings.

Space is strictly limited to 20 people and this will be done via DimDim.com. Please register for the webinar by entering your email address in the widget below and clicking “Sign Up”.

The agenda for this 2-3 hour tutorial will be:

  1. First Steps
    • Getting registered as a contributor for Drizzle on Launchpad
    • Registering your SSH keys with Launchpad
    • Picking up and creating blueprints
    • Basics of Bazaar
    • Setting up a local code repository for Drizzle
    • Committing your work to a Bazaar branch
    • Pushing your code to Launchpad
    • Requesting a code review and merge into trunk
    • One slide explaining the license your contributions may be submitted under
  2. The Drizzle Source Code
    • Our coding standards
    • Our build system
    • Walkthrough of major directories in Drizzle
    • Understanding the plugin system
    • Understanding what the kernel is responsible for
    • Where the Dragons live — and how to avoid them
  3. Walkthrough of a SELECT statement

    • Client communication with server
    • The role of the session scheduler plugin
    • How, when and where authentication and authorization plugins are called
    • How the drizzled::statement::Statement subclasses work
    • Dive into drizzled::statement::Select::execute()
    • Walkthrough how a Table’s definition (metadata) is read from a protobuffer file or an engine
    • Dive into mysql_lock_tables()
    • How does drizzled::plugin::StorageEngine::startStatement() work?
    • How does drizzled::plugin::TransactionalStorageEngine::startTransaction() work?
    • Inside the join optimizer and Join::optimize()
    • How does the nested loops algorithm get executed and how does READ_RECORD work?
    • How does drizzled::Cursor perform table and index scans and seeks?
    • How are result sets packaged up and sent to clients?
  4. Plugin Development Tutorial

    • What plugin classes are even available?
    • Creating your basic plugin
    • The plugin.ini file
    • The module initialization and configuration file
    • Registering your plugin with the kernel with plugin::Context::add()
    • Publishing your plugin's information using table functions
    • Providing users control over your plugin with user-defined functions

Slides from Developing Drizzle Replication Plugins Tutorial

Hi all!

So, Padraig, Toru, and myself teamed up yesterday at the MySQL Conference for about thirty or so attendees to discuss developing Drizzle plugins in C++. It was a set of slides that covered basic stuff all the way up through pretty advanced topics. We hope attendees got something out of it :)

Below are the slides from Padraig’s and my part of the tutorial which focused on plugin development basics and the replication plugin API in Drizzle. I’ve also tacked them onto my page of presentations.

Enjoy, and feel free to email me with comments and suggestions to SELECT REVERSE('moc.liamg@sepipyaj');

Developing Drizzle Replication Plugins


Open Office Impress slides
PDF slides


Topics included in the slides:

  • About the Drizzle Community and Expectations of Contributors
  • Getting started on Launchpad
  • Various features of Launchpad
  • Understanding the Source Code Directory Structure
  • Code walkthrough of Drizzle plugin basics
  • Drizzle’s System Architecture
  • Overview of Drizzle’s Replication System
  • Understanding Google Protobuffers
  • The Transaction message in Detail
  • In-depth code walkthrough of the Filtered Replicator module
  • In-depth code walkthrough of the Transaction Log module
  • Future of Drizzle replication – Publisher and Subscriber plugins

Holy Google Summer of Code, Batman

So, last year, Drizzle participated in the Google Summer of Code under the MySQL project organization. We had four excellent student submissions and myself, Monty Taylor, Eric Day and Stewart Smith all mentored students for the summer. It was my second year mentoring, and I really enjoyed it, so I was looking forward to this year’s summer of code.

This year, Padraig O’Sullivan, a GSoC student last year, is now working at Akiban Technologies, partly on Drizzle, and is the GSoC Adminsitrator and also a mentor for Drizzle this year, and Drizzle is its own sponsored project organization this year. Thank you, Padraig!

I have been absolutely floored by the flood of potential students who have shown up on the mailing list and the #drizzle IRC channel. I have been even more impressed with those students’ ambition, sense of community, and willingness to ask questions and help other students as they show up. A couple students have even gotten code contributed to the source trees even before submitting their official applications to GSoC. See, I told you they were ambitious! :)

This year, Drizzle has a listing of 16 potential projects for students to work on. The projects are for students interested in developing in C++, Python, or Perl.

If you are interested in participating, please do check out Drizzle! For those new to Launchpad, Bazaar, and C++ development with Drizzle, feel free to check out these blog articles which cover those topics:

And, in other news, Go Buckeyes!

Understanding Drizzle’s Transaction Log

Today I pushed up the initial patch which adds XA support to Drizzle’s transaction log. So, to give myself a bit of a rest from coding, I’m going to blog a bit about the transaction log and show off some of its features.

WARNING: Please keep in mind that the transaction log module in Drizzle is under heavy development and should not be used in production environments. That said, I’d love to get as much feedback as possible on it, and if you feel like throwing some heavy data at it, that would be awesome ;)

What is the Transaction Log?

Simply put, the transaction log is a record of every modification to the state of the server’s data. It is similar to MySQL’s binlog, with some substantial differences:

  • The transaction log is composed of Google Protobuffer messages. Because of this, it is possible to read the log using a variety of programming languages, as Marcus Eriksson’s RabbitReplication project demonstrates.
  • The transaction log is a plugin[1]. It lives entirely outside of the Drizzle kernel. The advantage of this is that development of the transaction log does not need to be linked with development in the kernel and versioning of the transaction log can happen independently of the kernel.
  • Currently, there is only a single log file. MySQL’s binlog can be split into multiple files. This may or may not change in the future. :)
  • Drizzle’s transaction log is indexed. Among other things, this means that you can query the transaction log directly from within a Drizzle client via DATA_DICTIONARY views. I will demonstrate this feature below.

It is important to also point out that Drizzle’s transaction log is not required for Drizzle replication. This probably sounds very weird to folks who are accustomed to MySQL replication, which depends on the MySQL binlog. In Drizzle, the replication API is different. Although the transaction log can be used in Drizzle’s replication system, it’s not required. I’ll write more on this in later blog posts which demonstrate how the replication system is not dependent on the transaction log, but in this article I just want to highlight the transaction log module.

How Do I Enable the Transaction Log

First things first, let’s see how we can enable the Transaction Log. If you’ve built Drizzle from source or have installed Drizzle locally, you will be familiar with the process of starting up a Drizzle server. To review, here is how you do so:

cd $basedir
./drizzled [options] &

Where $basedir is the directory you built Drizzle or installed Drizzle. For the [options], typically you will need at the very least a --datadir=$DATADIR and a --mysql-protocol-port=$PORT value. For an explanation of the --mysql-protocol-port option, see Eric Day’s recent article.

To demonstrate, I’ve built a Drizzle server in a local directory of mine, and I’ll use the /tests/var/ directory as my $datadir:

cd /home/jpipes/repos/drizzle/xa-transaction-log/drizzled/
./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 &

You should see output similar to this:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 &
[1] 31499
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ InnoDB: The InnoDB memory heap is disabled
InnoDB: Mutexes and rw_locks use GCC atomic builtins.
InnoDB: The first specified data file ./ibdata1 did not exist:
InnoDB: a new database to be created!
100317 15:41:51  InnoDB: Setting file ./ibdata1 size to 10 MB
InnoDB: Database physically writes the file full: wait...
100317 15:41:52  InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
InnoDB: Database physically writes the file full: wait...
100317 15:41:52  InnoDB: Log file ./ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
100317 15:41:53 InnoDB Plugin 1.0.4 started; log sequence number 0
Listening on 0.0.0.0:9306
Listening on :::9306
Listening on 0.0.0.0:4427
Listening on :::4427
./drizzled: Forcing close of thread 0 user: ''
./drizzled: ready for connections.
Version: '2010.03.1314' Source distribution (xa-transaction-log)

To connect to the above server, I then do:

../client/drizzle --port=9306

If all went well, you should be at a drizzle client prompt:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ../client/drizzle --port=9306
Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 2
Server version: 7 Source distribution (xa-transaction-log)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle>

You can check to see whether the transaction log is enabled by querying the DATA_DICTIONARY.VARIABLES table. The transaction log is not on by default:

drizzle> use data_dictionary
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> SELECT * FROM GLOBAL_VARIABLES WHERE VARIABLE_NAME LIKE 'transaction_log%';
+---------------------------------+-----------------+
| VARIABLE_NAME                   | VARIABLE_VALUE  |
+---------------------------------+-----------------+
| transaction_log_enable          | OFF             |
| transaction_log_enable_checksum | OFF             |
| transaction_log_enable_xa       | OFF             |
| transaction_log_log_file        | transaction.log |
| transaction_log_sync_method     | 0               |
| transaction_log_truncate_debug  | OFF             |
| transaction_log_xa_num_slots    | 8               |
+---------------------------------+-----------------+
7 rows in set (0 sec)

OK, let’s start up the server, this time with the transaction log enabled. To shutdown Drizzle, there is no need to use a tool like mysqladmin. You can shutdown the server via the client:

drizzle> exit
Bye
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ../client/drizzle --port=9306 --shutdown
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ./drizzled: Normal shutdown
100317 15:53:48  InnoDB: Starting shutdown...
100317 15:53:49  InnoDB: Shutdown completed; log sequence number 44244
...

Now let’s start up the server, this time passing the --transaction-log-enable and the --default-replicator-enable options. The --default-replicator-enable option is needed when the transaction log is not in XA mode (more on that later):

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 --transaction-log-enable --default-replicator-enable &
[2] 31582
[1]   Done                    ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ InnoDB: The InnoDB memory heap is disabled
...
./drizzled: ready for connections.

And again, connect to the server and check our transaction log variables again:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ../client/drizzle --port=9306
Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 2
Server version: 7 Source distribution (xa-transaction-log)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle> use data_dictionary
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> SELECT * FROM GLOBAL_VARIABLES WHERE VARIABLE_NAME LIKE 'transaction_log%';
+---------------------------------+-----------------+
| VARIABLE_NAME                   | VARIABLE_VALUE  |
+---------------------------------+-----------------+
| transaction_log_enable          | ON              |
| transaction_log_enable_checksum | OFF             |
| transaction_log_enable_xa       | OFF             |
| transaction_log_log_file        | transaction.log |
| transaction_log_sync_method     | 0               |
| transaction_log_truncate_debug  | OFF             |
| transaction_log_xa_num_slots    | 8               |
+---------------------------------+-----------------+
7 rows in set (0 sec)

drizzle>

OK. So, if you check the $datadir, you should see a file called transaction.log, with a size of 0:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ls -lha ../tests/var/
total 21M
drwxr-xr-x  6 jpipes jpipes 4.0K 2010-03-17 15:54 .
drwxr-xr-x 11 jpipes jpipes 4.0K 2010-03-17 14:57 ..
-rw-rw----  1 jpipes jpipes  10M 2010-03-17 15:54 ibdata1
-rw-rw----  1 jpipes jpipes 5.0M 2010-03-17 15:54 ib_logfile0
-rw-rw----  1 jpipes jpipes 5.0M 2010-03-17 15:41 ib_logfile1
-rwxr-----  1 jpipes jpipes    6 2010-03-17 15:54 serialcoder.pid
-rwx------  1 jpipes jpipes    0 2010-03-17 15:54 transaction.log

Back in the drizzle client, let’s go ahead and create a new schema, a new table, and add a single row to that table. This will add some entries to the transaction log that we’ll be able to view:

drizzle> CREATE SCHEMA lebowski;
Query OK, 1 rows affected (0.06 sec)
drizzle> USE lebowski
Database changed
drizzle> CREATE TABLE characters (name VARCHAR(20) NOT NULL PRIMARY KEY,
    -> hobby VARCHAR(10) NOT NULL) ENGINE=InnoDB;
Query OK, 0 rows affected (0.06 sec)

drizzle> INSERT INTO characters VALUES ('the dude','bowling');
Query OK, 1 row affected (0.05 sec)

Checking in on our transaction log file, we see it now has some size to it:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ls -lha ../tests/var/
total 21M
drwxr-xr-x  7 jpipes jpipes 4.0K 2010-03-17 16:11 .
drwxr-xr-x 11 jpipes jpipes 4.0K 2010-03-17 14:57 ..
-rw-rw----  1 jpipes jpipes  10M 2010-03-17 16:11 ibdata1
-rw-rw----  1 jpipes jpipes 5.0M 2010-03-17 16:11 ib_logfile0
-rw-rw----  1 jpipes jpipes 5.0M 2010-03-17 16:11 ib_logfile1
drwxrwx--x  2 jpipes jpipes 4.0K 2010-03-17 16:11 lebowski
-rwxr-----  1 jpipes jpipes    6 2010-03-17 16:11 serialcoder.pid
-rwx------  1 jpipes jpipes  444 2010-03-17 16:11 transaction.log

Finding Out What’s In the Transaction Log

OK, so now for the really cool part of this little demonstration. :) Let’s take a look at what is now contained in the transaction log, all via the Drizzle client and the DATA_DICTIONARY views.

There are currently three DATA_DICTIONARY views which show information about the transaction log and its contents:

  • DATA_DICTIONARY.TRANSACTION_LOG
  • DATA_DICTIONARY.TRANSACTION_LOG_ENTRIES
  • DATA_DICTIONARY.TRANSACTION_LOG_TRANSACTIONS

To see what each view contains, simply do a DESC on them:

drizzle> use data_dictionary
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> DESC TRANSACTION_LOG;
+---------------------+---------+-------+---------+-----------------+-----------+
| Field               | Type    | Null  | Default | Default_is_NULL | On_Update |
+---------------------+---------+-------+---------+-----------------+-----------+
| FILE_NAME           | VARCHAR | FALSE |         | FALSE           |           |
| FILE_LENGTH         | BIGINT  | FALSE |         | FALSE           |           |
| NUM_LOG_ENTRIES     | BIGINT  | FALSE |         | FALSE           |           |
| NUM_TRANSACTIONS    | BIGINT  | FALSE |         | FALSE           |           |
| MIN_TRANSACTION_ID  | BIGINT  | FALSE |         | FALSE           |           |
| MAX_TRANSACTION_ID  | BIGINT  | FALSE |         | FALSE           |           |
| MIN_END_TIMESTAMP   | BIGINT  | FALSE |         | FALSE           |           |
| MAX_END_TIMESTAMP   | BIGINT  | FALSE |         | FALSE           |           |
| INDEX_SIZE_IN_BYTES | BIGINT  | FALSE |         | FALSE           |           |
+---------------------+---------+-------+---------+-----------------+-----------+
9 rows in set (0 sec)

drizzle> DESC TRANSACTION_LOG_ENTRIES;
+--------------+---------+-------+---------+-----------------+-----------+
| Field        | Type    | Null  | Default | Default_is_NULL | On_Update |
+--------------+---------+-------+---------+-----------------+-----------+
| ENTRY_OFFSET | BIGINT  | FALSE |         | FALSE           |           |
| ENTRY_TYPE   | VARCHAR | FALSE |         | FALSE           |           |
| ENTRY_LENGTH | BIGINT  | FALSE |         | FALSE           |           |
+--------------+---------+-------+---------+-----------------+-----------+
3 rows in set (0 sec)

drizzle> DESC TRANSACTION_LOG_TRANSACTIONS;
+-----------------+--------+-------+---------+-----------------+-----------+
| Field           | Type   | Null  | Default | Default_is_NULL | On_Update |
+-----------------+--------+-------+---------+-----------------+-----------+
| ENTRY_OFFSET    | BIGINT | FALSE |         | FALSE           |           |
| TRANSACTION_ID  | BIGINT | FALSE |         | FALSE           |           |
| SERVER_ID       | BIGINT | FALSE |         | FALSE           |           |
| START_TIMESTAMP | BIGINT | FALSE |         | FALSE           |           |
| END_TIMESTAMP   | BIGINT | FALSE |         | FALSE           |           |
| NUM_STATEMENTS  | BIGINT | FALSE |         | FALSE           |           |
| CHECKSUM        | BIGINT | FALSE |         | FALSE           |           |
+-----------------+--------+-------+---------+-----------------+-----------+
7 rows in set (0 sec)

Let’s see what each of the views tells us about what is in the transaction log. Remember, we’ve executed a CREATE SCHEMA, a CREATE TABLE, and a single INSERT. Here is what the TRANSACTION_LOG view shows:

drizzle> SELECT * FROM TRANSACTION_LOG\G
*************************** 1. row ***************************
          FILE_NAME: transaction.log
        FILE_LENGTH: 444
    NUM_LOG_ENTRIES: 3
   NUM_TRANSACTIONS: 3
 MIN_TRANSACTION_ID: 1
 MAX_TRANSACTION_ID: 3
  MIN_END_TIMESTAMP: 1268856698672620
  MAX_END_TIMESTAMP: 1268856707093000
INDEX_SIZE_IN_BYTES: 73736

The column names should be self explanatory. The FILE_LENGTH shows the size in bytes of the log (which matches the output we had from our ls -lha above.) The INDEX_SIZE_IN_BYTES is total amount of memory allocated for the transaction log index.

The TRANSACTION_LOG_ENTRIES view isn’t that interesting at first glance:

drizzle> SELECT * FROM TRANSACTION_LOG_ENTRIES;
+--------------+-------------+--------------+
| ENTRY_OFFSET | ENTRY_TYPE  | ENTRY_LENGTH |
+--------------+-------------+--------------+
|            0 | TRANSACTION |           89 |
|           89 | TRANSACTION |          223 |
|          312 | TRANSACTION |          132 |
+--------------+-------------+--------------+

You might be tempted to ask what the heck the purpose of the TRANSACTION_LOG_ENTRIES view is for. It is a bit of a bridge table that allows one to see the type of entries at each offset. Currently, the only types of entries in the transaction log are of type TRANSACTION — basically a serialized GPB Protobuffer message — and a BLOB entry, which is for storage of large blob data.

The TRANSACTION_LOG_TRANSACTIONS view shows all the transaction log entries which are of type TRANSACTION:

drizzle> SELECT * FROM TRANSACTION_LOG_TRANSACTIONS;
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
| ENTRY_OFFSET | TRANSACTION_ID | SERVER_ID | START_TIMESTAMP  | END_TIMESTAMP    | NUM_STATEMENTS | CHECKSUM |
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
|            0 |              1 |         1 | 1268856698672606 | 1268856698672620 |              1 |        0 |
|           89 |              2 |         1 | 1268856702792284 | 1268856702792331 |              1 |        0 |
|          312 |              3 |         1 | 1268856707025455 | 1268856707093000 |              1 |        0 |
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
3 rows in set (0 sec)

As you can see, there is some basic information about each transaction entry in the log, including the offset in the transaction log, the start and end timestamp of the transaction, it’s transaction identifier, the number of statements involved in the transaction, and an optional checksum for the message (more on checksums below).

Viewing the Transaction Content

While the above view output may be nice, what we’d really like to be able to do is see what precisely were the changes a Transaction effected. To see this, we can use the PRINT_TRANSACTION_MESSAGE(log_file, offset) UDF. Below, I’ve added two more rows to the lebowski.characters table within an explicit transaction. I then query the DATA_DICTIONARY views using the PRINT_TRANSACTION_MESSAGE() function to show the changes logged to the transaction log:

drizzle> use lebowski
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> START TRANSACTION;
Query OK, 0 rows affected (0 sec)

drizzle> INSERT INTO characters VALUES ('walter','bowling');
Query OK, 1 row affected (0 sec)

drizzle> INSERT INTO characters VALUES ('donny','bowling');
Query OK, 1 row affected (0 sec)

drizzle> COMMIT;
Query OK, 0 rows affected (0.09 sec)

We now see an additional Transaction Log entry and can see that this transaction contains the two individual INSERT statements just executed:

drizzle> SELECT * FROM TRANSACTION_LOG_TRANSACTIONS;
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
| ENTRY_OFFSET | TRANSACTION_ID | SERVER_ID | START_TIMESTAMP  | END_TIMESTAMP    | NUM_STATEMENTS | CHECKSUM |
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
|            0 |              1 |         1 | 1268856698672606 | 1268856698672620 |              1 |        0 |
|           89 |              2 |         1 | 1268856702792284 | 1268856702792331 |              1 |        0 |
|          312 |              3 |         1 | 1268856707025455 | 1268856707093000 |              1 |        0 |
|          444 |              4 |         1 | 1268857926482600 | 1268857938514312 |              1 |        0 |
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
...
drizzle> SELECT PRINT_TRANSACTION_MESSAGE('transaction.log', ENTRY_OFFSET) as info
    -> FROM TRANSACTION_LOG_TRANSACTIONS WHERE ENTRY_OFFSET = 444\G
*************************** 1. row ***************************
info: transaction_context {
  server_id: 1
  transaction_id: 4
  start_timestamp: 1268857926482600
  end_timestamp: 1268857938514312
}
statement {
  type: INSERT
  start_timestamp: 1268857926482605
  end_timestamp: 1268857938514310
  insert_header {
    table_metadata {
      schema_name: "lebowski"
      table_name: "characters"
    }
    field_metadata {
      type: VARCHAR
      name: "name"
    }
    field_metadata {
      type: VARCHAR
      name: "hobby"
    }
  }
  insert_data {
    segment_id: 1
    end_segment: true
    record {
      insert_value: "walter"
      insert_value: "bowling"
    }
    record {
      insert_value: "donny"
      insert_value: "bowling"
    }
  }
}

1 row in set (0.01 sec)

You may notice that NUM_STATEMENTS is equal to 1 even though there were 2 INSERT statements issued. This is because the kernel packages both the INSERTs into a single message::Statement::InsertData package for more efficient storage. If there had been an INSERT and an UPDATE, NUM_STATEMENTS would be 2.

Enable Automatic Checksumming

One final feature I’ll highlight in this blog post is an option to automatically store a checksum of each transaction message when writing entries to the transaction log. To enable this feature, simply use the --transaction-log-enable-checksum command line option. You can view the checksums of entries in the TRANSACTION_LOG_TRANSACTIONS view, as demonstrated below:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 --transaction-log-enable --default-replicator-enable --transaction-log-enable-checksum &
[5] 32042
[4]   Done                    ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 --transaction-log-enable --default-replicator-enable
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ InnoDB: The InnoDB memory heap is disabled
InnoDB: Mutexes and rw_locks use GCC atomic builtins.
InnoDB: The first specified data file ./ibdata1 did not exist:
InnoDB: a new database to be created!
100317 16:47:07  InnoDB: Setting file ./ibdata1 size to 10 MB
InnoDB: Database physically writes the file full: wait...
100317 16:47:07  InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
InnoDB: Database physically writes the file full: wait...
100317 16:47:08  InnoDB: Log file ./ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
100317 16:47:08 InnoDB Plugin 1.0.4 started; log sequence number 0
Listening on 0.0.0.0:9306
Listening on :::9306
Listening on 0.0.0.0:4427
Listening on :::4427
./drizzled: Forcing close of thread 0 user: ''
./drizzled: ready for connections.
Version: '2010.03.1314' Source distribution (xa-transaction-log)
...
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ../client/drizzle --port=9306
Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 2
Server version: 7 Source distribution (xa-transaction-log)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle> CREATE SCHEMA lebowski;
Query OK, 1 row affected (0.05 sec)

drizzle> CREATE TABLE characters (name VARCHAR(20) NOT NULL PRIMARY KEY, hobby VARCHAR(10) NOT NULL) ENGINE=InnoDB;
ERROR 1046 (3D000): No database selected
drizzle> use lebowski
Database changed
drizzle> CREATE TABLE characters (name VARCHAR(20) NOT NULL PRIMARY KEY, hobby VARCHAR(10) NOT NULL) ENGINE=InnoDB;
Query OK, 0 rows affected (0.11 sec)

drizzle> INSERT INTO characters VALUES ('the dude','bowling');
Query OK, 1 row affected (0.1 sec)

drizzle> use data_dictionary
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> SELECT ENTRY_OFFSET, TRANSACTION_ID, CHECKSUM FROM TRANSACTION_LOG_TRANSACTIONS;
+--------------+----------------+------------+
| ENTRY_OFFSET | TRANSACTION_ID | CHECKSUM   |
+--------------+----------------+------------+
|            0 |              2 |  143866125 |
|           89 |              8 | 1466831622 |
|          312 |              9 |  460824986 |
+--------------+----------------+------------+
3 rows in set (0 sec)

DDL is not Statement-based Replication

As a final note, I’d like to point out that even DDL in Drizzle is replicated as row-based transaction messages, and not as raw SQL statements like in MySQL. You can see, for instance, the message::Statement::CreateTableStatement inside the transaction message which contains all the metadata about the table you just created. :)

drizzle> SELECT PRINT_TRANSACTION_MESSAGE('transaction.log', ENTRY_OFFSET)
    -> FROM TRANSACTION_LOG_TRANSACTIONS WHERE ENTRY_OFFSET = 89\G
*************************** 1. row ***************************
PRINT_TRANSACTION_MESSAGE('transaction.log', ENTRY_OFFSET): transaction_context {
  server_id: 1
  transaction_id: 2
  start_timestamp: 1268858897017396
  end_timestamp: 1268858897017447
}
statement {
  type: CREATE_TABLE
  start_timestamp: 1268858897017402
  end_timestamp: 1268858897017445
  create_table_statement {
    table {
      name: "characters"
      engine {
        name: "InnoDB"
      }
      field {
        name: "name"
        type: VARCHAR
        format: DefaultFormat
        constraints {
          is_nullable: false
        }
        string_options {
          length: 20
          collation_id: 45
          collation: "utf8_general_ci"
        }
      }
      field {
        name: "hobby"
        type: VARCHAR
        format: DefaultFormat
        constraints {
          is_nullable: false
        }
        string_options {
          length: 10
          collation_id: 45
          collation: "utf8_general_ci"
        }
      }
      indexes {
        name: "PRIMARY"
        is_primary: true
        is_unique: true
        type: UNKNOWN_INDEX
        key_length: 80
        index_part {
          fieldnr: 0
          compare_length: 80
          key_type: 0
        }
        options {
          binary_pack_key: true
          var_length_key: true
        }
      }
      type: STANDARD
      options {
        collation: "utf8_general_ci"
        collation_id: 45
      }
    }
  }
}

1 row in set (0 sec)

If you like or don’t like what you see, please do get in touch with me or fire off a wishlist to the Drizzle Discuss mailing list. We’d love to hear from ya!

[1] Actually, the transaction log module is a set of plugins.

Is Anyone Else Looking Forward To…

…the new Clash of the Titans movie, coming out soon? I loved the original back in 1981. Can’t wait to see this one. :)

I would have linked to the new trailer site, but it was all Flash and took over two minutes to load even with broadband. I figured I’d save users the hassle…

O’Gara Cloud Computing Article Off Base

Maureen O’Gara, self-described as “the most read technology reporter for the past 20 years”, has written an article about Drizzle at Rackspace for one of Sys-con’s online zines called Cloud Computing Journal, of which she is an editor.

I tried commenting on Maureen’s article on their website, but the login system is apparently borked, at least for registered users who use OpenID, which it wants to still have a separate user ID and login. Note to sys-con.com: OpenID is designed so that users don’t have to remember yet another login for your website.

Besides having little patience for content-sparse websites that simply provide an online haven for dozens of Flash advertisements per web page, the article had some serious problems with it, not the least of which was using large chunks of my Happiness is a Warm Cloud article without citation. Very professional.

OK, to start with, let’s take this quote from the article:

Drizzle runs the risk of not being as stable as MySQL, because the Drizzle team is taking things out and putting other stuff in. Of course it may be successful in trying to create a product that’s more stable than MySQL. But creating a stable DBMS engine is something that has always taken years and years.

This is just about the most naïve explanation for whether a product will or will not be stable that I’ve ever read. If Maureen had bothered to email or call any one of the core Drizzle developers, they’d have been happy to tell her what is and is not stable about Drizzle, and why. Drizzle has not changed the underlying storage engines, so the InnoDB storage engine in Drizzle is the same plugin as available in MySQL (version 1.0.6).

The pieces of MySQL which were removed from Drizzle happen to be the parts of MySQL which have had the most stability issues — namely the additional features added to MySQL 5.0: stored procedures, views, triggers, stored functions, the INFORMATION_SCHEMA implementation, and server-side cursors and prepared statements. In addition to these removed features of MySQL, Drizzle also has no built-in Query Cache, does not support anything other than UTF-8 character sets, and has removed the MySQL replication system and binary logging — moving a rewrite of these pieces out into the plugin ecosystem.

The pieces that were added to Drizzle have mostly been done by adding plugins that provide functionality. Maureen, the reason this was done was precisely to allow for greater stability of the kernel by segregating new features and functionality into the plugin ecosystem, where they can be properly versioned and quarantined, therefore increasing kernel stability. It’s pretty much the biggest principle of Drizzle’s design…

The core developers of Drizzle (and much of the Drizzle community) would also have been happy to tell Maureen how the Drizzle team defines “stability”: when the community says Drizzle is stable — simple as that.

OK, so the next thing I took objection to is the following line:

Half of Rackspace’s customers are on MySQL so there’ll be some donkey-style nosing to get them to migrate.

I think my Rackspace colleagues might have quite a bit to say about the above. I haven’t seen any Rackers talking about mass migration from MySQL to Drizzle. As far as I have seen, the plan is to provide Drizzle as an additional service to Rackspace customers.

Rackspace evidently wants its new boys, who were not the core pillars of the MySQL engineering team, to hitch MySQL, er, Drizzle to Cassandra

MySQL != Drizzle. Implying that the two are equal do a disservice to both, as they have very different target markets and developer audiences.

The smart money is betting that even if a good number of high-volume web sites go down this route, an even higher number such as Facebook and Google will continue with relational databases, primarily MySQL.

Again, probably best to do your homework on this one, too. Facebook runs an amalgamation of a custom MySQL version and storage engines, distributed key-value stores, and Memcached servers. I would think that Facebook moving to Drizzle would be one tough migration. Thousands (tens of thousands?) of MySQL servers all running custom software and integrated into their caching layers is a huge barrier to entry, and not one I would expect a large site like Facebook to casually undertake. But, the same could be said about a move to SQL Server or Oracle, for that matter, and has little to do with Drizzle.

Google is moving away from using MySQL entirely. Mark Callaghan, previously at Google, has moved over to Facebook (possibly because of this trend at Google to get rid of MySQL), and Anthony Curtis, formerly of MySQL, then Google, left Google partially because of this reason.

OK, so the next quote got me really fired up because it demonstrates a complete lack of understanding (maybe not Maureen’s, but the unnamed source it’s from at least):

Somebody – sorry we forget who exactly – claimed that as GPL 2 code Drizzle “severely limits revenue opportunities. For Rackspace, the opportunity to have some key Drizzle developers on its payrolls basically comes down to a promotional benefit, trying to position Rackspace as particularly Drizzle-savvy in the eyes of the community and currying favor for its seemingly generous contributions. What’s unclear is whether they may develop some Drizzle-related functionality that they will then not release as open source and just rent out to Rackspace hosting customers…that would be a way for them to differentiate themselves from competitors and GPLv2 would in principle allow this.”

A few points to make about the above quote.

First, name your source. I find it difficult to believe that the most-read technology writer would not write down a source. Is it the same person you deliberately left out of a quote from my Happiness article? (why did you do that, btw?).

Second, the MySQL server source code is licensed under the GPL 2, and so is Drizzle’s kernel, because it is a derivative work of the MySQL server.

Let me be clear: Developers who contribute code to Drizzle do so under the GPLv2 if that contribution is in the Drizzle kernel. If the code contribution is a plugin, the contributor is free to pick whatever license they choose.

Third, licensing has little if anything to do with revenue at all. The license is besides the point. There are two things which dictate the company’s revenue derivation from software:

  1. Copyright ownership
  2. Principles of the Company

Drizzle, Rackspace, or any company a Drizzle contributor works for, does not have the copyright ownership of the MySQL source code, from which Drizzle’s kernel is derived. Oracle does. Therefore, companies do not have any right to re-sell Drizzle (under any license) without explicit permission from Oracle. Period. Has nothing to do with the GPLv2.

That said, contributors do have the right to make money on plugins built for the Drizzle server, and Rackspace, while not having expressed any interest to yours truly in doing so, has the right like any other Drizzle contributor, to make money on plugins its contributors create for Drizzle.

It is my knowledge (after actually having talked to Rackspace managers and decision makers), that Rackspace is not interested in getting into the business of selling commercial Drizzle plugins. Their core direction is to create value for their customers, and I fail to see how getting into the commercial software sales business meets that goal.

Next time, please feel free to contact myself or any other Drizzle contributor to get the low-down on Drizzle-related stuff. We’ll be nice. I promise.

Recent Work on Improving Drizzle’s Storage Engine API

Over the past six weeks or so, I have been working on cleaning up the pluggable storage engine API in Drizzle.  I’d like to describe some of this work and talk a bit about the next steps I’m taking in the coming months as we roll towards implementing Log Shipping in Drizzle.

First, how did it come about that I started working on the storage engine API?

From Commands to Transactions

Well, it really goes back to my work on Drizzle’s replication system.  I had implemented a simple, fast, and extensible log which stored records of the data changes made to a server.  Originally, the log was called the Command Log, because the Google Protobuffer messages it contained were called message::Commands.  The API  for implementing replication plugins was very simple and within a month or so of debuting the API, quite a few replication plugins had been built, including one replicating to Memcached, a prototype one replicating to Gearman, and a filtering replicator plugin.

In addition, Marcus Eriksson had created the RabbitReplication project which could replicate from Drizzle to other data stores, including Cassandra and Project Voldemort.  However, Marcus did not actually implement any C/C++ plugins using the Drizzle replication API.  Instead, RabbitReplication simply read the new Command Log, which due to it simply being a file full of Google Protobuffer messages, was quick and easy to read into memory using a variety of different programming languages.  RabbitReplication is written in Java, and it was great to see other programming languages be able to read Drizzle’s replication log so easily.  Marcus later coded up a C++ TransactionApplier plugin which replaces the Drizzle replication log and instead replicates the GPB messages directly to RabbitMQ.

And there, you’ll note that one of the plugins involved in Drizzle’s replication system is called TransactionApplier.  It used to be called CommandApplier. That was because the GPB Command messages were individual row change events for the most part.  However, I made a series of changes to the replication API and now the GPB messages sent through the APIs are of class message::Transactionmessage::Transaction objects contain a transaction context, with information about the transaction’s start and end time, it’s transaction identifer, along with a series of message::Statement objects, each of which representing a part of the data changes that the SQL transaction made.

Thus, the Command Log now turned into the Transaction Log, and everywhere the term Command was used now was replaced with the terms Transaction and Statement (depending on whether you were talking about the entire Transaction or a piece of it). Log entries were now written at COMMIT to the Transaction Log and were not written if no COMMIT occurred1.

After finishing this work to make the transaction log write Transaction messages at commit time, I was keen to begin coding up the publisher and subscriber plugins which represent a node in the replication environment. However, Brian had asked me to delay working on other replication features and ensure that the replication API could support fully distributed transactions via the X/Open XA distributed transaction protocol. XA support had been removed from Drizzle when the MySQL binlog and original replication system was ripped out and needed some TLC. Fair enough, I said. So, off I went to work on XA.

If Only It Were Simple…

As anyone who has worked on the MySQL source code or developed storage engines for MySQL knows, working with the MySQL pluggable storage engine API is sometimes not the easiest or most straightforward thing. I think the biggest problem with the MySQL storage engine API is that, due to understandable historical reasons, it’s an API that was designed with the MyISAM and HEAP storage engines in mind. Much of the transactional pieces of the API seem to be a bolted-on afterthought and can be very confusing to work with.

As an example, Paul McCullagh, developer of the transactional storage engine PBXT, recently emailed the mysql internals mailing list asking how the storage engine could tell when a SQL statement started and ended. You would think that such a seemingly basic functionality would have a simple answer. You’d be wrong. Monty Widenius answered like this:

Why not simply have a counter in your transaction object for how start_stmt – reset(); When this is 0 then you know stmnt ended.

In Maria we count number of calls to external_lock() and when the sum goes to 0 we know the transaction has ended.

To this, Mark Callaghan responded:

Why does the solution need to be so obscure?

Monty answered (emphasis mine):

Historic reasons.

MySQL never kept a count of which handlers are used by a transaction, only which tables.

So the original logic was that external_lock(lock/unlock) is called for each usage of the table, which is normally more than enough information for a handler to know when a statement starts/ends.

The one case this didn’t work was in the case someone does lock tables as then external_lock is not called per statement. It was to satisfy this case that we added a call to start_stmt() for each table.

It’s of course possible to change things so that start_stmt() / end_stmt() would be called once per used handler, but this would be yet another overhead for the upper level to do which the current handlers that tracks call to external_lock() doesn’t need.

Well, in Drizzle-land, we aren’t beholden to “historic reasons” :) So, after looking through the in-need-of-attention transaction processing code in the kernel, I decided that I would clean up the API so that storage engines did not have to jump through hoops to notify the kernel they participate in a transaction or just to figure out when a statement and a transaction started and ended.

The resulting changes to the API are quite dramatic I think, but I’ll leave it to the storage engine developers to tell me if the changes are good or not. The following is a summary of the changes to the storage engine API that I committed in the last few weeks.

plugin::StorageEngine Split Into Subclasses

The very first thing I did was to split the enormous base plugin class for a storage engine, plugin::StorageEngine, into two other subclasses containing transactional elements. plugin::TransactionalStorageEngine is now the base class for all storage engines which implement SQL transactions:

/**
 * A type of storage engine which supports SQL transactions.
 *
 * This class adds the SQL transactional API to the regular
 * storage engine.  In other words, it adds support for the
 * following SQL statements:
 *
 * START TRANSACTION;
 * COMMIT;
 * ROLLBACK;
 * ROLLBACK TO SAVEPOINT;
 * SET SAVEPOINT;
 * RELEASE SAVEPOINT;
 */
class TransactionalStorageEngine :public StorageEngine
{
public:
  TransactionalStorageEngine(const std::string name_arg,
                             const std::bitset<HTON_BIT_SIZE> &flags_arg= HTON_NO_FLAGS);
 
  virtual ~TransactionalStorageEngine();
...
private:
  void setTransactionReadWrite(Session& session);
 
  /*
   * Indicates to a storage engine the start of a
   * new SQL transaction.  This is called ONLY in the following
   * scenarios:
   *
   * 1) An explicit BEGIN WORK/START TRANSACTION is called
   * 2) After an explicit COMMIT AND CHAIN is called
   * 3) After an explicit ROLLBACK AND RELEASE is called
   * 4) When in AUTOCOMMIT mode and directly before a new
   *    SQL statement is started.
   */
  virtual int doStartTransaction(Session *session, start_transaction_option_t options)
  {
    (void) session;
    (void) options;
    return 0;
  }
 
  /**
   * Implementing classes should override these to provide savepoint
   * functionality.
   */
  virtual int doSetSavepoint(Session *session, NamedSavepoint &savepoint)= 0;
  virtual int doRollbackToSavepoint(Session *session, NamedSavepoint &savepoint)= 0;
  virtual int doReleaseSavepoint(Session *session, NamedSavepoint &savepoint)= 0;
 
  /**
   * Commits either the "statement transaction" or the "normal transaction".
   *
   * @param[in] The Session
   * @param[in] true if it's a real commit, that makes persistent changes
   *            false if it's not in fact a commit but an end of the
   *            statement that is part of the transaction.
   * @note
   *
   * 'normal_transaction' is also false in auto-commit mode where 'end of statement'
   * and 'real commit' mean the same event.
   */
  virtual int doCommit(Session *session, bool normal_transaction)= 0;
 
  /**
   * Rolls back either the "statement transaction" or the "normal transaction".
   *
   * @param[in] The Session
   * @param[in] true if it's a real commit, that makes persistent changes
   *            false if it's not in fact a commit but an end of the
   *            statement that is part of the transaction.
   * @note
   *
   * 'normal_transaction' is also false in auto-commit mode where 'end of statement'
   * and 'real commit' mean the same event.
   */
  virtual int doRollback(Session *session, bool normal_transaction)= 0;
  virtual int doReleaseTemporaryLatches(Session *session)
  {
    (void) session;
    return 0;
  }
  virtual int doStartConsistentSnapshot(Session *session)
  {
    (void) session;
    return 0;
  }
};

As you can see, plugin::TransactionalStorageEngine inherits from plugin::StorageEngine and extends it with a series of private pure virtual methods that implement the SQL transaction parts of a query — doCommit(), doRollback(), etc. Implementing classes simply inherit from plugin::TransactionalStorageEngine and implement their internal transaction processing in these private methods.

In addition to the SQL transaction, however, is the concept of an XA transaction, which is for distributed transaction coordination. The XA protocol is a two-phase commit protocol because it implements a PREPARE step before a COMMIT occurs. This XA API is exposed via two other classes, plugin::XaResourceManager and plugin::XaStorageEngine. plugin::XaResourceManager derived classes implement the resource manager API of the XA protocol. plugin::XaStorageEngine is a storage engine subclass which, while also implementing SQL transactions, also implements XA transactions.

Here is the plugin::XaResourceManager class:

/**
 * An abstract interface class which exposes the participation
 * of implementing classes in distributed transactions in the XA protocol.
 */
class XaResourceManager
{
public:
  XaResourceManager() {}
  virtual ~XaResourceManager() {}
...
private:
  /**
   * Does the COMMIT stage of the two-phase commit.
   */
  virtual int doXaCommit(Session *session, bool normal_transaction)= 0;
  /**
   * Does the ROLLBACK stage of the two-phase commit.
   */
  virtual int doXaRollback(Session *session, bool normal_transaction)= 0;
  /**
   * Does the PREPARE stage of the two-phase commit.
   */
  virtual int doXaPrepare(Session *session, bool normal_transaction)= 0;
  /**
   * Rolls back a transaction identified by a XID.
   */
  virtual int doXaRollbackXid(XID *xid)= 0;
  /**
   * Commits a transaction identified by a XID.
   */
  virtual int doXaCommitXid(XID *xid)= 0;
  /**
   * Notifies the transaction manager of any transactions
   * which had been marked prepared but not committed at
   * crash time or that have been heurtistically completed
   * by the storage engine.
   *
   * @param[out] Reference to a vector of XIDs to add to
   *
   * @retval
   *  Returns the number of transactions left to recover
   *  for this engine.
   */
  virtual int doXaRecover(XID * append_to, size_t len)= 0;
};

and here is the plugin::XaStorageEngine class:

/**
 * A type of storage engine which supports distributed
 * transactions in the XA protocol.
 */
class XaStorageEngine :public TransactionalStorageEngine,
                       public XaResourceManager
{
public:
  XaStorageEngine(const std::string name_arg,
                  const std::bitset<HTON_BIT_SIZE> &flags_arg= HTON_NO_FLAGS);
 
  virtual ~XaStorageEngine();
  ...
};

Pretty clear. A plugin::XaStorageEngine inherits from both plugin::TransactionStorageEngine and plugin::XaResourceManager because it implements both SQL transactions and XA transactions. The InnobaseEngine is a plugin which inherits from plugin::XaStorageEngine because InnoDB supports SQL transactions as well as XA.

Explicit Statement and Transaction Boundaries

The second major change I made addressed the problem that Mark Callaghan noted in asking why finding out when a statement starts and ends was so obscure. I added two new methods to plugin::StorageEngine called doStartStatement() and doEndStatement(). The kernel now explicitly tells storage engines when a SQL statement starts and ends. This happens before any calls to Cursor::external_lock() happen, and there are no exception cases. In addition, the kernel now always tells transactional storage engines when a new SQL transaction is starting. It does this via an explicit call to plugin::TransactionalStorageEngine::doStartTransaction(). No exceptions, and yes, even for DDL operations.

What this means is that for a transactional storage engine, it no longer needs to “count the calls to Cursor::external_lock()” in order to know when a statement or transaction starts and ends. For a SQL transaction, this means that there is a clear code call path and there is no need for the storage engine to track whether the session is in AUTOCOMMIT mode or not. The kernel does all that work for the storage engine. Imagine a Session executes a single INSERT statement against an InnoDB table while in AUTOCOMMIT mode. This is what the call path looks like:

 drizzled::Statement::Insert::execute()
 |
 -> drizzled::mysql_lock_tables()
    |
    -> drizzled::TransactionServices::registerResourceForTransaction()
       |
       -> drizzled::plugin::TransactionalStorageEngine::startTransaction()
          |
          -> InnobaseEngine::doStartTransaction()
       |
       -> drizzled::plugin::StorageEngine::startStatement()
          |
          -> InnobaseEngine::doStartStatement()
       |
       -> drizzled::plugin::StorageEngine::getCursor()
          |
          -> drizzled::Cursor::write_row()
             |
             -> InnobaseCursor::write_row()
       |
       -> drizzled::TransactionServices::autocommitOrRollback()
          |
          -> drizzled::plugin::TransactionStorageEngine::commit()
             |
             -> InnobaseEngine::doCommit()

I think this will come as a welcome change to storage engine developers working with Drizzle.

No More Need for Engine to Call trans_register_ha()

There was an interesting comment in the original documentation for the transaction processing code. It read:

Roles and responsibilities
————————–

The server has no way to know that an engine participates in
the statement and a transaction has been started
in it unless the engine says so. Thus, in order to be
a part of a transaction, the engine must “register” itself.
This is done by invoking trans_register_ha() server call.
Normally the engine registers itself whenever handler::external_lock()
is called. trans_register_ha() can be invoked many times: if
an engine is already registered, the call does nothing.
In case autocommit is not set, the engine must register itself
twice — both in the statement list and in the normal transaction
list.

That comment, and I’ve read it dozens of times, always seemed strange to me. I mean, does the server really not know that an engine participates in a statement or transaction unless the engine tells it? Of course not.

So, I removed the need for a storage engine to “register itself” with the kernel. Now, the transaction manager inside the Drizzle kernel (implemented in the TransactionServices component) automatically monitors which engines are participating in an SQL transaction and the engine doesn’t need to do anything to register itself.

In addition, due to the break-up of the plugin::StorageEngine class and the XA API into plugin::XaResourceManager, Drizzle’s transaction manager can now coordinate XA transactions from plugins other than storage engines. Yep, that’s right. Any plugin which implements plugin::XaResourceManager can participate in an XA transaction and Drizzle will act as the transaction manager. What’s the first plugin that will do this? Drizzle’s transaction log. The transaction log isn’t a storage engine, but it is able to participate in an XA transaction, so it will implement plugin::XaResourceManager but not plugin::StorageEngine.

Performance Impact of Code Changes

So, that “yet another overhead” Monty talked about in the quote above? There wasn’t any noticeable impact in performance or scalability at all. So much for optimize-first coding.

What’s Next?

The next thing I’m working on is removing the notion of the “statement transaction”, which is also a historical by-product, this time because of BerkeleyDB. Gee, I’ve got a lot of work ahead of me…

[1] Actually, there is a way that a transaction that was rolled back can get written to the transaction log. For bulk operations, the server can cut a Transaction message into multiple segments, and if the SQL transaction is rolled back, a special RollbackStatement message is written to the transaction log.

Tags: