OpenStack Dev Tip — Easily Pull a Review Branch

Just a quick tip for developers working on OpenStack projects that work on multiple development machines or want to pull a colleague’s code from the Gerrit review system and test it locally.

If you have followed the instructions about setting up a development environment successfully, you will have installed the git-review tool that Jim Blair and Monty Taylor maintain. The git-review tool has a nice little feature that enables you to easily pull any branch that anyone has pushed up to code review:

$> git review -d $REVIEW_NUM

The $REVIEW_NUM variable should be replaced with the identifier of the review branch in Gerrit.

For example, I developed some code on my laptop that I now want to pull to my beefier work machine. The original branch is failing a few tests in Jenkins and I want to diagnose what’s going on. The review branch is here: https://review.openstack.org/#change,1656. The review number (ID) is 1656.

To grab that branch into my local environment and check it out, I do:

jpipes@uberbox:~/repos/glance$ git review -d 1656
Downloading refs/changes/56/1656/2 from gerrit into review/jay_pipes/bug/850377

Doing a git status, you’ll note that I am now in the local branch called review/jay_pipes/bug/850377:

jpipes@uberbox:~/repos/glance$ git status
# On branch review/jay_pipes/bug/850377
# Your branch and 'gerrit/master' have diverged,
# and have 1 and 2 different commit(s) each, respectively.
#
nothing to commit (working directory clean)

I can now run tests, diagnose the issue(s), fix code up and do a:

$> git commit -a --amend
$> git review

And my changes will be pushed up to the original review in Gerrit for others to look at.

Essex Design Summit — QA Sessions to Note

Essex Design Summit

There are quite a few folks interested in QA coming to the OpenStack
Essex Design Summit
next week. I wanted to give you all a heads-up on
the sessions that may be of interest to you.

Here they are:

Monday, Oct 3rd:

09:30-10:25 – Essex Release Cycle

Thierry Carrez, our illustrious release manager, will do a post-mortem
on the Diablo release cycle and discuss potential changes for the
Essex release cycle. I know almost all QAers have expressed desires to
have maintenance branches managed by the QA team and I’ve heard
suggestions about various QA-centric freeze points. Those interested
in advocating for these things should plan to attend this session.

14:00-14:45 – Stable Release Updates

Dave Walker from Canonical plans to outline some possibilities for how
to maintain and update stable releases of OpenStack projects.

15:00-15:45 – Separating API from Implementation of the API

Total self-promotion of a session I’ve proposed… I think anyone
interested in stabilizing the OpenStack APIs and having OpenStack APIs
become the open standards for the cloud computing industry should
attend.

16:30-17:15 – OpenStack Compute API 2.0

Glen Campbell will be leading a discussion about how to improve the
Compute (Nova) API for a 2.0 API series. I think it’s important that a
number of folks on the QA team attend this session and get an idea of
the things that we will be looking at in the future regarding the
Compute API. Personally, I’m definitely planning on attending this
one.

17:30 – 17:55 – NetStack Continuous Integration Planning

Personally, I will not be at this session as I have another session to
lead. However, I think it is important that a number of people from
the QA team attend this session, listen to the needs of the NetStack
contributors, voice our support for their projects, explain what the
goals of our team are, and enable some cross-team collaborative
efforts around CI and QA.

Tuesday, Oct 4th:

09:30 – 09:55 – Documentation Strategies for OpenStack

Anne Gentle will be leading a discussion about documentation of
OpenStack projects. One of the deliverables of the OpenStack QA team
is clearly to identify areas where specifications don’t match
behaviour, so I think it’s pretty critical that the Doc Team and the
QA team be on the same page when it comes to how to coordinate
communication of documentation discrepancies.

09:30 – 09:55 – VM Disk Management in Nova

At the same time as the documentation session, Paul Voccio is leading
a discussion about VM disk management in Nova. Those QAers focusing on
disk/volume management may want to attend this session to ensure the
QA team has a good grasp of changes coming in this arena.

10:00 – 10-25 – OpenStack Common

Brian Lamar will be leading a discussion on getting serious about the
potential of an openstack-common Python library of common code shared
amongst many OpenStack projects. Hey, it’s a heck of a lot easier to
QA code that’s in one location than the same code, written with slight
differences, spread across many projects… seems like a no-brainer
for the QA team to attend and support this idea. :)

11:00 – 11:25 – Monitoring in Swift

John Dickinson will be leading a session to discuss what things should
be monitored across a Swift cluster, and what tools are available for
monitoring. I think this discussion will be valuable for those of us
interested in long-running production integration tests where Swift is
one of the components of a full OpenStack test cluster.

12:00 – 12:25 – Integration Test Suites and Gating Trunk

A no brainer… in this session we will talk about the various
integration test suites for Nova/Glance/Keystone and discuss the
effort already underway to combine them. In addition, we will talk
about what policies to recommend for OpenStack projects regarding what
level of passing integration tests should hold up a gated trunk.

15:30 – 15:55 – Making VM State Handling More Robust

Phil Day is leading a discussion about ways in which the handling of
VM state transitions can be inconsistent and confusing. Since the QA
team is responsible for documenting just such inconsistencies and
building tests cases for such inconsistent behaviour, I think this
session would be good to hang around in and listen/take notes.

16:30 – 17:25 – OpenStack Faithful Implementation Test Suites (FITS)

Josh McKenty will be talking about certain proposals regarding a FITS
for OpenStack APIs. Should be an interesting session :)

Wednesday, Oct 5th:

09:30 – 10:25 – XenServer/KVM Feature Parity Plan

This session should be good for those QAers interested in identifying
areas where feature parity between hypervisors is lacking, and
discussing ways in which the QA team can document these disparities
and produce tests for identifying future disparity among hypervisors.

11:00 – 11:45 – Glance Throughput Improvements

This session is being led by Tim Reddins, from HP, who (along with his
team) have done some analysis on ways to improve Glance’s throughput.
QAers interested in stress, capacity, and parallelism testing should
definitely attend!

11:30 – 11:55 – Nova Upgrades

Ray Hookway will talk about ways that Nova’s update process can be
made more robust. I imagine that the talk’s recommendations will be
generally applicable to many OpenStack projects, not just Nova. I also
think that some members of the QA team should attend — we should be
able to create functional tests for upgrade processes for all
OpenStack projects…

14:30 – 14:55 – Git/Gerrit Best Practices

Monty Taylor is leading this session on Gerrit/Git best practices. I
recommend everyone go, if only to see the fireworks.

15:30 – 15:55 – Quality Assurance in OpenStack

Uhm, duh, you should all be at this one. :) We’ll discuss how to
divide the voluminous amount of work among our members, talk about
which projects (and components within certain projects) are
high-priority items, the ways we should communicate and track
progress, etc

17:00 – 17:25 – Internal Service Communication

Brian Waldon is leading a session on internal service communication
that should be quite interesting. The integration testing coverage of
major internal service components of Nova is currently light, and is
one of those areas I think should be carefully picked over by our QA
team.

OK, that’s the recommendations from me, but of course, feel free to
attend whatever sessions are of most interest to you. I’m very much
looking forward to meeting all of you (we’re up to 28 members as of
this writing).

Cheers, and see you tomorrow!
-jay

What the Heck is Hyperscale?

<rant>
Please, can we not use the term “hyperscale”. It doesn’t have any meaning at all.

If you say “a cloud that has 10,000 physical hosts” or “a cloud that supports 100,000 concurrent accounts” or even “a cloud that grows baby elephants”, then people will understand what you mean. Otherwise, nobody really knows what you’re talking about and it’s all just marketing-speak in the same vein of Microsoft’s “Get Me to the Cloud” commercials.
</rant>

Developing Nova on Linux – Getting Started

In the past few weeks, I’ve gotten involved in the newly-debuted OpenStack project. Right now, my focus is on the Compute sub-project of the stack, called Nova. The initial pieces I am focusing on are the unit tests and end-to-end systems testing of the compute stack.

I struggled over the last couple days to solve a bug that turned out to be not a bug at all, but an issue with the Python development environment I use. I figured I’d write a blog article for those Python developers who are looking to contribute to the Nova project and may also be struggling to get up and going.

If you’re contributing to an open source project like Nova, you’ll want to be able to work on multiple branches of the source code at the same time — for instance, if you’re working on fixing a few bugs simultaneously.

There are quite a few dependencies for Nova, and, because of the way Python searches for packages, it’s imperative that you use a tool such as virtualenv to isolate your multiple branches into their own development environments. Otherwise, as I learned today, the location of your site-packages and what has previously been installed on your development machine can wreak havoc on you. :)

NOTE: For this article, I assume the reader is on Debian/Ubuntu Linux, since that is what I use as my development machine. If you’re on a different flavour of Linux, feel free to adapt the instructions here to suit your particular package manager.

Installing the Tools for Installing the Tools

Before we get into our virtual development environments, you’ll first want to ensure you’ve got a few packages installed, including bzr, libssl-dev, swig and virtualenv. The following should do the trick:

sudo apt-get install -y swig libssl-dev bzr python-virtualenv

A Setup for Source Control and Virtual Environments

In order to get properly setup to contribute to the Nova project, you’ll want to setup a local repository to keep branches of source code that you work on. Although bzr is not required as your revision control system, I use bzr myself and will use it in this article. Adapt as needed if you use git-bzr or similar.

I like to have the following directory structure for working on Python projects:

~/repos/$projectname/ <-- shared repository for branches of your project
~/repos/$projectname/trunk <-- local trunk branch
~/repos/$projectname/$branch <-- a branch you work in
~/virtenvs/$projectname/ <-- Development environments for your project
~/virtenvs/$projectname/$branch <-- development environment for a branch you work in

Assuming you want to contribute to the Nova project and you want to work on fixing a bug #XXXXX, then following would get you started:

bzr init-repo ~/repos/nova
cd ~/repos/nova
bzr branch lp:nova trunk
bzr branch trunk bugXXXXX
mkdir -p ~/virtenvs/nova

At this point, we'll go ahead and create a virtual development environment for bugXXXXX:

cd ~/virtenvs/nova
virtualenv --no-site-packages bugXXXXX
cd bugXXXXX
source bin/activate

At this point, you'll notice your prompt change, indicating that you are now in a virtual development environment. The --no-site-packages ensures that your locally-installed Python packages aren't included in your Python PATH when inside your virtual environment.

Next step is to install into this virtual development environment all the packages and dependencies we'll need. This should do the trick:

easy_install twisted tornado boto M2Crypto IPy carrot mox redis
easy_install http://python-gflags.googlecode.com/files/python_gflags-1.3-py2.5.egg

Alright, next we simply link to our bzr branch location from inside the virtual environment and run the Nova test suite:

ln -s ~/repos/nova/bugXXXXX bugXXXXX
cd bugXXXXX
python run_tests.py

If all went smoothly, you'll see all passing test cases, like below :)

Having issues getting up and running? Find us on Freenode IRC #openstack.

See ya,

Jay

MySQL Stored Procedures Ain’t All That

I give quite a lot of presentations. A whole lot less than I used to, but still quite a few per year. Most of the time, the presentations are on performance tuning MySQL.

Almost every time I give a presentation on MySQL performance tuning — and this happens 100% of the time if I am presenting to a Windows SQL Server crowd — I get the following question:

Why don’t you cover using stored procedures in order to increase performance? Wouldn’t that be the easiest way to get better performance since the stored procedures will only be parsed once and then the compiled bytecode would be efficiently executed from then on?

Every person that asks this question assumes something about MySQL’s stored procedure implementation; they incorrectly believe that stored procedures are compiled and stored in a global stored procedure cache, similar to the stored procedure cache in Microsoft SQL Server[1] or Oracle[2].

This is wrong. Flat-out incorrect.

Here is the truth: Every single connection to the MySQL server maintains it’s own stored procedure cache.

This means two very important things that users of stored procedures should understand:

  • If you operate in a shared-nothing environment — for example, the majority of PHP and Python applications that do not use connection pooling or persistent connections — if your application uses stored procedures, the connection is compiling the stored procedure, storing it in a cache, and destroying that cache every single time you connect to the database server and issue a CALL statement
  • If you use stored procedures, the memory usage of every single connection that uses those stored procedures is going to increase, and will increase substantially if you use many stored procedures

Ooops, I Invalidated Everything Again

So, what happens when you CREATE, ALTER, or DROP any stored procedures? Since MySQL stores all stored procedure execution code on the connection threads, each of those connection threads must invalidate the procedure in its caches that has changed, right?

No, it’s worse. Every time ANY stored procedure is added, dropped, or updated, ALL stored procedures on ALL connection threads will be invalidated and must be re-compiled. Here is how the “caches” are invalidated:

from /sql/sp_cache.cc, lines 193-197, in MySQL 5.5

/*
  Invalidate all routines in all caches.
 
  SYNOPSIS
    sp_cache_invalidate()
 
  NOTE
    This is called when a VIEW definition is created or modified (and in some
    other contexts). We can't destroy sp_head objects here as one may modify
    VIEW definitions from prelocking-free SPs.
*/
void sp_cache_invalidate()
{
  DBUG_PRINT("info",("sp_cache: invalidating"));
  thread_safe_increment(Cversion, &Cversion_lock);
}

It’s a bit misleading, since it actually doesn’t invalidate anything at all. What the above code does is increment the global “Cversion” variable. When a connection thread attempts to execute, drop or insert a new procedure, it will notice that it’s local cache’s version number is less than this Cversion number and will destroy the entire cache and rebuild it gradually as procedures are affected or executed.

So, Should You Use Stored Procedures in MySQL?

The above warning doesn’t necessarily mean that you should never use stored procedures? No. What it means (besides being a bit of a rant on the implementation of MySQL’s stored procedures) is that you should be aware of these issues and use stored procedures where they make the most sense:

  • When you know that you will be executing the stored procedure over and over again on the same connection — for instance, in a bulk loading script or similar
  • When you know that you will not be disconnecting from the MySQL server at the end of script execution — for instance, if you use JDBC connection pooling
  • When you know that you have a limited number of stored procedures and the memory usage of connections won’t be an issue

Finally, if you see benchmarks that purport to show a huge performance increase from using stored procedures in MySQL, be careful to understand what the benchmark is doing and whether that benchmark represents your real-world environment. For instance, if you see a huge performance increase in sysbench when using stored procedures, but you have a PHP shared-nothing environment, understand that those benchmark results mean very little to you, since sysbench connections don’t get destroyed until the end of the run…

[1] From my copy of Inside SQL Server 2000, Delaney (2001), pages 852-865. For a short, but decent, online explanation of SQL Server’s stored procedure cache, see here

[2] Oracle’s stored procedures are stored in the shared pool of the Oracle system global area (SGA)

Now Recording Drizzle Contributor Tutorial

Hi all!

I was swamped with registrations for the online contributor tutorial for Drizzle, and so I’ve bumped up my account to a DimDim Pro account. This means two things:

  1. I can take >20 registrations
  2. I can record the session

So, Diego, rest assured, the session will be recorded (hopefully with no glitches). I’m going to call DimDim to see if I can do a practice recording beforehand to verify Linux64 is a platform they support for recording (if not, I’ll go to my neighbour’s Windows computer to record)

Again, if you’re interested in the webinar, please do register using the widget below:


Cheers,

jay

Signup for Drizzle Contributor Tutorial Webinar – May 15th

Hi all!

I’ll be giving an online webinar for Drizzle contributors on Saturday, May 15th @ 1am GMT (In the U.S. this is Friday, May14th @ 9pm EDT, 6pm PDT).

Note that the DimDim widget below shows the time as May 14th @ 8pm. The widget is wrong, since DimDim does not account for daylight savings.

Space is strictly limited to 20 people and this will be done via DimDim.com. Please register for the webinar by entering your email address in the widget below and clicking “Sign Up”.

The agenda for this 2-3 hour tutorial will be:

  1. First Steps
    • Getting registered as a contributor for Drizzle on Launchpad
    • Registering your SSH keys with Launchpad
    • Picking up and creating blueprints
    • Basics of Bazaar
    • Setting up a local code repository for Drizzle
    • Committing your work to a Bazaar branch
    • Pushing your code to Launchpad
    • Requesting a code review and merge into trunk
    • One slide explaining the license your contributions may be submitted under
  2. The Drizzle Source Code
    • Our coding standards
    • Our build system
    • Walkthrough of major directories in Drizzle
    • Understanding the plugin system
    • Understanding what the kernel is responsible for
    • Where the Dragons live — and how to avoid them
  3. Walkthrough of a SELECT statement

    • Client communication with server
    • The role of the session scheduler plugin
    • How, when and where authentication and authorization plugins are called
    • How the drizzled::statement::Statement subclasses work
    • Dive into drizzled::statement::Select::execute()
    • Walkthrough how a Table’s definition (metadata) is read from a protobuffer file or an engine
    • Dive into mysql_lock_tables()
    • How does drizzled::plugin::StorageEngine::startStatement() work?
    • How does drizzled::plugin::TransactionalStorageEngine::startTransaction() work?
    • Inside the join optimizer and Join::optimize()
    • How does the nested loops algorithm get executed and how does READ_RECORD work?
    • How does drizzled::Cursor perform table and index scans and seeks?
    • How are result sets packaged up and sent to clients?
  4. Plugin Development Tutorial

    • What plugin classes are even available?
    • Creating your basic plugin
    • The plugin.ini file
    • The module initialization and configuration file
    • Registering your plugin with the kernel with plugin::Context::add()
    • Publishing your plugin's information using table functions
    • Providing users control over your plugin with user-defined functions

Slides from Developing Drizzle Replication Plugins Tutorial

Hi all!

So, Padraig, Toru, and myself teamed up yesterday at the MySQL Conference for about thirty or so attendees to discuss developing Drizzle plugins in C++. It was a set of slides that covered basic stuff all the way up through pretty advanced topics. We hope attendees got something out of it :)

Below are the slides from Padraig’s and my part of the tutorial which focused on plugin development basics and the replication plugin API in Drizzle. I’ve also tacked them onto my page of presentations.

Enjoy, and feel free to email me with comments and suggestions to SELECT REVERSE('moc.liamg@sepipyaj');

Developing Drizzle Replication Plugins


Open Office Impress slides
PDF slides


Topics included in the slides:

  • About the Drizzle Community and Expectations of Contributors
  • Getting started on Launchpad
  • Various features of Launchpad
  • Understanding the Source Code Directory Structure
  • Code walkthrough of Drizzle plugin basics
  • Drizzle’s System Architecture
  • Overview of Drizzle’s Replication System
  • Understanding Google Protobuffers
  • The Transaction message in Detail
  • In-depth code walkthrough of the Filtered Replicator module
  • In-depth code walkthrough of the Transaction Log module
  • Future of Drizzle replication – Publisher and Subscriber plugins

Holy Google Summer of Code, Batman

So, last year, Drizzle participated in the Google Summer of Code under the MySQL project organization. We had four excellent student submissions and myself, Monty Taylor, Eric Day and Stewart Smith all mentored students for the summer. It was my second year mentoring, and I really enjoyed it, so I was looking forward to this year’s summer of code.

This year, Padraig O’Sullivan, a GSoC student last year, is now working at Akiban Technologies, partly on Drizzle, and is the GSoC Adminsitrator and also a mentor for Drizzle this year, and Drizzle is its own sponsored project organization this year. Thank you, Padraig!

I have been absolutely floored by the flood of potential students who have shown up on the mailing list and the #drizzle IRC channel. I have been even more impressed with those students’ ambition, sense of community, and willingness to ask questions and help other students as they show up. A couple students have even gotten code contributed to the source trees even before submitting their official applications to GSoC. See, I told you they were ambitious! :)

This year, Drizzle has a listing of 16 potential projects for students to work on. The projects are for students interested in developing in C++, Python, or Perl.

If you are interested in participating, please do check out Drizzle! For those new to Launchpad, Bazaar, and C++ development with Drizzle, feel free to check out these blog articles which cover those topics:

And, in other news, Go Buckeyes!

Understanding Drizzle’s Transaction Log

Today I pushed up the initial patch which adds XA support to Drizzle’s transaction log. So, to give myself a bit of a rest from coding, I’m going to blog a bit about the transaction log and show off some of its features.

WARNING: Please keep in mind that the transaction log module in Drizzle is under heavy development and should not be used in production environments. That said, I’d love to get as much feedback as possible on it, and if you feel like throwing some heavy data at it, that would be awesome ;)

What is the Transaction Log?

Simply put, the transaction log is a record of every modification to the state of the server’s data. It is similar to MySQL’s binlog, with some substantial differences:

  • The transaction log is composed of Google Protobuffer messages. Because of this, it is possible to read the log using a variety of programming languages, as Marcus Eriksson‘s RabbitReplication project demonstrates.
  • The transaction log is a plugin[1]. It lives entirely outside of the Drizzle kernel. The advantage of this is that development of the transaction log does not need to be linked with development in the kernel and versioning of the transaction log can happen independently of the kernel.
  • Currently, there is only a single log file. MySQL’s binlog can be split into multiple files. This may or may not change in the future. :)
  • Drizzle’s transaction log is indexed. Among other things, this means that you can query the transaction log directly from within a Drizzle client via DATA_DICTIONARY views. I will demonstrate this feature below.

It is important to also point out that Drizzle’s transaction log is not required for Drizzle replication. This probably sounds very weird to folks who are accustomed to MySQL replication, which depends on the MySQL binlog. In Drizzle, the replication API is different. Although the transaction log can be used in Drizzle’s replication system, it’s not required. I’ll write more on this in later blog posts which demonstrate how the replication system is not dependent on the transaction log, but in this article I just want to highlight the transaction log module.

How Do I Enable the Transaction Log

First things first, let’s see how we can enable the Transaction Log. If you’ve built Drizzle from source or have installed Drizzle locally, you will be familiar with the process of starting up a Drizzle server. To review, here is how you do so:

cd $basedir
./drizzled [options] &

Where $basedir is the directory you built Drizzle or installed Drizzle. For the [options], typically you will need at the very least a --datadir=$DATADIR and a --mysql-protocol-port=$PORT value. For an explanation of the --mysql-protocol-port option, see Eric Day‘s recent article.

To demonstrate, I’ve built a Drizzle server in a local directory of mine, and I’ll use the /tests/var/ directory as my $datadir:

cd /home/jpipes/repos/drizzle/xa-transaction-log/drizzled/
./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 &

You should see output similar to this:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 &
[1] 31499
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ InnoDB: The InnoDB memory heap is disabled
InnoDB: Mutexes and rw_locks use GCC atomic builtins.
InnoDB: The first specified data file ./ibdata1 did not exist:
InnoDB: a new database to be created!
100317 15:41:51  InnoDB: Setting file ./ibdata1 size to 10 MB
InnoDB: Database physically writes the file full: wait...
100317 15:41:52  InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
InnoDB: Database physically writes the file full: wait...
100317 15:41:52  InnoDB: Log file ./ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
100317 15:41:53 InnoDB Plugin 1.0.4 started; log sequence number 0
Listening on 0.0.0.0:9306
Listening on :::9306
Listening on 0.0.0.0:4427
Listening on :::4427
./drizzled: Forcing close of thread 0 user: ''
./drizzled: ready for connections.
Version: '2010.03.1314' Source distribution (xa-transaction-log)

To connect to the above server, I then do:

../client/drizzle --port=9306

If all went well, you should be at a drizzle client prompt:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ../client/drizzle --port=9306
Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 2
Server version: 7 Source distribution (xa-transaction-log)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle>

You can check to see whether the transaction log is enabled by querying the DATA_DICTIONARY.VARIABLES table. The transaction log is not on by default:

drizzle> use data_dictionary
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> SELECT * FROM GLOBAL_VARIABLES WHERE VARIABLE_NAME LIKE 'transaction_log%';
+---------------------------------+-----------------+
| VARIABLE_NAME                   | VARIABLE_VALUE  |
+---------------------------------+-----------------+
| transaction_log_enable          | OFF             |
| transaction_log_enable_checksum | OFF             |
| transaction_log_enable_xa       | OFF             |
| transaction_log_log_file        | transaction.log |
| transaction_log_sync_method     | 0               |
| transaction_log_truncate_debug  | OFF             |
| transaction_log_xa_num_slots    | 8               |
+---------------------------------+-----------------+
7 rows in set (0 sec)

OK, let’s start up the server, this time with the transaction log enabled. To shutdown Drizzle, there is no need to use a tool like mysqladmin. You can shutdown the server via the client:

drizzle> exit
Bye
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ../client/drizzle --port=9306 --shutdown
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ./drizzled: Normal shutdown
100317 15:53:48  InnoDB: Starting shutdown...
100317 15:53:49  InnoDB: Shutdown completed; log sequence number 44244
...

Now let’s start up the server, this time passing the --transaction-log-enable and the --default-replicator-enable options. The --default-replicator-enable option is needed when the transaction log is not in XA mode (more on that later):

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 --transaction-log-enable --default-replicator-enable &
[2] 31582
[1]   Done                    ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ InnoDB: The InnoDB memory heap is disabled
...
./drizzled: ready for connections.

And again, connect to the server and check our transaction log variables again:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ../client/drizzle --port=9306
Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 2
Server version: 7 Source distribution (xa-transaction-log)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle> use data_dictionary
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> SELECT * FROM GLOBAL_VARIABLES WHERE VARIABLE_NAME LIKE 'transaction_log%';
+---------------------------------+-----------------+
| VARIABLE_NAME                   | VARIABLE_VALUE  |
+---------------------------------+-----------------+
| transaction_log_enable          | ON              |
| transaction_log_enable_checksum | OFF             |
| transaction_log_enable_xa       | OFF             |
| transaction_log_log_file        | transaction.log |
| transaction_log_sync_method     | 0               |
| transaction_log_truncate_debug  | OFF             |
| transaction_log_xa_num_slots    | 8               |
+---------------------------------+-----------------+
7 rows in set (0 sec)

drizzle>

OK. So, if you check the $datadir, you should see a file called transaction.log, with a size of 0:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ls -lha ../tests/var/
total 21M
drwxr-xr-x  6 jpipes jpipes 4.0K 2010-03-17 15:54 .
drwxr-xr-x 11 jpipes jpipes 4.0K 2010-03-17 14:57 ..
-rw-rw----  1 jpipes jpipes  10M 2010-03-17 15:54 ibdata1
-rw-rw----  1 jpipes jpipes 5.0M 2010-03-17 15:54 ib_logfile0
-rw-rw----  1 jpipes jpipes 5.0M 2010-03-17 15:41 ib_logfile1
-rwxr-----  1 jpipes jpipes    6 2010-03-17 15:54 serialcoder.pid
-rwx------  1 jpipes jpipes    0 2010-03-17 15:54 transaction.log

Back in the drizzle client, let’s go ahead and create a new schema, a new table, and add a single row to that table. This will add some entries to the transaction log that we’ll be able to view:

drizzle> CREATE SCHEMA lebowski;
Query OK, 1 rows affected (0.06 sec)
drizzle> USE lebowski
Database changed
drizzle> CREATE TABLE characters (name VARCHAR(20) NOT NULL PRIMARY KEY,
    -> hobby VARCHAR(10) NOT NULL) ENGINE=InnoDB;
Query OK, 0 rows affected (0.06 sec)

drizzle> INSERT INTO characters VALUES ('the dude','bowling');
Query OK, 1 row affected (0.05 sec)

Checking in on our transaction log file, we see it now has some size to it:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ls -lha ../tests/var/
total 21M
drwxr-xr-x  7 jpipes jpipes 4.0K 2010-03-17 16:11 .
drwxr-xr-x 11 jpipes jpipes 4.0K 2010-03-17 14:57 ..
-rw-rw----  1 jpipes jpipes  10M 2010-03-17 16:11 ibdata1
-rw-rw----  1 jpipes jpipes 5.0M 2010-03-17 16:11 ib_logfile0
-rw-rw----  1 jpipes jpipes 5.0M 2010-03-17 16:11 ib_logfile1
drwxrwx--x  2 jpipes jpipes 4.0K 2010-03-17 16:11 lebowski
-rwxr-----  1 jpipes jpipes    6 2010-03-17 16:11 serialcoder.pid
-rwx------  1 jpipes jpipes  444 2010-03-17 16:11 transaction.log

Finding Out What’s In the Transaction Log

OK, so now for the really cool part of this little demonstration. :) Let’s take a look at what is now contained in the transaction log, all via the Drizzle client and the DATA_DICTIONARY views.

There are currently three DATA_DICTIONARY views which show information about the transaction log and its contents:

  • DATA_DICTIONARY.TRANSACTION_LOG
  • DATA_DICTIONARY.TRANSACTION_LOG_ENTRIES
  • DATA_DICTIONARY.TRANSACTION_LOG_TRANSACTIONS

To see what each view contains, simply do a DESC on them:

drizzle> use data_dictionary
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> DESC TRANSACTION_LOG;
+---------------------+---------+-------+---------+-----------------+-----------+
| Field               | Type    | Null  | Default | Default_is_NULL | On_Update |
+---------------------+---------+-------+---------+-----------------+-----------+
| FILE_NAME           | VARCHAR | FALSE |         | FALSE           |           |
| FILE_LENGTH         | BIGINT  | FALSE |         | FALSE           |           |
| NUM_LOG_ENTRIES     | BIGINT  | FALSE |         | FALSE           |           |
| NUM_TRANSACTIONS    | BIGINT  | FALSE |         | FALSE           |           |
| MIN_TRANSACTION_ID  | BIGINT  | FALSE |         | FALSE           |           |
| MAX_TRANSACTION_ID  | BIGINT  | FALSE |         | FALSE           |           |
| MIN_END_TIMESTAMP   | BIGINT  | FALSE |         | FALSE           |           |
| MAX_END_TIMESTAMP   | BIGINT  | FALSE |         | FALSE           |           |
| INDEX_SIZE_IN_BYTES | BIGINT  | FALSE |         | FALSE           |           |
+---------------------+---------+-------+---------+-----------------+-----------+
9 rows in set (0 sec)

drizzle> DESC TRANSACTION_LOG_ENTRIES;
+--------------+---------+-------+---------+-----------------+-----------+
| Field        | Type    | Null  | Default | Default_is_NULL | On_Update |
+--------------+---------+-------+---------+-----------------+-----------+
| ENTRY_OFFSET | BIGINT  | FALSE |         | FALSE           |           |
| ENTRY_TYPE   | VARCHAR | FALSE |         | FALSE           |           |
| ENTRY_LENGTH | BIGINT  | FALSE |         | FALSE           |           |
+--------------+---------+-------+---------+-----------------+-----------+
3 rows in set (0 sec)

drizzle> DESC TRANSACTION_LOG_TRANSACTIONS;
+-----------------+--------+-------+---------+-----------------+-----------+
| Field           | Type   | Null  | Default | Default_is_NULL | On_Update |
+-----------------+--------+-------+---------+-----------------+-----------+
| ENTRY_OFFSET    | BIGINT | FALSE |         | FALSE           |           |
| TRANSACTION_ID  | BIGINT | FALSE |         | FALSE           |           |
| SERVER_ID       | BIGINT | FALSE |         | FALSE           |           |
| START_TIMESTAMP | BIGINT | FALSE |         | FALSE           |           |
| END_TIMESTAMP   | BIGINT | FALSE |         | FALSE           |           |
| NUM_STATEMENTS  | BIGINT | FALSE |         | FALSE           |           |
| CHECKSUM        | BIGINT | FALSE |         | FALSE           |           |
+-----------------+--------+-------+---------+-----------------+-----------+
7 rows in set (0 sec)

Let’s see what each of the views tells us about what is in the transaction log. Remember, we’ve executed a CREATE SCHEMA, a CREATE TABLE, and a single INSERT. Here is what the TRANSACTION_LOG view shows:

drizzle> SELECT * FROM TRANSACTION_LOG\G
*************************** 1. row ***************************
          FILE_NAME: transaction.log
        FILE_LENGTH: 444
    NUM_LOG_ENTRIES: 3
   NUM_TRANSACTIONS: 3
 MIN_TRANSACTION_ID: 1
 MAX_TRANSACTION_ID: 3
  MIN_END_TIMESTAMP: 1268856698672620
  MAX_END_TIMESTAMP: 1268856707093000
INDEX_SIZE_IN_BYTES: 73736

The column names should be self explanatory. The FILE_LENGTH shows the size in bytes of the log (which matches the output we had from our ls -lha above.) The INDEX_SIZE_IN_BYTES is total amount of memory allocated for the transaction log index.

The TRANSACTION_LOG_ENTRIES view isn’t that interesting at first glance:

drizzle> SELECT * FROM TRANSACTION_LOG_ENTRIES;
+--------------+-------------+--------------+
| ENTRY_OFFSET | ENTRY_TYPE  | ENTRY_LENGTH |
+--------------+-------------+--------------+
|            0 | TRANSACTION |           89 |
|           89 | TRANSACTION |          223 |
|          312 | TRANSACTION |          132 |
+--------------+-------------+--------------+

You might be tempted to ask what the heck the purpose of the TRANSACTION_LOG_ENTRIES view is for. It is a bit of a bridge table that allows one to see the type of entries at each offset. Currently, the only types of entries in the transaction log are of type TRANSACTION — basically a serialized GPB Protobuffer message — and a BLOB entry, which is for storage of large blob data.

The TRANSACTION_LOG_TRANSACTIONS view shows all the transaction log entries which are of type TRANSACTION:

drizzle> SELECT * FROM TRANSACTION_LOG_TRANSACTIONS;
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
| ENTRY_OFFSET | TRANSACTION_ID | SERVER_ID | START_TIMESTAMP  | END_TIMESTAMP    | NUM_STATEMENTS | CHECKSUM |
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
|            0 |              1 |         1 | 1268856698672606 | 1268856698672620 |              1 |        0 |
|           89 |              2 |         1 | 1268856702792284 | 1268856702792331 |              1 |        0 |
|          312 |              3 |         1 | 1268856707025455 | 1268856707093000 |              1 |        0 |
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
3 rows in set (0 sec)

As you can see, there is some basic information about each transaction entry in the log, including the offset in the transaction log, the start and end timestamp of the transaction, it’s transaction identifier, the number of statements involved in the transaction, and an optional checksum for the message (more on checksums below).

Viewing the Transaction Content

While the above view output may be nice, what we’d really like to be able to do is see what precisely were the changes a Transaction effected. To see this, we can use the PRINT_TRANSACTION_MESSAGE(log_file, offset) UDF. Below, I’ve added two more rows to the lebowski.characters table within an explicit transaction. I then query the DATA_DICTIONARY views using the PRINT_TRANSACTION_MESSAGE() function to show the changes logged to the transaction log:

drizzle> use lebowski
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> START TRANSACTION;
Query OK, 0 rows affected (0 sec)

drizzle> INSERT INTO characters VALUES ('walter','bowling');
Query OK, 1 row affected (0 sec)

drizzle> INSERT INTO characters VALUES ('donny','bowling');
Query OK, 1 row affected (0 sec)

drizzle> COMMIT;
Query OK, 0 rows affected (0.09 sec)

We now see an additional Transaction Log entry and can see that this transaction contains the two individual INSERT statements just executed:

drizzle> SELECT * FROM TRANSACTION_LOG_TRANSACTIONS;
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
| ENTRY_OFFSET | TRANSACTION_ID | SERVER_ID | START_TIMESTAMP  | END_TIMESTAMP    | NUM_STATEMENTS | CHECKSUM |
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
|            0 |              1 |         1 | 1268856698672606 | 1268856698672620 |              1 |        0 |
|           89 |              2 |         1 | 1268856702792284 | 1268856702792331 |              1 |        0 |
|          312 |              3 |         1 | 1268856707025455 | 1268856707093000 |              1 |        0 |
|          444 |              4 |         1 | 1268857926482600 | 1268857938514312 |              1 |        0 |
+--------------+----------------+-----------+------------------+------------------+----------------+----------+
...
drizzle> SELECT PRINT_TRANSACTION_MESSAGE('transaction.log', ENTRY_OFFSET) as info
    -> FROM TRANSACTION_LOG_TRANSACTIONS WHERE ENTRY_OFFSET = 444\G
*************************** 1. row ***************************
info: transaction_context {
  server_id: 1
  transaction_id: 4
  start_timestamp: 1268857926482600
  end_timestamp: 1268857938514312
}
statement {
  type: INSERT
  start_timestamp: 1268857926482605
  end_timestamp: 1268857938514310
  insert_header {
    table_metadata {
      schema_name: "lebowski"
      table_name: "characters"
    }
    field_metadata {
      type: VARCHAR
      name: "name"
    }
    field_metadata {
      type: VARCHAR
      name: "hobby"
    }
  }
  insert_data {
    segment_id: 1
    end_segment: true
    record {
      insert_value: "walter"
      insert_value: "bowling"
    }
    record {
      insert_value: "donny"
      insert_value: "bowling"
    }
  }
}

1 row in set (0.01 sec)

You may notice that NUM_STATEMENTS is equal to 1 even though there were 2 INSERT statements issued. This is because the kernel packages both the INSERTs into a single message::Statement::InsertData package for more efficient storage. If there had been an INSERT and an UPDATE, NUM_STATEMENTS would be 2.

Enable Automatic Checksumming

One final feature I’ll highlight in this blog post is an option to automatically store a checksum of each transaction message when writing entries to the transaction log. To enable this feature, simply use the --transaction-log-enable-checksum command line option. You can view the checksums of entries in the TRANSACTION_LOG_TRANSACTIONS view, as demonstrated below:

jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 --transaction-log-enable --default-replicator-enable --transaction-log-enable-checksum &
[5] 32042
[4]   Done                    ./drizzled --datadir=/home/jpipes/repos/drizzle/xa-transaction-log/tests/var/ --mysql-protocol-port=9306 --transaction-log-enable --default-replicator-enable
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ InnoDB: The InnoDB memory heap is disabled
InnoDB: Mutexes and rw_locks use GCC atomic builtins.
InnoDB: The first specified data file ./ibdata1 did not exist:
InnoDB: a new database to be created!
100317 16:47:07  InnoDB: Setting file ./ibdata1 size to 10 MB
InnoDB: Database physically writes the file full: wait...
100317 16:47:07  InnoDB: Log file ./ib_logfile0 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile0 size to 5 MB
InnoDB: Database physically writes the file full: wait...
100317 16:47:08  InnoDB: Log file ./ib_logfile1 did not exist: new to be created
InnoDB: Setting log file ./ib_logfile1 size to 5 MB
InnoDB: Database physically writes the file full: wait...
InnoDB: Doublewrite buffer not found: creating new
InnoDB: Doublewrite buffer created
InnoDB: Creating foreign key constraint system tables
InnoDB: Foreign key constraint system tables created
100317 16:47:08 InnoDB Plugin 1.0.4 started; log sequence number 0
Listening on 0.0.0.0:9306
Listening on :::9306
Listening on 0.0.0.0:4427
Listening on :::4427
./drizzled: Forcing close of thread 0 user: ''
./drizzled: ready for connections.
Version: '2010.03.1314' Source distribution (xa-transaction-log)
...
jpipes@serialcoder:~/repos/drizzle/xa-transaction-log/drizzled$ ../client/drizzle --port=9306
Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 2
Server version: 7 Source distribution (xa-transaction-log)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle> CREATE SCHEMA lebowski;
Query OK, 1 row affected (0.05 sec)

drizzle> CREATE TABLE characters (name VARCHAR(20) NOT NULL PRIMARY KEY, hobby VARCHAR(10) NOT NULL) ENGINE=InnoDB;
ERROR 1046 (3D000): No database selected
drizzle> use lebowski
Database changed
drizzle> CREATE TABLE characters (name VARCHAR(20) NOT NULL PRIMARY KEY, hobby VARCHAR(10) NOT NULL) ENGINE=InnoDB;
Query OK, 0 rows affected (0.11 sec)

drizzle> INSERT INTO characters VALUES ('the dude','bowling');
Query OK, 1 row affected (0.1 sec)

drizzle> use data_dictionary
Reading table information for completion of table and column names
    You can turn off this feature to get a quicker startup with -A

Database changed
drizzle> SELECT ENTRY_OFFSET, TRANSACTION_ID, CHECKSUM FROM TRANSACTION_LOG_TRANSACTIONS;
+--------------+----------------+------------+
| ENTRY_OFFSET | TRANSACTION_ID | CHECKSUM   |
+--------------+----------------+------------+
|            0 |              2 |  143866125 |
|           89 |              8 | 1466831622 |
|          312 |              9 |  460824986 |
+--------------+----------------+------------+
3 rows in set (0 sec)

DDL is not Statement-based Replication

As a final note, I’d like to point out that even DDL in Drizzle is replicated as row-based transaction messages, and not as raw SQL statements like in MySQL. You can see, for instance, the message::Statement::CreateTableStatement inside the transaction message which contains all the metadata about the table you just created. :)

drizzle> SELECT PRINT_TRANSACTION_MESSAGE('transaction.log', ENTRY_OFFSET)
    -> FROM TRANSACTION_LOG_TRANSACTIONS WHERE ENTRY_OFFSET = 89\G
*************************** 1. row ***************************
PRINT_TRANSACTION_MESSAGE('transaction.log', ENTRY_OFFSET): transaction_context {
  server_id: 1
  transaction_id: 2
  start_timestamp: 1268858897017396
  end_timestamp: 1268858897017447
}
statement {
  type: CREATE_TABLE
  start_timestamp: 1268858897017402
  end_timestamp: 1268858897017445
  create_table_statement {
    table {
      name: "characters"
      engine {
        name: "InnoDB"
      }
      field {
        name: "name"
        type: VARCHAR
        format: DefaultFormat
        constraints {
          is_nullable: false
        }
        string_options {
          length: 20
          collation_id: 45
          collation: "utf8_general_ci"
        }
      }
      field {
        name: "hobby"
        type: VARCHAR
        format: DefaultFormat
        constraints {
          is_nullable: false
        }
        string_options {
          length: 10
          collation_id: 45
          collation: "utf8_general_ci"
        }
      }
      indexes {
        name: "PRIMARY"
        is_primary: true
        is_unique: true
        type: UNKNOWN_INDEX
        key_length: 80
        index_part {
          fieldnr: 0
          compare_length: 80
          key_type: 0
        }
        options {
          binary_pack_key: true
          var_length_key: true
        }
      }
      type: STANDARD
      options {
        collation: "utf8_general_ci"
        collation_id: 45
      }
    }
  }
}

1 row in set (0 sec)

If you like or don’t like what you see, please do get in touch with me or fire off a wishlist to the Drizzle Discuss mailing list. We’d love to hear from ya!

[1] Actually, the transaction log module is a set of plugins.