Archive for category C/C++

The Deal with REPLACE .. Or Is It UPDATE?

Yesterday, I posed a question to the ZanyWeb about what exactly a REPLACE statement does behind the scenes in the storage engine. There were many excellent comments and these comments exposed some misunderstandings (including some of my own misconceptions) about the REPLACE statement itself and what goes on behind the scenes in the storage engine.

The question I asked was this: if I execute the following statements in a client, what would you expect would happen behind the scenes in the storage engine?

CREATE TABLE t1 (
  id INT NOT NULL AUTO_INCREMENT PRIMARY KEY
, padding VARCHAR(200) NOT NULL
);

INSERT INTO t1 VALUES (1, "I love testing.");
INSERT INTO t1 VALUES (2, "I hate testing.");

REPLACE INTO t1 VALUE (2, "I love testing.");

Based purely on the manual, one would expect, as Ryan Thiessen expressed in his comment

try to insert once, detect a failure and then delete/insert in failure case.

or, as Ryan put it in pseudo-code:

INSERT INTO t1 VALUES (1, "I love testing.");
INSERT INTO t1 VALUES (2, "I hate testing.");
INSERT INTO t1 VALUE (2, "I love testing.");
if error() { DELETE FROM t1 WHERE id = 2; INSERT INTO t1 VALUE (2, "I love testing."); }

Unfortunately, this is not the case. At least, it is not always the case.

ha_write_row() vs. ha_update_row()

In MySQL, as well as in Drizzle, there is a pluggable storage engine API. This API consists primarily of two classes: In MySQL, one class is called handlerton and the other is called handler — in Drizzle, these classes are called StorageEngine and Cursor (Current Set Of Records), because that is what they actually represent.

Without going into too many details, the Cursor (handler) interface has four calls which are relevant to this discussion:

  • int Cursor::ha_write_row(unsigned char *new_record) — inserts a new record into the table pointed to by the Cursor
  • int Cursor::ha_update_row(unsigned char *old_record, unsigned char *new_record) — update an existing record in the table pointed to by the Cursor
  • int Cursor::ha_delete_row(unsigned char *old_record) — delete an existing record in the table pointed to by the Cursor
  • virtual int Cursor::extra(ha_extra_function operation) — indicates to the Cursor that it should handle an operation in a certain way. If this sounds vague to you, it is, so read on…

OK, So What Does Happen?

What actually happens behind the scenes for the code above is the following “optimized” execution path, again in pseudo-code:

// (0) Enter mysql_insert() and prepare to do an INSERT
mysql_insert();

// (1) tell engine that a write can replace an existing record
Cursor::extra(HA_EXTRA_WRITE_CAN_REPLACE);

// (2) tell engine to ignore duplicate keys
Cursor::extra(HA_EXTRA_IGNORE_DUP_KEY);

// (3) For each record to be inserted...
for each record begin:

  // (4) Insert the record by calling ha_write_row()
  Cursor::ha_write_row();

  // (5) Cursor tries to "insert" the record
  error= Cursor::write_row();

  // (6) If the write fails, then DELETE, the record and try to INSERT again.
  if error is true:

    Cursor::ha_delete_row();
    Cursor::ha_write_row();

The above code looks very similar to what Ryan has already said he expected the underlying code to look like, no? The problem is the calls to Cursor::extra() (steps 1 and 2) and what happens inside InnoDB (step 5) when Cursor::write_row() is called.

InnoDB tries to insert the record and realizes that the new record violates an existing primary key value but since MySQL has already told it to ignore duplicate key violations, InnoDB updates the existing row and returns successfully from write_row()!.

Unfortunately, because of this “optimization”, Cursor::ha_write_row(), which is the kernel’s wrapper around the virtualized storage engine’s Cursor::write_row call is completely unaware that an UPDATE and not an INSERT has occurred. Why is this a problem? Well, what happens after a call to Cursor::write_row() succeeds? That’s correct: post-change logging occurs. In other words, logging for row-based replication occurs. Well, Cursor::ha_write_row() then logs an INSERT and not the appropriate UPDATE. Oops.

There are three major things to note about the above logic:

  1. This is undocumented behaviour. According to the manual, a REPLACE statement works like an INSERT, except if a primary or unique key is violated, the original row is DELETEd and the new row is INSERTed.
  2. This breaks the defined public Cursor (known as handler in MySQL) interface for ha_write_row() since rows may be updated when ha_update_row() is NOT called.
  3. The code has side effects, namely that the change to the state of the server is unknown to the Cursor: an update occurred but the Cursor believes an insert occurred.

When Optimizations Aren’t

This kind of coding can be argued to be an “optimization”. I will argue that it is not an optimization, but is overly clever and reduces the clarity of the code thereby making the code unnecessarily difficult to trace and follow. Personally, I wasted days of research time trying to understand how a call to ha_write_row() was actually updating an existing record.

In my opinion, a proper optimization would be to:

  • Make the internal code function exactly as the documentation says it does.
  • Adapt the documentation for REPLACE and state that REPLACE is not as efficient as INSERT .. ON DUPLICATE KEY UPDATE and whenever possible, prefer the latter statement over the former.

Does REPLACE belong in Drizzle?

One of the reasons that REPLACE may exist in MySQL is to make the INSERT statement idempotent when run on a replication slave when run in mixed or statement-based mode. Don’t believe me? Here is a code snippet and comment from the sql/log_event.cc file in MySQL 5.4:

Write_rows_log_event::do_before_row_operations(const Slave_reporting_capability *const)
...
  /**
     todo: to introduce a property for the event (handler?) which forces
     applying the event in the replace (idempotent) fashion.
  */
  if (bit_is_set(slave_exec_mode, SLAVE_EXEC_MODE_IDEMPOTENT) == 1 ||
      m_table->s->db_type()->db_type == DB_TYPE_NDBCLUSTER)
  {
    /*
      We are using REPLACE semantics and not INSERT IGNORE semantics
      when writing rows, that is: new rows replace old rows.  We need to
      inform the storage engine that it should use this behaviour.
    */

    /* Tell the storage engine that we are using REPLACE semantics. */
    thd->lex->duplicates= DUP_REPLACE;

    /*
      Pretend we're executing a REPLACE command: this is needed for
      InnoDB and NDB Cluster since they are not (properly) checking the
      lex->duplicates flag.
    */
    thd->lex->sql_command= SQLCOM_REPLACE;
    /*
       Do not raise the error flag in case of hitting to an unique attribute
    */
    m_table->file->extra(HA_EXTRA_IGNORE_DUP_KEY);
...

By contrast to MySQL 5.4, Drizzle’s replication system is a reflection of the atomic changes made to the state of a server. These changes are, by definition, deterministic, and applying a transaction log to a Drizzle replica can result in one and only one state in the resulting server. In Drizzle, having the application of the transaction log be idempotent is not necessary, as the transaction log stream is always deterministic. For instance, if an INSERT ... ON DUPLICATE KEY UPDATE is issued against a Drizzle server, the replication system shall create either an InsertStatement or an UpdateStatement message, depending on what precisely occurred on the server, and these messages are logged in Drizzle’s transaction log. There is no fudging of statements. What happened on the server is logged as an exact change in the state of the server, nothing more.

So…anyway, because of the above “optimization”, I’m now left with a tricky problem: continue to pay the interest on this Technical Debt or rework the Cursor interface so that behaviour is always idempotent via the wrapper interface and has no side effects. I’m not sure what I will do…but this has delayed me substantially. :(

Drizzle Replication – The CommandReplicator and CommandApplier Plugin API

IMPORTANT:
This article is out of date and the replication API has been updated. Please see the follow-up article for the most up to date information!

OK, so here is the next installment in the Drizzle replication article series. Today, I’ll be talking about the flow of the Command message object through the CommandReplicator and CommandApplier APIs. If you missed the first article about the structure of the Command message and Google Protobuffers, you may want to read that first. We’ll only be talking in this article about what happens on one server. We will be discussing the Command Log in the next article, and then discuss how messages are passed from one server to another. But, before we discuss those things, it is critical to first understand the CommandReplicator and CommandApplier plugin classes, which are two of the abstract interfaces out of which replication modules can be built (and from which a number of modules have already been built).

In this article, I’ll be showing code that was written by myself and by Padraig O’Sullivan, another Drizzle contributor.

Flow of Command messages

SIDEBAR: Namespaces in Drizzle

The astute reader may have noticed that there is quite a big difference in the organization of the Drizzle codebase versus MySQL. Drizzle uses C++ namespaces to make the code easier to read and understand.

New code written for Drizzle is always namespaced in a namespace corresponding to the exact directory structure in which you find the source files for a particular class. For instance, all plugin interfaces classes (abstract base classes) are in the drizzled::plugin namespace and are found in /drizzled/plugin/. All protobuffer message classes are defined in /drizzled/message/ and are in the namespace drizzled::message. We feel organizing the code like so, and using C++ namespaces, makes the code clearer and easier to read. You can read more about Drizzle’s coding style, including how we use namespaces on our wiki.

Command messages are created by a component within the Drizzle kernel whenever any SQL statement which modifies the state of the server is executed. This component is called ReplicationServices. This component is a singleton object and manages communication between the kernel and two kinds of plugins, called CommandReplicator and CommandApplier. The component is declared in drizzled/replication_services.h and defined in drizzled/replication_services.cc.

The advantage of the ReplicationServices component frees replication module developers from having to understand anything about what’s going on in the kernel. You don’t have to understand anything about the Session object, how statements are executed in the kernel, or even how storage engines are actually applying changes to their data. All you need to learn is the structure of the Command message (covered in the previous article) and the very, very simple CommandReplicator and CommandApplier plugin APIs. So, let’s look at those APIs. :)

The CommandReplicator Plugin API

Here is the abstract base class drizzled::plugin::CommandReplicator, available in drizzled/plugin/command_replicator.h.

/**
 * Class which replicates Command messages
 */
class CommandReplicator
{
public:
  CommandReplicator() {}
  virtual ~CommandReplicator() {}
  /**
   * Replicate a Command message to a CommandApplier.
   *
   * @note
   *
   * It is important to note that memory allocation for the 
   * supplied pointer is not guaranteed after the completion 
   * of this function -- meaning the caller can dispose of the
   * supplied message.  Therefore, replicators and appliers 
   * implementing an asynchronous replication system must copy
   * the supplied message to their own controlled memory storage
   * area.
   *
   * @param Pointer to the applier of the command message
   * @param Command message to be replicated
   */
  virtual void replicate(CommandApplier *in_applier, drizzled::message::Command &to_replicate)= 0;
};

See, I told you it was simple :) There is an additional isActive() method currently in this API, but this will disappear in the coming months as that method wil be “moving up” to a base Plugin class. You can ignore it for right now. The only other method which CommandReplicator plugins must implement is the replicate() method.

The replicate() method accepts only two parameters. The first is a pointer to an object which implements the CommandApplier interface (inherits from drizzled::plugin::CommandApplier). The second is the Command message that the ReplicationServices component constructs when a data-modification event occurs on the server.

The CommandApplier Plugin API

Before I show some implementation of CommandReplicators, let’s first quickly take a look at the drizzled::plugin::CommandApplier interface. It is equally simple.

/**
 * Base class for appliers of Command messages
 */
class CommandApplier
{
public:
  CommandApplier() {}
  virtual ~CommandApplier() {}
 /**
   * Apply something to a target.
   *
   * @note
   *
   * It is important to note that memory allocation for the 
   * supplied pointer is not guaranteed after the completion 
   * of this function -- meaning the caller can dispose of the
   * supplied message.  Therefore, appliers which are
   * implementing an asynchronous replication system must copy
   * the supplied message to their own controlled memory storage
   * area.
   *
   * @param Command message to be applied
   */
  virtual void apply(const drizzled::message::Command &to_apply)= 0;
};

The CommandApplier quite predictably implements a single method: apply(), which accepts a single parameter of a Command message. Yes, it’s that simple.

Putting it All Together

Before we get to the example implementation of the above plugin interfaces, it’s worth pointing out that Drizzle’s replication system allows multiple CommandReplicator and CommandApplier plugins to register themselves with the ReplicationServices component. The ReplicationServices::push() internal method simply loops through the active replicators and appliers, calling the replicate() method, passing in a constructed Command message and a pointer to each registered applier. Here is the code showing this process:

void ReplicationServices::push(message::Command &to_push)
{
  vector<plugin::CommandReplicator *>::iterator repl_iter= replicators.begin();
  vector<plugin::CommandApplier *>::iterator appl_start_iter, appl_iter;
  appl_start_iter= appliers.begin();
 
  plugin::CommandReplicator *cur_repl;
  plugin::CommandApplier *cur_appl;
 
  while (repl_iter != replicators.end())
  {
    cur_repl= *repl_iter;
    if (! cur_repl->isActive())
    {
      ++repl_iter;
      continue;
    }
 
    appl_iter= appl_start_iter;
    while (appl_iter != appliers.end())
    {
      cur_appl= *appl_iter;
 
      if (! cur_appl->isActive())
      {
        ++appl_iter;
        continue;
      }
 
      cur_repl->replicate(cur_appl, to_push);
      /* 
       * We update the timestamp for the last applied Command so that
       * publisher plugins can ask the replication services when the
       * last known applied Command was using the getLastAppliedTimestamp()
       * method.
       */
      last_applied_timestamp.fetch_and_store(to_push.timestamp());
      ++appl_iter;
    }
    ++repl_iter;
  }
}

Pretty simple, no? There’s a couple big things to point out about the code you’ve seen so far. First, Drizzle is focused on providing clean, simple, and encapsulated interfaces for plugin developers. There’s no need for plugin developers to understand the often-messy and poorly documented internals of the kernel. A plugin developer only needs to do two things:

  • Understand the well-documented Google Protobuffer message API and have a copy of the replication.proto file which contains the blueprint for the Command message
  • Implement either the drizzled::plugin::CommandReplicator::replicate() or drizzled::plugin::CommandApplier::apply() method

Secondly, the interface allows individuals and companies to completely customize how Drizzle replication works for them. You want replication to do X, Y, and Z? Fine, no problem. We give you some example plugins which clearly demonstrate the interfaces, and you can take it from there, without worrying about messing up the kernel. :)

So, now that you’ve seen the plugin interfaces, let’s discuss how these plugin interfaces are implemented in a couple plugins that are distributed with Drizzle right now.

The FilteredReplicator Plugin

After I wrote the default replicator plugin (which literally does nothing but the following:

void DefaultReplicator::replicate(plugin::CommandApplier *in_applier, message::Command &to_replicate)
{
  /* 
   * We do absolutely nothing but call the applier's apply() method, passing
   * along the supplied Command.  Yep, told you it was simple...
   */
  in_applier->apply(to_replicate);
}

Padraig O’Sullivan took the reins and wrote a FilteredReplicator plugin which allows a DBA to filter Command messages by schema name, table name, and a regular expression. His FilteredReplicator is much more interesting than the default replicator, so let’s take a look at some of his code. Here is the replicate() method from the FilteredReplicator plugin:

void FilteredReplicator::replicate(plugin::CommandApplier *in_applier, 
                                   message::Command &to_replicate)
{
  string schema_name;
  string table_name;
 
  /*
   * First, we check to see if the command consists of raw SQL. If so,
   * we need to parse this SQL and determine whether to filter the event
   * based on the information we obtain from the parsed SQL.
   * If not raw SQL, check if this event should be filtered or not
   * based on the schema and table names in the command message.
   */
  if (to_replicate.type() == message::Command::RAW_SQL)
  {
    parseQuery(to_replicate.sql(),
               schema_name,
               table_name);
  }
  else
  {
    schema_name.assign(to_replicate.schema());
    table_name.assign(to_replicate.table());
  }
 
  /*
   * Convert the schema name and table name strings to lowercase so that it
   * does not matter what case the table or schema name was specified in. We
   * also keep all entries in the vectors of schemas and tables to filter in
   * lowercase.
   */
  transform(schema_name.begin(), schema_name.end(),
                 schema_name.begin(), ::tolower);
  transform(table_name.begin(), table_name.end(),
                 table_name.begin(), ::tolower);
 
  if (isSchemaFiltered(schema_name) ||
      isTableFiltered(table_name))
  {
    return;
  }
 
   /*
   * We can now simply call the applier's apply() method, passing
   * along the supplied command.
   */
  in_applier->apply(to_replicate);
}

The above code is quite easy to read and understand. I encourage you to check out the rest of Padraig’s implementation in /plugin/filtered_replicator/filtered_replicator.cc and see just how easy it is to start writing new Drizzle replication modules.

The CommandLog::apply() Method

To demonstrate a CommandApplier implementation, I present the apply() method of the CommandLog plugin, which inherits from drizzled::plugin::CommandApplier. Without going into too much about the Command Log, which is the subject of the next article, I’ll let you take a look at just the apply() method, to see how a CommandApplier does its main job (see /plugin/command_log/command_log.cc):

void CommandLog::apply(const message::Command &to_apply)
{
  string buffer(""); /* Buffer we will write serialized command to */
 
  static const uint32_t HEADER_TRAILER_BYTES= sizeof(uint64_t) + /* 8-byte length header */
                                              sizeof(uint32_t); /* 4 byte checksum trailer */
 
  size_t length;
  ssize_t written;
  off_t cur_offset;
 
  to_apply.SerializeToString(&buffer);
 
  length= buffer.length(); 
 
  /*
   * Do an atomic increment on the offset of the log file position
   */
  cur_offset= log_offset.fetch_and_add((off_t) (HEADER_TRAILER_BYTES + length));
 
  /*
   * We adjust cur_offset back to the original log_offset before
   * the increment above...
   */
  cur_offset-= (off_t) (HEADER_TRAILER_BYTES + length);
 
  /* 
   * Quick safety...if an error occurs below, the log file will
   * not be active, therefore a caller could have been ready
   * to write...but the log is crashed.
   */
  if (unlikely(state == CRASHED))
    return;
 
  <span style="color: green;">/* We always write in network byte order */</span>
  unsigned char nbo_length[8];
  int8store(nbo_length, length);
 
  /* Write the length header */
  do
  {
    written= pwrite(log_file, nbo_length, sizeof(uint64_t), cur_offset);
  }
  while (written == -1 && errno == EINTR); /* Just retry the write when interrupted by a signal... */
 
  if (unlikely(written != sizeof(uint64_t)))
  {
    errmsg_printf(ERRMSG_LVL_ERROR, 
                  _("Failed to write full size of command.  Tried to write %" PRId64 " bytes at offset %" PRId64 ", but only wrote %" PRId64 " bytes.  Error: %s\n"), 
                  sizeof(int64_t), 
                  (int64_t) cur_offset,
                  (int64_t) written, 
                  strerror(errno));
    state= CRASHED;
    /* 
     * Reset the log's offset in case we want to produce a decent error message including
     * the original offset where an error occurred.
     */
    log_offset= cur_offset;
    is_active= false;
    return;
  }
 
  cur_offset+= (off_t) written;
 
  /* 
   * Quick safety...if an error occurs above in another writer, the log 
   * file will be in a crashed state.
   */
  if (unlikely(state == CRASHED))
  {
    /* 
     * Reset the log's offset in case we want to produce a decent error message including
     * the original offset where an error occurred.
     */
    log_offset= cur_offset;
    return;
  }
 
  /* Write the command message itself */
  do
  {
    written= pwrite(log_file, buffer.c_str(), length, cur_offset);
  }
  while (written == -1 && errno == EINTR); /* Just retry the write when interrupted by a signal... */
 
  if (unlikely(written != (ssize_t) length))
  {
    errmsg_printf(ERRMSG_LVL_ERROR, 
                  _("Failed to write full serialized command.  Tried to write %" PRId64 " bytes at offset %" PRId64 ", but only wrote %" PRId64 " bytes.  Error: %s\n"), 
                  (int64_t) length, 
                  (int64_t) cur_offset,
                  (int64_t) written, 
                  strerror(errno));
    state= CRASHED;
    log_offset= cur_offset;
    is_active= false;
  }
 
  cur_offset+= (off_t) written;
 
  /* 
   * Quick safety...if an error occurs above in another writer, the log 
   * file will be in a crashed state.
   */
  if (unlikely(state == CRASHED))
  {
    /* 
     * Reset the log's offset in case we want to produce a decent error message including
     * the original offset where an error occurred.
     */
    log_offset= cur_offset;
    return;
  }
 
  uint32_t checksum= 0;
 
  if (do_checksum)
  {
    checksum= crc32(0L, (unsigned char *) buffer.c_str(), length);
  }
 
  /* We always write in network byte order */
  unsigned char nbo_checksum[4];
  int4store(nbo_checksum, checksum);
 
  /* Write the checksum trailer */
  do
  {
    written= pwrite(log_file, nbo_checksum, sizeof(uint32_t), cur_offset);
  }
  while (written == -1 && errno == EINTR); /* Just retry the write when interrupted by a signal... */
 
  if (unlikely(written != (ssize_t) sizeof(uint32_t)))
  {
    errmsg_printf(ERRMSG_LVL_ERROR, 
                  _("Failed to write full checksum of command.  Tried to write %" PRId64 " bytes at offset %" PRId64 ", but only wrote %" PRId64 " bytes.  Error: %s\n"), 
                  (int64_t) sizeof(uint32_t), 
                  (int64_t) cur_offset,
                  (int64_t) written, 
                  strerror(errno));
    state= CRASHED;
    log_offset= cur_offset;
    is_active= false;
    return;
  }
}

Conclusion, for Now…

I do hope this article has been useful in getting to know how the Drizzle replication system passes the Command message around within a single server. You got to see the replicator and applier plugin APIs and example implementations of those APIs in the form of the FilteredReplicator and CommandLog classes. Next, we’ll be looking in detail at the Command Log itself, its format, and writing/reading from it.

Please do email me (jaypipes@gmail.com) or feel free to post to the Drizzle Discussion list any criticism, suggestions, or requests. I’m eager to hear from you. :)

Drizzle Replication – The Command Message

IMPORTANT:
This article is out of date and the replication API has been updated. Please see the follow-up article for the most up to date information!

I wanted to start writing about how Drizzle’s new replication system works, how its internals are structured, how logs are formatted, what are its (current) limitations, what are planned features, how you can get involved in development, and a lot more. Before jumping in, you may want to read a quick overview about the concepts of Drizzle replication here.

Fortunately, some advice from my friend Edwin DeSouza got me back to reality: “Jay, do a series of small, targeted, easily digestible blog posts”. And, so, this is the first in a series of quick blog posts about Drizzle replication. Today, I’ll start at the very beginning and talk about the basic unit of “currency” in Drizzle’s replication system: the Command Message.

Background

When two servers in a replicated environment must communicate with each other about changes in the state of one server, they must do so by sending raw bytes of information across some port or socket. One server must “package up” information about what changed and send it to the other server. The receiving server must then “unpack” the data and interpret it, subsequently applying the data change to its own schema.

In the image above, all of the arrows represent Command Message objects being passed between components within a single server or across the wire to another server.

This packing and unpacking of raw bytes is fairly complicated, and can get downright confusing when you take into account that different hardware architectures store raw bytes in different orders. Code needed to interpret raw streams of data is complex and, more importantly, makes the resulting code difficult to read. Furthermore, what we want to focus on in development is what to do with the unpacked data messages, not how to pack, unpack, and serialize the darn things.

Google Protobuffers to the Rescue

At Drizzle, we’re all about laziness. We don’t reinvent wheels if there is an excellent library that already does what we need.

It just so happens that a library exists that can solve the low-level problems of packing (serializing), unpacking (deserializing or parsing), and “versioning” formatted binary streams of data so that raw data can change its structure without having to re-implement all new serialization and deserialization routines. Enter Google Protobuffers.

The Google Protobuffers library (just GPB from now on) allows you to create a text file called a “proto” that contains a blueprint of a “message” class that represents some piece of data. This proto file is then consumed by the protoc program, which generates code files in a variety of programming languages. The code files contain class definitions of the message you define in your proto file.

This may all seem a bit strange until I show you exactly how it all works…and I will do so shortly for a C++ example. But first, let’s take a look at the proto file (the blueprint if you will) of the Command message, which stores information about data changes occurring on a server. Here is a section of the proto file, stored in /drizzled/message/replication.proto:

import "table.proto";

package drizzled.message;
option optimize_for = SPEED;


/*
  Context for a transaction.
*/

message TransactionContext
{
  required int32 server_id = 1;
  required uint64 transaction_id = 2; /* Globally-unique transaction ID */
}

/*
  Insert one record into a single table.
*/

message InsertRecord
{
  repeated Table.Field insert_field = 3;
  repeated bytes insert_value = 4;
}


/*
  Update one record in a single table.
*/

message UpdateRecord
{
  repeated Table.Field update_field = 3;
  repeated bytes before_value = 4;
  repeated bytes after_value = 5;
  repeated Table.Field where_field = 6;
  repeated bytes where_value = 7;
}


/*
  Deletes one record in a single table
*/

message DeleteRecord
{
  repeated Table.Field where_field = 3;
  repeated bytes where_value = 4;
}


/*
  A component of a transaction -- a single instruction or command
*/

message Command
{
  enum Type
  {
    START_TRANSACTION = 0;        /* A START TRANSACTION statement */
    COMMIT = 1;                   /* A COMMIT statement */
    ROLLBACK = 2;                 /* A ROLLBACK statement */
    INSERT = 3;                   /* An insert of a single record */
    DELETE = 4;                   /* A delete of a single record */
    UPDATE = 5;                   /* An update of a single record */
    RAW_SQL = 6;                  /* A raw SQL statement */
  }
  required Type type = 1;
  required uint64 timestamp = 2;  /* A nanosecond precision timestamp */

  /*
    Transaction Context is duplicated here so that Commands may
    be sent over the wire separately from the rest of the Commands in
    a transaction.
  */

  required TransactionContext transaction_context = 3;
  optional string schema = 4;     /* The schema affected */
  optional string table = 5;      /* The table affected */

  optional string sql = 6;  /* May contain the actual SQL supplied for the original statement */


  /*
    The below implement the actual change.  Each Command will
    have zero or one of the below sub-messages defined.
  */

  optional InsertRecord      insert_record = 7;
  optional DeleteRecord      delete_record = 8;
  optional UpdateRecord      update_record = 9;
}

The above proto file format defines the members of the Command message and some sub-messages that are pieces of a Command message.

There are exactly three required elements of a Command message:

  • type — an enumeration of the type of Command the message is. Does it insert, update, or delete a record? Does it start, rollback, or commit a transaction? Is it simply a raw SQL statement? The type element answers this.
  • timestamp — a timestamp of when the Command occurred on the server, to nanosecond precision.
  • transaction_context — this is an element which is a sub-message containing elements describing the ID of the server this command occurred on and the global transaction identifier the Command belongs to.

There are a number of optional members of the Command class, including the actual SQL string used in the original statement, and a set of optional sub-message classes that contain specialized data for certain types of commands.

Creating Generated Source Files

There’s probably a lot of readers currently asking themselves (or yelling at their blog reader) what the heck is the big deal about the above…I mean, it looks just like a simple definition of a POD (plain old data) class. Well, the real magic happens when you compile the above proto file into a source code file, and are able to take advantage of the GPB framework. Let’s do that now. We use the protoc program, passing in a couple simple arguments. Here, I’ll generate a source file with code creating C++ classes for the message definitions in our proto file. Drizzle keeps all the proto definition files for its message classes in the /drizzled/message/ subdirectory, so I’ll change to that directory and generate the source files:

jpipes@serialcoder:~/repos/drizzle/trunk$ cd drizzled/message/
jpipes@serialcoder:~/repos/drizzle/trunk/drizzled/message$ protoc --cpp_out=. replication.proto
jpipes@serialcoder:~/repos/drizzle/trunk/drizzled/message$ ls -lah replication.*
-rw-r--r-- 1 jpipes jpipes  95K 2009-08-12 06:48 replication.pb.cc
-rw-r--r-- 1 jpipes jpipes  66K 2009-08-12 06:48 replication.pb.h
-rw-r--r-- 1 jpipes jpipes 3.4K 2009-08-03 10:16 replication.proto

The --cpp_out=. simply directs protoc to generate C++ files, and look for proto files in the current directory. As you can see, protoc generated two files: replication.pb.h and replication.pb.cc.

Here’s the beauty of GPB: there’s absolutely no reason for you to ever look at the code in either of these generated files at all, unless you are just curious as to the style of code that GPB generates. The excellent online documentation for GPB contains all the information you’ll ever need in order to use these generated code files. All C++ classes generated by GPB will always follow an identical interface, so there is no reason to look at the generated files at all. We can always look at our .proto file (which is much simpler and shorter) to determine what methods our generated classes will have for us to use.

I already stated that the advantage of GPB is that serialization and deserialization is already done for you. Let’s look at some code that is in the Drizzle distribution which demonstrates using the Command message.

Constructing Command Messages

Inside Drizzle’s kernel, there is a component called drizzled::ReplicationServices which is responsible for constructing Command messages when data change events occur in the server and then passing these Command messages to replicators that are listening for change events. Here is some code from the drizzled::ReplicationServices::insertRecord() method which constructs a Command message.

#include <drizzled/replication.pb.h>
...
void ReplicationServices::setCommandTransactionContext(message::Command &in_command,
                                                       Session *in_session) const
{
  message::TransactionContext *trx= in_command.mutable_transaction_context();
  trx->set_server_id(in_session->getServerId());
  trx->set_transaction_id(in_session->getTransactionId());

  in_command.set_session_id((uint32_t) in_session->getSessionId());
}
...
void ReplicationServices::insertRecord(Session *in_session, Table *in_table)
{
  ...
  message::Command command;
  command.set_type(message::Command::INSERT);
  command.set_timestamp(in_session->getCurrentTimestamp());

  setCommandTransactionContext(command, in_session);

  const char *schema_name= in_table->getShare()->db.str;
  const char *table_name= in_table->getShare()->table_name.str;

  command.set_schema(schema_name);
  command.set_table(table_name);

  /*
   * Now we construct the specialized InsertRecord command inside
   * the message::Command container...
   */
  message::InsertRecord *change_record= command.mutable_insert_record();

  Field *current_field;
  Field **table_fields= in_table->field;
  String *string_value= new (in_session->mem_root) String(ReplicationServices::DEFAULT_RECORD_SIZE);
  string_value->set_charset(system_charset_info);

  message::Table::Field *current_proto_field;

  /* We will read all the table's fields... */
  in_table->setReadSet();

  while ((current_field= *table_fields++) != NULL)
  {
    current_proto_field= change_record->add_insert_field();
    current_proto_field->set_name(current_field->field_name);
    current_proto_field->set_type(message::Table::Field::VARCHAR); /* @TODO real types! */
    string_value= current_field->val_str(string_value);
    change_record->add_insert_value(string_value->c_ptr());
    string_value->free();
  }

  push(command);
}

The code above shows just how easy it is to construct a Command message containing information about a new record that has been inserted into a schema on the server.

We start by simply creating a message::Command variable on the stack, named “command”. We then set the required member variables of the Command by calling the set_type() and set_timestamp() methods of the Command message and the set_server_id() and set_transaction_id() methods of the TransactionContext submessage of the Command object.

When you create any GPB message class, any scalar member variables (variables that are not other message classes or arrays of things), the class will always have “setter” methods that follow the naming convention set_xxx(), where xxx is the exact name of the field. In the above case, you can clearly see the correlation between the “type” data member and the set_type() method.

But, what about those methods which begin with mutable_ ? Those member methods return a pointer to a data member of a class whose type is another message class. For example, the mutable_transaction_context() method of the Command message returns a pointer to a TransactionContext object that may have its own data members set using setter methods. It is important to note that GPB-generated code follows strict RAII and encapsulation principles, like good C++ libraries always do. This means that the pointers returned by mutable_xxx() methods of a message class are managed by the class itself, and you do not need to manage the memory yourself. When the Command message’s destructor is called, it will clean up any resources it has allocated for sub-messages like the TransactionContext class.

Accessing Command Message data members

OK, so above you see example code which sets the data members of a Command message. Let’s check out some code which accesses the data member of the Command message. Here is a snippet of code from the SubscriberApplier class in the async_replication module (/plugin/async_replication/subscriber_applier.cc):

void SubscriberApplier::apply(const message::Command &to_apply)
{
  string query;
...
  switch (to_apply.type())
  {
...
  case INSERT:
    query.assign("INSERT INTO `");
    query.append(to_apply.schema());
    query.append("`.`");
    query.append(to_apply.table());
    query.append("` (");

    const message::InsertRecord &record= to_apply.insert_record();

    int32_t num_fields= record.insert_field_size();

    int32_t x;
    for (x= 0; x < num_fields; x++)
    {
      if (x != 0)
        query.push_back(',');

      const message::Table::Field f= record.insert_field(x);

      query.push_back('`');
      query.append(f.name());
      query.push_back('`');
    }

    query.append(") VALUES ");

    /*
    * There may be an INSERT VALUES (),() type statement.  We know the
    * number of records is equal to the field_values array size divided
    * by the number of fields.
    *
    * So, we do an inner and an outer loop.  Outer loop is on the number
    * of records and the inner loop on the number of fields.  In this way,
    * we know that record.field_values(outer_loop * num_fields) + inner_loop))
    * always gives us our correct field value.
    */
    int32_t num_records= (record.insert_value_size() / num_fields);
    int32_t y;
    for (x= 0; x < num_records; x++)
    {
      if (x != 0)
        query.push_back(',');

      query.push_back('(');
      for (y= 0; y < num_fields; y++)
      {
        if (y != 0)
          query.push_back(',');

        query.push_back('"');
        query.append(record.insert_value((x * num_fields) + y));
        query.push_back('"');
      }
      query.push_back(')');
    }
    break;
...
  }

  result= dispatch_command(server_command, session, query.c_str(), query.length()) == false;
  ...
}

In the above code sample, you can see that "getter" methods for a Command class fall into a few different categories. For simple, scalar data members or sub-message data members, the getter method is simply the name of the data member itself, and these methods always return const references to their data members. For instance, Command::schema() returns a const std::string& with the the name of the schema and Command::transaction_context() returns a const message::TransactionContext& of the transaction_context message class data member.

For optional data members, there will always be a has_xxx "check for existence" method. So, since the "sql" data member of the Command message is optional, one might do the following:

message::Command &command= get_some_command_object();
if (command.has_sql())
{
  cout << command.sql() << endl;
}

Which would print out the SQL statement of the Command message if the sql data member has been set.

For data members that are of the "repeated" type (for instance, the InsertRecord message's insert_value data member, which is of type repeated string), the number of individual elements in the vector is always accessed with a method called xxx_size(), where xxx is the exact name of the data member. You can see in the example code above calls to insert_record.insert_value_size(), which returns the count of the number of elements in the insert_value data member.

To access specific elements of a repeated data member, simply call the method named for the data member, passing in the index of the element you wish to access.

Conclusion, For Now...

The above is just a quick introduction to the Command message, which is the basic unit of currency in Drizzle's replication system. The next article will show how the Command message is passed through the plugin::CommandReplicator and plugin::CommandApplier interfaces, the basic worker plugins of the Drizzle replication system. After that, we'll be discussing the format of the Command message log.

Questions on anything above? Please do write to the drizzle-discuss@ mailing list or drop my an email to joinfu@sun.com. Cheers.

UPDATE: Read the next article on the CommandReplicator and CommandApplier interfaces.

Towards a New Modular Replication Architecture

Over the past week, I’ve been refactoring the way that the Drizzle kernel communicates with plugin modules that wish to implement functionality related to replication. There are many, many potholes in the current way that row-based replication works in Drizzle, and my refactoring efforts were solely focused on three things:

  • Make an interface for replicating actions which occur inside a server that is clear and simple to understand for the caller or the interface
  • Make an interface that uses only documented data structures and standardized containers.
  • Completely remove the notion that logging is tightly-coupled with replication.

Let me expand on these two goals, and why they are critical to the success of a replication architecture.

Simple, Clear Interfaces Designed for the Interface Caller

I have a very strong belief that interfaces should always be designed from the perspective of the caller of the interface. By focusing on the caller of the interface, you produce interfaces that are inherently more stable and simpler than if you design an interface with the perspective of the consumer of the interface’s objects — or the internal implementation’s perspective. If you design an interface from the perspective of how the internals of a system work, then you invariably end up with interfaces that reflect the internal implementation of something. Interfaces should be generalized, not specific to an implementation. I’ll expand on this with an example in the next section.

Interfaces should be as simple and clear as possible. Why? Because people won’t implement the darn interfaces if they can’t understand them. Simple as that. Drizzle wants a vibrant ecosystem of plugins and modules. If developers have a difficult time understanding an interface, that means we get fewer plugins, and that just won’t do.

Interfaces Must Use Documented Data Structures and Standard Containers

This may seem like a no-brainer, but if you look at the interfaces in MySQL, you will notice that so many of them pass pointers to internal MySQL objects (THD, st_table, TABLE_SHARE, etc). You will also notice that in order to use the interface, implementors must become intimately familiar with the custom iterators (List<Item> List_iterator<>, List_iterator_fast<>, etc) in order to really implement their plugins. Many of these custom objects and containers are poorly or not documented at all. This makes the plugin developer’s life harder. This is why there are so very few plugins written for MySQL that have not been written by MySQL engineers who are familiar with the internal implementation of the server. The harder life is for the plugin developer, the smaller the ecosystem of plugins will be. Period.

Now, why do I say “only standard containers”? Well, how many of you readers realized that List<Item> was not actually a list of Items; that in fact it is a list of Item pointers? Probably very few. By using the STL and its very-well-documented, standardized container classes in the interface, we remove the need of the plugin developer to have to learn Yet-Another-Custom template implementation. Easier for the plugin developer, so faster time-to-market for plugins.

Laying Waste to Tight-Coupling of Replication with “Other Stuff” (Like Logging)

It’s no secret that MySQL’s internal subsystems are very tightly-coupled. Back in 2005, when I wrote the system internals chapter in Pro MySQL, I noted that the internal subsystems of MySQL were interwoven, tightly-coupled, and a bit of a bowl of spaghetti. Very little substantial progress has been made since then on refactoring those systems into separate modules. Replication is no different. In MySQL, the replication subsystem is tightly coupled to a logging subsystem that most readers would be familiar with as “The Binlog”.

Here’s the problem with that: it ties the implementation (the binlog) to the replication interface. Why is this bad? Well, it means that if you want to modularize the replication system, you can’t do so without fiddling with the binlog. And likewise, if you want to improve the binlog, you invariably will be affecting the replication system.

Tight-coupling of implementation to interfaces is disastrous from an architectural point of view. Touching each runs the risk of breaking the other. When these types of architectural faux-pas litter the code, it means that development and debugging life-cycles are extended well past what they should be, because developers working on one system tend to affect other subsystems that shouldn’t really be affected by their changes.

And so, one of the big goals of this refactoring work was to produce interfaces which did not interfere with each other. Changing an implementation shouldn’t break the whole system.

A New, Modular Replication Architecture

OK, enough about the goals of the new replication architecture. Let’s see some diagrams and some code, eh?

In Drizzle, we strive to have a clear separation between the core drizzled kernel and modules which implement functionality. The module communicates with the kernel through an API. The goal of the this API is to shield the internal implementation of the core kernel from plugin/module developers. All plugin developers need to know is what will be passed to them from the kernel (documented data structures) and what to provide the kernel back as output (a return code, for instance). The kernel doesn’t care how the plugin implements something. All it cares about is that the plugin provide methods which match the API’s base class interfaces.

There are a number of documented data structures which are used to in the API’s calling interface. These data structures are all Google Protobuffer (GPB) Message derived classes. Because they are all GPB derived message classes, they all work in the same way: via the Protobuffer Message API, which is well-documented and automatically handles versioning when changes to these message classes are needed. The current list of Message classes are:

  • EventList — describes a list of Events
  • Event — describes a single Event
  • StartTrxEvent — describes as Event which delineates the start of a transaction
  • EndTrxEvent — describes as Event which delineates the end of a transaction
  • InsertRecordEvent — describes a specialized type of Event which occurs when a new record in a table is inserted
  • DeleteRecordEvent — describes a specialized type of Event which occurs when a record is removed from a table
  • UpdateRecordEvent — describes a specialized type of Event which occurs when a record in a table is modified

More Event specialty classes will be added as needed, but they will all follow the same interface as each other.

In addition to the above data classes, for a replication architecture, what are the basic “worker classes” that plugins may implement? For the first phase of this new replication system, I have three basic class interfaces:

  • EventReplicator — Responsible for replicating an EventList to an EventApplier
    • The main interface method is bool EventReplicator::replicateEvents(EventApplier *, const EventList&)
  • EventApplier — Responsible for consuming an EventList and applying it to a target
    • The main interface method is bool EventApplier::applyEvents(const EventList&)
  • EventReader — Responsible for reading an EventList from some source and passing it to an EventApplier
    • The main interface method is bool EventReader::getEvents(EventList *);

But you might be thinking… Jay, where’s the Binlog?

There isn’t one in the interface. That’s the whole point. A “binlog” is an implementation, not an interface. Need a more concrete vision of how a module would use such an interface? OK, let’s imagine an example module, which we’ll call “RecoveryModule”. The purpose of this module would be to provide backup and restore capabilities to Drizzle. Let’s see how the interface to the Drizzle kernel would be implemented in the module…

The RecoveryModule Implementation Overview

As noted, the purpose of this module is clear: to provide backup and restore functionality for Drizzle. We’ll implement this functionality using a serialized transaction log. Here are the classes which we will map out for the module:

  • SynchronousReplicator — extends EventReplicator and is responsible for replicating event lists passed from the kernel to any registered EventAppliers in a synchronous, transactional way.
  • TrxLogWriter — extends EventApplier and is responsible for writing events to a serial transaction log.
  • TrxLogReader — extends EventReader and is responsible for reading events from the transaction log.
  • TrxLogApplier — extends EventApplier and is responsible for applying events it reads (via the EventReader) from the transaction log.

As you can see, all of the classes above extend (or more explicitly, inherit from a publicly defined interface class in /drizzled/plugin/) a base interface class. There is a clear separation of duties. Replicators replicate events. Appliers apply events. Readers read events. No mixing of one into the other. This way, a developer can work on the writing of records to a transaction log (the TrxLogWriter) and another developer can work on the implementation of applying those records during a recovery phase (TrxLogApplier). The two developers don’t need to know anything about how the other class is implemented. The simply implement the interface of their class. In fact, the TrxLogWriter implementation might be in an entirely different module! The developer of the TrxLogApplier doesn’t care, and isn’t affected by that. All she needs to know is the interface of an EventReader and the EventApplier, and the details of the Event data structures (PODs). That’s it. Clean separation.

Here is a graphical overview of how such a module might look, with a separation of the components of the module, the plugin interface API, and the drizzled kernel.

You will note that I also have put in a couple other classes in the module, called SqlLogWriter, which inherits from EventApplier, and SqlLogReader, which inherits from EventReader. This is to show that the module could implement the MySQL General Log easily using the existing API. A general log is merely an implementation detail. The module developer can create a general log implementation just by specializing the EventApplier class to write, for instance, raw SQL records…

Also note the AsynchReplicator class… Mark Callaghan, you could implement a SemiSynchReplicator if you wanted. :)

Flow of Events from Drizzle Kernel to Module

So, how would the flow of events happen from the kernel to a module and back? I’ve put together a quick diagram of how the API would be used to communicate between the kernel and the module. You can see some API method calls (such as replicateEvents() or registerApplier(). These method names are of course still up for debate, but I think they adequately communicate what the API would look like and the purpose of the interface calls…

OK, so in my next blog post I’ll show some example code from my local branch which implements a module similar to the above. I’ll end this one now because for the next blog post I need to get some code highlighting done to make it easier to read…

Now Is the Time to Influence the Shape of the API

Robert Hodges, I’m lookin’ at you, kid! ;)

In all seriousness, this is the time when we are defining the interfaces between the kernel and the modules/plugins. If you have suggestions, want something included in the API, or in general want to tell me I’m full of shit, then join us on the Drizzle Discussion mailing list and let your thoughts be heard! :)

Libdrizzle Benchmarks – Massive Performance Increases

Last night and today, I ran a series of benchmarks against Drizzle. These benchmarks were designed to isolate the performance improvement or regression from one change: using Eric Day‘s new libdrizzle client library instead of the legacy libdrizzleclient library from MySQL. The results are in, and they are stunning.

Here is a graph showing the difference between Drizzle sysbench on a readonly workload with the only difference being sysbench using the libdrizzle driver versus using the libdrizzleclient (libmysql) driver for sysbench:

As you can see, with libdrizzle, the throughput is dramatically increased, with Drizzle scaling to 8x the number of cores on the benchmark machine before a drop-off in throughput is seen. The grey vertical line in the chart is the number of cores on the machine used.

The performance improvement varies from 2.80% improvement at 16 concurrent threads to 29.95% improvement at 128 threads and a whopping 41.16% increase at a mere 2 threads. Overall, there was an average performance increase across all concurrency levels of 16.95%.

For a read-write workload, here are the results:

Again, we see a massive increase in throughput at all concurrency levels. This time, the performance increase was pretty consistent across the concurrency levels, and we don’t see a significant drop-off in throughput until 4x the number of processors. Across all concurrency levels, the performance increase of libdrizzle was 17.62%.

What this means for MySQL

As you may or may not be aware, libdrizzle actually is able to communicate with both Drizzle and MySQL. Looking for an easy way to dramatically increase the performance and throughput of your application without tinkering with much stuff? Then get involved in the libdrizzle movement, download one of the connectors and join us on Freenode #drizzle to discuss how you can use and help us test the incredible new asynchronous, non-blocking communication library from Eric Day.

What this means for Drizzle

It should be stressed that the above performance improvements are solely due to a replacement of the client library. However, Eric is now pulling apart the Protocol class in the main Drizzle server to take advantage of the asynchronous and non-blocking nature of libdrizzle. This means that you can expect to see some pretty good performance improvements on the server side as well as Eric continues his great work.

Please give Eric a bunch of kudos if you see him online and go check out his talks at the MySQL and Percona conferences in a couple weeks.

Small but steady progress in improving Drizzle performance

We’re making steady progress in removing bottlenecks in the Drizzle code base. So far, a number of mutexes have been removed and we’ve begin to replace a number of contention points with atomic instructions which remove the need for a lock structure on platforms which support atomic fetch and store instructions.

I’m pretty positive about the direction we are going so far. We’re seeing the right trends in our scaling graphs, with very little performance drop off in read-only workloads up to 4X the number of cores on the machine, and little performance drop off on the read-write workloads up to 2X the number of cores, as you can see from the graphs below.

It’s a little difficult to see, but we’ve made a small but steady improvement from r950 to r968, with numbers increasing around 1-2% across most concurrency levels. You can see the raw numbers here:

+--------------------------------+-------+-----+---------+----------+
| config_name                    | revno | c   | tps     | rwrps    |
+--------------------------------+-------+-----+---------+----------+
| drizzle_innodb_readonly_1000K  |   950 |   2 |  710.67 |  9949.34 |
| drizzle_innodb_readonly_1000K  |   950 |   4 | 1163.77 | 16292.81 |
| drizzle_innodb_readonly_1000K  |   950 |   8 | 1692.59 | 23696.29 |
| drizzle_innodb_readonly_1000K  |   950 |  16 | 2470.31 | 34584.39 |
| drizzle_innodb_readonly_1000K  |   950 |  32 | 3104.98 | 43469.73 |
| drizzle_innodb_readonly_1000K  |   950 |  64 | 3376.98 | 47277.73 |
| drizzle_innodb_readonly_1000K  |   950 | 128 | 2986.91 | 41816.74 |
| drizzle_innodb_readonly_1000K  |   950 | 256 | 2657.54 | 37205.47 |
| drizzle_innodb_readonly_1000K  |   968 |   2 |  712.73 |  9978.25 |
| drizzle_innodb_readonly_1000K  |   968 |   4 | 1081.72 | 15144.10 |
| drizzle_innodb_readonly_1000K  |   968 |   8 | 1714.77 | 24006.77 |
| drizzle_innodb_readonly_1000K  |   968 |  16 | 2480.48 | 34726.77 |
| drizzle_innodb_readonly_1000K  |   968 |  32 | 3140.16 | 43962.29 |
| drizzle_innodb_readonly_1000K  |   968 |  64 | 3394.03 | 47516.32 |
| drizzle_innodb_readonly_1000K  |   968 | 128 | 3008.74 | 42122.30 |
| drizzle_innodb_readonly_1000K  |   968 | 256 | 2676.62 | 37472.65 |
| drizzle_innodb_readwrite_1000K |   950 |   2 |  438.04 |  8322.77 |
| drizzle_innodb_readwrite_1000K |   950 |   4 |  720.68 | 13692.98 |
| drizzle_innodb_readwrite_1000K |   950 |   8 | 1068.65 | 20304.39 |
| drizzle_innodb_readwrite_1000K |   950 |  16 | 1454.71 | 27639.47 |
| drizzle_innodb_readwrite_1000K |   950 |  32 | 1699.74 | 32295.02 |
| drizzle_innodb_readwrite_1000K |   950 |  64 | 1506.04 | 28614.71 |
| drizzle_innodb_readwrite_1000K |   950 | 128 | 1341.46 | 25487.69 |
| drizzle_innodb_readwrite_1000K |   950 | 256 | 1157.95 | 22001.16 |
| drizzle_innodb_readwrite_1000K |   968 |   2 |  444.10 |  8437.81 |
| drizzle_innodb_readwrite_1000K |   968 |   4 |  700.45 | 13308.53 |
| drizzle_innodb_readwrite_1000K |   968 |   8 | 1075.59 | 20436.14 |
| drizzle_innodb_readwrite_1000K |   968 |  16 | 1457.83 | 27698.76 |
| drizzle_innodb_readwrite_1000K |   968 |  32 | 1732.04 | 32908.82 |
| drizzle_innodb_readwrite_1000K |   968 |  64 | 1506.98 | 28632.61 |
| drizzle_innodb_readwrite_1000K |   968 | 128 | 1355.17 | 25748.31 |
| drizzle_innodb_readwrite_1000K |   968 | 256 | 1157.29 | 21988.59 |
+--------------------------------+-------+-----+---------+----------+

Drizzle is all about open and transparent, and if you want to know how these numbers are generated, feel free to download the drizzle-automation project, a Python benchmarking and code coverage utility I’ve written over the past couple weeks. We use standard sysbench modified to use our client driver, with the following configuration in our /etc/drizzle-automation/bench.cnf file:

[defaults]

# Number of iterations the benchmark process should do for each level
# of concurrency
iterations= 5

# Comma-separate list of concurrency levels to run
concurrency_levels= 2,4,8,16,32,64,128,256

# Options passed as-as to make when building server
make_options= -j32

# Options passed as-is to configure when building server
configure_options= 

[drizzle_innodb_readonly_1000K]

# The program used to run the benchmark.  If you use paths, they
# will be relative to the sandbox directory in which the run is
# being processed
bench_cmd= sysbench

# Options given to the benchmark program on every iteration
bench_options= --max-time=60 --max-requests=0 --test=oltp \
--drizzle-db=test --drizzle-port=4427 --drizzle-host=127.0.0.1 \
--drizzle-user=root --db-ps-mode=disable --db-driver=drizzleclient \
--drizzle-table-engine=innodb --oltp-read-only=on --oltp-table-size=1000000

# The program used to start the server instance.  If you use paths, they
# will be relative to the sandbox directory in which the run is
# being processed
server_cmd= ./drizzled/drizzled

# Options passed to the server on startup
server_options= --port=4427 --datadir=/tmp --innodb-buffer-pool=4G \
--key-buffer-size=64M --scheduler=multi_thread --innodb_log_buffer_size=512M \
 --innodb_additional_mem_pool_size=120M --table_open_cache=4096 &

[drizzle_innodb_readwrite_1000K]

# The program used to run the benchmark.  If you use paths, they
# will be relative to the sandbox directory in which the run is
# being processed
bench_cmd= sysbench

# Options given to the benchmark program on every iteration
bench_options= --max-time=60 --max-requests=0 --test=oltp --drizzle-db=test \
--drizzle-port=4427 --drizzle-host=127.0.0.1 --drizzle-user=root --db-ps-mode=disable \
--db-driver=drizzleclient --drizzle-table-engine=innodb --oltp-read-only=off \
--oltp-table-size=1000000

# The program used to start the server instance.  If you use paths, they
# will be relative to the sandbox directory in which the run is
# being processed
server_cmd= ./drizzled/drizzled

# Options passed to the server on startup
server_options= --port=4427 --datadir=/tmp --innodb-buffer-pool=4G \
--key-buffer-size=64M --scheduler=multi_thread --innodb_log_buffer_size=512M \
--innodb_additional_mem_pool_size=120M --table_open_cache=4096 &

The machine we run the benchmarks on is an Intel Xeon QuadCore, and all tests fit into the available RAM on the box.

These benchmarks, which we can run against a single revision or a range of revisions, are invaluable in allowing us to pinpoint the specific cause of a performance regression. Feel free to check out the automation work, contribute to the project, or suggest improvements to it on the Drizzle Discuss mailing list.

LCOV Code Coverage Pages for Drizzle

Yesterday, Monty and I were fussing around with lcov and genhtml trying to generate code coverage analysis for Drizzle. After a few hours, I was finally able to get some good output, and I’ve published the results temporarily on my website.

We’re currently at 70.4% code coverage which is less-than-ideal, but at least we now have a baseline from which to improve. We’re all about making incremental improvements, and having statistics to tell us whether we’re going in the right direction is important. This is a good first step.

So, what exactly do these code coverage numbers mean?

OK, so for those readers not familiar with gcov or lcov, here is what these code coverage numbers actually mean… They represent the percentage of executable source lines which are executed during a run of Drizzle’s test suite. Basically, the percent gives us a rough idea of the percent of our code paths which are being verified by the test suite.

What we do is have GCov produce coverage data during compile for each source code file. We then run our test suite, and then have LCov collect (capture) the source code lines in each file was executed during the run of the test suite. We then call on the genhtml utility to produce pretty HTML pages from the “info files” generated from GCov.

Automating LCov Reports

The goal is to have LCov reports run for each revision in trunk, and have the output automatically rsynced up to the Drizzle.org server. I’m working with Mike Shadle from Intel and Monty on getting this done, along with Doxygen output, and hope to get the automated test coverage reports done by the end of the week.

Part of the automation is producing a text-based report which I can then store in a database and keep historical data about our progress in improving test coverage. Of course, those historical graphs will also be published. :)

How Can You Help Improve Drizzle’s Code Coverage?

Actually, this is a great starter task for new contributors to get involved in Drizzle development. The way to improve code coverage percentages is to analyze the blocks of code which are not covered in the following way:

  1. Determine if the code block is dead code
  2. If not dead code, come up with an addition to an existing or new test case which will ensure the code block is executed
  3. If it is dead code, remove the code from the tree after determining why the code is dead

Some more details on the above steps will come in another blog post later today… OK, gotta get back to work. Feel free to peruse the LCov HTML reports. There are some interesting things buried in there. :)

A Better Parser Needed?

Taking a little break from refactoring temporal data handling this evening, I decided to run some profiles against both Drizzle and MySQL 5.1.33. I profiled the two servers with callgrind (a valgrind tool/skin) while running the drizzleslap/mysqlslap test case. In both cases, I had to make a small change to the drizzled/tests/test-run.pl Perl script.

I then used KCacheGrind's excellent graphical interface to look at where the server's were spending their time. Now, as much as certain folks would like me to compare MySQL and Drizzle, I'm not interested in that. In fact, I'm actually going to focus on an area in both servers that I think could use a little TLC.

For the MySQL build, I used the BUILD/compile-amd64-debug-max build script. For Drizzle, I used my standard build process which builds Drizzle with maximum debugging symbols and hooks. It’s worth noting that the debug and build process for MySQL and Drizzle are very different, and the MySQL debug build contains hooks to the DBUG library, which you’ll notice appear on the MySQL call graphs and consume a lot of the overall function calls. You won’t see this in the Drizzle graphs because we do not use DBUG. For all intents and purposes, just ignore the calls to anything in the DBUG library in the MySQL graphs since in a non-debug build all that stuff is NOOPed out…

FYI, the drizzleslap/mysqlslap test case is a decent one to run profiling against because it tests a range of different SQL statements in a concurrent environment, something you won’t really see in the other tests cases. This is the reason I like using it when profiling with valgrind/callgrind/oprofile…

OK, so on to the graphs…

Drizzle Callgrind Profile — The Function Calls


Here is the top portion of the function call tree for the drizzled server over the time of the drizzleslap test run. I ordered the result by the percentage of total execution time that was spent in each function call. At the top, you’ll see that the get_text function used by the Lex_input_stream class and the InnoDB log_group_write_buf() function are the two top execution time consumers.

For the log_group_write_buf() function (defined in /storage/innobase/log/log0log.c) , this makes sense: Drizzle’s default storage engine is InnoDB (yes, even for the INFORMATION_SCHEMA) and therefore you’ll notice this function, which is responsible for writing to one of InnoDB’s log group files.

More interestingly, there is a fairly alarming 6.45% of the total execution time, for the get_text() static function (defined in drizzled/sql_lex.cc). If we look at this routine, it’s fairly clear what the function does — it simply reads an unescaped text literal, without quotes. Nothing particularly fancy, although this comment above the routine might be telling: “Return an unescaped text literal without quotes. Fix sometimes to do only one scan of the string”

Perhaps it’s time to look into that single-scan thing…or perhaps not.

The next “biggie” is the my_mb_wc_utf8mb4() routine, which is called an astonishing 20,779,971 times over the course of the execution of approximately 17,000 statements. How’d I come up with 17,000 statements? I just looked at the number of times DRIZZLEparse() is called, which is close enough…

my_mb_wc_utf8mb4() is called from a number of places, most notably from within the parser and lexer, and when converting to various string classes and primitives. It’s really scary that this function is called so many times! Why? Well, one reason is that Drizzle does not support any other character set than UTF8 full 4-byte. Although we support many collations, we don’t support the myriad character sets that MySQL does. So, it kind of makes sense that UTF8 routines would be called quite a bit. But…20M executions for 17K statements seems like there is an obvious inefficiency here. :)

This leads nicely to MySQL. If Drizzle is spending so much time in UTF8 routines, and MySQL doesn’t by default use UTF8 as it’s character set (and MySQL 5.1.33 doesn’t support 4-byte UTF8, it would make sense that MySQL would NOT be spending nearly as much time executing character set conversion routines, right? Well, not so much. :(

MySQL Callgrind Profile — The Function Calls


Remember when looking at the MySQL call graph to disregard the DBUG library calls (look for dbug.c in the source file column) since these would be optimized away in a production environment….

What we notice in the MySQL calls is that pthread_getspecific is the number one execution time consumer, followed by a few of those DBUG library calls. I have a suspicion that the pthread_specific calls are actually related to the DBUG library calls, which track debugging information in the threads. I might be wrong about this but given the stark difference between pthread_specific()‘s top spot in MySQL’s call graph and it not appearing in Drizzle’s graph, it makes sense that this is related to the DBUG library. So, I’ll ignore it for now. :)

So, after those, you’ll notice a bunch of calls to memcpy() and the number of calls to memcpy() very closely matches the number of calls from the Drizzle graph. This makes sense. Drizzle’s mechanism for transporting data across the wire and for translating record formats between the database kernel and the storage engine has not yet changed much, and this is where much of the calls to memcpy() are coming from. (this will change with ValueObjects, BTW, but more on that later…)

After memcpy(), though, if you scan the function call list, you’ll notice that MySQL, even with UTF8 not as the default character set, there is still a whole lot of calls, just like in Drizzle, to various character set routines — notably my_uni_utf8(), my_uft8_uni(), copy_and_convert(), my_mb_wc_latin1(), my_ismbchar_utf8(), and so on.

It turns out that if you add up all the character set conversion and comparison routine executions in both MySQL and Drizzle, that all those function calls are taking up more than 12% of the total execution time for both Drizzle and MySQL!

Call Trees Seem to Blame the Parser

I’m not going to go on too much further, as it’s getting late and I’m tired, but I’m putting the call trees for Drizzle and MySQL for the profiling runs below. I think it’s fairly clear that the parser is eating up a large chunk of execution time. Perhaps it’s time to look into prototyping and benchmarking other parsers, or at the very least, looking into streamlining the existing parser to be more efficient when it comes to character set routines… :)

Feel free to click on the images below for the fullsize versions.

Cheers!
-jay

Here is the Drizzle call tree:

And here is the MySQL one:

Drizzle, MySQL, and the mess that is Dates and SQL_MODE

The frustration builds.

I have come to despise MySQL’s sql_mode. It is a hack of the most gargantuan proportions.

Basically, the optimizer just ignores the sql_mode whenever it is convenient for it to do so. More importantly, the optimizer silently ignores bad datetime input in various places. The reason for this is because of my statement above: sql_mode is a big ole’ hack. Instead of fixing the runtime executor in MySQL to use real ValueObject types — that are immutable and know how to convert (and not convert) between each other, the runtime is a mess of checks for various runtime codes, warning modes, “count_cuted_field” crap and other miscellany that obfuscates the executor pipeline almost beyond recognition.

Slowly, I am attacking the mess, but the executor is so fragile that even tiny changes can wreak havoc on the system, so the going is slow and painful. It’s no wonder that the release cycle for the MySQL server is so bloody long…

To give you an example of the symptoms which manifest themselves because of this hack, I present the following, executed in MySQL 5.0.67:

mysql> create table t1(f1 int, f2 date);
Query OK, 0 rows affected (0.03 sec)

mysql> insert into t1 values(1,'2005-01-01'),(2,'2005-09-01'),(3,'2005-09-30'),   (4,'2005-10-01'),(5,'2005-12-30');
Query OK, 5 rows affected (0.04 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> select @@sql_mode;
+------------+
| @@sql_mode |
+------------+
|            |
+------------+
1 row in set (0.00 sec)

mysql> select * from t1 where f2 <= '2005-09-31' order by f2;
+------+------------+
| f1   | f2         |
+------+------------+
|    1 | 2005-01-01 |
|    2 | 2005-09-01 |
|    3 | 2005-09-30 |
+------+------------+
3 rows in set (0.00 sec)

OK, so far, everything is as "expected" when running in a non-strict mode. Even though "2005-09-31" is clearly an invalid date, MySQL in non-strict mode is expected to ignore such triviality and, we can assume, "corrects" the bad date to something else (maybe an integer representation of "2005-09-30"?)

The MySQL manual is fairly clear about sql_mode=STRICT_ALL_TABLES and bad datetime values:

Strict mode controls how MySQL handles input values that are invalid or missing. A value can be invalid for several reasons. For example, it might have the wrong data type for the column, or it might be out of range.

So, we should expect the above statement to throw an error in STRICT_ALL_TABLES, right? Wrong:

mysql> set @@sql_mode="STRICT_ALL_TABLES";
Query OK, 0 rows affected (0.00 sec)

mysql> select * from t1 where f2 >= '0000-00-00' order by f2;
+------+------------+
| f1   | f2         |
+------+------------+
|    1 | 2005-01-01 |
|    2 | 2005-09-01 |
|    3 | 2005-09-30 |
|    4 | 2005-10-01 |
|    5 | 2005-12-30 |
+------+------------+
5 rows in set (0.00 sec)

mysql> select * from t1 where f2 > '2005-09-99' order by f2;
+------+------------+
| f1   | f2         |
+------+------------+
|    4 | 2005-10-01 |
|    5 | 2005-12-30 |
+------+------------+
2 rows in set, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+-------------------------------------------------------------+
| Level   | Code | Message                                                     |
+---------+------+-------------------------------------------------------------+
| Warning | 1292 | Incorrect date value: '2005-09-99' for column 'f2' at row 1 |
+---------+------+-------------------------------------------------------------+
1 row in set (0.00 sec)

MySQL accepts the bad date, even in STRICT_ALL_TABLES mode, and does the same implicit conversion, giving us a warning when not the zero date but not telling us what conversion was done. Clearly, a conversion happened because MySQL winnowed the result set. But we have no way to know what the winnowing condition was on.

Let's take this even further. The docs suggest that STRICT_ALL_TABLES must be coupled with NO_ZERO_IN_DATE if real validation on dates is to occur:

Strict mode disallows invalid date values such as '2004-04-31'. It does not disallow dates with zero month or day parts such as '2004-04-00' or “zero” dates. To disallow these as well, enable the NO_ZERO_IN_DATE and NO_ZERO_DATE SQL modes in addition to strict mode.

So, maybe with "STRICT_ALL_TABLES, NO_ZERO_IN_DATE" sql_mode, we'll actually get an error...

mysql> set @@sql_mode="STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE";
Query OK, 0 rows affected (0.00 sec)

mysql> select * from t1 where f2 > '2005-09-99' order by f2;
+------+------------+
| f1   | f2         |
+------+------------+
|    4 | 2005-10-01 |
|    5 | 2005-12-30 |
+------+------------+
2 rows in set, 1 warning (0.00 sec)

mysql> select * from t1 where f2 >= '0000-00-00' order by f2;
+------+------------+
| f1   | f2         |
+------+------------+
|    1 | 2005-01-01 |
|    2 | 2005-09-01 |
|    3 | 2005-09-30 |
|    4 | 2005-10-01 |
|    5 | 2005-12-30 |
+------+------------+
5 rows in set, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+-------------------------------------------------------------+
| Level   | Code | Message                                                     |
+---------+------+-------------------------------------------------------------+
| Warning | 1292 | Incorrect date value: '0000-00-00' for column 'f2' at row 1 |
+---------+------+-------------------------------------------------------------+
1 row in set (0.00 sec)

Well, the above ridiculousness is the symptom of a hacked-up runtime engine switching on modes, return codes, and warning levels all over the place. The danger in the above is that we don't actually know how the silent truncation/conversion has affected our query results. What if the query results are not what we intended? A warning on bad input is simply NOT enough.

Want to see another symptom of this kind of implicit truncation/conversion that is rampant in the runtime? Take a look at the following in MySQL:

mysql> create table t1(f1 int, f2 date not null);
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t1 values(1,'2005-01-01'),(2,'2005-09-01'),(3,'2005-09-30'), (4,'2005-10-01'),(5,'2005-12-30');
Query OK, 5 rows affected (0.00 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> select * from t1 where f2 is null;
Empty set (0.00 sec)

mysql> explain extended select * from t1 where f2 is null;
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL |    5 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
1 row in set, 1 warning (0.02 sec)

mysql> show warnings;
+-------+------+--------------------------------------------------------------------------------------------------------+
| Level | Code | Message                                                                                                |
+-------+------+--------------------------------------------------------------------------------------------------------+
| Note  | 1003 | select `test`.`t1`.`f1` AS `f1`,`test`.`t1`.`f2` AS `f2` from `test`.`t1` where (`test`.`t1`.`f2` = 0) |
+-------+------+--------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

If you take a look at the above query, you will notice that the WHERE condition is asking for rows in which f2 IS NULL. However, f2 is defined as DATE NOT NULL.

POP QUIZ: Why doesn't the optimizer display a Impossible WHERE in the Extra column and optimize away the query?

ANSWER: Because of all those silly silent conversions. MySQL converts "IS NULL" to "= 0" because everything is designed around comparisons to "the zero date". So, even if you have STRICT_ALL_TABLES, NO_ZERO_DATE, NO_ZERO_IN_DATE mode on, MySQL is still using the zero date behind the scenes.

Here is the relevant code in sql_select.cc:

      /* fix to replace 'NULL' dates with '0' (shreeve@uci.edu) */
      else if (
          ((field->type() == DRIZZLE_TYPE_DATE) || (field->type() == DRIZZLE_TYPE_DATETIME))
          && (field->flags & NOT_NULL_FLAG)
          && ! field->table->maybe_null)
      {
        COND *new_cond;
        if ((new_cond= new Item_func_eq(args[0],new Item_int("0", 0, 2))))
        {
          cond= new_cond;
          /*
            Item_func_eq can't be fixed after creation so we do not check
            cond->fixed, also it do not need tables so we use 0 as second
            argument.
          */
          cond->fix_fields(session, &cond);
        }
      }

Suffice it to say, this "fix" is gone now from Drizzle, and we're back to producing the appropriate plan:

drizzle> use test
Database changed
drizzle> create table t1(f1 int, f2 date not null);
Query OK, 0 rows affected (0.02 sec)

drizzle> insert into t1 values(1,'2005-01-01'),(2,'2005-09-01'),(3,'2005-09-30'), (4,'2005-10-01'),(5,'2005-12-30');
Query OK, 5 rows affected (0.01 sec)
Records: 5  Duplicates: 0  Warnings: 0

drizzle> select * from t1 where f2 is null;
Empty set (0.00 sec)

drizzle> explain extended select * from t1 where f2 is null;
+----+-------------+-------+------+---------------+------+---------+------+------+----------+------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra            |
+----+-------------+-------+------+---------------+------+---------+------+------+----------+------------------+
|  1 | SIMPLE      | NULL  | NULL | NULL          | NULL | NULL    | NULL | NULL |     NULL | Impossible WHERE |
+----+-------------+-------+------+---------------+------+---------+------+------+----------+------------------+
1 row in set, 1 warning (0.00 sec)

drizzle> show warnings;
+-------+------+-----------------------------------------------------------------------------------+
| Level | Code | Message                                                                           |
+-------+------+-----------------------------------------------------------------------------------+
| Note  | 1003 | select `test`.`t1`.`f1` AS `f1`,`test`.`t1`.`f2` AS `f2` from `test`.`t1` where 0 |
+-------+------+-----------------------------------------------------------------------------------+
1 row in set (0.00 sec)

The deeper I dig, the more scared I become.

Where Drizzle Is Succeeding

Over the past few weeks, I have been happy working on Drizzle. Why have I been happy? Is it because of some new incredible code that will revolutionize the database industry? Nope. Is it because we’ve been able to remove all the issues that plague the server core? Nope. Is it because I see Drizzle quickly morphing into a modular, standard-conforming super-kernel? Nope.

So, why am I joyous?

To paraphrase the late Charlton Heston: “[Drizzle] is people!

Recently, I’ve seen the fruit that transparent, open source development bears. This fruit takes the form of engaged, motivated, and humble individuals who wish to make their mark on a project.

Whether it’s on IRC on #drizzle, the drizzle-discuss mailing list (now with 354 active members), or via the platform which Launchpad.net provides our community, I’ve seen new developers scrambling to pick up blueprint tasks, tackle bugs (minor and major), and stamp their footprint on the code base.

With each new face comes an entirely new perspective, a new angle, a different set of skills and experience. And I’m taking the time to chat and learn with each of them. It’s a humbling experience for me, as I learn from each person who visits the ever-growing IRC channel and mailing list. It doesn’t matter if it’s the sage advice of folks like MySQL’s Mats Kindahl, Bernt Johnsen and Roy Lyseng, or database veteran Jim Starkey. It doesn’t matter if the new face is a college student wishing to help in any small way they can. Everyone makes a difference in their own way.

So, just like Monty Widenius says about his new company, that all employees will share in the profit, so is the case with Drizzle, and truly open source development projects. Those who contribute share in the project’s success and stamp their mark, forever, on its direction and shape. It is this fact that propels me in coding, and gives me joy when I log into IRC in the mornings.

Just thought I would share that happiness. Cheers.