Happiness is a Warm Cloud

Although a few folks knew about where I and many of the Sun Drizzle team had ended up, we’ve waited until today to “officially” tell folks what’s up. We — Monty Taylor, Eric Day, Stewart Smith, Lee Bieber, and myself — are all now “Rackers”, working at Rackspace Cloud. And yep, we’re still workin’ on Drizzle. That’s the short story. Read on for the longer one :)

An Interesting Almost 3 Years at MySQL

I left my previous position of Community Relations Manager at MySQL to begin working on Brian Aker‘s newfangled Drizzle project in October 2008.

Many people at MySQL still think that I abandoned MySQL when I did so. I did not. I merely had gotten frustrated with the slow pace of change in the MySQL engineering department and its resistance to transparency. Sure, over the 3 years I was at MySQL, the engineering department opened up a bit, but it was far from the ideal level of transparency I had hoped to inspire when I joined MySQL.

For almost 3 years, I had sent numerous emails to the MySQL internal email discussion lists asking the engineering and marketing departments, both headed by Zack Urlocker, to recognize the importance and necessity of major refactoring of the MySQL kernel, and the need to modularize the kernel or risk having more modular databases overtake MySQL as the key web infrastructure database. The focus was always on the short term; on keeping up with the Jones’ as far as features went, and I railed against this kind of roadmap, instead pushing the idea of breaking up the server into modules that could be blackboxed and developed independently of the kernel. My ideas were met with mostly kind responses, but nothing ever materialized as far as major refactoring efforts were concerned.

I remember Jim Winstead casually responding to one of my emails, “Congratulations, you’ve just reinvented Apache 2.0″. And, yes, Jim, that was kind of the point…

The MySQL source code base had gotten increasingly unmaintainable over the years, and key engineers were extremely resistant to changing the internals of MySQL and modernizing it. There were some good reasons for being resistant, and some poor reasons (such as “this is the way we’ve always done it”). Overall, it’s tough to question the strategy that Zack, Marten Mickos, and others had regarding the short term gains. After all, they managed to maneuver MySQL into a winning position that Sun Microsystems thought was worth one billion dollars. Because of this, it’s tough to argue with them. :|

Working on Drizzle since October 2008 (officially)

I’m not the kind of person which likes to wait for years to see change, and so the Drizzle project interested me because it was not concerned with backwards compatibility with MySQL, it wasn’t concerned with having a roadmap that was dependent on the whims of a few big customers, and it was very much interested in challenging the assumptions built into a 20 year-old code base. This is a project I could sink my teeth into. And I did.

Many folks have said that the only reason Drizzle is still around is because Sun continued to pay for a number of engineers to work on Drizzle as “an experiment of sorts” and that Drizzle has no customers and therefore nothing to lose and everything to gain. This was true, no doubt about it. At Sun CTO Labs, the few of us did have the ability to code on Drizzle without the pressure-cooker of product marketing and sales demands. We were lucky.

4 6 9 10 Months in Purgatory

So, around rolls April 2009. The stock market and worldwide economy had collapsed and recession was in the air. There’s one thing that is absolutely certain in recession economies: companies that have poor leadership and direction and are beholden to the interests of a large stockholder will seek an end to their misery through acquisition by a larger, stronger firm.

And Sun Microsystems was no different. JAVA stock plummeted to two dollars a share, and Jonathan Schwartz and the Sun board began shopping Sun around to the highest bidder. IBM was courted along with other tech giants. So was Oracle.

And it was with a bit of a hangover that I awoke at the MySQL conference in April 2009 to the news that Oracle had purchased Sun Microsystems. Joy. We’d just gone through 14 months of ongoing integration with Sun Microsystems and now it was going to start all over again.

Anyone who follows PlanetMySQL knows about the ensuing battle in the European Commission’s court regarding monopoly of Oracle in the database market with its acquisition of MySQL. Monty Widenius, Eben Moglen, even Richard Stallman, weighed in on the pros and cons of Oracle’s impending control over MySQL.

All the while, us Sun Microsystems employees had to hold our tongues and try to keep our jobs as Sun laid off thousands more workers while the EC battle ensued. Not fun. It was the employment equivalent of purgatory. And the time just dragged on, with many employees, including myself and the Sun Drizzle team, not having a clue as to what would happen to us. Management was completely silent about future plans. Oracle made zero attempts to outline its future strategy regarding software, and thus most software employees simply kept on doing their work not knowing if the pink slip was arriving tomorrow or not. Lots of fun that was.

Oracle Doesn’t Need Our Services — Larry Don’t Need No Stinkin’ Cloud

The acquisition finally closed and very shortly afterwards, I got a call from my boss, Lee Bieber, that Oracle wouldn’t be needing our services. Monty, Eric, and Stewart had already resigned; none of them had any desire to work for Oracle. Lee and I had decided to see what they had in mind for us. Apparently, not much.

Larry Ellison has gone on record that the whole “cloud thing” is faddish. I don’t know whether Larry understands that cloud computing and infrastructure-as-a-service, platform-as-a-service, and database-as-a-service will eventually put his beloved Oracle cash cow in its place or not. I don’t know whether Oracle is planning on embracing the cloud environments which will continue to eat up the market share of more traditional in-house environments upon which their revenue streams depend. I really don’t.

But what I do know is that Rackspace is betting that providing these services is what the future of technology will be about.

Happiness is a Warm Cloud

Our team has landed at Rackspace Cloud. I’ve now been down to San Antonio twice to meet with key individuals with whom we’ll be working closely. Rackspace is not shy about why the wanted to acquire our team. They see Drizzle as a database that will provide them an infrastructure piece that will be modular and scalable enough to meet the needs of their very diverse Cloud customers, of which there are many tens of thousands.

Rackspace recognizes that the pain points they feel with traditional MySQL cannot be solved with simple hacks and workarounds, and that to service the needs of so many customers, they will need a database server that thinks of itself as a friendly piece of their infrastructure and not the driver of its applications. Drizzle’s core principles of flexibility and focus on scalability align with the goals Rackspace Cloud has for its platform’s future.

Rackspace is also heavily invested in Cassandra, and sees integration of Drizzle and Cassandra as being a key way to add value to its platforms and therefore for its customers.

Rackspace is all about the customers, and this is a really cool thing to experience. It’s typical for companies to claim they are all about the customer — in fact, every company I’ve ever worked for has claimed this. Rackspace is the first company I’ve worked for where you actually feel this spirit, though. You can see the fanaticism of Rackers and how they view what they do always in terms of service to the customer. It’s infectious, and I’m pretty psyched to be on their team.

Anyway, that’s my story and I’m stickin’ to it. See y’all on the nets.

Describing Drizzle’s Development Process

Yesterday, I was working on a survey that Selena Deckelmann put together for open source databases. She will be presenting the results at Linux.conf.au this month.

One of the questions on the survey was this:

How would you describe your development process?

followed by these answer choices:

  • Individuals request features
  • Large/small group empowered to make decisions
  • Benevolent dictator
  • Other, please specify:____________

I thought a bit about the question and then answered the following in the “Other, please specify:” area:

Bit of a mix between all three above.

The more I think about it, the more I really do feel that Drizzle’s development process is indeed a mixture of individuals, groups, and a Benevolent dictator. And I think it works pretty well. :) Here’s some of the reasons why I believe our development process is effective in enabling contributions by being a mix of the above three styles.

Who’s the Benevolent Dictator of Drizzle?

First, let me get the BDFL question out of the way. We’ve made a big deal in the Drizzle community and mailing lists that anyone and everyone is encouraged to participate in the development process — so why would I say that Drizzle has a benevolent dictator?

Well, although he would probably disagree with the tile of BDFL, Brian Aker does have some dictator-like abilities with regards to the development process, and rightfully so. Brian came up with many of the concepts that Drizzle aspires to be, and Brian has more experience working on the code base than any other contributor.

After having worked closely with Brian now for 18 months or so, I can definitively say that Brian’s brain works in a very, well, interesting way. Those of us who work with him understand that sometimes his brain works so fast, his typing fingers struggle to keep up, resulting in something I call “Krowspeak”. It’s kinda funny sometimes trying to translate :)

With this wonderfully unique noodle, Brian tends to knock out large chunks of code at a time, and often he wants to push these chunks of code into our build and regression system and into trunk to see the results of his work quickly. Sometimes, this can cause other branches to get out of sync and get merge conflicts, and Brian will inform branch owners of the conflicts and work with them to resolve them.

So, regarding dictator-like development processes, I suppose we have Brian acting as the merge dictator because he’s got a lot of experience and understands best how both his code and other’s code integrates. We tried a little while back having myself and Monty Taylor be merge captains, but that distribution of merge work actually created a number of other problems and we’ve since gone back to Brian being the merge captain by himself, with Lee, Monty, and myself improving our automated build and regression system to help Brian with the repetitive work.

That said, what Brian does not do is make decisions in a dictator-like way. Decisions about the code style, reviews, features, syntax changes, etc are made on the mailing list by consensus vote. If a consensus is not reached, generally, no change is made which would depend on the decision. Brian does not influence the direction of the software or the source code style any more than anyone else on the mailing list which expresses an opinion about an issue; and for this, I greatly respect his wisdom to seek consensus in an open and community-oriented way.

Groups Empowered to Make Decisions

I’m assuming that what Selena’s “large/small group empowered to make decisions” answer meant was what is sometimes called “Cabal Leadership” of a project. In other words, there is some group which steers the project and makes decisions about the project which affect the rest of the project’s contributors.

Drizzle has at least one such group, the Sun Microsystems Drizzle Team, which is composed of Brian, Monty Taylor, Lee Bieber, Eric Day, Stewart Smith, and myself. One might call us the core committers for Drizzle.

However, while the Sun Drizzle team certainly is empowered to guide development, it is no different than any other group of developers that choose to contribute to Drizzle. There isn’t a “what the Sun Drizzle team decides” rule in effect. Our “power” in the development process is no greater or less than any other group of contributors. We act merely as a team of individuals who work on the Drizzle code and advocate for the project’s goals.

Individuals Empowered to Make Decisions

One thing I’ve been impressed with in the past 18 months is how the Drizzle community has embraced the opinions and work of individual contributors. I believe Toru Maesaka, Andrew Hutchings, Diego Medina and Padraig O’Sullivan were among the first individuals to begin actively contributing to Drizzle. Since then, dozens of others have joined the developer and advocate community, with each individual carving out a piece of the source code or community activities that they want to work on.

I have learned much from all these individuals over the last year or so, and I’ve tried my best to share knowledge and encourage others to do the same. Our IRC channel and mailing list are active places of discussion. Our code reviews are always completely open to the public for comments and discussed transparently on Launchpad, and this code review process has been a great mixing bowl of opinion, discussion, learning and debate. I love it.

More and more we have developers showing up and taking ownership of a bug, a blueprint, or just a part of the code that interests them. And nobody stands in their way and says “Oh, no, you shouldn’t work on that because <insert another contributor’s name> owns that code.” Instead, what you will more likely see on the lists or on IRC is a response like “hey, that’s awesome! be sure to chat with <insert another contributor’s name>. They are interested in that code, too, and you should share ideas!” This is incredibly refreshing to see.

In short, the Drizzle developer process is a nice mix of empowered individuals and groups, and a dash of dictatorship just to keep things moving efficiently. It’s open, transparent, and fun to work on Drizzle. Come join us :)

Great Job, MySQL Engineering!

Just a quick note to congratulate MySQL engineering on their next milestone release, MySQL 5.5. There seem to be some excellent new features, some of which have been hotly requested for quite some time:

  • Support for SIGNAL/RESIGNAL — the ANSI/ISO method of throwing errors inside SQL procedures
  • Support for partitioning by RANGE COLUMNS and LIST COLUMNS
  • Support for semi-synchronous replication, which ensures that the master waits until at least one slave has committed a transaction

On the subject of SIGNAL/RESIGNAL, Roland Bouman has, as usual, an excellent and informative article on the subject. My personal opinion on SIGNAL/RESIGNAL is that it is one of the most poorly-architected, clunky error-raising frameworks even invented, but alas, if you stick with ANSI standard SQL it’s the only game in town. I look forward to seeing the progress on this, especially in relation to the addition of DIAGNOSTICS.

Regarding semi-synchronous replication, Mark Callaghan had a quick write-up about it. Check out his short article or Giuseppe’s 5.5 primer for information. Personally, I’m interested in this in Drizzle, because I’m working on the new replication system. Putting this functionality into Drizzle shouldn’t actually be too difficult. I’m looking into it.

Anyway, nice job MySQL! Keep up the good work, and continue the milestone release model. It works. It would be nice to hear MySQL engineers blogging about what they are working on, though. Give the community a chance to comment on your work, etc, and drum up interest in the new features before they arrive. Hint, hint ;)

A Laptop for Developers without paying The Windows Tax

I find it amazing that the U.S. Department of Justice can continue to cover its eyes and ears while Microsoft is allowed to exert its monopolistic power over all hardware manufacturers.

About 20 months ago, I was able to purchase a Lenovo Thinkpad T61 from the lenovo.com website without an operating system installed. Today, I went to purchase a new Lenovo Thinkpad laptop, again without having to pay the Windows Tax. Turns out Lenovo has stopped offering this option. What a complete PILE OF SHIT. Somebody in Microsoft’s “Business Development” or “Partners” team must have told Lenovo to stop offering its customers a simple choice of not having to pay the OEM license fees for Windows. And there’s nothing anyone can do about it. Microsoft is just too big and too pervasive for anybody to have a damn effect on them.

Frankly, it’s anti-choice, anti-competition, anti-innovation behaviour from Microsoft.

And its ridiculous.

Does anyone out there know how to get a decent laptop any more without having to fork over my money to a software giant that continues to bully all competition out of the market? Your suggestions are most welcome.

P.S. Mac is not an option for me. Sorry.

P.P.S The only thing this post has to do with MySQL is the general discussion on the acquisition of Sun by Oracle, and the pending investigation into possibly monopoly concerns by the EC…but of course I can’t comment on that directly…grr.

UPDATE:

Seems DELL offers laptops with Ubuntu installed instead of Windows, at least according to search results from their website. Yeah! \o/ Of course, now I have to just figure out how to get to that customization option. When I’ve gone through the customization screens, no option other than Windows is available. :(

UPDATE 2:

The DELL representative on their online chat program was quite helpful and offered this link to laptops they offer with no Windows Tax.

Macro Support in new Drizzle Client Console?

Hi all!

I’ve been reading through the requested features for the new client on the wiki here:

I think all the stuff on that link is excellent so far. I’d also like to request a feature that I think will be a really cool timesaver for DBAs and developers using Drizzle.

Macro Support

Remember, “way back when” you used Microsoft Excel and were able to start recording your actions, then when you stopped recording, Excel would store a “macro” of your actions that you could subsequently replay?

I think this would be incredibly useful for folks who do repetitive work in the console.

Sure, I know, I know…the first reaction folks will say is “but HEY, you guys removed stored procedures!” Yeah, yeah… but the feature I’m proposing here is different from stored procedures in the following ways:

  1. It’s entirely client-side. There is no server-side storage/cache, processing, parsing, or anything.
  2. It’s not limited to a small subset of SQL that stored procedures (at least in MySQL) are currently limited to. Anything the new client can do would be able to go into a macro.
  3. Since the client is in Python, the macros are themselves re-writable in a scripting language. This gives the recorded macros incredible flexibility.
  4. No fussing with SQL stored procedure permissions at runtime (you know, the silly INVOKER/DEFINER crap)
  5. Ability to interact with result sets in the macro. Just try doing that easily in a SQL stored procedure. Using CURSORs is incredibly clunk and ugly. Applying a Python function or closure/lambda on each of a result set is elegant and easy.

Imagine the following rough example interface…

drizzle> RECORD MACRO "sales_report_with_email" (to_email);
macro recording started.

drizzle> mode python;
in python mode.

python> import datetime
python> today= datetime.datetime.now().isoformat()
python> filename= "%s-%s-%s" % ("sales", to_email, today)
python> Ctrl-D

drizzle> SELECT * FROM sales
         WHERE manager = @to_email; > csv(@filename);
drizzle> mode python;
In python mode.

python> report_txt= open(filename, "r+b").read()
python> import smtplib
python> mailserver = smtplib.SMTP('localhost')
python> mailserver.sendmail('theboss@company.com', to_email, report_txt)
python> mailserver.quit()
python> print "Mail sent to %s\n" % to_email
python> Ctrl-D

drizzle> STOP MACRO;
Macro "sales_report_with_email" saved.

drizzle> macro("sales_report_with_email", "myboss@company.com");
Mail sent to myboss@company.com

Pretty powerful, eh?

If you follow the flow above, you will notice the only real trick to solve is passing the macro’s arguments into the console’s variable array, and from the console’s variable array into the Python interpreter’s variable scope. But this is a fairly simple problem to solve…

Thoughts? Suggestions? If you’ve got comments, please feel free to share here, or on the Drizzle Discussion mailing list, or even update the wiki pages posted above. Thanks! :)

Sneak Peek – Drizzle Transaction Log and INFORMATION_SCHEMA

I’ve been coding up a storm in the last couple days and have just about completed coding on three new INFORMATION_SCHEMA views which allow anyone to query the new Drizzle transaction log for information about its contents. I’ve also finished a new UDF for Drizzle called PRINT_TRANSACTION_MESSAGE() that prints out the Transaction message‘s contents in a easy-to-read format.

I don’t have time for a full walk-through blog entry about it, so I’ll just paste some output below and let y’all take a looksie. A later blog entry will feature lots of source code explaining how you, too, can easily add INFORMATION_SCHEMA views to your Drizzle plugins.

Below is the results of the following sequence of actions:

  • Start up a Drizzle server with the transaction log enabled, checksumming enabled, and the default replicator enabled.
  • Open a Drizzle client
  • Create a sample table, insert some data into it, do an update to that table, then drop the table
  • Query the INFORMATION_SCHEMA views and take a look at the transaction messages and information the transaction log now contains

Enjoy! :)

jpipes@serialcoder:~/repos/drizzle/replication-group-commit/tests$ ./dtr --mysqld="--default-replicator-enable"\
 --mysqld="--transaction-log-enable"\
 --mysqld="--transaction-log-enable-checksum"\
 --start-and-exit
...
Servers started, exiting
jpipes@serialcoder:~/repos/drizzle/replication-group-commit/tests$ ../client/drizzle --port=9306
Welcome to the Drizzle client..  Commands end with ; or \g.
Your Drizzle connection id is 2
Server version: 2009.11.1181 Source distribution (replication-group-commit)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

drizzle> use test
Database changed
drizzle> CREATE TABLE t1 (   id INT NOT NULL PRIMARY KEY , padding VARCHAR(200) NOT NULL );
Query OK, 0 rows affected (0.01 sec)

drizzle> INSERT INTO t1 VALUES (1, "I love testing.");
Query OK, 1 row affected (0.01 sec)

drizzle> INSERT INTO t1 VALUES (2, "I hate testing.");
Query OK, 1 row affected (0.01 sec)

drizzle> UPDATE t1 SET padding="I love it when a plan comes together" WHERE id = 2;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

drizzle> DROP TABLE t1;
Query OK, 0 rows affected (0.17 sec)

drizzle> SELECT * FROM INFORMATION_SCHEMA.TRANSACTION_LOG\G
*************************** 1. row ***************************
         FILE_NAME: transaction.log
       FILE_LENGTH: 639
   NUM_LOG_ENTRIES: 5
  NUM_TRANSACTIONS: 5
MIN_TRANSACTION_ID: 0
MAX_TRANSACTION_ID: 9
 MIN_END_TIMESTAMP: 1257888458463696
 MAX_END_TIMESTAMP: 1257888473929116
1 row in set (0 sec)

drizzle> SELECT * FROM INFORMATION_SCHEMA.TRANSACTION_LOG_ENTRIES;
+--------------+-------------+--------------+
| ENTRY_OFFSET | ENTRY_TYPE  | ENTRY_LENGTH |
+--------------+-------------+--------------+
|            0 | TRANSACTION |          141 |
|          141 | TRANSACTION |          121 |
|          262 | TRANSACTION |          121 |
|          383 | TRANSACTION |          181 |
|          564 | TRANSACTION |           75 |
+--------------+-------------+--------------+
5 rows in set (0 sec)

drizzle> SELECT * FROM INFORMATION_SCHEMA.TRANSACTION_LOG_TRANSACTIONS;
+--------------+----------------+-----------+------------------+------------------+----------------+------------+
| ENTRY_OFFSET | TRANSACTION_ID | SERVER_ID | START_TIMESTAMP  | END_TIMESTAMP    | NUM_STATEMENTS | CHECKSUM   |
+--------------+----------------+-----------+------------------+------------------+----------------+------------+
|            0 |              0 |         1 | 1257888458463668 | 1257888458463696 |              1 | 3275955647 |
|          141 |              7 |         1 | 1257888462222183 | 1257888462226990 |              1 |  407829420 |
|          262 |              8 |         1 | 1257888465371330 | 1257888465378423 |              1 | 4073072174 |
|          383 |              9 |         1 | 1257888470209443 | 1257888470215165 |              1 |   92884681 |
|          564 |              9 |         1 | 1257888473929111 | 1257888473929116 |              1 | 2850269133 |
+--------------+----------------+-----------+------------------+------------------+----------------+------------+
5 rows in set (0 sec)

drizzle> SELECT PRINT_TRANSACTION_MESSAGE("transaction.log", ENTRY_OFFSET) as trx
       > FROM INFORMATION_SCHEMA.TRANSACTION_LOG_ENTRIES\G
*************************** 1. row ***************************
trx: transaction_context {
  server_id: 1
  transaction_id: 0
  start_timestamp: 1257888458463668
  end_timestamp: 1257888458463696
}
statement {
  type: RAW_SQL
  start_timestamp: 1257888458463676
  end_timestamp: 1257888458463694
  sql: "CREATE TABLE t1 (   id INT NOT NULL PRIMARY KEY , padding VARCHAR(200) NOT NULL )"
}

*************************** 2. row ***************************
trx: transaction_context {
  server_id: 1
  transaction_id: 7
  start_timestamp: 1257888462222183
  end_timestamp: 1257888462226990
}
statement {
  type: INSERT
  start_timestamp: 1257888462222185
  end_timestamp: 1257888462226989
  insert_header {
    table_metadata {
      schema_name: "test"
      table_name: "t1"
    }
    field_metadata {
      type: INTEGER
      name: "id"
    }
    field_metadata {
      type: VARCHAR
      name: "padding"
    }
  }
  insert_data {
    segment_id: 1
    end_segment: true
    record {
      insert_value: "1"
      insert_value: "I love testing."
    }
  }
}

*************************** 3. row ***************************
trx: transaction_context {
  server_id: 1
  transaction_id: 8
  start_timestamp: 1257888465371330
  end_timestamp: 1257888465378423
}
statement {
  type: INSERT
  start_timestamp: 1257888465371332
  end_timestamp: 1257888465378422
  insert_header {
    table_metadata {
      schema_name: "test"
      table_name: "t1"
    }
    field_metadata {
      type: INTEGER
      name: "id"
    }
    field_metadata {
      type: VARCHAR
      name: "padding"
    }
  }
  insert_data {
    segment_id: 1
    end_segment: true
    record {
      insert_value: "2"
      insert_value: "I hate testing."
    }
  }
}

*************************** 4. row ***************************
trx: transaction_context {
  server_id: 1
  transaction_id: 9
  start_timestamp: 1257888470209443
  end_timestamp: 1257888470215165
}
statement {
  type: UPDATE
  start_timestamp: 1257888470209446
  end_timestamp: 1257888470215163
  update_header {
    table_metadata {
      schema_name: "test"
      table_name: "t1"
    }
    key_field_metadata {
      type: INTEGER
      name: "id"
    }
    set_field_metadata {
      type: VARCHAR
      name: "padding"
    }
  }
  update_data {
    segment_id: 1
    end_segment: true
    record {
      key_value: "2"
      key_value: "I love it when a plan comes together"
      after_value: "I love it when a plan comes together"
    }
  }
}

*************************** 5. row ***************************
trx: transaction_context {
  server_id: 1
  transaction_id: 9
  start_timestamp: 1257888473929111
  end_timestamp: 1257888473929116
}
statement {
  type: RAW_SQL
  start_timestamp: 1257888473929113
  end_timestamp: 1257888473929115
  sql: "DROP TABLE `t1`"
}

5 rows in set (0.06 sec)

FYI, if you look closely, you’ll see some odd things — namely that there is a transaction with an ID of zero. I’m aware of this and am working on fixing it :) Like I said, I’m almost done coding…

The Great Escape

This week, I am working on putting together test cases which validate the Drizzle transaction log‘s handling of BLOB columns.

I ran into an interesting set of problems and am wondering how to go about handling them. Perhaps the LazyWeb will have some solutions. :)

The problem, in short, is inconsistency in the way that the NUL character is escaped (or not escaped) in both the MySQL/Drizzle protocol and the MySQL/Drizzle client tools. And, by client tools, I mean both everyone’s favourite little mysql command-line client, but also the mysqltest client, which provides infrastructure and runtime services for the MySQL and Drizzle test suites.

Even within the server and client protocol, there appears to be some inconsistency in how and when things are escaped. Take a look at this interesting output from the drizzle client program (FYI, output is identical for mysql client, I checked…)

drizzle> select 'test\0me';
+---------+
| test    |
+---------+
| test me | 
+---------+
1 row in set (0 sec)

You’ll notice that in the first SELECT statement, the column header is cut off — i.e. the column header is not escaping the \0 NUL character in the string 'test\0me'. However, the result data does not truncate the string but replaces the NUL character with a space character. So, I came to the conclusion that the drizzle client does not escape column headers but does do some sort of escaping for the result data. Given this conclusion, you will understand my raised eyebrow when the following SELECT statement was displayed:

drizzle> select 'test\0me' = 'test me';
+------------------------+
| 'test\0me' = 'test me' |
+------------------------+
|                      0 | 
+------------------------+
1 row in set (0 sec)

Hmmm…so maybe column headers are being escaped by the MySQL/Drizzle client? Clearly, the NUL character was escaped as the characters ‘\\’ followed by the character ‘0’ in the column header above. Indeed, quite puzzling.

OK, so the above anomaly needs to be investigated. However, a similar issue exists for the mysqltest/drizzletest client program. To see the problem, check the following out. I create a simple test case with the following in it:

--disable_warnings
DROP TABLE IF EXISTS t1;
--enable_warnings

SELECT 'test\0me';

CREATE TABLE t1 (fld BLOB NULL);
INSERT INTO t1 VALUES ('test\0me');
SELECT COUNT(*) FROM t1;
DROP TABLE t1;

Now, what you would expect to see for the output of the above — at least if you expect results similar to the MySQL/Drizzle client output — is the following:

DROP TABLE IF EXISTS t1;
SELECT 'test\0me';
test
test me
CREATE TABLE t1 (fld BLOB NULL);
INSERT INTO t1 VALUES ('test\0me');
SELECT COUNT(*) FROM t1;
COUNT(*)
1
DROP TABLE t1;

That is what you would expect to see in the output of course… Here is what you actually get in the output:

DROP TABLE IF EXISTS t1;
SELECT 'test\0me';
test
test

So, the mysqltest/drizzletest client apparently does not escape the NUL character for the result data at all. It looks like it does do some escaping/replacing for the NUL character in the column header, though, otherwise the second “test” line would not appear. This leads to the result file being essentially truncated as soon as a NUL character is included in any output to the mysqltest/drizzletest client. This essentially makes the mysqltest/drizzletest client useless for testing and validating BLOB data.

Possible Solutions?

I think the cleanest solution would be to create a shared library of code that would be responsible for uniformly and consistently escaping data, and then linking the various clients (and server) with this library and removing all of the various escaping functions currently in the server. This would, of course, take some time, but would be the most future proof solution. Anyone else have ideas on solving the problem of being able to test and validate binary data via the test suite? Cheers!

A Month of Milestones

I’m finding myself smiling today. I lay in bed last night thinking about a number of milestones that this month marks for me.

October 15th marked four months since the last time I had a cigarette. I feel good about my chances at remaining smoke-free for the remainder of my life.

October 18th marked one year since I officially began working on the Drizzle project. Although, as Giuseppe can attest to, I had been contributing to Drizzle before October 18th, 2008, that date was the official start. :)

I think about how much has been accomplished by the Drizzle community since that time. The Drizzle of October 2008 is barely recognizable now. Monty‘s incredible work on the build system, Stewart‘s continued removal of legacy Unireg code like the FRM files, Eric Day joining the Sun Drizzle team and contributing amazing work on the protocol and client libraries, Monty’s reworking of the plugin system, new datetime (temporal) work, Padraig O’Sullivan‘s enormous contributions in the arena of the INFORMATION_SCHEMA, the optimizer, runtime, replication, and memcached, and an automation system that provides per-commit regression feedback. It’s truly fantastic to be part of a living, breathing, active project with so many special contributors who bring infectious enthusiasm to the world of database development. I’m privileged to be a part of it.

And finally, Tuesday the 28th marked the day that the new transactional replication system hit Drizzle’s trunk. While a ton of work is of course left to be done on the replication system, Tuesday’s code hitting trunk was a big milestone that I’m really happy about.

Anyway, just wanted to share some happiness. Cheers.

Drizzle Replication – The Transaction Log

In this installment of my Drizzle Replication blog series, I’ll be talking about the Transaction Log. Before reading this entry, you may want to first read up on the Transaction Message, which is a central concept to this blog entry.

The transaction log is just one component of Drizzle’s default replication services, but it also serves as a generalized log of atomic data changes to a particular server. In this way, it is only partially related to replication. The transaction log is used by components of the replication services to store changes made to a server’s data. However, there is nothing that mandates that this particular transaction log be a required feature for Drizzle replication systems. For instance, Eric Lambert is currently working on a Gearman-based replication service which, while following the same APIs, does not require the transaction log to function. Furthermore, other, non-replication-related modules may use the transaction log themselves. For instance, a future Recovery and/or Backup module may just as easily use the transaction log for its own purposes as well.

Before we get into the details, it’s worth noting the general goals we’ve had for the transaction log, as these goals may help explain some of the design choices made. In short, the goals for the transaction log are:

  • Introduce no global contention points (mutexes/locks)
  • Once written, the transaction log may not be modified
  • The transaction log should be easily readable in multiple programming languages

Overview of the Transaction Log Structure




The format of the transaction log is simple and straightforward. It is a single file that contains log entries, one after another. These log entries have a type associated with them. Currently, there are only two types of entries that can go in the transaction log: a Transaction message entry and a BLOB entry. We will only cover the Transaction message entry in this article, as I’ll leave how to deal with BLOBs for a separate article entirely.

Each entry in the transaction log is preceded by a 4 bytes containing an integer code identifying the type of entry to follow. The bytes which follow this type header are interpreted based on the type of entry. For entries of type Transaction message, the graphics here show the layout of the entry in the log. First, a 4 byte length header is written, then the serialized Transaction message, then a 4 byte checksum of the serialized Transaction message.


Details of the TransactionLog::apply() Method

For those interested in how the transaction log is written to, I’m going to detail the apply() method of the TransactionLog class in /plugin/transaction_log/transaction_log.cc. The TransactionLog class is simply a subclass of plugin::TransactionApplier and therefore must implement the single pure virtual apply method of that class interface.

The TransactionLog class has a private drizzled::atomic<off_t> called log_offset which is an offset into the transaction log file that is incremented with each atomic write to the log file. You will notice in the code below that this atomic off_t is stored locally, then incremented by the total length of the log entry to be written. A buffer is then written to the log file using pwrite() at the original offset. In this way, we completely avoid calling pthread_mutex_lock() or similar when writing to the log file, which should increase scalability of the transaction log.

void TransactionLog::apply(const message::Transaction &to_apply)
{
  uint8_t *buffer; /* Buffer we will write serialized header, message and trailing checksum to */
  uint8_t *orig_buffer;
 
  int error_code;
  size_t message_byte_length= to_apply.ByteSize();
  ssize_t written;
  off_t cur_offset;
  size_t total_envelope_length= HEADER_TRAILER_BYTES + message_byte_length;
 
  /*
   * Attempt allocation of raw memory buffer for the header,
   * message and trailing checksum bytes.
   */
  buffer= static_cast<uint8_t *>(malloc(total_envelope_length));
  if (buffer == NULL)
  {
    errmsg_printf(ERRMSG_LVL_ERROR,
      _("Failed to allocate enough memory to buffer header, transaction message, "
        "and trailing checksum bytes. Tried to allocate %" PRId64
        " bytes.  Error: %s\n"),
    static_cast<uint64_t>(total_envelope_length),
    strerror(errno));
    state= CRASHED;
    deactivate();
    return;
  }
  else
    orig_buffer= buffer; /* We will free() orig_buffer, as buffer is moved during write */
 
  /*
   * Do an atomic increment on the offset of the log file position
   */
  cur_offset= log_offset.fetch_and_add(static_cast<off_t>(total_envelope_length));
 
  /*
   * We adjust cur_offset back to the original log_offset before
   * the increment above…
   */
 cur_offset-= static_cast((total_envelope_length));
 
  /*
   * Write the header information, which is the message type and
   * the length of the transaction message into the buffer
   */
  buffer= protobuf::io::CodedOutputStream::WriteLittleEndian32ToArray(
    static_cast<uint32_t>(ReplicationServices::TRANSACTION), buffer);
  buffer= protobuf::io::CodedOutputStream::WriteLittleEndian32ToArray(
    static_cast<uint32_t>(message_byte_length), buffer);
 
  /*
   * Now write the serialized transaction message, followed
   * by the optional checksum into the buffer.
   */
  buffer= to_apply.SerializeWithCachedSizesToArray(buffer);
  uint32_t checksum= 0;
  if (do_checksum)
  {
    checksum= drizzled::hash::crc32(reinterpret_cast<char *>(buffer) – 
                     message_byte_length, message_byte_length);
  }
 
  /* We always write in network byte order */
  buffer= protobuf::io::CodedOutputStream::WriteLittleEndian32ToArray(checksum, buffer);
  /*
   * Quick safety…if an error occurs above in another writer, the log
   * file will be in a crashed state.
   */
  if (unlikely(state == CRASHED))
  {
    /*
     * Reset the log’s offset in case we want to produce a decent error message including
     * the original offset where an error occurred.
     */
    log_offset= cur_offset;
    free(orig_buffer);
    return;
  }
 
  /* Write the full buffer in one swoop */
  do
  {
    written= pwrite(log_file, orig_buffer, total_envelope_length, cur_offset);
  }
  while (written == -1 && errno == EINTR); /* Just retry the write when interrupted by a signal… */
 
  if (unlikely(written != static_cast<ssize_t>(total_envelope_length)))
  {
    errmsg_printf(ERRMSG_LVL_ERROR,
     _(“Failed to write full size of transaction.  Tried to write %” PRId64
        ” bytes at offset %” PRId64 “, but only wrote %” PRId32 ” bytes.  Error: %s\n“),
        static_cast<uint64_t>(total_envelope_length),
        static_cast<uint64_t>(cur_offset),
        static_cast<uint64_t>(written),
        strerror(errno));
      state= CRASHED;
      /*
       * Reset the log’s offset in case we want to produce a decent error message including
       * the original offset where an error occurred.
       */
      log_offset= cur_offset;
      deactivate();
  }
  free(orig_buffer);
  error_code= my_sync(log_file, 0);
  if (unlikely(error_code != 0))
  {
    errmsg_printf(ERRMSG_LVL_ERROR,
      _(“Failed to sync log file. Got error: %s\n“),
      strerror(errno));
  }
}

Reading the Transaction Log

OK, so the above code shows how the transaction log is written. What about reading the log file? Well, it’s pretty simple. There is an example program in /drizzle/message/transaction_reader.cc which has code showing how to do this. Here’s a snippet from that program:

int main(int argc, char* argv[])
{
  …
  message::Transaction transaction;
 
  file= open(argv[1], O_RDONLY);
  if (file == -1)
  {
    fprintf(stderr, _(“Cannot open file: %s\n“), argv[1]);
    return -1;
  }
      …
  protobuf::io::ZeroCopyInputStream *raw_input=
    new protobuf::io::FileInputStream(file);
  protobuf::io::CodedInputStream *coded_input=
    new protobuf::io::CodedInputStream(raw_input);
 
  char *buffer= NULL;
  char *temp_buffer= NULL;
  uint32_t length= 0;
  uint32_t previous_length= 0;
  uint32_t checksum= 0;
  bool result= true;
  uint32_t message_type= 0;
 
  /* Read in the length of the command */
  while (result == true &&
           coded_input->ReadLittleEndian32(&message_type) == true &&
           coded_input->ReadLittleEndian32(&length) == true)
  {
      if (message_type != ReplicationServices::TRANSACTION)
      {
        fprintf(stderr, _("Found a non-transaction message "
                            "in log.  Currently, not supported.\n"));
        exit(1);
      }
 
      if (length > INT_MAX)
      {
        fprintf(stderr, _(“Attempted to read record bigger than INT_MAX\n“));
        exit(1);
      }
 
      if (buffer == NULL)
      {
        temp_buffer= (char *) malloc(static_cast<size_t>(length));
      }
      /* No need to allocate if we have a buffer big enough… */
      else if (length > previous_length)
      {
        temp_buffer= (char *) realloc(buffer, static_cast<size_t>(length));
      }
 
      if (temp_buffer == NULL)
      {
        fprintf(stderr, _("Memory allocation failure trying to "
                            "allocate %" PRIu64 " bytes.\n"),
                 static_cast<uint64_t>(length));
        break;
      }
      else
        buffer= temp_buffer;
 
      /* Read the Command */
      result= coded_input->ReadRaw(buffer, (int) length);
      if (result == false)
      {
        fprintf(stderr, _(“Could not read transaction message.\n“));
        fprintf(stderr, _(“GPB ERROR: %s.\n“), strerror(errno));
        fprintf(stderr, _(“Raw buffer read: %s.\n“), buffer);
        break;
      }
 
      result= transaction.ParseFromArray(buffer, static_cast<int32_t>(length));
      if (result == false)
      {
        fprintf(stderr, _(“Unable to parse command. Got error: %s.\n“),
                 transaction.InitializationErrorString().c_str());
        if (buffer != NULL)
          fprintf(stderr, _(“BUFFER: %s\n“), buffer);
        break;
    }
    /* Print the transaction */
    printTransaction(transaction);
 
    /* Skip 4 byte checksum */
    coded_input->ReadLittleEndian32(&checksum);
 
    if (do_checksum)
    {
      if (checksum != drizzled::hash::crc32(buffer, static_cast<size_t>(length)))
      {
        fprintf(stderr, _("Checksum failed. Wanted %" PRIu32
                              " got %" PRIu32 "\n"),
                 checksum,
                 drizzled::hash::crc32(buffer, static_cast<size_t>(length)));
      }
    }
    previous_length= length;
  }
 
  if (buffer)
    free(buffer);
  delete coded_input;
  delete raw_input;
  return (result == true ? 0 : 1);
}

Shortcomings of the Transaction Log

So far, we’ve generally focused on a scalable design for the transaction log and have not spent too much time on performance tuning the code — and yes, performance != scalability. There are a number of problems with the current code which we will address in future versions of the transaction log. Namely:

  • Reduce calls to malloc(). Currently, each write of a transaction message to the log file incurs a call to malloc() to allocate enough memory to store the serialized log entry. Clearly, this is not optimal. We’ve considered a number of alternate approached to calling malloc(), including having a scoreboard approach where a vector of memory slabs are used in a round-robin fashion. This would introduce some locking, however. Also, I’ve thought about using a hazard pointer list on the Session object to have previously-allocated memory on the Session object be used for something like this. But, these ideas must be hashed out further.
  • There is no index into the transaction log. This is not a problem for writing the transaction log, of course, but for readers of the transaction log. I’m in the process of creating classes and a library for building indexes for a transaction log and, in addition, creating archived snapshots to enable log shipping for Drizzle replication. I’ll be pushing code for this to Launchpad later this week and will write a new article about log shipping and snapshot creation.
  • Each call to apply() calls fdatasync()/fsync() on the transaction log. Certain environments may consider this to be too strict a sync requirement, since the storage engine may already keep a transaction log file of its own that is also synced. For instance, InnoDB has a transaction log that, depending on the setting of InnoDB configuration variables, may call fdatasync() upon every transaction commit. It would be best to have the syncing behaviour be user-adjustable — for instance, a setting to allow the transaction log to be synced every X number of seconds…

Summary and Request for Comments

That’s it for the discussion about the transaction log. I’ll post some more code examples from the replication plugins which utilize the transaction log in a later blog entry.

What do you think of the design of the transaction log? What would you change? Comments are always welcome! Cheers. :)

Drizzle Replication – Changes in API to support Group Commit

Hi all. It’s been quite some time since my last article on the new replication system in Drizzle. My apologies for the delay in publishing the next article in the replication series.

The delay has been due to a reworking of the replication system to fully support “group commit” behaviour and to support fully transactional replication. The changes allow replicator and applier plugins to understand much more about the actual changes which occurred on the server, and to understand the transactional container properly.

The goals of Drizzle‘s replication system are as follows:

  • Make replication modular and not dependent on one particular implementation
  • Make it simple and fun to develop plugins for Drizzle replication
  • Encapsulate all transmitted information in an efficient, portable, and standard format

This article serves to build on the last article and explain the changes to the Google Protobuffer message definitions used in the replication API. The actual replication API described in the last article remains almost the same. However, instead of being named CommandApplier and CommandReplicator, those plugin base classes are now named TransactionApplier and TransactionReplicator respectively. And, instead of consuming a Command message, they consume Transaction messages.


For my friend Edwin‘s benefit, I’ll be including lots of pretty graphics. :) For my developer readers, I’m including lots of example C++ code to help you best understand how to read and manipulate the Transaction and Statement messages in the new replication system.

New Message Definitions

As I mentioned above, the Command message previously discussed in the first replication article, has been changed in favour of a more space-efficient and transactional message format. The proto file is now called /drizzled/message/transaction.proto. You can look at the proto file online.

The Command Message has become the Statement message, and a new Transaction message serves as a container for multiple Statement messages representing (for most cases) an atomic change in the state of the database server. I’ll discuss later in the article those specific cases where a Transaction message’s contents may contain only a partial atomic change to the server.

The image to the right depicts the Transaction message container. As you can see, the Transaction message contains two things: a TransactionContext message and an array of one or more Statement messages.

The TransactionContext Message

Each Transaction message contains a single TransactionContext message. The TransactionContext message contains information about the entire transaction. The data members of the TransactionContext are as follows:

  • server_id – (uint32_t) A numeric identifier for the server which executed this transaction
  • transaction_id – (uint64_t) A globally-unique transaction identifier
  • start_timestamp – (uint64_t) A nano-second precision timestamp of when the transaction began.
  • end_timestamp – (uint64_t) A nano-second precision timestamp of when the transaction completed.

Since TransactionContext is simply a Google Protobuffer message, accessing data members is simple and straightforward. If you’re writing a replicator or applier, a reference to a const Transaction message will be supplied to you via the standard API. For instance, let’s assume we’re writing a replicator and we want to filter all messages that are from the server with a server_id of 100. Kind of a silly example, but nevertheless, it allows us to see some example code.

As you may remember, the API for a replicator is dirt simple. There is a replicate() pure virtual method which accepts two parameters, the GPB message and a reference to the Applier which will “apply” the message to some target. The new function signature is the same as the last one, with the term “Command” replaced with the term “Transaction”:

  1. virtual void replicate(TransactionApplier *in_applier,
  2.                        message::Transaction &to_replicate)= 0;

Suppose our replicator class is called MyReplicator. Here is how to query the transaction context of the Transaction message and filter out transactions coming from server #100. :)

  1. void MyReplicator::replicate(TransactionApplier *in_applier,
  2.                         message::Transaction &to_replicate)
  3. {
  4.   const message::TransactionContext &ctx= to_replicate.transaction_context();
  5.   if (ctx.server_id() != 100)
  6.     in_applier->apply(to_replicate);
  7. }

See? Pretty darn simple. :) OK, on to the Statement message, which is slightly more complicated.

The Statement Message

As noted above, the Transaction message contains an array of Statement messages. In Protobuffer terminology, the Transaction message contains a “repeated” Statement data member. The Statement message is an envelope containing the following information:

  • type – (enum Type) The type of Statement this message represents. Currently, the possible values of the type are as follows:
    • ROLLBACK
    • INSERT
    • UPDATE
    • DELETE
    • TRUNCATE_TABLE
    • CREATE_SCHEMA
    • ALTER_SCHEMA
    • DROP_SCHEMA
    • CREATE_TABLE
    • ALTER_TABLE
    • DROP_TABLE
    • SET_VARIABLE
    • RAW_SQL
  • start_timestamp – (uint64_t) A nano-second precision timestamp of when the statement began.
  • end_timestamp – (uint64_t) A nano-second precision timestamp of when the statement completed.
  • sql – (string) Optionally stores the exact original SQL string producing this message.
  • For certain types of Statement messages, there will also be a specialized header and data message (see below).

To access the Statement messages in a Transaction, use something like the following code, which loops over the Transaction message’s vector of Statement messages:

  1. void MyReplicator::replicate(TransactionApplier *in_applier,
  2.                         message::Transaction &to_replicate)
  3. {
  4. /* Grab the number of statements in the Transaction message */
  5. size_t x;
  6. size_t num_statements= to_replicate.statement_size();
  7.  
  8. /* Do something with each statement… */
  9. for (x= 0; x < num_statements; ++x)
  10. {
  11.   const message::Statement &stmt= to_replicate.statement(x);
  12.   /* processStatement() does something with the statement… */
  13.   processStatement(stmt);
  14. }
  15. }

Serialized Polymorphism with the type Member

The type data member is of critical importance to the Statement message, as it allows us to have a sort of polymorphism serialized within the Statement message itself. This polymorphism allows the generic Statement message to contain specialized submessages depending on what type of event occurred on the server.

The above paragraph probably sounds overly complicated, but in reality things are pretty simple. As usual, it’s easiest to see what’s going on by looking at an example in code. For our example, let’s build out our fictional processStatement() method from the snippet above.

The processStatement() method is basically a giant switch statement, switching off of the supplied Statement message parameter’s type data member property. Here is the outline of the processStatement() method, with only our switch statement and some comments visible which should give you an idea of how we deal with specific types of Statements:

  1. void processStatement(const message::Statement &stmt)
  2. {
  3.   switch (stmt.type())
  4.   {
  5.   case message::Statement::INSERT:
  6.     /* Handle statements which insert new data… */
  7.     break;
  8.   case message::Statement::UPDATE:
  9.     /* Handle statements which update existing data… */
  10.     break;
  11.   case message::Statement::DELETE:
  12.     /* Handle statements which delete existing data… */
  13.     break;
  14.   …   
  15.   }
  16. }



Let’s go ahead and “fill out” one of the case blocks in the switch statement above. We will handle the case where the Statement type is INSERT. Note that this does not necessarily mean a SQL INSERT statement was executed. All this means is that an SQL statement was executed which resulted in a new record being added to a table on the server. This means that the actual SQL statement could have been any of INSERT, INSERT … SELECT, REPLACE INTO, or LOAD DATA INFILE.

The /drizzled/message/transaction.proto file will always contain lots of documentation explaining how each of the specific submessages in the Statement message class are handled. To the right is a graphic depicting the InsertHeader and InsertData message classes which compose the “meat” of Statements that inserted new records into the database. Whenever the Statement message’s type is INSERT, the Statement message will contain two submessages, one called insert_header and another called insert_data which will be populated with the InsertHeader and InsertData messages. The header message will contain information about the table and fields affected, while the data message will contain the values to be inserted into the table.

Here is some example code which queries the header and data messages and constructs an SQL string from them:

  1. void processStatement(const message::Statement &stmt)
  2. {
  3.   switch (stmt.type())
  4.   {
  5.   case message::Statement::INSERT:
  6.     /* Handle statements which insert new data… */
  7.     {
  8.     const message::InsertHeader &header= stmt.insert_header();
  9.     const message::InsertData &data= stmt.insert_data();
  10.     string destination;
  11.     char quoted_identifier= ‘`’;
  12.  
  13.     destination->assign("INSERT INTO ");
  14.     destination->push_back(quoted_identifier);
  15.     destination->append(header.table_metadata().schema_name());
  16.     destination->push_back(quoted_identifier);
  17.     destination->push_back(‘.’);
  18.     destination->push_back(quoted_identifier);
  19.     destination->append(header.table_metadata().table_name());
  20.     destination->push_back(quoted_identifier);
  21.     destination->append(" (");
  22.  
  23.     /* Add field list to SQL string… */
  24.     size_t num_fields= header.field_metadata_size();
  25.     size_t x;
  26.  
  27.     for (x= 0; x < num_fields; ++x)
  28.     {
  29.       const message::FieldMetadata &field_metadata= header.field_metadata(x);
  30.       if (x != 0)
  31.         destination->push_back(‘,’);
  32.    
  33.       destination->push_back(quoted_identifier);
  34.       destination->append(field_metadata.name());
  35.       destination->push_back(quoted_identifier);
  36.     }
  37.  
  38.     destination->append(") VALUES (");
  39.  
  40.     /* Add insert values */
  41.     size_t num_records= data.record_size();
  42.     size_t y;
  43.  
  44.     for (x= 0; x < num_records; ++x)
  45.     {
  46.       if (x != 0)
  47.         destination->append("),(");
  48.  
  49.       for (y= 0; y < num_fields; ++y)
  50.       {
  51.         if (y != 0)
  52.           destination->push_back(‘,’);
  53.  
  54.         destination->push_back(\’);
  55.         destination->append(data.record(x).insert_value(y));
  56.         destination->push_back(\’);
  57.       }
  58.     }
  59.     destination->push_back(‘)’);
  60.  
  61.     }
  62.     break;
  63.   …   
  64.   }
  65. }

The example code above is far from production-ready, of course. I don’t take into account different field types, instead simply enclosing everything in single quotes. Also, I don’t handle errors or escaping strings. The point isn’t to be perfect, but to show you the general way to get information out of the Statement message…

Partial Atomic Transactions

Above, I stated that the Transaction messages sent to Replicators and Appliers represent an atomic change to the state of a server. This is true, most of the time. :) There are specific situations when a Transaction message will not represent an atomic change, and you should be aware of these scenarios if you plan to write plugins which implement a replication scheme.

There are times when it is simply inefficient or impossible to create a Transaction message that represents the actual atomic change on a server. For instance, imagine a table having 100 million records. Now, imagine issuing an UPDATE against that table that potentially affected every row in the table.

In order to transmit to replicas the atomic change to the server, one gigantic Transaction message would need to be constructed on the master server. Not only is there a distinct chance that the master would run out of memory constructing such a large message object, but it’s safe to say that the master server would suffer from performance degradation during this construction. There must, therefore, be a way to start streaming the changes made to the master server before the actual final commit has happened on the master.

You may have noticed two data members of the InsertData message above named segment_id and end_segment. The first is of type uint32_t and the second is a bool. Together, these two data members fulfill the need to transmit transaction messages that are part of a bulk data modification. When a reader of a Transaction message sees that the end_segment data member is false, then the reader knows that another data segment will follow the current data message and will contain more inserts, updates, or deletes for the current transaction.

Summary and Request for Comments

Hopefully, I’ve explained the changes that have been made to Drizzle’s replication system well enough above, but I understand the changes to the message definitions are substantial and am available at any time to discuss the changes and assist people with their code. You can find me on IRC, Freenode’s #drizzle channel, via the Drizzle discussion mailing list, or via email joinfu@sun.com. I very much welcome comments. The new replication system is just finishing up the valgrind regression tests and should hit trunk later today.

The next article covers the new Transaction Log, which is a serialized log of the Transaction messages used in the replication system.