Understanding reservations, concurrency, and locking in Nova

Imagine that two colleagues, Alice and Bob, issue a command to launch a new virtual machine at approximately the same moment in time. Both Alice’s and Bob’s virtual machines must be given an IP address within the range of IP addresses granted to their project. Let’s say that range is 192.168.20.0/28, which would allow for a total of 16 IP addresses for virtual machines [1]. At some point during the launch sequence of these instances, Nova must assign one of those addresses to each virtual machine.

How do we prevent Nova from assigning the same IP address to both virtual machines?

In this blog post, I’ll try to answer the above question and shed some light on issues that have come to light about the way in which OpenStack projects currently solve (and sometimes fail) to address this issue.

Demonstrating the problem

figure A

figure A

Dramatically simplified, the launch sequence of Nova looks like figure A. Of course, I’m leaving out hugely important steps, like the provisioning and handling of block devices, but the figure demonstrates the important steps in the launch sequence for the purposes of our discussion here. The specific step in which we find our IP address reservation problem is the determine networking details step.

figure B

figure B

Now, within the determine networking details step, we have a set of tasks that looks like figure B. All of the tasks except the last revolve around interacting with the Nova database [2]. The tasks are all pretty straightforward: we grab a record for a “free” IP address from the database and mark it “assigned” by setting the IP address record’s instance ID to the ID of the instance being launched, and the host field to the ID of the compute node that was selected during the determine host machine step in figure A. We then save the updated record to the database.

OK, so back to our problem situation. Imagine if Alice and Bob’s launch requests were made at essentially the same moment in time, and that both requests arrived at the start of the determine networking details step at the same point in time, but that the tasks from figure B are executed in an interleaved fashion between Alice and Bob’s requests like figure C shows.

figure C

figure C

If you step through the numbered actions in both Alice and Bob’s request process, you will notice a problem. Actions #7 and #9 will both return the same IP address information to their callers. Worse, the database record for that single IP address will show the IP address is assigned to Alice’s instance, even though Bob’s instance was (very briefly) assigned to the IP address because the database update in action #5 occurred (and succeeded) before the database update in action #8 occurred (and also succeeded). In the words of Mr. Mackey, “this is bad, m’kay”.

There are a number of ways to solve this problem. Nova happens to employ a traditional solution: database-level write-intent locks.

Database-level Locking

At its core, any locking solution is intended to protect some critical piece of data from simultaneous changes. Write-intent locks in traditional database systems are no different. One thread announces that it intends to change one or more records that it is reading from the database. The database server will mark the records in question as locked by the thread, and return the records to the thread. While these locks are held, any other thread that attempts to either read the same records with the intent to write, or write changes to those records, will get what is called a lock wait.

Only once the thread indicates that it is finished making changes to the records in question — by issuing a COMMIT statement — will the database release the locks on the records. What this lock strategy accomplishes is prevention of two threads simultaneously reading the same piece of data that they intend to change. One thread will wait for the other thread to finish reading and changing the data before its read succeeds. This means that using a write-intent lock on the database system results in the following order of events:

figure D

figure D

For MySQL and PostgreSQL, the SQL keyword that is used to indicate to the database server that the calling thread intends to change records that it is asking for is called SELECT ... FOR UPDATE.

Using a couple MySQL command-line client sessions, I’ll show you what affect this SELECT FOR UPDATE construct has on a normal MySQL database server (though the effect is identical for PostgreSQL). I created a test database table called fixed_ips that looks like the following:

CREATE TABLE `fixed_ips` (
  `id` INT(11) NOT NULL AUTO_INCREMENT,
  `host` INT(11) DEFAULT NULL,
  `instance_id` INT(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

I then populate the table with a few records representing IP addresses, all “available” for an instance: the host and instance_id fields are set to NULL:

mysql> SELECT * FROM fixed_ips;
+----+------+-------------+
| id | host | instance_id |
+----+------+-------------+
|  1 | NULL |        NULL |
|  2 | NULL |        NULL |
|  3 | NULL |        NULL |
+----+------+-------------+
3 rows in set (0.00 sec)

And now, here interleaved in time order in a tabular format, are the SQL commands executed in each of the sessions. Thread A is on the left, thread B on the right.

Alice (thread A) Bob (thread B)
sessA>BEGIN;
Query OK, 0 rows affected (...)

sessA>SELECT NOW();
+---------------------+
| NOW()               |
+---------------------+
| 2014-12-31 09:03:07 |
+---------------------+
1 row in set (0.00 sec)

sessA>SELECT * FROM fixed_ips
    -> WHERE instance_id IS NULL
    -> AND host IS NULL
    -> ORDER BY id LIMIT 1
    -> FOR UPDATE;
+----+------+-------------+
| id | host | instance_id |
+----+------+-------------+
|  2 | NULL |        NULL |
+----+------+-------------+
1 row in set (...)
 
 
sessB>BEGIN;
Query OK, 0 rows affected (...)

sessB>SELECT NOW();
+---------------------+
| NOW()               |
+---------------------+
| 2014-12-31 09:04:05 |
+---------------------+
1 row in set (0.00 sec)

sessB>SELECT * FROM fixed_ips
    -> WHERE instance_id IS NULL
    -> AND host IS NULL
    -> ORDER BY id LIMIT 1
    -> FOR UPDATE;
sessA>UPDATE fixed_ips
    -> SET host = 42,
    ->     instance_id = 42
    -> WHERE id = 2;
Query OK, 1 row affected (...)
Rows matched: 1  Changed: 1

sessA>COMMIT;
Query OK, 0 rows affected (...)
 
 
+----+------+-------------+
| id | host | instance_id |
+----+------+-------------+
|  3 | NULL |        NULL |
+----+------+-------------+
1 row in set (42.03 sec)

sessB>COMMIT;
Query OK, 0 rows affected (...)

I’ve highlighted in red above the important things to note about the interplay between session A and session B. The 42.03 seconds is important: it shows the amount of time the SELECT ... FOR UPDATE statement waited on the write-intent locks held by session A. Secondly, the 3 returned by session B’s SELECT ... FOR UPDATE statement indicates that a different row was returned for the same query that session A issued. In other words, MySQL waited until session A issued a COMMIT before executing session B’s SELECT ... FOR UPDATE statement.

In this way, the write-intent locks constructed with the SELECT ... FOR UPDATE statement prevent the collision of threads changing the same record at the same time.

How locks “fail” with MySQL Galera Cluster

galera_replication1

At the Atlanta design summit, I co-led an Ops Meetup session on databases and was actually surprised by my poll of who was using which database server for their OpenStack deployments. Out of approximately 220 people in the room, MySQL Galera Cluster was by far the most popular way of deploying MySQL for use by OpenStack services, with around 200 or so operators raising their hands that they used it. Standard MySQL was next, and there was one person using PostgreSQL.

MySQL Galera Cluster is a system that wraps the standard MySQL row-level binary replication log transmission with something called working-set replication, enabling synchronous replication between many nodes running the MySQL database server. Now, that’s a lot of fancy words to really say that Galera Cluster allows you to run a cluster of database nodes that do not suffer from replication slave lag. You are guaranteed that the data on disk on each of the nodes in a Galera Cluster is exactly the same.

One interesting thing about MySQL Galera Cluster is that it can efficiently handle writes to any node in the cluster. This is different from standard MySQL replication, which generally relies on a single master database server that handles writes and real-time reads, and one or more slave database servers that serve read requests from applications that can tolerate some level of lag between the master and slave. Many people refer to this setup as multi-master mode, but that is actually a misnomer, because with Galera Cluster, there is no such thing as a master and a slave. Every node in a cluster is the same. Each can apply writes coming to the node directly from a MySQL client. For this reason, I like to refer to such a setup as multi-writer mode.

This ability to have writes be directed to and processed by any node in the Galera Cluster is actually pretty awesome. You can direct a load balancer to spread read and write load across all nodes in the cluster, allowing you to scale writes as well as reads. This multi-writer mode is ideal for WAN-replicated environments, believe it or not, as long as the amount of data being written to is not crazy-huge (think: Ceilometer), because you can have application servers send writes to the closest database server in the cluster, and let Galera handle the efficiency of transmitting writesets across the WAN.

However, there’s a catch. Peter Boros, a principal architect at Percona, a company that makes a specialized version of Galera Cluster called Percona XtraDB Cluster, was actually the first to inform the OpenStack community about this catch — in the aforementioned Ops Meetup session. The problem with MySQL Galera Cluster is that it does not replicate the write-intent locks for SELECT ... FOR UPDATE statements. There’s actually a really good reason for this. Galera does not have any idea about the write-intent locks, because those locks are constructions of the underlying InnoDB storage engine, not the MySQL database server itself. So, there’s no good way for InnoDB to communicate to the MySQL row-based replication stream that write-intent locks are being held inside of InnoDB for a particular thread’s SELECT ... FOR UPDATE statement [3].

Figure E

Figure E

The ramifications of this catch are interesting, indeed. If two application server threads issue the same SELECT ... FOR UPDATE request to a load balancer at the same time, which directs each thread to different Galera Cluster nodes, both threads will return the exact same record(s) with no lock waits [4]. Figure E illustrates this phenomenon, with the circled 1, 2, and 3 events representing things occurring at exactly the same time (due to no locks being acquired/held).

One might be tempted to say that Galera Cluster, due to its lack of support for SELECT ... FOR UPDATE write-intent locks, is no longer ACID-compliant, since now two threads can simultaneously select the same record with the intent of changing it. And while it is indeed true that two threads can select the same record with the intent of changing it, it is extremely important to point out that Galera Cluster is still ACID-compliant.

The reason is because even though two threads can simultaneously read the same record with the intent of changing it (which is the identical behaviour that would be seen if the FOR UPDATE was left off the SELECT statement), if both threads attempt to write a change to the same record via an UPDATE statement, either one or none of the threads would succeed in updating the record, but not both. The reason for this is in the way that Galera Cluster certifies a working set (the set of changes to data). If node 1 writes an update to disk, it must certify with a quorum of nodes in the cluster that its update does not conflict with updates to those nodes. If node 3 has begun changing the same row of data, but has not certified with the other nodes in the cluster for that working set, then it will fail to certify the original working set from node 1 and will send a certification failure back to node 1.

This certification failure manifests itself as a MySQL deadlock error, specifically error 1213, which will look like this:

ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

All nodes other than the one that first “won” — i.e. successfully committed and certified its transaction — will return this deadlock error to any other thread that attempted to change the same record(s) at the same time as the thread that “won”. Need a visual of all this interplay? Check out figure F, which I scraped together for the graphically-inclined.

Figure F

Figure F

If you ever wondered why, in the Nova codebase, we make prodigious use of a decorator called @_retry_on_deadlock in the SQLAlchemy API module, it is partly because of this issue. These deadlock errors can be consistently triggered by running load tests or things like Tempest that can put a load on the database that forces “hot spots” in the data to occur. This decorator does exactly what you’d think it would do: it retries the transaction if a deadlock error is returned from the database server.

So, given what we know about MySQL Galera Cluster, one thing we are trying to do is entirely remove any use of SELECT ... FOR UPDATE from the Nova code base. Since we know it doesn’t work the way people think it works on Galera Cluster, we might as well stop using this construct in our code. However, the retry-on-deadlock mechanism is actually not the most effective or efficient mechanism we could use to solve the concurrent update problems in the Nova code base. There is another technique, which I’ll call compare and swap, which offers a variety of benefits over the retry-on-deadlock technique.

Compare and swap

One of the drawbacks to the retry-on-deadlock method of handling concurrency problems is that it is reactive by nature. We essentially wrap calls that may tend to deadlock with a decorator that catches the deadlock error if it arises and retry the entire database transaction again. The problem with this is that the deadlock error that manifests itself from the Galera Cluster working set certification failure (see Figure F above) takes some non-insignificant amount of time to occur.

Think about it. A thread manages to start a write transaction on a Galera Cluster node. It writes the transaction on the local node and gets all the way up to the point of doing the COMMIT. At that point, the node sends out a certification request to each node in the cluster (in parallel). It must wait until a quorum of those nodes respond with a successful certification. If another node has an active working set that changes the same modified rows, then a deadlock will occur, and that deadlock will eventually bubble its way back to the caller, who will retry the exact same database transaction. All of these things, while individually very quick in Galera Cluster, do take some amount of time.

What if we used a technique that would allow us to structure our SQL statements in such a way that we can avoid the roundtrips from one Galera Cluster node to the other nodes? Well, there is.

Consider the following SQL statements, taken from the above CLI examples:

BEGIN;
/* Grab the "first" unassigned IP address */
SELECT id FROM fixed_ips
WHERE host IS NULL
AND instance_id IS NULL
ORDER BY id
LIMIT 1
FOR UPDATE;
/* Let's assume that the above query returned the
   fixed_ip with ID of 1
   We now "assign" the IP address to instance #42
   and on host #99 */
UPDATE fixed_ips
SET host = 99, instance_id = 42
WHERE id = 1;
COMMIT;

Now, we know that the locks taken for the FOR UPDATE statement won’t actually be considered by any other nodes in a Galera Cluster, so we need to get rid of the use of SELECT ... FOR UPDATE. But, how can we structure things so that the SQL code sent to any node in the Galera Cluster will guarantee to us that we will neither stumble into a deadlock error and that the cluster node we end up executing our statements on will not need to contact any other node to determine that another thread has updated the same record during the time that we SELECT‘d our record and when we go to UPDATE it?

The answer lies in constructing an UPDATE statement that contains a WHERE clause that contains all the fields from the previously SELECT‘ed record, like so:

/* Grab the "first" unassigned IP address */
SELECT id FROM fixed_ips
WHERE host IS NULL
AND instance_id IS NULL
ORDER BY id
LIMIT 1;
/* Let's assume that the above query returned the
   fixed_ip with ID of 1
   We now "assign" the IP address to instance #42
   and on host #99, but specify that the host and
   instance_id fields must match our original view
   of that record -- i.e., they must both be NULL
*/
UPDATE fixed_ips
SET host = 99, instance_id = 42
WHERE id = 1
AND host IS NULL
AND instance_id IS NULL;

If we structure our application code so that it is executing the above SQL statements, each statement can be executed on any node in the cluster, without waiting for certification failures to occur before “knowing” if the UPDATE would succeed. Remember that working set certification in Galera only happens once the local node (i.e. the node originally receiving the SQL statement) is ready to COMMIT the changes. Well, if thread B managed to update the fixed_ip record with id = 1 in between the time when thread A does its SELECT and the time thread A does its UPDATE, then the WHERE condition:

WHERE id = 1
AND host IS NULL
AND instance_id IS NULL;

Will fail to select any any rows in the database to update, since host IS NULL AND instance_id IS NULL will no longer be true if another thread updated the record. We can catch this failure to update any rows in the database more efficiently than the certification timeout, since the thread that sent the UPDATE ... WHERE ... host IS NULL AND instance_id IS NULL statement will receive notification about no rows updated before any certification traffic would ever be generated (since there’s no certification needed if nothing was updated).

Do we still need a retry mechanism? Yes, of course we do, in order to retry the SELECT, then UPDATE ... WHERE statements when a previous UPDATE ... WHERE statement returned zero rows affected. The difference between this compare-and-swap approach and the brute-force retry-on-deadlock approach is that we’re no longer reacting to an exception being emitted after some timeout of certification, but instead being proactive and just structuring our UPDATE statement to pass in our previous view of the record we want to change, allowing for a tighter retry loop to occur (no timeout waits needed, simply detect whether rows_affected is greater than zero).

This compare and swap mechanism is what I describe in the lock-free-quota-management Nova blueprint specification. There’s been a number of mailing list threads and IRC conversations about this particular issue, so I figured I would write a bit and create some pretty graphics to illustrate the sequencing of events that occurs. Hope this has been helpful. Let me know if you have thoughts on the topic or see any errors in my work. Always happy for feedback.

[1] This is just for example purposes. Technically, such a CIDR would result in 13 available addresses in Nova, since addresses for the gateway, cloudpipe VPN, and broadcast addresses are reserved for use by Nova.

[2] We are not using Neutron in our example here, but the same general problem resides in Neutron’s IPAM code as is described in this post.

[3] Technically, there are trade-offs between pessimistic locking (which InnoDB uses locally) and optimistic locking (that Galera uses in its working-set certification. For an excellent read on the topic, check out Jay Janssen‘s blog article on multi-node writing and deadlocks in Galera.

[4] If both threads happened to hit the same Galera Cluster node, then the last thread to execute the SELECT ... FOR UPDATE would end up waiting for the locks (in InnoDB) on that particular cluster node.

So, What is the Core of OpenStack?

In a conversation on Twitter, Lydia Leong stated something that I’ve heard from a number of industry folks and OpenStack insiders alike:

The core needs to be small, rock-solid stable, and readily extensible.

I responded:

Depends on what you mean by “core” :) I think that term has been abused.

It’s probably worth writing down a response that spans more than 140 characters, so I decided to write a post about why the term “core” is, indeed, abused, and some of my thoughts about Lydia’s pronouncement.

First, on specificity

In my years working in the cloud space, it has dawned on me that there really is no single way of looking at the cloud development and deployment space. As soon as one person (myself included) tries to describe with any detail what cloud systems are, invariably someone else will say “no, it’s not (just) that… it’s this as well.”

For instance, if I say that cloud is on-demand computing that gives application developers tools to drive their own deployment, someone will correctly point out that cloud is also hardware that has been virtualized to fit budgetary and technological needs of an IT department.

Similarly, if I said that cloud was all about treating VMs as cattle, someone will rightly come along and say that legacy applications and “pet VMs” have as much of a right to benefit from virtualized infrastructure as those hipster-scale folks.

One man’s young dame is another’s old woman. (http://bit.ly/to-each-his-own)

And, then John Dickinson will appropriately say, “hey, Jay, cloud isn’t all about compute, you know.” And of course, he’d be totally correct.

My point is that, well, cloud means something different to everyone. And there’s really nothing wrong with that. It just means that when you express an idea about the cloud space, you should qualify exactly what it is you are applying that idea to, and conversely, what you are not applying your idea to.

And, of course, Twitter, being limited in its conversational envelope size, is hardly an ideal medium to express grand ideas about a space such as “the cloud” that already suffers from a dearth of crisp definition.

On golf balls, layercake, and taxonomy

Forrester, Gartner and other companies obsessively attempt to categorize and rank companies and products in ways that they feel are helpful to their CIO/tech buyer audience. And that’s fine; everybody’s got to make a living out here.

But lately, it seems the OpenStack developer community has gotten gung-ho about categorizing various OpenStack projects. I want to summarize here some of the existing thoughts on the matter, before delving into my own personal opinions.

Late last year, Dean Troyer originally posted his ideas about categorizing projects within the OpenStack arena using a set of layers. His ideas were centered around finding a way to describe new projects in a technical (as opposed to trademark or political) sense, in order to more easily identify where the project fit in relation to other projects. The impetus for Dean’s “OpenStack Layers” approach was his work in DevStack, in trying to detect the boundaries and dependencies between components that DevStack configures.

Sean Dague more recently expanded on Dean’s ideas, attempting to further clarify where newer (since Dean’s post) incubated and integrated projects lie in the OpenStack layers. Monty Taylor and Robert Collins followed up Sean’s post with posts of their own, each attempting to further provide ways in which OpenStack projects may be grouped together.

Dean’s model had five simple layers:

  • Layer 0: Operating Systems and Libraries — OS, database, message queue, Oslo, other libraries
  • Layer 1: The Basics — Keystone, Glance, Nova
  • Layer 2: Extending the Base — Cinder, Neutron, Swift, Ironic
  • Layer 3: All the options — Ceilometer, Horizon
  • Layer 4: Turtles all the way up — Heat, Trove, Zaqar, Designate

Sean’s model extended Dean’s, adding in a couple of the projects like Barbican that had not really been around at the time of Dean’s post, and calling the turtle layer “Consumption Services”:

Monty’s model grouped OpenStack projects in yet a different manner:

  • Layer #1 The Only Layer — Any project that is needed to spin up a VM running WordPress on it. His list here is Keystone, Glance, Nova, Cinder, Neutron and Designate. Monty believes this is where the layer model falls apart, and all other terms should be simple “tags” that describe some aspect of a project
  • Tag: Cloud Native — Any project “that provide(s) features for end user applications which could be provided by services run within VMs instead.” Examples here are Trove and Swift. Trove provides managed databases in VMs instead of databases running on bare metal. Swift provides object storage instead of the user having to run their own bare-metal machines with lots of disk space.
  • Tag: Operations — Any project that bridges functional gaps between various components for operators of OpenStack clouds. Examples here include Ceilometer and Ironic.
  • Tag: User Interface — Any project that enhances the user interface to other OpenStack services. Examples here are Horizon, Heat, and the openstack-sdk
http://bit.ly/flaming-golf-ball

http://bit.ly/flaming-golf-ball

Thierry Carrez redefined Monty’s “Layer #1″ as “Ring 0″. His depiction of OpenStack is like the construction of a golf ball, with a small rubber core and a larger plastic covering[1], rather than the layercake of Sean and Dean. In Thierry’s model, the OpenStack projects that would live in “Ring 0″ would be the projects that would need to be “tightly integrated” and “limited-by-design”. All other projects would just live outside this Ring 0, but still be “in OpenStack”. Thierry also brings up the question of what to do about the concept of Programs, which to this date, are the official way in which the OpenStack community blesses teams of people working towards common missions. At least, that’s what Programs are supposed to be. In practice, they tend to be viewed as equal to the main project that the Program began life as.

Finally, Robert Collins’ take on the layers discussion brought up a couple interesting points about the importance of APIs vs implementation (which I will cover shortly) and that the “core” of OpenStack really is smaller than Monty envisioned, but Robert ended up with a taxonomy of OpenStack projects that was broken up by functional categories. There would be a set of teams that would select which projects implemented an API that belonged to one of the following functional categories:

  • IaaS product: selects components from the tent to make OpenStack/IaaS
  • PaaS product: selects components from the tent to make OpenStack/PaaS
  • CaaS product: (containers)
  • SaaS product: (storage)
  • NaaS product: (networking – but things like NFV, not the basic Neutron we love today). Things where the thing you get is useful in its own right, not just as plumbing for a VM.

So why do we insist on categorizing these projects?

All of the above OpenStack leaders try to get at the fundamental question of what is the core of OpenStack? But, really, what is this obsession we have with shoving OpenStack projects into these broad categories? Why do we need to constantly define what this core thing is? And what does Lydia refer to when she says that the “core needs to be small, rock-solid stable, and readily extensible”?

I am going to posit that labeling some set of projects “the OpenStack Core” is actually not useful to any of our users[2], and that replacing overloaded terms such as “core” or “integrated” or “incubated” with a set of informative tags for each project will lead to a more enlightened OpenStack user community.

Monty started his blog post with a discussion about the different types of users that we serve in the OpenStack community. I think when we answer questions about OpenStack projects, we always need to keep in mind what information is actually useful to these different user groups, and why that information is useful. When we understand the characteristics that make a certain piece of information useful to a group of users, we can emphasize those characteristics in the language we use to describe OpenStack.

Operators

Let’s take the group of OpenStack users that Monty calls “deployers”, that I like to call “operators”. These are folks that have deployed OpenStack and are running an OpenStack cloud for one or more users. What kinds of information do these users crave? I can think of a number of questions that this user group frequently asks:

  • Should I deploy OpenStack using an OpenStack distribution like RDO, or should I deploy OpenStack myself using something like DevStack or maybe the Chef cookbooks on Stackforge?
  • Is the Icehouse version of Nova more stable than Havana?
  • Does Ceilometer’s SQL driver handle 2000 or more VMs?
  • What notification queues can my Nagios NPRE plugin monitor to get an early warning sign that something is degraded?
  • Can my old Grizzly nova-network deployment be upgraded to Icehouse Neutron with no downtime to VM networking?
  • What is the best way to diagnose data plane network connectivity issues with Havana Neutron deployments that use OpenVSwitch 1.11 and the older OpenVSwitch agent with ML2?
  • Is Heat capable of supporting 100 concurrent users?

For operators, questions about OpenStack projects generally revolve around stability, performance & scalability and deployment & diagnostics. They want to know practical information on their options with regards to how they can deploy OpenStack, keep it up and running smoothly, and maintain it over time.

What does defining a set of “core OpenStack projects” give the operator user group? Nothing.

What might we be able to tag OpenStack projects with that would be of use to operators? Well, I think the answer to this question comes in the form of answers to the questions that these operators frequently ask. How’s this for a set of tags that might be useful to operators?

  • included-in-$distribution-$version: Indicates a project has been packaged for inclusion in some OpenStack distribution. Examples: included-in-rdo-icehouse, included-in-uca-trusty, included-in-mos-5.1
  • stability-$rating: Indicates the operator community’s viewpoint on the stability of a project. Examples: stability-experimental, stability-improved, stability-mature
  • driver-$driver-experimental: Indicates the developers of driver $driver consider the code to be experimental. Examples: driver-sql-experimental, driver-docker-experimental
  • puppet-tested: Indicates that Puppet modules exist in the openstack/ code namespace that are functionally tested to install and configure the service. Similar tags could exist for chef-tested or ansible-tested, etc
  • operator-docs[-$topic]: Indicates that there is Operator-specific documentation for the project, optionally with a $topic suffix. Examples: operator-docs-nagios, operator-docs-monitoring
  • rally-verified-sla: Indicates the project has one or more Rally SLA definitions included in its gate testing platform
  • upgrade-from-$version-(in-place|downtime-needed): Indicates a project can be upgraded with or without downtime from some previous $version. Examples: upgrade-from-icehouse-in-place, upgrade-from-juno-downtime-needed

I personally feel all of the above tags contain more useful information for operators than having a set of “core OpenStack projects”.

Application developers (end users and DevOps folks)

Monty calls this group of users “end users”. The end users of clouds are application developers and DevOps people who support the operation of an application on a cloud environment. Again, let’s take a look at what this group of users typically cares about, in the form of frequent questions that this user group poses:

  • Does the Fog Ruby library support connecting to HP Helion?
  • Can I use python-neutronclient to connect to RAX Cloud and if so, does it support all of Neutron’s Firewall API extensions?
  • Can I use Zaqar to implement a simple RPC mechanism for my cloud application? Do any public clouds expose Zaqar?
  • Can I develop my cloud application against a private OpenStack cloud created with DevStack and deploy the application on Amazon EC2 and S3?
  • How can I deploy a WordPress blog for my pointy-haired boss in my company’s OpenStack private cloud in less than 15 minutes to get him off my back?
  • Can I use Nagios to monitor my cloud application, or should I use Ceilometer for that?

What does defining a set of “core OpenStack projects” give the end user group? Nothing.

How about a set of informative tags instead?

  • supported-on-$publiccloud: Indicates that $publiccloud supports the service. Examples: supported-on-rax, supported-on-hp
  • cloud-native: Indicates the project implements a service designed for applications built for the cloud
  • user-docs[-$topic]: Indicates there is documentation specifically for application developers around this project, optionally for a smaller $topic. Examples: user-docs-nagios, user-docs-fog, user-docs-wordpress
  • compare-to-$subject: Indicates that the service provides similar functionality to $subject. Examples: compare-to-s3, compare-to-cloud-foundry
  • compat-with-$api: Indicates the project exposes functionality that allows $api to be used to perform some native action. Examples: compat-with-ec2-2013.10.9, compat-with-glance-v2. Optionally consider a -with-docs suffix to indicate documentation on the compatibility exists

You will probably note that many of the tags for application developers involve the existence of documentation about compatibility with, or comparison to, other APIs or services. This is because docs really matter to application developers. API docs, tutorials, usage examples, SDK documentation. All of this stuff is critical to driving strong adoption of OpenStack with cloud app developers. The move to a documentation focus can and should start with a simple set of user-focused tags that inform our application developer users about the existence of official documentation about the things they care about.

Packagers

The next group of OpenStack users are the packagers of the OpenStack projects. Monty calls this group of users the “distributors”. Folks who work on operating system packages of OpenStack projects and libraries, folks who work on the OpenStack distributions themselves, and folks who work on configuration management tool modules that install and configure an OpenStack project are all members of this user group. This group of users care about a number of things, such as:

  • Does the Juno Nova release have support for the version of libvirt on Ubuntu Trusty Tahr?
  • Does version X of the python-novaclient support version Y of the Nova REST API?
  • Can Havana Nova speak with an Icehouse Glance server?
  • Which Keystone token driver should I enable by default?
  • Have database migrations for Icehouse Neutron been tested with PostgreSQL 9.3?
  • Has the qpid message queue driver been tested with Juno Ceilometer?

What does defining a set of “core OpenStack projects” give the packager user group? Nothing.

You’ll notice that many of the concerns that packagers have revolve around two things: documentation around version dependencies and testing of various optional drivers or settings. What does a designation that a project is “in core OpenStack” answer for the packager? Really, nothing at all. A finer-grained source of information is what is desired. A set of tags would be much more useful:

  • gated-with-$thing[-$version]: Indicates that patches to the project are gated on successful functional integration testing with $thing at an optional $version. Examples: gated-with-neutron, gated-with-postgresql, gated-with-glanceclient-1.1
  • tested-with-$thing[-$version]: Indicates that functional integration tests are run post-merge against the project. Examples: tested-with-ceph, tested-with-postgresql-9.3
  • driver-$driver-(recommended|default|gated|tested): Indicates a $driver is recommended for use, is the default, or is tested with some gate or post-merge integration tests. Examples: driver-sql-recommended, driver-mongodb-gated

OpenStack Developers

OK, so the final group of OpenStack users is the developers of the OpenStack projects themselves. This is the group that does the most squealing about categorizing projects in and out of the OpenStack tent. We’ve created a governance structure and organizational model that, as Zane Bitter said, is like the reverse of Conway’s law. We tend to box ourselves in because we focus so intently on creating these categories by which we group the OpenStack projects.

We created the terms “incubated” and “integrated” to ostensibly inform ourselves of which projects have aligned with the OpenStack governance model and infrastructure tooling processes. And then we created the gate as a model of those categories, without first asking ourselves whether the terms “incubated” and “integrated” were really serving a technically sound purpose. We group all the integrated and incubated projects together, saying that all these projects must integrate with each other in one giant, complex gate testing platform.

Zane thinks that is madness, and I tend to agree.

Personally, I feel we need to stop thinking in terms of “core” or “incubated” or “integrated” and instead stick to thinking about the projects that live in the OpenStack tent[3] in terms of the soft and hard dependencies that each project has to another. In graphical representation, the current set of OpenStack projects (that aren’t software libraries) might look something like this:

OpenStack Non-library Project Dependency Graph

Converting that graphical representation to something machine-readable might produce a YAML snippet like this:

projects:
 - keystone
 - tempest
 - swift:
   soft:
   - keystone
 - glance:
   hard:
   - keystone
   soft:
   - swift
...
 - nova:
   hard:
   - keystone
   - glance
   - cinder
   soft:
   - neutron
   - ironic
   - barbican

The YAML above is nothing more than a set of tags that could be applied to OpenStack projects to inform developers about their relative dependency on other projects. This dependency graph information can then be used to construct a testing platform that should be more efficient than the current system that looks the way it does because we’ve insisted on using this single “integrated” category to describe the relationship of an OpenStack project with another.

So, my answer to Lydia

Seems I’ve yet again exceeded my 140 character limit. And I haven’t addressed Lydia’s point about the “core needs to be small, rock-solid stable, and readily extensible.” I think you can probably tell by now that I don’t think there is a useful way of denoting “a core OpenStack”. At least, not useful to our users. Does this mean that I disagree with Lydia’s sentiment about what is important for the OpenStack contributor community to focus on? Absolutely not.

So, in the spirit of specificity and useful taxonomic tagging, I will go further and make the following statements about the OpenStack community and code using Lydia’s words.

Our public REST APIs need to be designed to solve small problems

Right now, the public REST APIs of a number of OpenStack projects suffer from what I like to call “extension-itis”. Some of the REST APIs honestly feel like they were stitched together by a group of kindergartners, each trying to purposefully do something different than the last kid who attached their little piece of the quilt. APIs need to be small, simple, and clear. Why? Because they are our first impression, and OpenStack will live and die by the grace of the application developers that use our APIs.

We should have a working group composed of members of the Technical Committee, interested operators and end users, and API standards experts that control the public face of OpenStack: its HTTP REST APIs. This working group should have teeth to it, and be able to enforce a set of rules across OpenStack projects about API consistency, resource naming, clarity, discoverability and documentation

Our deployment tools need to be as rock-solid stable as any other part of our code

I’m not just talking about the official OpenStack Deployment Program here (Triple-O). I’m talking about the combination of Chef cookbooks, Puppet modules, Ansible playbooks, Salt state files, DevStack scripts, Mirantis FUEL deployment tooling, and the myriad other things that deploy and configure OpenStack services. They need to be tested in our continuous integration platform with the same zeal as everything else. As a community, we need to dedicate resources to make this a reality.

Our governance model needs to be readily extensible

Unfortunately, I believe our governance model and organizational structure has a tendency to reinforce the status quo and not adapt to changing times as quickly as it should. Examples of this rigidity are plenty. See, for example, the inability of the Technical Committee to come up with a single set of criteria for Zaqar’s graduation from incubation status. As a member of the Technical Committee, I can say I was pretty embarrassed at the way we treated the Zaqar contributor community. I’d love for the role of the Technical Committee to change from one of Judge to one of Advisor. The current Supreme OpenStack Court vision just reinforces the impression that our community cannot find a meaningful balance between progress and stability, and I for one refuse to admit that such a balance cannot be reached.

The current organization of official OpenStack Programs is something I also believe is no longer useful, and discourages our community from extending itself in the ways we should be extending. For example, we should be encouraging the sort of competition and pluralism that having multiple overlapping services like Ceilometer, Monasca, and Stacktach all under the OpenStack tent would bring.

There is no core

thereisnospoon

Finally, unless it’s not obvious, I don’t believe there is any OpenStack core. Or at least, that there’s any use to spending the effort required to concoct what it might be.
OK, I think that’s just about enough for one night of writing. Thanks very much if you made it all the way through. I look forward to your comments.

[1]OK, yes, golf balls have more than just a core and a larger plastic covering…just go with me on this and stop being so argumentative.

[2]Denoting some subset of OpenStack projects as “core” may be useful to industry pundits and marketing folks, but those aren’t OpenStack users.

Pushing revisions to a Gerrit code review

Martin Stoyanov recently asked an excellent question about the proper way to push revisions to a changeset in Gerrit. This is a very common question that folks new to Gerrit or Git have, and I think it deserves its own post.

When you do the first push of a local working branch to Gerrit, the act of pushing your code creates a Gerrit changeset. The changeset can be reviewed, and in the process of doing that review, it’s common for reviewers to request that the submitter make some changes to the code. Sometimes these changes are stylistic or cosmetic. Other times, the requested modifications can be extensive.

How you handle making the requested modifications and submitting those changes back to Gerrit depends on a few things:

  1. Are the changes requested mostly stylistic or cosmetic?
  2. Are the changes requested going to provide additional functionality that is dependent on the existing changeset?
  3. Are the changes requested going to provide additional functionality that is independent of the existing changeset?

Depending on the answers to the above questions, you should either amend the existing changeset commit, push a new commit to Gerrit from the same local branch, or push a new commit from a new local branch. Here’s some quick guidelines to help you decide:

The Changes Requested Are Cosmetic or Style-related

When a reviewer is providing some stylistic advice or offering suggestions for cosmetic changes or cleanups, you should amend the original commit. Do so like this:

# After making the requested changes...
git commit -a --amend
# Modify the commit message, if necessary, and save your editor
git review

While this looks fairly simple (and it is…), many folks make a fatal mistake when they modify the commit message and add sections that describe the “sausage-making” involved in the cleanups. DO NOT do this. It’s not necessary. Avoid adding any lines to the commit message that look like this:

  • “Cleaned up whitespace”
  • “DRY’d up some stuff based on review comments”
  • “Fixed typos found during reviews”

If all you did was correct typos and whitespace, simply leave the commit message as it was originally. After the call to git review, you will see a new patchset appear in the original code review. This is expected. The changeset is still viewed by Gerrit and reviewers as a single changeset, and reviewers may even select “Patchset 1″ from the “Old Version History” dropdown instead of “Base” in order to see only the changes made in this last amended commit.

The Changes Requested Are Extensive and Depend on Original Commit

If a reviewer has asked for modifications to your original code, and the requested modifications are fairly extensive and depend on the code in your original commit, you have four choices:

  • Amend the original commit to include all new changes
  • Amend the original commit for some things, push those changes to the original commit, make additional changes in the same local branch, git commit those additional changes and git review
  • Make additional changes, git commit those additional changes and git review
  • Lobby for your original commit to be accepted as-is, then when your change is accepted and merged into master, then create a new branch from master and push the additional changes in a new changeset

Whenever you do not amend a commit and issue a call to git review, you will be created a dependent changeset. Gerrit will assign a new Change-Id to the patchset, but understands that the commit logically follows your original changeset’s code. If you go to the code review screen of your newly-created changeset, you will see your original changeset referenced in the “Dependencies” section. Below, you can see a screenshot of a changeset that is part of a “dependency chain”. Another patchset is dependent on this patchset and this patchset is dependent on another patchset.

Changeset showing a dependent changeset and a changeset dependent on this one (chained dependent patchsets)

Changeset showing a dependent changeset and a changeset dependent on this one (chained dependent patchsets)

It’s best to avoid long chains of dependent patchsets. The reason is because if a reviewer requests changes for one of the changesets at the “bottom” of the dependency chain, the entire chain of changesets (even changesets that are approved like the one shown above) are going to be held up from going through the gate tests.

The Changes Requested Are Extensive but are Independent of the Original Commit

If a reviewer has requested extensive changes, but points out that the changes they want made are actually independent of the changes in your original commit, the reviewers will generally ask the original committer to wait until this changeset is merged and create a new branch for the additional work. Normally, depending on the extent of the requested changes, reviewers will insist that the submitter create a new bug or blueprint on Launchpad to keep track of the additional work they feel is needed.

Conclusion

It’s up to the discretion of core reviewers and the original submitter to work out which of the above solutions works best for the particular changeset. Each changeset introduces code and functionality that must be treated differently, and changesets from one submitter may be dealt with differently than others. However, to keep things simple for yourself and upstream reviewers, it’s best to follow this simple advice:

  1. Prefer to amend the original commit. In most cases, this is the appropriate solution to push revisions to Gerrit.
  2. Don’t include sausage-making comments in the commit message.
  3. Prefer free-standing changesets to long chains of dependent patches.
  4. Ask reviewers what their preferences are.

Follow those guidelines and you’ll keep yourself out of the weeds. For more detailed information, including strategies for handling updates to your code that is dependent on another branch of code that gets updated, see the excellent OpenStack GerritWorkflow documentation.

Upgrade to Xubuntu 12.04 – All Keyboard Shortcuts Don’t Work

Seriously, Ubuntu, upgrading between versions has become just painful… I waited a few weeks before I upgraded from Xubuntu 11.10 to 12.04 because the upgrade last time completely hosed my system and left me with a borked X configuration. This time, I upgraded and now all my keyboard shortcuts don’t work. None of them. I mean, they appear in my Keyboard Settings –> Application Shortcuts, but none of them work anymore. Seriously, WTF.

UPDATE:

Turns out that the issue is that (for some stupid reason), XFCE changed the name of the <Ctrl> key to “Primary”, so you need to go to Accessories –> Settings Manager –> Keyboard –> Application Shortcuts and then remove all your custom shortcuts that show <Control> in them and re-add them. You’ll notice that when you press the Ctrl key, it will now show up as <Primary>. Completely retarded.

Slides from Developing Drizzle Replication Plugins Tutorial

Hi all!

So, Padraig, Toru, and myself teamed up yesterday at the MySQL Conference for about thirty or so attendees to discuss developing Drizzle plugins in C++. It was a set of slides that covered basic stuff all the way up through pretty advanced topics. We hope attendees got something out of it :)

Below are the slides from Padraig’s and my part of the tutorial which focused on plugin development basics and the replication plugin API in Drizzle. I’ve also tacked them onto my page of presentations.

Enjoy, and feel free to email me with comments and suggestions to SELECT REVERSE('moc.liamg@sepipyaj');

Developing Drizzle Replication Plugins


Open Office Impress slides
PDF slides


Topics included in the slides:

  • About the Drizzle Community and Expectations of Contributors
  • Getting started on Launchpad
  • Various features of Launchpad
  • Understanding the Source Code Directory Structure
  • Code walkthrough of Drizzle plugin basics
  • Drizzle’s System Architecture
  • Overview of Drizzle’s Replication System
  • Understanding Google Protobuffers
  • The Transaction message in Detail
  • In-depth code walkthrough of the Filtered Replicator module
  • In-depth code walkthrough of the Transaction Log module
  • Future of Drizzle replication – Publisher and Subscriber plugins

O’Gara Cloud Computing Article Off Base

Maureen O’Gara, self-described as “the most read technology reporter for the past 20 years”, has written an article about Drizzle at Rackspace for one of Sys-con’s online zines called Cloud Computing Journal, of which she is an editor.

I tried commenting on Maureen’s article on their website, but the login system is apparently borked, at least for registered users who use OpenID, which it wants to still have a separate user ID and login. Note to sys-con.com: OpenID is designed so that users don’t have to remember yet another login for your website.

Besides having little patience for content-sparse websites that simply provide an online haven for dozens of Flash advertisements per web page, the article had some serious problems with it, not the least of which was using large chunks of my Happiness is a Warm Cloud article without citation. Very professional.

OK, to start with, let’s take this quote from the article:

Drizzle runs the risk of not being as stable as MySQL, because the Drizzle team is taking things out and putting other stuff in. Of course it may be successful in trying to create a product that’s more stable than MySQL. But creating a stable DBMS engine is something that has always taken years and years.

This is just about the most naïve explanation for whether a product will or will not be stable that I’ve ever read. If Maureen had bothered to email or call any one of the core Drizzle developers, they’d have been happy to tell her what is and is not stable about Drizzle, and why. Drizzle has not changed the underlying storage engines, so the InnoDB storage engine in Drizzle is the same plugin as available in MySQL (version 1.0.6).

The pieces of MySQL which were removed from Drizzle happen to be the parts of MySQL which have had the most stability issues — namely the additional features added to MySQL 5.0: stored procedures, views, triggers, stored functions, the INFORMATION_SCHEMA implementation, and server-side cursors and prepared statements. In addition to these removed features of MySQL, Drizzle also has no built-in Query Cache, does not support anything other than UTF-8 character sets, and has removed the MySQL replication system and binary logging — moving a rewrite of these pieces out into the plugin ecosystem.

The pieces that were added to Drizzle have mostly been done by adding plugins that provide functionality. Maureen, the reason this was done was precisely to allow for greater stability of the kernel by segregating new features and functionality into the plugin ecosystem, where they can be properly versioned and quarantined, therefore increasing kernel stability. It’s pretty much the biggest principle of Drizzle’s design…

The core developers of Drizzle (and much of the Drizzle community) would also have been happy to tell Maureen how the Drizzle team defines “stability”: when the community says Drizzle is stable — simple as that.

OK, so the next thing I took objection to is the following line:

Half of Rackspace’s customers are on MySQL so there’ll be some donkey-style nosing to get them to migrate.

I think my Rackspace colleagues might have quite a bit to say about the above. I haven’t seen any Rackers talking about mass migration from MySQL to Drizzle. As far as I have seen, the plan is to provide Drizzle as an additional service to Rackspace customers.

Rackspace evidently wants its new boys, who were not the core pillars of the MySQL engineering team, to hitch MySQL, er, Drizzle to Cassandra

MySQL != Drizzle. Implying that the two are equal do a disservice to both, as they have very different target markets and developer audiences.

The smart money is betting that even if a good number of high-volume web sites go down this route, an even higher number such as Facebook and Google will continue with relational databases, primarily MySQL.

Again, probably best to do your homework on this one, too. Facebook runs an amalgamation of a custom MySQL version and storage engines, distributed key-value stores, and Memcached servers. I would think that Facebook moving to Drizzle would be one tough migration. Thousands (tens of thousands?) of MySQL servers all running custom software and integrated into their caching layers is a huge barrier to entry, and not one I would expect a large site like Facebook to casually undertake. But, the same could be said about a move to SQL Server or Oracle, for that matter, and has little to do with Drizzle.

Google is moving away from using MySQL entirely. Mark Callaghan, previously at Google, has moved over to Facebook (possibly because of this trend at Google to get rid of MySQL), and Anthony Curtis, formerly of MySQL, then Google, left Google partially because of this reason.

OK, so the next quote got me really fired up because it demonstrates a complete lack of understanding (maybe not Maureen’s, but the unnamed source it’s from at least):

Somebody – sorry we forget who exactly – claimed that as GPL 2 code Drizzle “severely limits revenue opportunities. For Rackspace, the opportunity to have some key Drizzle developers on its payrolls basically comes down to a promotional benefit, trying to position Rackspace as particularly Drizzle-savvy in the eyes of the community and currying favor for its seemingly generous contributions. What’s unclear is whether they may develop some Drizzle-related functionality that they will then not release as open source and just rent out to Rackspace hosting customers…that would be a way for them to differentiate themselves from competitors and GPLv2 would in principle allow this.”

A few points to make about the above quote.

First, name your source. I find it difficult to believe that the most-read technology writer would not write down a source. Is it the same person you deliberately left out of a quote from my Happiness article? (why did you do that, btw?).

Second, the MySQL server source code is licensed under the GPL 2, and so is Drizzle’s kernel, because it is a derivative work of the MySQL server.

Let me be clear: Developers who contribute code to Drizzle do so under the GPLv2 if that contribution is in the Drizzle kernel. If the code contribution is a plugin, the contributor is free to pick whatever license they choose.

Third, licensing has little if anything to do with revenue at all. The license is besides the point. There are two things which dictate the company’s revenue derivation from software:

  1. Copyright ownership
  2. Principles of the Company

Drizzle, Rackspace, or any company a Drizzle contributor works for, does not have the copyright ownership of the MySQL source code, from which Drizzle’s kernel is derived. Oracle does. Therefore, companies do not have any right to re-sell Drizzle (under any license) without explicit permission from Oracle. Period. Has nothing to do with the GPLv2.

That said, contributors do have the right to make money on plugins built for the Drizzle server, and Rackspace, while not having expressed any interest to yours truly in doing so, has the right like any other Drizzle contributor, to make money on plugins its contributors create for Drizzle.

It is my knowledge (after actually having talked to Rackspace managers and decision makers), that Rackspace is not interested in getting into the business of selling commercial Drizzle plugins. Their core direction is to create value for their customers, and I fail to see how getting into the commercial software sales business meets that goal.

Next time, please feel free to contact myself or any other Drizzle contributor to get the low-down on Drizzle-related stuff. We’ll be nice. I promise.

jpipes.com is now joinfu.com

As you may have noticed, my blog has changed.  No more old and busted serendipity 0.9.  Hello WordPress 2.9.  Yeah \0/

It was a manual data transformation I had to do in order to get all the posts and comments from my old blog to the new one, and I think I got everything transferred over correctly.  Please do let me know at REVERSE(‘moc.liamg@sepipyaj’) if you see anything out of the ordinary or something is messed up.  Thanks :)

Here’s to more blogging on joinfu.com ;)

An SQL Puzzle?

Dear Lazy Web,

What should the result of the SELECT be below? Assume InnoDB for all storage engines.

CREATE TABLE t1 (a INT, b INT);
INSERT INTO t1 VALUES (1,1),(1,2);
CREATE TEMPORARY TABLE t2 (a INT, b INT, PRIMARY KEY (a));
BEGIN;
INSERT INTO t2 VALUES (100,100);
CREATE TEMPORARY TABLE IF NOT EXISTS t2 (PRIMARY KEY (a)) SELECT * FROM t1;
 
# The above statement will correctly produce an ERROR 23000: Duplicate entry '1' FOR KEY 'PRIMARY'
# What should the below RESULT be?
 
SELECT * FROM t2;
COMMIT;