Understanding the OpenStack CI System

This post describes in detail the upstream OpenStack continuous integration platform. In the process, I’ll be describing the code flow in the upstream system — from the time the contributor submits a patch to Gerrit, all the way through the creation of a devstack environment in a virtual machine, the running of the Tempest test suite against the devstack installation, and finally the reporting of test results and archival of test artifacts. Hopefully, with a good understanding of how the upstream tooling works, setting up your own linked external testing platform will be easier.

Some History and Concepts

Over the past four years, there has been a steady evolution in the way that the source code of OpenStack projects is tested and reviewed. I remember when we used Bazaar for source control and Launchpad merge proposals for code review. There was no automated or continuous testing to speak of in those early days, which put pressure on core reviewers to do testing of proposed patches locally. There was also no standardized integration test suite, so often a change in one project would inadvertantly break another project.

Thanks to the work of many contributors, particularly those patient souls in the OpenStack Infrastructure team, today there is a robust platform supporting continuous integration testing for OpenStack and Stackforge projects. At the center of this platform are the Jenkins CI servers, the Gerrit git and patch review server, and the Zuul gating system.

The Code Review System

When a contributor submits a patch to one of the OpenStack projects, one pushes their code to the git server managed by Gerrit running on review.openstack.org. Typically, contributors use the git-review Git plugin, which simplifies submitting to a git server managed by Gerrit. Gerrit controls which users or groups are allowed to propose code, merge code, and administer code repositories under its management. When a contributor pushes code to review.openstack.org, Gerrit creates a Changeset representing the proposed code. The original submitter and any other contributors can push additional amendments to that Changeset, and Gerrit collects all of the changes into the Changeset record. Here is a shot of a Changeset under review. You can see a number of patches (changes) listed in the review screen. Each of those patches was an amendment to the original commit.

Individual patches amend the changeset

Individual patches amend the changeset

For each patch in Gerrit, there are three sets of “labels” that may be applied to the patch. Anyone can comment on a Changeset and/or review the code. A review is shown on the patch in the Code-Review column in the patch “labels matrix”:

The "label matrix" on a Gerrit patch

The “label matrix” on a Gerrit patch

Non-core team members may give the patch a Code-Review label of +1 (Looks good to me), 0 (No strong opinion), or -1 (I would prefer you didn’t merge this). Core team members can give any of those values, plus +2 (Looks good to me, approved) and -2 (Do not submit).

The other columns in the label matrix are Verified and Approved. Only non-interactive users of Gerrit, such as Jenkins, are allowed to add a Verified label to a patch. The external testing platform you will set up is one of these non-interactive users. The value of the Verified label will be +1 (check pipeline tests passed), -1 (check pipeline tests failed), +2 (gate pipeline tests passed), or -2 (gate pipeline tests failed).

Only members of the OpenStack project’s core team can add an Approved label to a patch. It is either a +1 (Approved) value or not, appearing as a check mark in the Approved column of the label matrix:

An approved patch.

An approved patch.

Continuous Integration Testing

Continuous integration (CI) testing is the act of running tests that validate a full application environment on a continual basis — i.e. when any change is proposed to the application. Typically, when talking about CI, we are referring to tests that are run against a full, real-world installation of the project. This type of testing, called integration testing, ensures that proposed changes to one component do not cause failures in other components. This is especially important for complex multi-project systems like OpenStack, with non-trivial dependencies between subsystems.

When code is pushed to Gerrit, a series of jobs are triggered that run a series of tests against the proposed code. Jenkins is the server that executes and manages these jobs. It is a Java application with an extensible architecture that supports plugins that add functionality to the base server.

Each job in Jenkins is configured separately. Behind the scenes, Jenkins stores this configuration information in an XML file in its data directory. You may manually edit a Jenkins job as an administrator in Jenkins. However, in a testing platform as large as the upstream OpenStack CI system, doing so manually would be virtually impossible and fraught with errors. Luckily, there is a helper tool called Jenkins Job Builder (JJB) that constructs these XML configuration files after reading a set of YAML files and job templating rules. We will describe JJB later in the article.

The “Gate”

When we talk about “the gate”, we are talking about the process by which code is kept out of a set of source code branches if certain conditions are not met.

OpenStack projects use a method of controlling merges into certain branches of their source trees called the Non-Human Gatekeeper model [1]. Gerrit (the non-human) is configured to allow merges by users in a group called “Non-Interactive Users” to the master and stable branches of git repositories under its control. The upstream main Jenkins CI server, as well as Jenkins CI systems running at third party locations, are the users in this group.

So, how do these non-interactive users actually decide whether to merge a proposed patch into the target branch? Well, there is a set of tests (different for each project) — unit, functional, integration, upgrade, style/linting — that is marked as “gating” that particular project’s source trees. For most of the OpenStack projects, there are unit tests (run in a variety of different supported versions of Python) and style checker tests for HACKING and PEP8 compliance. These unit and style tests are run in Python virtualenvs managed by the tox testing utility.

In addition to the Python unit and style tests, there are a number of integration tests that are executed against full installations of OpenStack. The integration tests are simply subsets of the Tempest integration test suite. Finally, many projects also include upgrade and schema migration tests in their gate tests.

How Upstream Testing Works

Graphically, the upstream continuous integration gate testing system works like this:

gerrit-zuul-jenkins-flow

We step through this event flow in detail below, referencing the numbered steps in bold.

The Gerrit Event Stream and Zuul

After a contributor has pushed (1a) a new patch to a changeset or a core team member has reviewed the patch and added an Approved +1 label (1b), Gerrit pushes out a notification event to its event stream (2). This event stream can have a number of subscribers, including the Gerrit Jenkins plugin and Zuul. Zuul was developed to manage the many complex graphs of interdependent branch merge proposals in the upstream system. It monitors in-progress jobs for a set of related patches and will pre-emptively cancel any dependent test jobs that would not succeed due to a failure in a dependent patch [2].

In addition to this dependency monitoring, Zuul is responsible for constructing the pipelines of jobs that should be executed on various events. One of these pipelines is called the “gate” pipeline, appropriately named for the set of jobs that must succeed in order for a proposed patch to be merged into a target branch.

Zuul’s pipelines are configured in a single file called layout.yaml in the OpenStack-Infra config project. Here’s a snippet from that file that constructs the gate pipeline:

  - name: gate
    description: Changes that have been approved by core developers...
    failure-message: Build failed. For information on how to proceed...
    manager: DependentPipelineManager
    precedence: low
    trigger:
      gerrit:
        - event: comment-added
          approval:
            - approved: 1
        - event: comment-added
          comment_filter: (?i)^\s*reverify( (?:bug|lp)[\s#:]*(\d+))\s*$
    start:
      gerrit:
        verified: 0
    success:
      gerrit:
        verified: 2
        submit: true
    failure:
      gerrit:
        verified: -2

Zuul listens to the Gerrit event stream (3), and matches the type of event to one or more pipelines (4). The matching conditions for the gate pipeline are configured in the trigger:gerrit: section of the YAML snippet above:

    trigger:
      gerrit:
        - event: comment-added
          approval:
            - approved: 1
        - event: comment-added
          comment_filter: (?i)^\s*reverify( (?:bug|lp)[\s#:]*(\d+))\s*$

The above indicates that Zuul should fire the gate pipeline when it sees reviews with an Approved +1 label, and any comment to the review that contains “reverify” with or without a bug identifier. Note that there is a similar pipeline that is fired when a new patchset is created or when a review comment is made with the word “recheck”. This pipeline is called the check pipeline. Look in the layout.yaml file for the configuration of the check pipeline.

Once the appropriate pipeline is matched, Zuul executes (5) that particular pipeline for the project that had a patch proposed.

But wait, hold up…“, you may be asking yourself, “how does Zuul know which Jenkins jobs to execute for a particular project and pipeline?“. Great question! :)

Also in the layout.yaml file, there is a section that configures which Jenkins jobs should be run for each project. Let’s take a look at the configuration of the gate pipeline for the Cinder project:

  - name: openstack/cinder
    template:
      - name: python-jobs
...snip...
    gate:
      - gate-cinder-requirements
      - gate-tempest-dsvm-full
      - gate-tempest-dsvm-postgres-full
      - gate-tempest-dsvm-neutron
      - gate-tempest-dsvm-large-ops
      - gate-tempest-dsvm-neutron-large-ops
      - gate-grenade-dsvm

Each of the lines in the gate: section indicate a specific Jenkins job that should be run in the gate pipeline for Cinder. In addition, there is the python-jobs item in the template: section. Project templates are a way that Zuul consolidates configuration of many similar jobs into a simple template configuration. The project template definition for python-jobs looks like this (still in layout.yaml:

project-templates:
  - name: python-jobs
...snip...
    gate:
      - 'gate-{name}-docs'
      - 'gate-{name}-pep8'
      - 'gate-{name}-python26'
      - 'gate-{name}-python27'

So, on determing which Jenkins jobs should be executed for a particular pipeline, Zuul sees the python-jobs project template in the Cinder configuration and expands that to execute the following Jenkins jobs:

  • gate-cinder-docs
  • gate-cinder-pep8
  • gate-cinder-python26
  • gate-cinder-python27

Jenkins Job Creation and Configuration

I previously mentioned that the configuration of an individual Jenkins job is stored in a config.xml file in the Jenkins data directory. Now, at last count, the upstream OpenStack Jenkins CI system has just shy of 2,000 jobs. It would be virtually impossible to manage the configuration of so many jobs using human-based processes. To solve this dilemma, the Jenkins Job Builder (JJB) python tool was created. JJB consumes YAML files that describe both individual Jenkins jobs as well as templates for parameterized Jenkins jobs, and writes the config.xml files for all Jenkins jobs that are produced from those templates. Important: Note that Zuul does not construct Jenkins jobs. JJB does that. Zuul simply configures which Jenkins jobs should run for a project and a pipeline.

There is a master projects.yaml file in the same directory that lists the “top-level” definitions of jobs for all projects, and it is in this file that many of the variables that are used in job template instantiation are defined (including the {name} variable, which corresponds to the name of the project.

When JJB constructs the set of all Jenkins jobs, it reads the projects.yaml file, and for each project, it sees the “name” attribute of the project, and substitutes that name attribute value wherever it sees {name} in any of the jobs that are defined for that project. Let’s take a look at the Cinder project’s definition in the projects.yaml file here:

- project:
    name: cinder
    github-org: openstack
    node: bare-precise
    tarball-site: tarballs.openstack.org
    doc-publisher-site: docs.openstack.org

    jobs:
      - python-jobs
      - python-grizzly-bitrot-jobs
      - python-havana-bitrot-jobs
      - openstack-publish-jobs
      - gate-{name}-pylint
      - translation-jobs

You will note one of the items in the jobs section is called python-jobs. This is actually not a single Jenkins job, but actually a job group. A job group definition is merely a list of jobs or job templates. Let’s take a look at the definition of the python-jobs job group:

- job-group:
    name: python-jobs
    jobs:
      - '{name}-coverage'
      - 'gate-{name}-pep8'
      - 'gate-{name}-python26'
      - 'gate-{name}-python27'
      - 'gate-{name}-python33'
      - 'gate-{name}-pypy'
      - 'gate-{name}-docs'
      - 'gate-{name}-requirements'
      - '{name}-tarball'
      - '{name}-branch-tarball'

Each of the items listed in the jobs section of the python-jobs job group definition above is a job template. Job templates are expanded in the same way as Zuul project templates and JJB job groups are expanded. Let’s take a look at one such job template in the list above, called gate-{name}-python27.

(Hint: all Jenkins jobs for any OpenStack or Stackforge project are described in the OpenStack-Infra Config project’s modules/openstack_projects/files/jenkins_jobs/config/ directory).

The python-jobs.yaml file in the modules/openstack_project/files/jenkins_job_builder/config directory contains the definition of common Python project Jenkins job templates. One of those job templates is gate-{name}-python27:

- job-template:
    name: 'gate-{name}-python27'
... snip ...
    builders:
      - gerrit-git-prep
      - python27:
          github-org: '{github-org}'
          project: '{name}'
      - assert-no-extra-files

    publishers:
      - test-results
      - console-log

    node: '{node}'

Looking through the above job template definition, you will see a section called “builders“. The builders section of a job template lists (in sequential order of expected execution) the executable sections or scripts of the Jenkins job. The first executable section in the gate-{name}-python27 job template is called “gerrit-git-prep“. This executable section is defined in macros.yaml, which contains a number of commonly-run scriptlets. Here’s the entire gerrit-git-prep macro definition:

- builder:
    name: gerrit-git-prep
    builders:
      - shell: "/usr/local/jenkins/slave_scripts/gerrit-git-prep.sh https://review.openstack.org http://zuul.openstack.org git://git.openstack.org"

So, gerrit-git-prep is simply executing a Bash script called “gerrit-git-prep.sh” that is stored in the /usr/local/jenkins/slave_scripts/ directory. Let’s take a look at that file. You can find it in the /modules/jenkins/files/slave_scripts/ [3]
directory in the same OpenStack Infra Config project:

#!/bin/bash -e
 
GERRIT_SITE=$1
ZUUL_SITE=$2
GIT_ORIGIN=$3
 
# ... snip ...
 
set -x
if [[ ! -e .git ]]
then
    ls -a
    rm -fr .[^.]* *
    if [ -d /opt/git/$ZUUL_PROJECT/.git ]
    then
        git clone file:///opt/git/$ZUUL_PROJECT .
    else
        git clone $GIT_ORIGIN/$ZUUL_PROJECT .
    fi
fi
git remote set-url origin $GIT_ORIGIN/$ZUUL_PROJECT
 
# attempt to work around bugs 925790 and 1229352
if ! git remote update
then
    echo "The remote update failed, so garbage collecting before trying again."
    git gc
    git remote update
fi
 
git reset --hard
if ! git clean -x -f -d -q ; then
    sleep 1
    git clean -x -f -d -q
fi
 
if [ -z "$ZUUL_NEWREV" ]
then
    git fetch $ZUUL_SITE/p/$ZUUL_PROJECT $ZUUL_REF
    git checkout FETCH_HEAD
    git reset --hard FETCH_HEAD
    if ! git clean -x -f -d -q ; then
        sleep 1
        git clean -x -f -d -q
    fi
else
    git checkout $ZUUL_NEWREV
    git reset --hard $ZUUL_NEWREV
    if ! git clean -x -f -d -q ; then
        sleep 1
        git clean -x -f -d -q
    fi
fi
 
if [ -f .gitmodules ]
then
    git submodule init
    git submodule sync
    git submodule update --init
fi

The purpose of the script above is simple: Check out the source code of the proposed Gerrit changeset and ensure that the source tree is clean of any cruft from a previous run of a Jenkins job that may have run in the same Jenkins workspace. The concept of a workspace is important. When Jenkins runs a job, it must execute that job from within a workspace. The workspace is really just an isolated shell environment and filesystem directory that has a set of shell variables export’d inside it that indicate a variety of important identifiers, such as the Jenkins job ID, the name of the source code project that has triggered a job, the SHA1 git commit ID of the particular proposed changeset that is being tested, etc [4].

The next builder in the job template is the “python27” builder, which has two variables injected into itself:

      - python27:
          github-org: '{github-org}'
          project: '{name}'

The github-org variable is a string of the already existing {github-org} variable value. The project variable is populated with the value of the {name} variable. Here’s how the python27 builder is defined (in macros.yaml:

- builder:
    name: python27
    builders:
      - shell: "/usr/local/jenkins/slave_scripts/run-unittests.sh 27 {github-org} {project}"

Again, just a wrapper around another Bash script, called run-unittests.sh in the /usr/local/jenkins/slave_scripts directory. Here’s what that script looks like:

version=$1
org=$2
project=$3
 
# ... snip ...
 
venv=py$version
 
# ... snip ...
 
source /usr/local/jenkins/slave_scripts/select-mirror.sh $org $project
 
tox -e$venv
result=$?
 
echo "Begin pip freeze output from test virtualenv:"
echo "======================================================================"
.tox/$venv/bin/pip freeze
echo "======================================================================"
 
if [ -d ".testrepository" ] ; then
# ... snip ...
    .tox/$venv/bin/python /usr/local/jenkins/slave_scripts/subunit2html.py ./subunit_log.txt testr_results.html
    gzip -9 ./subunit_log.txt
    gzip -9 ./testr_results.html
# ... snip ...
fi
 
# ... snip ...

In short, for the Python 2.7 builder, the above runs the command tox -epy27 and then runs a prettifying script and gzips up the results of the unit test run. And that’s really the meat of the Jenkins job. We will discuss the publishing of the job artifacts a little later in this article, but if you’ve gotten this far, you have delved deep into the mines of the OpenStack CI system. Congratulations!

Devstack-Gate and Running Tempest Against a Real Environment

OK, so unit tests running in a simple Jenkins slave workspace are one thing. But what about Jenkins jobs that run integration tests against a full set of OpenStack endpoints, interacting with real database and message queue services? For these types of Jenkins jobs, things are more complicated. Yes, I know. You probably think things have been complicated up until this point, and you’re right! But the simple unit test jobs above are just the tip of the proverbial iceberg when it comes to the OpenStack CI platform.

For these complex Jenkins jobs, an additional set of tools are added to the mix:

  • Nodepool — Provides virtual machine instances to Jenkins masters for running complex, isolation-sensitive Jenkins jobs
  • Devstack-Gate — Scripts that create an OpenStack environment with Devstack, run tests against that environment, and archive logs and results

Assignment of a Node to Run a Job

Different Jenkins jobs require different workspaces, or environments, in which to run. For basic unit or style-checking test jobs, like the gate-{name}-python27 job template we dug into above, not much more is needed than a tox-managed virtualenv running in a source checkout of the project with a proposed change. However, for Jenkins jobs that run a series of integration tests against a full OpenStack installation, a workspace with significantly more resources and isolation is necessary. For these latter types of jobs, the upstream CI platform uses a pool of virtual machine instances. This pool of virtual machine instances is managed by a tool called nodepool. The virtual machines run in both HP Cloud and Rackspace Cloud, who graciously donate these instances for the upstream CI system to use. You can see the configuration of the Nodepool-managed set of instances here.

Instances that are created by Nodepool run Jenkins slave software, so that they can communicate with the upstream Jenkins CI master servers. A script called prepare_node.sh runs on each Nodepool instance. This script just git clones the OpenStack Infra config project to the node, installs Puppet, and runs a Puppet manifest that sets up the node based on the type of node it is. There are bare nodes, nodes that are meant to run Devstack to install OpenStack, and nodes specific to the Triple-O project. The node type that we will focus on here is the node that is meant to run Devstack. The script that runs to prepare one of these nodes is prepare_devstack_node.sh, which in turn calls prepare_devstack.sh. This script caches all of the repositories needed by Devstack, along with Devstack itself, in a workspace cache on the node. This workspace cache is used to enable fast reset of the workspace that is used during the running of a Jenkins job that uses Devstack to construct an OpenStack environment.

Devstack-Gate

The Devstack-Gate project is a set of scripts that are executed by certain Jenkins jobs that need to run integration or upgrade tests against a realistic OpenStack environment. Going back to the Cinder project configuration in the Zuul layout.yaml file:

  - name: openstack/cinder
    template:
      - name: python-jobs
... snip ...
    gate:
      - gate-cinder-requirements
      - gate-tempest-dsvm-full
      - gate-tempest-dsvm-postgres-full
      - gate-tempest-dsvm-neutron
      - gate-tempest-dsvm-large-ops
      - gate-tempest-dsvm-neutron-large-ops
      - gate-grenade-dsvm
... snip ...

Note the highlighted line. That Jenkins job template is one such job that needs an isolated workspace that has a full OpenStack environment running on it. Note that “dsvm” stands for “Devstack virtual machine”.

Let’s take a look at the JJB configuration of the gate-tempest-dsvm-full job:

- job-template:
    name: '{pipeline}-tempest-dsvm-full{branch-designator}'
    node: '{node}'
... snip ...
    builders:
      - devstack-checkout
      - shell: |
          #!/bin/bash -xe
          export PYTHONUNBUFFERED=true
          export DEVSTACK_GATE_TIMEOUT=180
          export DEVSTACK_GATE_TEMPEST=1
          export DEVSTACK_GATE_TEMPEST_FULL=1
          export BRANCH_OVERRIDE={branch-override}
          if [ "$BRANCH_OVERRIDE" != "default" ] ; then
              export OVERRIDE_ZUUL_BRANCH=$BRANCH_OVERRIDE
          fi
          cp devstack-gate/devstack-vm-gate-wrap.sh ./safe-devstack-vm-gate-wrap.sh
          ./safe-devstack-vm-gate-wrap.sh
      - link-logs

    publishers:
      - devstack-logs
      - console-log

The devstack-checkout builder is simply a Bash script macro that looks like this:

- builder:
    name: devstack-checkout
    builders:
      - shell: |
          #!/bin/bash -xe
          if [[ ! -e devstack-gate ]]; then
              git clone git://git.openstack.org/openstack-infra/devstack-gate
          else
              cd devstack-gate
              git remote set-url origin git://git.openstack.org/openstack-infra/devstack-gate
              git remote update
              git reset --hard
              if ! git clean -x -f ; then
                  sleep 1
                  git clean -x -f
              fi
              git checkout master
              git reset --hard remotes/origin/master
              if ! git clean -x -f ; then
                  sleep 1
                  git clean -x -f
              fi
              cd ..
          fi

All the above is doing is git clone’ing the devstack-gate repository into the Jenkins workspace, and if the devstack-gate repository already exists, checks out the latest from the master branch.

Returning to our gate-tempest-dsvm-full JJB job template, we see the remaining part of the builder is a Bash scriptlet like so:

          #!/bin/bash -xe
          export PYTHONUNBUFFERED=true
          export DEVSTACK_GATE_TIMEOUT=180
          export DEVSTACK_GATE_TEMPEST=1
          export DEVSTACK_GATE_TEMPEST_FULL=1
          export BRANCH_OVERRIDE={branch-override}
          if [ "$BRANCH_OVERRIDE" != "default" ] ; then
              export OVERRIDE_ZUUL_BRANCH=$BRANCH_OVERRIDE
          fi
          cp devstack-gate/devstack-vm-gate-wrap.sh ./safe-devstack-vm-gate-wrap.sh
          ./safe-devstack-vm-gate-wrap.sh

Not all that complicated. It exports some environment variables and copies the devstack-vm-gate-wrap.sh script out of the devstack-gate repo that was clone’d in the devstack-checkout macro to the work directory and then runs that script.

The devstack-vm-gate-wrap.sh script is responsible for setting even more environment variables and then calling the devstack-vm-gate.sh script, which is where the real magic happens.

Construction of OpenStack Environment with Devstack

The devstack-vm-gate.sh script is responsible for constructing a full OpenStack environment and running integration tests against that environment. To construct this OpenStack environment, it uses the excellent Devstack project. Devstack is an elaborate series of Bash scripts and functions that clones each OpenStack project source code into /opt/stack/new/$project [5]— , runs python setup.py install in each project checkout, and starts each relevant OpenStack service (e.g. nova-compute, nova-scheduler, etc) in a separate Linux screen session.

Devstack’s creation script (stack.sh) is called from the script after creating the localrc file that stack.sh uses when constructing the Devstack environment.

Execution of Integration Tests Against an OpenStack Environment

Once the OpenStack environment is constructed, the devstack-vm-gate.sh script continue on to run a series of integration tests:

    cd $BASE/new/tempest
    if [[ "$DEVSTACK_GATE_TEMPEST_ALL" -eq "1" ]]; then
        echo "Running tempest all test suite"
        sudo -H -u tempest tox -eall -- --concurrency=$TEMPEST_CONCURRENCY
        res=$?
    elif [[ "$DEVSTACK_GATE_TEMPEST_FULL" -eq "1" ]]; then
        echo "Running tempest full test suite"
        sudo -H -u tempest tox -efull -- --concurrency=$TEMPEST_CONCURRENCY
        res=$?

You will note that the $DEVSTACK_GATE_TEMPEST_FULL Bash environment variable was set to “1” in the gate-tempest-dsvm-full Jenkins job builder scriptlet.

sudo -H -u tempest tox -efull triggers the execution of Tempest’s integration test suite. Tempest is the collection of canonical OpenStack integration tests that are used to validate that OpenStack APIs work according to spec and that patches to one OpenStack service do not inadvertently cause failures in another service.

If you are curious what actual commands are run, you can check out the tox.ini file in Tempest:

[testenv:full]
# The regex below is used to select which tests to run and exclude the slow tag:
# See the testrepostiory bug: https://bugs.launchpad.net/testrepository/+bug/1208610
commands =
  bash tools/pretty_tox.sh '(?!.*\[.*\bslow\b.*\])(^tempest\.(api|scenario|thirdparty|cli)) {posargs}'

In short, the above runs the Tempest API, scenario, CLI, and thirdparty tests.

Archival of Test Artifacts

The final piece of the puzzle is archiving all of the artifacts from the Jenkins job execution. These artifacts include log files from each individual OpenStack service running in Devstack’s screen sessions, the results of the Tempest test suite runs, as well as echo’d output from the devstack-vm-gate* scripts themselves.

These artifacts are gathered together by the devstack-logs and console-log JJB publisher macros:

- publisher:
    name: console-log
    publishers:
      - scp:
          site: 'static.openstack.org'
          files:
            - target: 'logs/$LOG_PATH'
              copy-console: true
              copy-after-failure: true


- publisher:
    name: devstack-logs
    publishers:
      - scp:
          site: 'static.openstack.org'
          files:
            - target: 'logs/$LOG_PATH'
              source: 'logs/**'
              keep-hierarchy: true
              copy-after-failure: true

Conclusion

I hope this article has helped you understand a bit more how the OpenStack continuous integration platform works. We’ve stepped through the flow through the various components of the platform, including which events trigger what actions in each components. You should now have a good idea how the various parts of the upstream CI infrastructure are configured and where to go look in the source code for more information.

The next article in this series discusses how to construct your own external testing platform that is linked with the upstream OpenStack CI platform. Hopefully, this article will provide you most of the background information you need to understand the steps and tools involved in that external testing platform construction.


[1]— The link describes and illustrates the non-human gatekeeper model with Bazaar, but the same concept is applicable to Git. See the OpenStack GitWorkflow pages for an illustration of the OpenStack specific model.
[2]— Zuul really is a pretty awesome bit of code kit. Jim Blair, the author, does an excellent job of explaining the merge proposal dependency graph and how Zuul can “trim” dead-end branches of the dependency graph in the Zuul documentation.
[3]— Looking for where a lot of the “magic” in the upstream gate happens? Take an afternoon to investigate the scripts in this directory. :)
[4]— Gerrit Jenkins plugin and Zuul export a variety of workspace environment variables into the Jenkins jobs that they trigger. If you are curious what these variables are, check out the Zuul documentation on parameters.
[5]— The reason the projects are installed into /opt/stack/new/$project is because the current HEAD of the target git branch for the project is installed into /opt/stack/old/$project. This is to allow an upgrade test tool called Grenade to test upgrade paths.

Character Sets, Collations and the Jörmungandr

One of the (many) ongoing discussions in the Drizzle developer community is the level of support the database server kernel should provide for non-Unicode character set encodings. Actually, when I say non-Unicode, I actually mean non-UTF8, since we’ve stripped out all other character sets and “standardized” on 4-byte UTF8. I’ll come back to why exactly I put standardized in quotes in just a bit…but to sum up, in childish terms, my thoughts after spending 4 hours tonight reading about character sets and collations, here is an exchange between Toru and myself on Freenode #drizzle:

<jaypipes> tmaesaka: how do you write "I wish everyone would just speak English" in Japanese? ;)
<tmaesaka> みんな英語使うといいのに。

A Little Background

For those of you new to the world of character sets and collations, I’ll briefly summarize the concepts and terms I’ll talk about in this article. Incidentally, I consider myself to be in this crowd, since I’ve never really had to deal with anything more than a cursory knowledge of them in reference to how they work in MySQL (not the internals).

Character Sets and Encodings

A character set, or character encoding scheme, is a system for matching characters — such as “A” or “み” or “ß” — with a machine-readable code for the character. This machine-readable code can be represented simply as a decimal number, or in more complex character sets, a hexidecimal number. The “encoding” of the character set is the protocol, or instructions, that the character set uses in order to enable the computer to understand a series of byte sequences and interpret the sequence as a specific character.


Character sets such as ASCII are very simple, and consist of a single 8-bit byte for each character contained in the character set — with only 7 bits actually used for the character code. ASCII consists of the English-language alphabetic characters, including their captalized forms, the numbers 0 through 9, a variety of punctuation and “common” symbols like “$” and “;”, and a series of non-printable “control” characters. This encoding scheme works wonderfully for those of us in the U.S., but it is utterly lacking when it comes to representing the myriad characters and symbols in other languages.

Other more-complex character encodings are localized for a specific language, or writing system. For instance, the Shift_JIS character encoding scheme encodes, in 2 bytes, the ASCII character set (with 2 exceptions), the “half-width Katakana” characters, and the JIS X 0208 set of kanji symbols. Sound complicated? It is. And it gets even more complicated the further down the rabbit-hole one goes.

Which leads me to Unicode…

What the Heck is Unicode and UTF?

Many folks think that Unicode is merely another character set or encoding scheme. It’s not. It’s actually more than that. It’s an entire system which endeavours to standardize the way that computers can read, sort, and transform characters encoded in various character sets.

Actually, The Unicode standard according to Wikipedia

…consists of a repertoire of more than 100,000 characters, a set of code charts for visual reference, an encoding methodology and set of standard character encodings, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and a number of related items, such as character properties, rules for normalization, decomposition, collation, rendering and bidirectional display order…

Got all that?

So, Unicode is a set of standards for dealing with lots of varying languages and characters, and transcoding character codes from one encoding scheme to another. What, then, is UTF[8|16|32]?

UTF stands for Unicode Transformation Format, and is a set of mapping methods for translating one of Unicode’s 1,114,112 code points (characters or control sequences) to a hexadecimal number. UTF8 is a variably-sized mapping method, which uses between one and four bytes to represent one of the code points. ASCII and most Western character sets take up 1 byte of storage, whilst CJK (Chinese/Japanese/Korean) characters typically consume 3 bytes of space per character. It is important to note that this 3 bytes is one more byte per character than encoding schemes like Shift_JIS, which use either 1 or 2 bytes for characters. Yoshinori Matsunobu published a short article today on these storage space differences.

UTF16 is a variable-width mapping scheme which uses the first 16 bits of the hexadecimal number to represent what “category” or “plane” of characters the code point belongs to. UTF16 generally uses a little bit less storage space for CJK characters versus UTF8. However, when analyzing actual CJK text, which includes spaces and other ASCII characters, the storage difference seems to be negligible. UTF32 is a fixed-length mapping method which uses 4 bytes to store each code point.

UTF8 is dominant in the web space, with all modern browsers able to understand and encode for UTF8.

OK, So What is a Collation?

So, if a character encoding scheme, such as UTF8, is used to identify a set of characters and symbols as a machine-readable sequence of bytes, then what exactly is a collation, and why are they important?

Glad you asked. A collation , or collating sequence, refers to the order in which different characters in a character set should appear when sorted in a list. The alphabetic collating sequence is the one some of us, in our little English-only world, are familiar with. But in various regions of the world, the same set of characters may be ordered differently when appearing in a list of characters. And therefore, even with a character encoding scheme like UTF8, one must also specify a collation when listing textual results in a specific order.

In MySQL, as well as Drizzle, the method for ordering results by a specific collation is fairly simple: one merely specifies the collation in the ORDER BY clause, like the example below shows:

mysql> CREATE TABLE utf8_tests (
    ->   my_text VARCHAR(100) NOT NULL
    -> ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO utf8_tests VALUES ('comb'),('cukor'),('csak'),('folyik'),('folyó'),('folyosó'),('fő'),('födém');
Query OK, 8 rows affected (0.00 sec)
Records: 8  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM utf8_tests ORDER BY my_text COLLATE utf8_general_ci;
+----------+
| my_text  |
+----------+
| comb     | 
| csak     | 
| cukor    | 
| födém  | 
| fő      | 
| folyó   | 
| folyik   | 
| folyosó | 
+----------+
8 rows in set (0.00 sec)

mysql> SELECT * FROM utf8_tests ORDER BY my_text COLLATE utf8_hungarian_ci;
+----------+
| my_text  |
+----------+
| comb     | 
| csak     | 
| cukor    | 
| fő      | 
| födém  | 
| folyó   | 
| folyik   | 
| folyosó | 
+----------+
8 rows in set (0.00 sec)

You’ll notice that the words “fő” and “födém” are reversed depending on the collation used in the ORDER BY clause.

Any Hungarians reading this article? If there are, you’ll likely have already spotted the problem with the above output. The problem is that it’s wrong. “csak” should appear after cukor, since “cs” is a digraph (two-characters interpreted as one) which comes after “c” in the Hungarian alphabet.

The above behaviour is known bug in MySQL since August 2005, over three years. The above bug is something I noticed while reading up on collations and comparing what’s going on in MySQL/Drizzle to what the standard expects. The ICU project has a set of HTML pages where you can type in a list of words in a language and sort by various collations, and it will show you the correct sort order. I ran into the bug above, as well as a new bug in the German collation I found today.

Where Drizzle Is Right Now

Currently, all but the UTF8 character set have been removed from Drizzle. Furthermore, the UTF8 implementation in Drizzle is full 4-byte UTF8, which differs from the 3-byte variety used in MySQL <= 5.1. There are two major benefits that this decision and subsequent removal has given Drizzle:

  1. Reductions in the size and complexity of the Drizzle parser — removal of some CONVERT() stuff, introducers, and more
  2. Easier to understand (and potentially refactor) code surrounding character sets and collations

So, it seems that although we’ve stripped out a lot of complexity by moving to only UTF8 and its collations, we’ve inherited a system that, frankly, was never designed to handle complex collations. Instead, it is designed to be fast, not entirely accurate. So, what is a project to do? :)

We have a number of options, all of which we’ve been debating over on the mailing lists:

  • Use libICU for all character set and collation services — libICU is a full-featured library written by experts in the languages and transcoding fields, why not take advantage of that expertise?
  • Use GLib’s locale facilities — this has mostly been ruled out because of performance concerns over non-reentrant code dependent on setlocale()
  • Write our own — this is essentially where we are right now
  • Use C++’s <locale> facilities, as Monty demonstrated on the mailing lists — actually, I also have O’Reilly’s C++ Cookbook, so I know where that code originated… ;)

libICU is, frankly, quite a large library, and it’s not certain that the performance of it would be satisfactory. However, I can certainly envision taking libICU’s test case suite and converting it to the Drizzle test suite format. This would certainly poke holes in our current character set handling that need to be discovered.

Although Yoshinori-san’s objections about UTF8 storage requirements versus localized Japanese character sets are valid, I don’t think at this point that we’ll re-introduce non-UTF8 character sets into the server at this time. If there is a huge uproar over this, in the future, pluggable character sets are a possibility, after changes to the plugin API to enable it. Pluggable collations too…

This last option is the one which interests me the most, and I find most appealing. In fact, I compiled a small test program based on the C++ <locale> facilities which actually produces the correct collation order for the bug demonstrated above:

  1. #include <locale>
  2. #include <iostream>
  3. #include <vector>
  4. #include <string>
  5. #include <algorithm>
  6.  
  7. using namespace std;
  8.  
  9. bool localeLessThan(const string& s1, const string& s2) {
  10.  
  11. locale locale1("hu_HU.utf8");
  12.  
  13. const collate<char>& col= use_facet<collate<char> >(locale1);
  14.  
  15. const char* pb1= s1.data();
  16. const char* pb2= s2.data();
  17.  
  18. return (col.compare(pb1, pb1 + s1.size(),
  19. pb2, pb2 + s2.size()) < 0);
  20. }
  21.  
  22. int main(int argc, char** argv) {
  23.  
  24. string s1 = "comb";
  25. string s2 = "csak";
  26. string s3 = "cukor";
  27.  
  28. vector<string> all_the_strings;
  29.  
  30. all_the_strings.push_back(s1);
  31. all_the_strings.push_back(s2);
  32. all_the_strings.push_back(s3);
  33.  
  34. sort(all_the_strings.begin(), all_the_strings.end(), localeLessThan);
  35. for (vector<string>::const_iterator p= all_the_strings.begin();
  36. p != all_the_strings.end();
  37. ++p)
  38. cout << *p << endl;
  39.  
  40. }

Compiling and running the program shows the correct sorted order for the words:

[518][jpipes@serialcoder: /home/jpipes/repos/drizzle/test-hun]$ g++ test.cc 
[519][jpipes@serialcoder: /home/jpipes/repos/drizzle/test-hun]$ ./a.out 
comb
cukor
csak

I’m thinking that the refactoring work that still needs to be completed around CHARSET_INFO and MY_CHARSET_HANDLER should experiment with the technique above and verify any performance regression (or improvement) that may occur. Accuracy, in my opinion, and the ability to let a library not written by Drizzle developers do the heavy lifting, is more important than a small performance increase.

The Edwin Strikes Back


As he is sometimes prone to do, my dear colleague, Edwin DeSouza, shot me an email with a link he thought I might find interesting. He saw I had been chatting about character sets and had discovered a forum thread over on the Ruby Forum which went into some detail about the difficulties surrounding encoding conversion, localization, and internationalization.


Suffice it to say, “some detail” would be the understatement of the year in this case. The forum thread is longer than Jörmungandr, the mythical Norse sea serpent.

Notable voices on the thread include Matz, creator of Ruby and a core influencer in its direction, and Tim Bray, of our own Sun Microsystems and XML fame. The original poster, one Michael Selig, began the thread, entitled Character encodings – a radical suggestion, with an ostensibly simple suggestion:

Remove internal support for non-ASCII encodings completely, and when
reading/writing UTF-16 (and UTF-32) files automatically transcode
to/from UTF-8.

Unfortunately for Michael, this small suggestion was the online equivalent of stepping in a pile of elephant dung.

Until reading the above-mentioned forum thread, I really had no idea about the complexities involved in character set handling, especially in the Asian countries. If you are interested in character sets, collations, and Unicode vs. local encodings, reading through the forum thread will truly enlighten you as to the various arguments for and against UTF8. It’s highly recommended reading, but be warned, it may leave you gasping for breath at some points…enjoy. :)

Enabling and Fixing Drizzle Test Cases

When Brian began the work on refactoring the MySQL 6.0 Server source code into what has now become the Drizzle Project, a number of code pieces were removed, including some major MySQL functionality such as stored procedures, server-side prepared statements, SQL Mode, some legacy code, and a variety of data types. The goal, of course, was to reduce the server code base down to a more streamlined and eventually modular kernel.

Of course, that vision is great, but it’s got some side effects! One of those side effects is a dramatic reduction in the number of test cases that pass the test suite in their current form, and an increase in the number of tests that have been disabled. I re-enabled and fixed a few tests yesterday, but as of this writing, there are only 54 of 408 tests currently passing in the test suite.

This is to be expected. You can’t just go and strip a huge chunk of the parser and functionality out of the server and expect the original test suite to run without problems :) So, Brian disabled some tests while removing code sections anticipating that these tests would eventually be re-enabled and any regressions fixed. Well, we are now at that point. With Brian’s work this week to remove the last vestiges of non-UTF8 character set support, fixing and re-enabling disabled tests in the test suite is now a high priority project. Luckily, this is a project which almost anyone — even non-coders — can get involved in and make a difference.

This article will explain the process of running the Drizzle test suite and identifying test cases which can be re-enabled or should be fixed. We’ll focus on stuff that you can help with as a contributor who wants to start getting involved in Drizzle and making an impact without having C/C++ coding experience. If you haven’t caught my previous articles on using Launchpad.net for code management, I’d suggest reading those now. In addition, although we won’t be doing C/C++ coding, you’ll need a build environment established in order to properly run the test suite. So, I’d also suggest reading my article on setting up a C/C++ development environment for Drizzle.

The Test Suite Basics

NOTE: This section describes the Drizzle test suite. However, if you are contributing to the MySQL Server project, the instructions in this section are exactly the same if you are working with the MySQL Server. Just change dtr to mtr. :)

The Drizzle test suite is a composite of a main Perl script — tests/test-run.pl — and a couple other tools. After you have built Drizzle with the standard build process, you will see a program in the /tests source directory called dtr. This is the test suite runner. When you issue the command:


make test

This test runner is called with some command-line options and a list of tests to run. You can verify this behaviour by looking at /tests/Makefile.am and seeing the actual command for the test make target.

The general form for running a test case is the following:


cd tests

./dtr $testname1 $testname2 ... $testnameN

There are a number of command-line options that the test suite runner accepts, and I’ll cover a smattering of them in this article.

Running a Test

So, how do you know what the names of the tests are? Good question! :) In the /tests directory, you will find a /t directory which contains all the test cases contained in the main test suite. The main test suite are tests not specific to a functionality of the server like replication. For instance, the /t/select.test is the test case in the main test group corresponding to testing of the SELECT syntax and functionality. Other test cases for specific functional pieces can be found in /tests/suite. Running this t/select.test test case through the runner, we’d do this:


./dtr select

Note that you do not need to add the .test suffix. You should see results similar to the following:

[505][jpipes@serialcoder: /home/jpipes/repos/drizzle/trunk/tests]$ ./dtr select
Logging: ./dtr select
<snip>
MySQL Version 7.0.0
Using dynamic switching of binlog format
Using MTR_BUILD_THREAD      = 0
Using MASTER_MYPORT         = 9306
Using MASTER_MYPORT1        = 9307
Using SLAVE_MYPORT          = 9308
Using SLAVE_MYPORT1         = 9309
Using SLAVE_MYPORT2         = 9310
Killing Possible Leftover Processes
Removing Stale Files
Creating Directories
Saving snapshot of installed databases
=======================================================

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

main.select                    [ pass ]           9673
-------------------------------------------------------
Stopping All Servers
All 1 tests were successful.
The servers were restarted 1 times
Spent 9.673 of 21 seconds executing testcases

As you can see, the test suite fires up a Drizzle server, loads the test file and performs the tests contained in the file. The tests in the file generally consist of SQL statements that are executed against one or more servers, but they can also be commands such as creating a new connection, logging output, and other things. For this article, we’ll be focusing on the SQL command tests. In a followup article, I may highlight some of the other test-case commands available to you.

Failing Test Cases

Well, it’s all fine and dandy if a test case succeeds like in the example above, but like I mentioned in the introduction of this article, we’re focused on the test cases that aren’t succeeding and getting these test cases to succeed! So, how do we find those tests which are failing? One method is to look at the Drizzle Build Farm and track down failures occurring in the test runs. Another way is to simply run a series of tests and see what fails. For simplicity’s sake, I’ve done a little research already and know a number of tests that are failing. So, we’ll go ahead and take a look at a test case that I know needs some TLC.

The test case I’ve chosen is the func_math test from the main test suite. It’s small and provides a good example of how we can work to fix up the failures. Here is what I get when running this test:

[505][jpipes@serialcoder: tests]$ ./dtr func_math
Logging: ./dtr func_math
<snip>
MySQL Version 7.0.0
Using dynamic switching of binlog format
Using MTR_BUILD_THREAD      = 0
Using MASTER_MYPORT         = 9306
Using MASTER_MYPORT1        = 9307
Using SLAVE_MYPORT          = 9308
Using SLAVE_MYPORT1         = 9309
Using SLAVE_MYPORT2         = 9310
Killing Possible Leftover Processes
Removing Stale Files
Creating Directories
Saving snapshot of installed databases
=======================================================

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

main.func_math                 [ fail ]

drizzletest: At line 134: query 'create table t1 (a varchar(90), ts datetime not null, index (a)) engine=innodb 
default charset=utf8' failed: 1064: You have an error in your SQL syntax; check the manual that corresponds 
to your MySQL server version for the right syntax to use near 'charset=utf8' at line 1

The result from queries just before the failure was:
< snip >
656	405
122	405
645	405
INSERT INTO t1 VALUES (3);
SELECT CAST(RAND(2) * 1000 AS UNSIGNED), CAST(RAND(a) * 1000 AS UNSIGNED) 
FROM t1;
CAST(RAND(2) * 1000 AS UNSIGNED)	CAST(RAND(a) * 1000 AS UNSIGNED)
656	405
122	405
645	405
858	656
354	906
SELECT CAST(RAND(2) * 1000 AS UNSIGNED), CAST(RAND(a) * 1000 AS UNSIGNED) 
FROM t1 WHERE a = 1;
CAST(RAND(2) * 1000 AS UNSIGNED)	CAST(RAND(a) * 1000 AS UNSIGNED)
656	405
122	405
645	405
DROP TABLE t1;
create table t1 (a varchar(90), ts datetime not null, index (a)) engine=innodb default charset=utf8;

More results from queries before failure can be found in 
/home/jpipes/repos/drizzle/trunk/tests/var/log/func_math.log

Stopping All Servers
Restoring snapshot of databases
Resuming Tests

-------------------------------------------------------
Stopping All Servers
Failed 1/1 tests, 0.00% were successful.

The log files in var/log may give you some hint
of what went wrong.
If you want to report this error, please read first the documentation at

http://dev.mysql.com/doc/mysql/en/mysql-test-suite.html

The servers were restarted 1 times
Spent 0.000 of 7 seconds executing testcases

mysql-test-run in default mode: *** Failing the test(s): main.func_math
mysql-test-run: *** ERROR: there were failing test cases

As you can see, the test fails and outputs the source of the failure.

Fixing a Broken Test

Now that we’ve identified a failing test, we need to follow a process in order to fix it. The process you should follow is this:

  1. Make a change to the test case file
  2. Re-run the test through dtr using the --record option
  3. If any failure occurs, go back to #1
  4. Once the test succeeds under the --record option, a test result file will be written to the /tests/r/ directory. We’ll need to bzr commit the changes to the test and the result file and push to a branch on Launchpad.
  5. Edit tests/Makefile.am and ensure the newly-passing test is included in the make test target

In this case, the failure is due to a mere syntax issue. We’ve removed character set support and standardized entirely on UTF8, and so the support in the parser syntax for the phrase DEFAULT CHARSET=utf8 is gone. To fix this test, we need to remove the pieces of the old MySQL syntax which are no longer supported in Drizzle.

So, we pop open our favorite editor and open up the /tests/t/func_math.test file. Go ahead and remove all instances of default charset=utf8. And then re-run the test with the --record. You should see the following:

[508][jpipes@serialcoder: tests]$ ./dtr --record func_math
<snip>
=======================================================

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

main.func_math                 [ fail ]

drizzletest: At line 160: query 'create table t1
(f1 varchar(32) not null,
f2 smallint(5) unsigned not null,
f3 int(10) unsigned not null default '0')
engine=myisam' failed: 1064: You have an error in your SQL syntax; check the manual 
that corresponds to your MySQL server version for the right syntax to use near '(5) unsigned not null,
f3 int(10) unsigned not null default '0')
engine=myisam' at line 2
<snip>

Again, it looks like we’ve run into another syntax problem. Above, the test case contains the old ZEROFILL syntax, which allows you to specify a number in parentheses after an integer data type. This functionality, a legacy from Unireg times, is not supported in Drizzle. So, we must remove it. After removing the (XX) ZEROFILL syntax from the CREATE TABLE definitions in the test case file, I re-run the test:

<snip>
=======================================================

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

main.func_math                 [ fail ]

drizzletest: At line 230: query 'CREATE TABLE t1(a SET('a','b','c'))' failed: 1064: 
You have an error in your SQL syntax; check the manual that corresponds to your 
MySQL server version for the right syntax to use near 'SET('a','b','c'))' at line 1
<snip>

Once again, we’ve run into a failure. This time, it’s because of the SET data type. This data type has been removed from Drizzle. So, we must remove it from the test case here. After doing so, I re-run the test case, and finally we see a success:

[509][jpipes@serialcoder: tests]$ ./dtr --record func_math
Logging: ./dtr --record func_math
<snip>
=======================================================

TEST                           RESULT         TIME (ms)
-------------------------------------------------------

main.func_math                 [ pass ]            107
-------------------------------------------------------
Stopping All Servers
All 1 tests were successful.
The servers were restarted 1 times
Spent 0.107 of 8 seconds executing testcases

Cool. Looks good. Now we edit tests/Makefile.am and add the newly successful test to the make test target.


cd tests

vim Makefile.am

Here is what the section in the Makefile.am looks like, with the bolded line being the line I add for our newly successful test:

test-drizzle:
	  $(PERL) -I$(top_srcdir)/tests/lib \
		$(top_srcdir)/tests/test-run.pl --fast --reorder --force \
        1st \
        alter_table \
        bench_count_distinct \
	bulk_replace \
	comment_column2 \
	comments \
	consistent_snapshot \
        count_distinct \
        count_distinct2 \
        count_distinct3 \
	create_select_tmp \
        ctype_filename \
        delete \
        distinct \
	drizzleslap \
        endspace \
        flush2 \
        func_equal \
        func_group_innodb \
        func_isnull \
        func_like \
	func_math \
        greedy_optimizer \
        group_min_max_innodb \
        heap_auto_increment \

Alright, cool. OK, now we simply need to verify our test case and result file changes, edit our make test target, and commit our changes. First, verification:

[511][jpipes@serialcoder: tests]$ bzr status
modified:
  tests/Makefile.am
  tests/r/func_math.result
  tests/t/func_math.test

Looks good. The final step is committing our work and then pushing to a code branch on Launchpad.net. Below, I am pushing to the branch lp:~drizzle-developers/drizzle/enable-tests, which is a team branch used to push code for the various test cleanups.

[514][jpipes@serialcoder: tests]$ bzr commit Makefile.am t/func_math.test \
> r/func_math.result -m "Fixed syntax errors in func_math test and re-enable \
> the test in the make test target"
Committing to: /home/jpipes/repos/drizzle/trunk/
modified tests/Makefile.am
modified tests/r/func_math.result
modified tests/t/func_math.test
Committed revision 405.                                                                                                                  
[515][jpipes@serialcoder: tests]$ bzr push lp:~drizzle-developers/drizzle/enable-tests
Pushed up to revision 405.

And that’s that! Test fixed, case, result and Makefile.am edited, and changes committed.

A Contributor’s Guide to Launchpad.net – Part 2 – Code Management

In this second part of my Launchpad guidebook series, I’ll be covering the code management and repository features of Launchpad.net. If you missed the first part of my series, go check it out and get established on Launchpad.net. Then pop back to this article to dive into the magic of http://code.launchpad.net. In this article, we’ll cover the following aspects of the code management pieces of Launchpad:

  • The Structure of Project Source Code on Launchpad.net
  • Pulling Code into a Local Repository
  • Creating a Local Working Branch for Bug Fixing
  • Pushing Code to Launchpad
  • Notifying a Merge Captain of Your Code Pushes
  • Keeping Trunk Up To Date
  • Merging Local Code with a Trunk Branch

For the following article, we’ll be acting as if you are contributing to the MySQL Server project and wish to create a patch to fix a bug in the MySQL server. We’ll be working through all the steps to do so. If you are looking to contribute to a different project, or your own project, simply replace the names and URLs in the article with ones for your particular project. :)

The Structure of Project Source Code on Launchpad.net

Projects hosted on Launchpad.net are organized using the terminology branches, repositories and series. A branch is simply a Bazaar branch of the project’s source code. A repository is a collection of a project’s Bazaar branches. A series is a named branch which represents something special for the project — usually a tagged release or a development branch.


Because Launchpad uses Bazaar, a distributed version control system, sometimes it takes a little while to get used to the fact that there is not a “central” source tree to which you check code into and out of. Instead, what you do is “pull” a named branch from a Bazaar server to your local workstation and work on that local branch in peace and quiet until you want to “push” your code into another branch. The branch to which you “push” may be the “trunk”, or “active development” branch, but normally you will not push to the trunk branch. Instead, you’ll push your local branch containing your source code changes to a personal branch on the Launchpad.net server and propose that your branch be “merged into trunk”. The image to the right, from the Bazaar user guide, show the general flow of code using this type of process, called a “decentralized system with a human gatekeeper”.

This pull, code, push and merge process is the recommended way to manage code changes for a project. It allows a core set of “merge captains” to review and check your code before merging your code into the active development branch. It is this process which I will be demonstrating in this article.

Pulling Code into a Local Repository

When you work on a project in Launchpad, you work on code in a local branch of the project. To get rolling, you will first want to set up a local repository if you haven’t already done so. To do so, we use the bzr init-repo command:

NOTE: You will need Bazaar installed to do so. Don’t have Bazaar installed? See my previous article on Getting a C/C++ Development Environment Established. Not coding in C/C++? Don’t worry, just read the section of that article on installing Bazaar.


cd ~/repos # Change this to the folder in which you plan to have bzr repositories...
bzr init-repo mysql-server-5.1

At this point, a shared repository will be created. What the heck is a shared repository? Well, it’s basically a special folder that Bazaar knows contains information about source code branches. It facilitates speedier branching and merging and is especially useful for larger source trees and changeset histories like the MySQL server. You can verify that Bazaar knows something about your newly created repository by checking for a .bzr hidden folder in your repository folder:

[504][jpipes@serialcoder: /home/jpipes/repos]$ ls -la mysql-server-5.1/
total 12
drwxr-xr-x 3 jpipes jpipes 4096 2008-08-26 15:00 .
drwxr-xr-x 7 jpipes jpipes 4096 2008-08-26 15:00 ..
drwxr-xr-x 4 jpipes jpipes 4096 2008-08-26 15:00 .bzr

The next step is pulling the active development series of your particular project. In our example, we’ll pull the active development branch of the “5.1” series of the MySQL Server. To do so, we use the bzr branch command:


cd mysql-server-5.1/

bzr branch lp:mysql-server/5.1 trunk

WARNING: When doing the initial pull of the first branch in a shared repository, the branch command can take quite some time to execute, especially when pulling a branch of a project like the MySQL Server which has a huge history of changesets to it. Be prepared to wait a while, and if Bazaar looks like it’s stuck doing stuff, just leave it alone. :) It typically will take about 80-90 minutes to complete the first time!

In the bzr branch command above, lp: designates we are looking for a branch residing on the Launchpad.net Bazaar servers. The colon is followed by the name of the project, in this case mysql-server, followed by a slash and the name of the series, in this case 5.1. You can always check the names of a project’s series by going to the main code area of a project. For MySQL Server, that address would be http://code.launchpad.net/mysql-server

When finished the initial branch, you’ll see something like the following, shown with the time command to illustrate the amount of time you should expect for the MySQL Server initial branch:

[511][jpipes@serialcoder: /home/jpipes/repos/mysql-server-5.1]$ time bzr branch lp:mysql-server/5.1 trunk 
Server is too old for streaming pull, reconnecting.  (Upgrade the server to Bazaar 1.2 to avoid this)
Branched 2719 revision(s).

real	91m30.337s
user	14m6.825s
sys	6m7.355s

The total amount of “stuff” that is downloaded for the 5.1 server is around 600MB, so it shouldn’t be surprising that it takes some time to do the initial branch…

You can ignore the message about upgrading the server to Bazaar 1.2; it’s because I’m not using the very recent Bazaar 1.6 client. The performance of this first branch is currently being investigated by John Arbash Meinel, one of Bazaar’s developers, whom I spent some “quality time” with on IRC today. :) He mentions that the developers are working on something called “shallow trees” which should significantly speed up initial branching for large projects.

Creating a Local Working Branch for Bug Fixing

OK, if you’ve gotten this far, then you will have a local shared repository that contains a single branch which contains the source code and changeset history for the 5.1 series of the MySQL Server. What we want to do now is fix a bug in the MySQL Server 5.1 locally on our workstation. This will be a fictitious bug called Bug#99999 – “authors.h file doesn’t contain MY NAME!:)

Certainly, we could just start hacking up the code in the trunk branch we just pulled. But, that’s not the most practical way of doing structured development on a local workstation. Instead, you should create a branch from trunk which will house only the changes specific to what you are working on: in this case, our fictitious bug#99999. Why is this a better practice than simply making the changes in the trunk branch? Well, a couple reasons:

  • It keeps the trunk branch free of any of your changes. This is very important if you want to keep trunk up-to-date with changes from the rest of the project’s developers and not have to constantly resolve merge conflicts (more on that later). Having a clean trunk branch local to your workstation means you can quickly and easily make local working branches from trunk for anything you are working on.
  • It creates an easy-to-understand naming structure for your local repository. After a while, you might have a repository with the following structure:
    ~/repos
      /mysql-server-5.1
        /trunk # perhaps a cron job keeps this in synch every morning...?
        /bugXXXX-fix-invalid-pointer-in-binlog
        /bugXXXX-wrong-error-message-on-create-table
        /my-fancy-storage-engine
    

OK, so hopefully I’ve convinced you to follow the advice of creating separate local working branches for actually changing source code. :) Let’s create a local working branch to do the fix for our Bug#99999. We use the bzr branch command, as before, but this time we’ll be branching from our local trunk, and not the Launchpad trunk branch for the 5.1 series. You will notice that the branching will be significantly faster than before. This is due both to doing things locally, and to the fact that you now have a shared repository set up. Enter the following:

bzr branch trunk bug99999-fix-authors-file

Once completed, you should see something like the following:

[571][jpipes@serialcoder: /home/jpipes/repos/mysql-server-5.1]$ bzr branch trunk bug99999-fix-authors-file
Branched 2719 revision(s).

We’re now ready to start fixing our bug. Hop into the newly created working branch and open up the sql/authors.h file in your editor of choice (here, I’ll use Vim)

cd bug99999-fix-authors-file
vim sql/authors.h

The “fix” for this bug is simply adding your name to the sql/authors.h file of course. Go ahead and add your name to the list in that file and save and close the file. Now, that has to be the easiest bug fix ever. No wait, I take that back.

Before we commit anything, let’s first check to see what changes we have made in the local branch. To do so, we use the bzr status command, like so:

bzr status

If you’ve done everything up until now, you should see something very similar to the below:

[503][jpipes@serialcoder: /home/jpipes/repos/mysql-server-5.1/bug99999-fix-authors-file]$ bzr status
modified:
  sql/authors.h

The next thing we’ll need to do is commit our changes to the local branch. For those of you used to CVS or Subversion, this step will look familiar, however remember that with Bazaar you are committing to the local branch, not a central repository. (This isn’t always the case, but for now, assume it is…)

Like in other source control systems, we’ll use the commit command. There are a number of command options that you can use with the commit command, and I will outline two of them here. The most important is the -m option, which allows you to enter a string which will be the comment for the set of changes in this commit. This is an extremely useful option for smaller changesets. For larger ones, leave off the -m option and your environment’s editor will pop up after hitting enter to allow you to enter in larger comments.

TIP: Remember, a best practice whenever you commit source code to a revision control system is to make the changeset comments as descriptive as possible, so other developers can clearly tell what you were doing.

The second option I’ll tell you about is a nifty one which integrates Bazaar with the Launchpad.net bug tracking system. If you have a Bug report that is managed by Launchpad.net, you can supply the bug number to the --fixes option and automatically close a bug report with your bzr commit. Pretty cool, eh? :)

Below, I show how to use these two options with the bzr commit command:

bzr commit -m "Add my name to authors.h" --fixes lp:99999

You should see something like the following appear:

Committing to: /home/jpipes/repos/mysql-server/bug99999-fix-authors-file/
modified sql/authors.h
Committed revision 2720.

Did you know that Bazaar supports more bug tracking systems than just Launchpad.net? Check out the Bazaar user guide section on Bug Tracking integration for more information.

Pushing Code to Launchpad

Now that you’ve made your code changes, it’s time to push those changes up to a branch on Launchpad.net. Why do we want to push the local branch changes to Launchpad.net, instead of sending the changes to another team member (using the bzr send or bzr export command)? Well, first of all, pushing the changes to Launchpad.net allows anyone to see and review your code changes, making the Launchpad.net website an easy and centralized place to do that. Secondly, having the branch on Launchpad.net allows you to get more out of the Launchpad.net platform and integrate your branch and code with other features of the platform, such as Bug Tracking and the Blueprints task management and milestone system.

To get your local branch to Launchpad.net, you use the bzr push command. The push command takes as an argument the address of the branch to which you wish to push your local changes. If you are pushing changes to a new branch on Launchpad.net, the system will create that new branch for you automatically. If you are pushing to an existing branch, the system simply uploads your changesets and applies them to that branch.

On Launchpad, there are a number of locations where we can push Bazaar branches. Each Launchpad user gets their own http://code.launchpad.net/~username area in which to put branches. Whenever you are a member of a project team, you also can push code into the code.launchpad.net/~username/projectname/ location. Also, each Launchpad user has a “Junk” area (code.launchpad.net/~username/+junk/ that they can post any old branch to. After either of these locations, you put the name of the branch you are pushing to (or creating).

To see an example, let’s use our “Junk” area for right now, and push our local bug99999-fix-authors-file branch to Launchpad.net. We want to push our local bug-fix branch to a general branch in our junk folder which will contain all of our bug fixing efforts. Why? Well, there’s no need to create separate branches for each bug fix on Launchpad.net just so people can see the code in the single bug fix. We can push all our code changes to a general branch and then point reviewers to the specific revision we worked on. This saves a whole lot of time when pushing branches. So, here is how we push:

bzr push lp:~jaypipes/+junk/mysql-server # Of course, replace jaypipes with your own username!

After a while, you’ll see the following:

Created new branch.

You can now go view your branch and it’s associated code changes by visiting the branch’s code URL, which will be http://code.launchpad.net/~yourusername/+junk/mysql-server if you’ve been following the steps in this article.

Belonging to a Project Team

Having a branch in your junk folder is fine, but Launchpad is all about belonging to a community of developers! When you belong to a project team, you are automatically able to push your branches to the project’s code area. Importantly, if you belong to a project team, then you can use Launchpad’s ability to propose a branch merge — something you cannot do if you push to your “Junk” folder. To push one of your local branches to a project, you would do:

cd ~/repos/projectname/branchname>
bzr push lp:~yourusername/projectname/branchname

Once pushed, the branch will be visible to anyone when they look at the project’s code branch listing, which is always at http://code.launchpad.net/projectname. If you click on your branch, you will go to the branch’s main page. You’ll notice links for a number of actions that you can take, including one called “Propose for merging into another branch”. I’ll be talking a lot about these options in future articles in this series, since many of them relate to the other parts of the Launchpad platform.

Notifying a Merge Captain of Your Code Pushes

OK, now that your code is up in a branch, the next step is to ask for a review of your code to be merged into the development branch of the project. In this case, I’m pretty sure our Bug#99999 fix isn’t going be passing any code reviews to get into the MySQL Server, but I’ll explain the process anyway for reference. :)

A merge captain is someone who chaperons a project’s main development branch by being a gatekeeper for new code being merged into it. This is a critical role that some folks think is a fun or coveted job. It’s not. As a merge captain for the Drizzle project, I can tell you that in fact it’s mainly grunt work comprising tedious pulling, merging, building the code, and running of test suites. So, be nice to your merge captains! The best way to be nice to your merge captain is to follow these two simple rules:

  1. Always make comments for your commits that are descriptive and explain clearly what your code is doing and what tasks or bugs the work addresses
  2. Follow a standard, agreed upon process for notifying a merge captain of merge requests

In the case of the second rule, the standard process uses the Launchpad.net platform to notify the merge captain of a request — instead of, for instance, bugging the crap out of the captain on IRC to merge your branch. ;)

To notify the merge captain about your branch, navigate to the main code page of the branch you pushed. Remember that to do this step, you must be a member of the project team! On the branch’s main page, look for the link “Propose merging into another branch” and click it. You will be taken to the Propose Branch for Merging page. Select the target branch. The default will be the active development branch for the project in question. You may also provide a branch name. In the “whiteboard” text area, you can provide a brief description of the changes contained in your branch. Typically, you will want to keep the option “Needs Review” selected, and then click the Register button.

At this point, an email will be fired off to any subscribers for the project’s trunk branch commits. The merge captain will, of course, be one of the ones which receives this email and will initiate the review of your code and the merging of it into trunk (or whichever target you specified).

Keeping Trunk Up to Date

The final two parts of this article looks at how to keep your own local branches up to date with the development branch of your project and also shows you how to merge someone else’s branch with your own.

To pull all the latest changes from a remote branch into a local branch, the bzr pull command is used. You navigate to the local branch you wish to pull changes for and then issue the pull command.

This will bring in all the merges and changesets of the branch you originally branched from into the local branch. In our case, the local “trunk” branch was branched originally from the branch at lp:mysql-server/5.1 and so if we do:

cd ~/repos/mysql-server-5.1/trunk
bzr pull

We will update our local trunk branch with the changes in the active development branch on Launchpad.

Merging Local Code with a Trunk Branch

A merge, as alluded to above, is simply when you want to combine the code changes of one branch with the code in another. We use the bzr merge command to do so. The merge command takes a single argument: the location of the branch you wish to merge into the current one.

To demonstrate how the merge command is used, I’ll just paste the work I just did for the Drizzle project in merging in a contributor’s fixes into trunk:

[509][jpipes@serialcoder: /home/jpipes/repos/drizzle]$ cd trunk/
[510][jpipes@serialcoder: /home/jpipes/repos/drizzle/trunk]$ bzr merge ../grant-bug261687/
 M  drizzled/sql_derived.cc                                            
All changes applied successfully.
[511][jpipes@serialcoder: /home/jpipes/repos/drizzle/trunk]$ bzr commit -m \
"Merged Grant's fixes for sql_derived.  Fixes bug#261687" --fixes lp:261687
Committing to: /home/jpipes/repos/drizzle/trunk/
modified drizzled/sql_derived.cc
Committed revision 373.

The above was executed after a pulled Grant’s Launchpad.net branch called bug261687, built the code, and ran the test suite to ensure no failures. I change to my local branch of the drizzle project development branch (called “trunk” locally). I then merge Grant’s branch into trunk locally with the bzr merge ../grant-bug261687 command. After merging, I must commit the trunk code. I do so, making a comment that this is a merge of Grant’s work, and noting the bug # which the changeset fixes. Once I do this, I am free to bzr push lp:drizzle and push the local merge to Launchpad…

Conclusion

As you can see, Launchpad and Bazaar is a feature-full code management system with a lot of bells and whistles. Hopefully, this article can get you started in your adventures in contributing to projects hosted on Bazaar/Launchpad.net. See you on Launchpad! Up next in this series: how to do task management with Launchpad’s Blueprints system.

A Contributor’s Guide to Launchpad.net – Part 1 – Getting Started

This post is the first in a series of articles which serves to highlight the services of the Launchpad platform which hosts the MySQL Server, MySQL Forge, MySQL Sandbox and Drizzle Server projects. I will be walking you through the various pieces of the platform and provide examples of using each of the services. I will cover in depth the source code management services which all three projects now rely upon. The code management services are the critical piece of the development platform. In addition, I will show you how to use the Blueprints, Bugs, Answers and Translations services that many MySQL ecosystem projects, including the MySQL Sandbox and MySQL Forge, use.

In this first article, I will walk through the critical first step of establishing a Launchpad.net account and setting up the OpenSSH and OpenPGP keys for your account. In follow-up posts, I cover the code management system, blueprints system, and more.

Creating Your Account on Launchpad.net

The first thing to do is obviously create your account on Launchpad.net. Doing so is trivially easy. Go to the registration page and enter in your email address. Launchpad will then email (subject line: “Finish your Launchpad registration”) that address with a link to start the registration process. Click on the link in that email. You’ll be asked to provide a Display Name and a password. You can choose to have your email address hidden from other users or not. After filling in the information, click the Continue button and you’ll end up in your account profile area.

Once in the profile area, go ahead and fill in any of the information you want to be public about yourself by clicking the “Change Details” link in the right of the page. You can upload an avatar image for yourself and fill in a little “Bio” section. In a sub-navigation area at the top of the page, you will see links to edit your Email Settings, SSH Keys, GPG Keys, and Passwords. We’ll cover the SSH and GPG Keys in a bit. For now, explore this area and set your preferences the way you like.

Also, at the very bottom of the Change Details page, you also have some other links to edit your IRC nicks and other stuff.

Your OpenSSH Keys

One important thing to do when setting up your account is to upload your public SSH key to Launchpad.net. This helps the code management system by facilitating the bzr+ssh protocol and allowing you to push a bazaar branch to the Launchpad.net supermirror (more on that later). If you have already generated your public key, go ahead and skip the next subsection and proceed to “Upload your public key“.

To generate your SSH key, you will need to have OpenSSH installed. For Ubuntu users, simply do sudo apt-get install openssh. For other Linux users, use your package manager of choice. Windows users can use the PuTTY key generator, and should follow instructions on the excellent Launchpad.net Help wiki. Once you have OpenSSH installed, it’s time to generate your key. Do so with the following in a terminal:

ssh-keygen

When prompted, to accept the default key file names (~/.ssh/id_rsa.pub and ~/.ssh/id_rsa, and then a password for protecting the key file.

Upload your public key

You can output your public key using the following:

cat ~/.ssh/id_rsa.pub

Go ahead and copy the public key as-is; you’ll need it soon.

Now that you’ve generated your SSH keys, go ahead and click on the “SSH Keys” in the sub-navigation bar in your profile. Simply copy your public key into the text area marked “Add an SSH Key” and then click “Import Public Key”. OK, all set, let’s tackle the GPG Keys next.

GPG Keys (Optional, but Recommended)

Before we generate a GPG key and upload one to Launchpad.net, you will first want to ensure that you have a mail reader capable of decrypting PGP-encrypted emails. Personally, I use Thunderbird and the excellent Enigmail plugin for this, but you will want to use your own preferred MUA. Use this help article for assistance in setting up PGP for your mail client of choice.

OK, next up we’ll go ahead generate GnuPG keys for use with email security and the Launchpad.net mailing lists. To generate a GnuPG key pair, issue the following in a terminal:

gpg --gen-key

This will start a series of questions for you to answer, including which email address the key is for, your name, a passphrase and how many bits to make your key file (I chose 2048). Here is what the series will look like in a terminal:

[510][jpipes@serialcoder: /home/jpipes/.gnupg]$ gpg --gen-key
gpg (GnuPG) 1.4.6; Copyright (C) 2006 Free Software Foundation, Inc.
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions. See the file COPYING for details.

Please select what kind of key you want:
   (1) DSA and Elgamal (default)
   (2) DSA (sign only)
   (5) RSA (sign only)
Your selection? 1
DSA keypair will have 1024 bits.
ELG-E keys may be between 1024 and 4096 bits long.
What keysize do you want? (2048) 
Requested keysize is 2048 bits
Please specify how long the key should be valid.
         0 = key does not expire
        = key expires in n days
      w = key expires in n weeks
      m = key expires in n months
      y = key expires in n years
Key is valid for? (0) 
Key does not expire at all
Is this correct? (y/N) y

You need a user ID to identify your key; the software constructs the user ID
from the Real Name, Comment and Email Address in this form:
    "Heinrich Heine (Der Dichter) "

Real name: Jay Pipes
Email address: xxxxxxxxx
Comment: 
You selected this USER-ID:
    "Jay Pipes "

Change (N)ame, (C)omment, (E)mail or (O)kay/(Q)uit? O
You need a Passphrase to protect your secret key.

We need to generate a lot of random bytes. It is a good idea to perform
some other action (type on the keyboard, move the mouse, utilize the
disks) during the prime generation; this gives the random number
generator a better chance to gain enough entropy.
...
gpg: key 9C5804A8 marked as ultimately trusted
public and secret key created and signed.

gpg: checking the trustdb
gpg: 3 marginal(s) needed, 1 complete(s) needed, PGP trust model
gpg: depth: 0  valid:   1  signed:   0  trust: 0-, 0q, 0n, 0m, 0f, 1u
pub   1024D/9C5804A8 2008-08-22
      Key fingerprint = 16C5 50D4 7061 03A1 48CD  3826 0CAD 7BD9 9C58 04A8
uid                  Jay Pipes 
sub   2048g/F7F4A925 2008-08-22

OK, once generated, verify that everything worked as expected using the following:

gpg --list-keys

which should produce output similar to the following:

[512][jpipes@serialcoder: /home/jpipes/.gnupg]$ gpg --list-keys
/home/jpipes/.gnupg/pubring.gpg
-------------------------------
pub   1024D/9C5804A8 2008-08-22
uid                  Jay Pipes 
sub   2048g/F7F4A925 2008-08-22

The next step is to push your public key to the Ubuntu key server — technically, you could push it to any old key server, but to keep things simple, just use the ubuntu key server. You’ll need to take note of your public key ID, which is the 8-character hex number following 1024D in the list-keys output. In my case, that public key ID is “9C5804A8″. Issue the following command, substituting your public key ID:

gpg --send-keys --keyserver keyserver.ubuntu.com @Your_Public_Key

The final step is to make Launchpad.net aware of your new GPG key. To do so, you need to send your GPG fingerprint to Launchpad. To grab your fingerprint, issue the following command:

gpg --fingerprint

which should produce something very similar to:

[514][jpipes@serialcoder: /home/jpipes/.gnupg]$ gpg --fingerprint
/home/jpipes/.gnupg/pubring.gpg
-------------------------------
pub   1024D/9C5804A8 2008-08-22
      Key fingerprint = 16C5 50D4 7061 03A1 48CD  3826 0CAD 7BD9 9C58 04A8
uid                  Jay Pipes 
sub   2048g/F7F4A925 2008-08-22

Copy the key fingerprint and paste it into the text area in your Open GPG keys page in your profile, then click the “Import Key” button. Launchpad.net will email the address of the key with a confirmation message. Click the link in the email under “Please go here to finish adding the key to your Launchpad account:”. And you are all done.

Now you have all the pieces set up to begin working with Launchpad.net effectively. In the next few posts in this series, I’ll walk you through how to best use the platform to be a productive contributor to the MySQL ecosystem. :)

Getting a Working C/C++ Development Environment for Developing Drizzle

This article explains how to set up a properly functioning C/C++ development environment on Linux. The article is aimed at developers interested in contributing to the Drizzle server project, but the vast majority of the content applies equally well to developers wishing to contribute to the MySQL server or any other open source project written in C/C++

IMPORTANT: This article doesn’t get into any religious battles over IDEs or particular editors. IDEs and editors are what you use to edit code. What this article covers is the surrounding libraries, toolchain, and dependencies needed to get into the development or contirbution process. That said, go Vim. ;)

The examples shown use the Debian/Ubuntu methods of obtaining code packages and installing them — specifically, using apt-get install to install packages. If you are running on a RPM-based distro, simply change the commands to use your package manager of preference, for instance yum install. Solaris users should get as far as they can in the installation/setup process and hop over to the Freenode #drizzle channel for help from one of Solaris experts. The Drizzle wiki has more information on specific packages needed for different distributions.

OK, let’s get started, shall we?

Installing Bazaar and Some Bzr Goodies

So, what is this Bazaar thing and why do you need it? Good question! Bazaar is the revision control system that the Drizzle development team (as well as the MySQL engineering team) uses for source code control. The good folks over at Canonical maintain and enhance the excellent Bazaar-NG system and have an online platform called Launchpad.net which is tightly integrated with Bazaar. Launchpad.net is kind of like SourceForge.net, only focused around Bazaar as the revision control system, and includes a number of nifty little features that make it easier to manage and maintain teams of developers working on the code base. The Drizzle Server project is hosted on Launchpad.net at http://launchpad.net/drizzle.

To install Bazaar, issue the following:

sudo apt-get install bzr

Some Optional Goodies

Once installed, you might want to install a few more things that will make your bzr life easier. The bzrtools package is a collection of command-line and graphical utilities for bzr. meld is a graphical merge conflict resolution utility that I have found invaluable at times. PoEdit is an easy way to work with the GetText translation utilities. To install these tools, do:

sudo apt-get install bzrtools meld poedit

OK, that’s it for Bazaar for now. Let’s move on to getting your development toolchain installed.

The Required Toolchain Packages and Library Dependencies

In order to compile the Drizzle server, you will need a working GNU Toolchain with the C++ development tools.

The Easy Way

The best and most consistent way to get these build dependencies if you are working on Debian/Ubuntu, is to add the Drizzle Developers Personal Package Archive to your apt path and install the Drizzle dependencies.

This way to do this via the terminal:

# For Ubuntu >= 9.10 (Karmic):
sudo add-apt-repository ppa:drizzle-developers/ppa
sudo apt-get update
sudo apt-get install drizzle-dev

For versions of Ubuntu prior to 9.10, simply head over to the Drizzle Developers PPA and follow the instructions on that page.

The Hard Way

The following packages are required for building Drizzle.

  • binutils (includes ld and as)
  • gcc/g++ (4.2+)
  • autoconf (2.59+)
  • automake (1.10+)
  • libtool (1.5.24+)
  • m4 (1.4.8+)
  • bison/yacc (2.3)
  • pkg-config (0.22+)

To install the above toolchain, do the following:

sudo apt-get install libc-dev gcc g++ bison binutils automake autoconf m4 pkg-config libtool
NOTE: Thanks to Tom Hanlon…older versions of Ubuntu may need to specify libc6-dev, not libc-dev!

Once apt-get finishes installing the above, you’ll have a system capable of compiling C/C++ programs, for the most part. The Drizzle server needs some additional libraries and header files in order to compile. I list them here along with a brief description of the library or file.

  • libpcre3 – A standard PCRE regular expression library
  • libpam0g – A pluggable authentication modules library (a horrible package name, no?…)
  • libncurses – A library for displays of terminal applications (used by the drizzle client)
  • libprotobufGoogle Proto Buffers library, used by the server in message communication
  • gettext – i10n and l10n services
  • libevent – socket event handling
  • libz – compression
  • libreadline – Command-line editing utilities
  • uuid-dev – UUID headers
  • libboost-dev – BOOST C++ headers and library
  • libboost-program-options-dev – BOOST program_options library headers
  • libdrizzle-dev – libdrizzle headers
  • gperf – GNU hash function generator
  • libgcrypt11-dev – GNU Crypto library

The following command should install the required libraries with the exception of Google Proto Buffers, which is described in the following section.

sudo apt-get install libpcre3-dev libpam0g libncurses5-dev libpam0g-dev gettext libevent-dev libz-dev libreadline-dev uuid-dev  libgcrypt11-dev libboost-dev libboost-program-options-dev libdrizzle-dev gperf

Installing Google Proto Buffers

After installing the libraries and toolchain, you’ll need to install the Google Proto Buffers library. Again, the easiest way to do so is via the Drizzle Developers PPA, which you should have added to your repositories above. If you did, simply do:

sudo apt-get install libprotobuf-dev protobuf-compiler

Setting Up a Local Bazaar Repository for Drizzle

Now that you’ve installed all the required toolchain and dependencies, it’s time to use Bazaar to pull the development branch of Drizzle and compile the Drizzle server. The first step to do is to set up that local bzr repository. Myself, I have all my bzr repositories in a directory called ~/repos, and that is what the below examples show, but you are of course welcome to put your repos wherever you prefer. To set up a directory and a drizzle repo under your home directory, do the following:

cd ~
mkdir repos
cd repos
bzr init-repo drizzle
cd drizzle

At this point, you have a local bzr repository. Let’s now create a local branch of the development source code trunk that we can play with. To do so, we use the bzr branch command, like so:

bzr branch lp:drizzle trunk

This tells bzr to go grab the main development branch of the “drizzle” project that resides on the Launchpad.net servers (thus, the lp: prefix), and create a local branch called “trunk”. The branch operation may take a little while to complete when you do it for the first time. Subsequent branch and merge operations are much, much quicker than the first branch into a repository. When the branch succeeds, go ahead and look at the files that have been downloaded into your “trunk” branch:

cd trunk
ls -la

You should see something like the following:

jpipes@serialcoder:~/repos/drizzle/trunk$ ls -la
total 264
drwxr-xr-x 13 jpipes jpipes  4096 2010-02-25 12:39 .
drwxr-xr-x 16 jpipes jpipes  4096 2010-03-10 19:44 ..
-rw-r--r--  1 jpipes jpipes 76502 2010-02-24 23:30 ABOUT-NLS
-rw-r--r--  1 jpipes jpipes   377 2010-02-24 23:30 AUTHORS
drwxr-xr-x  5 jpipes jpipes  4096 2010-02-24 23:30 .bzr
-rw-r--r--  1 jpipes jpipes  5983 2010-02-25 12:39 .bzrignore
drwxr-xr-x  2 jpipes jpipes  4096 2010-03-10 15:27 client
drwxr-xr-x  2 jpipes jpipes  4096 2010-02-24 23:30 config
-rw-r--r--  1 jpipes jpipes  5350 2010-02-24 23:30 configure.ac
-rw-r--r--  1 jpipes jpipes 19071 2010-02-24 23:30 COPYING
-rw-r--r--  1 jpipes jpipes 56574 2010-02-24 23:30 Doxyfile
drwxr-xr-x 16 jpipes jpipes 12288 2010-03-10 15:27 drizzled
-rw-r--r--  1 jpipes jpipes  5962 2010-02-24 23:30 DRIZZLE.FAQ
-rw-r--r--  1 jpipes jpipes  5139 2010-02-24 23:30 EXCEPTIONS-CLIENT
drwxr-xr-x  2 jpipes jpipes  4096 2010-02-24 23:30 extra
drwxr-xr-x  2 jpipes jpipes  4096 2010-02-24 23:30 gnulib
drwxr-xr-x  2 jpipes jpipes  4096 2010-02-24 23:30 m4
-rw-r--r--  1 jpipes jpipes  4545 2010-02-24 23:30 Makefile.am
-rw-r--r--  1 jpipes jpipes    41 2010-02-24 23:30 NEWS
drwxr-xr-x 55 jpipes jpipes  4096 2010-03-05 13:11 plugin
drwxr-xr-x  2 jpipes jpipes  4096 2010-02-24 23:30 po
-rw-r--r--  1 jpipes jpipes   821 2010-02-24 23:30 README
drwxr-xr-x  3 jpipes jpipes  4096 2010-02-24 23:30 support-files
drwxr-xr-x  8 jpipes jpipes  4096 2010-03-10 15:27 tests

Compiling Drizzle

OK, you are now ready to compile the server and client tools contained in your branch. The way to do so is the following:

./config/autorun.sh
./configure --with-debug
make
make test

If all goes well, drizzle will compile and build, get installed, and output the (hopefully!) passing test results. The output at the end should be similar to the following:

...
collation_dictionary.slap      [ pass ]             54
blackhole.blackhole            [ pass ]             13
archive.archive                [ pass ]            509
archive.archive_aio_posix      [ pass ]            487
archive.archive_basic          [ pass ]              2
archive.archive_discover       [ pass ]              2
-------------------------------------------------------
Stopping All Servers
All 290 tests were successful.
The servers were restarted 36 times
Spent 206.430 of 290 seconds executing testcases

And that’s it. :)

OK, so that’s it for this first article. I hope you’ve found it helpful in getting a development environment set up so that you can feel comfortable contributing to the Drizzle project. Join me on Freenode’s #drizzle IRC channel to get some help. We’re always available.

MySQL Connection Management in PHP – How (Not) To Do Things

I’ll warn you right now, this is going to be a long article. More than likely, I’ll put a version of this up on the MySQL developer zone and PHP zone. This article is intended to highlight various basic topics concerning proper methods of handling connections to MySQL databases in PHP, guidelines for caching dynamic content, and a technique called “lazy loading”. Hopefully by the end of the article you’ll have learned how to combat a very widespread and potentially devastating scalability problem seen in an enormous number of PHP web applications.

An introduction to the problem

Before I start the discussion on connecting to MySQL servers via PHP, it’s worth pointing out that the relative cost of connecting to a MySQL database, versus connecting to a PostgreSQL or Oracle installation, is very, very low. However, the fact that connecting to a MySQL resource is inexpensive does not mean that connection resources can be abused. Anyone who has ever seen the dreaded “Too many connections” error which occurs when MySQL attempts to service more connections than the number of concurrent connections afforded by the max_connections configuration variable, knows what I am talking about.

Connecting to a MySQL resource, while inexpensive, requires PHP to issue a call to the mysql client API via mysql_connect(). Passed as variables to this function are the connection arguments for the database host, user, password, etc. Once the MySQL database server has received the connection information, it will either return a success value, or a failure. Upon failure, the mysql_error() function can be used to determine what went wrong during the connection attempt (typically user credential problems or the max connections issue).

So, where, you ask, is the problem? Well, the issue that I commonly see is that connections are made to MySQL resources when they do not need to be made. But, you say, almost all web applications serve dynamic content, so therefore doesn’t dynamic content virtually require a connection to a database be made?

Well, not really in many, many cases

Let’s take as an example a very, very popular PHP web application installed on hundreds of thousands of servers worldwide: the blogging application WordPress. Now, before I go any further, I want to say that the reader should not think that I am attacking WordPress in this article, or deliberately trying to point out shortcomings in their software. By contrast, I picked WordPress to demonstrate that the problem described in this article is widespread among PHP web applications. At the very end of the article, I’ll present a patch to the current WordPress source code which fixes the issue identified in this article. I will post the patch to the wp-testers mailing list after I complete the article. Promise. :)

Now, a blogging application is indeed a data-driven web application. Typical blog software involves the posting of articles, the management of comments, the display of such articles and comments, and, of course, pages which serve to provide the various RSS and Atom feeds for the blog. So, one might argue that blogging software is highly dynamic, and therefore would necessarily issue calls to connect to the database upon every visit to the blogsite.

This, however, is not particularly true. Even on extremely busy blogs, content doesn’t change on a continual basis. This point is even more relevant when you consider that after a certain number of days after an article is posted, content becomes almost entirely static. Keep this point in mind as you follow through the next sections, which walk through the page invocation process which WordPress executes upon every hit to a blog page (including feed pages).

Investigating WordPress page invocation

On all PHP pages in the WordPress main directory — which include index.php, wp-atom.php, wp-rss, etc. — each file begins with the following include:

require(./wp-blog-header.php);

The wp-blog-header.php page does a couple things but mostly serves to include the following:

require_once( dirname(__FILE__) . '/wp-config.php');

OK, so we’re heading over to wp-config.php… and we find some good stuff, such as the defines for the database connection parameters, and:

define('ABSPATH', dirname(__FILE__).'/');
require_once(ABSPATH.'wp-settings.php');

OK, so a quick :find wp-settings.php later, we open up the first meaty file of our page invocation. The first 72 code lines of wp-settings.php do some housekeeping stuff, like checking if the PHP version is adequate and if the MySQL extension is installed (tangent: is it really necessary to do this on every web page invocation?!). After that, we see the following include:

define('WPINC', 'wp-includes');
require_once (ABSPATH . WPINC . '/wp-db.php');

Into the heart of the beast

OK, so thus far there’s been nothing spectacular or extraordinary about the code. Just a few includes to make file maintenance orderly, but nothing unusual. However, the wp-includes/wp-db.php file contains perhaps the most common PHP/MySQL gotcha seen in today’s web applications.

The file starts out with some defines and then the class definition of the WordPress database abstraction object, called wpdb. The wpdb class contains the very typical methods commonly seen in a DB abstraction layer: get_col(), get_row(), get_results(), etc, which allow a query string to be passed in and executed against a MySQL database. However, there is one major problem in the design, which manifests itself in the last line of the file:

$wpdb = new wpdb(DB_USER, DB_PASSWORD, DB_NAME, DB_HOST);

Well, what’s so wrong with that, you say? Well, what does the new operator do? It creates an object of a class type specified and, during the creation of the object, calls the class constructor, which is the class method with a name identical to the class name — in this case, the method called wpdb, shown here:

	// ==================================================================
	//	DB Constructor - connects to the server and selects a database

	function wpdb($dbuser, $dbpassword, $dbname, $dbhost) {
		$this->dbh = @mysql_connect($dbhost, $dbuser, $dbpassword);
		if (!$this->dbh) {
			$this->bail("... content snipped ... ");
		}

		$this->select($dbname);
	}

Can anyone tell what is happening on every page invocation to a WordPress blog? Yup. A connection is being made to the underlying MySQL database. On a heavily hit blog site, this code can easily lead to the dreaded too many connections error because a connection to the database is being made even for mostly static content. There are a couple ways to combat this problem: Lazy loading and Content Caching. These two techniques can be used together to eliminate a huge portion of the database queries and connections in typical web applications.

Lazy Loading

The code in the wpdb class isn’t fundamentally wrong. It just needs some tweaking to ensure that a connection to the database is only made if a query is executed against the database. A technique called lazy loading essentially delays the connection to the database until the last minute, instead of upon creation of the database abstraction object.

The MySQL Forge database abstraction layer uses lazy loading in just this way. The name of the class is SqlConnection, and it has an empty constructor. Instead of connection logic embedded in the constructor, the object has a Connect() method, which looks like the following. The code has been modified only to remove the logic which automatically handles master/slave replication switching:

    /**
     * Attempt to connect the resource based on supplied parameters. 
     * 
     * @return  boolean 
     * @access  public
     *  
     * @param   string  (optional) Host name (Server name)
     * @param   string  (optional) User Name
     * @param   string  (optional) User Password
     * @param   string  (optional) Database Name
     */  
    function Connect() {
        if (func_num_args() == 4) {
            
            // A different database has been requested other than the 
            // standard global config settings
            $host = func_get_arg(1);
            $user = func_get_arg(2);
            $pass = func_get_arg(3);
            $dbname = func_get_arg(4);
            
        }
        else {
            $host = SQL_HOST;
            $user = SQL_USER;
            $pass = SQL_PASS;
            $dbname = SQL_DB_NAME;
        }

        /**
         * Short circuit out when already
         * connected.  To reconnect, pass
         * args again
         */
        if (is_resource($this->_Cnn) && func_num_args() != 4) {return true;}
            
        if (! $this->_Cnn = mysql_connect($host, $user, $pass)) {
            trigger_error(get_class($this) . 
                          "::Connect() Could not connect to server: " . 
                          mysql_error(), E_USER_ERROR);
            return false;
        }
        else {
            if (! mysql_select_db($dbname, $this->_Cnn)) {
                trigger_error(get_class($this) . 
                              "::Connect() Could not connect to specified database on server: " . 
                              mysql_error(), E_USER_ERROR);
                return false;
            }
            else {
                return true;
            }            
        }
    }

The _Cnn member variable is a reference to a MySQL database resource that is returned upon a successful call to mysql_connect(). Notice that the Connect() method has logic which ensures that if an existing connection has already been made during page execution, then the Connect() method simply returns true.

You may be surprised to find out that, just like WordPress, the MySQL Forge software creates a database abstraction object upon each call to the MySQL Forge website. The following code is included in all page invocations:

/**
 * Fine to establish a global connection
 * here, since connect doesn't occur until
 * SQL execution.
 */
require_once(DIR_LIB . 'class/SqlConnection.php');
$GLOBALS['Db'] =& new SqlConnection();

The difference is that a connection to the database is not made in the SqlConnection class constructor, so having the object instantiated doesn’t consume database resources, unlike the wpdb class. So, if the constructor doesn’t call Connect(), then when exactly will mysql_connect() be called? Here, we see the Execute() method of SqlConnection:

    /**
     * Executes the supplied SQL statement and returns
     * the result of the call.
     * 
     * @return  bool   
     * @access  public
     *  
     * @param   string  SQL to execute
     */  
    function Execute( $Sql ) {
        
        /* Auto-connect to database */
        if (! $this->_Cnn) {
            $this->Connect();
        }
        
        if (!$this->_Res = mysql_query($Sql, $this->_Cnn)) {
            trigger_error(get_class($this) . 
                          "::Execute() Could not execute: " . 
                          mysql_error() . 
                          " (SQL: " . $Sql . ")", E_USER_ERROR);
            return false;
        }
        else {
            return true;
        }
    }

In the Execute() method, you can see that if the _Cnn member variable is not set (meaning, a previous connection has not been made to the database), then the SqlConnection connects, otherwise, it simply executes the supplied string against that connection via the mysql_query() function and stores the returned result resource in the _Res member variable for use by other methods in the class.

Other methods of SqlConnection simply wrap the Execute() call and provide result sets in various forms. What this means is that on page evocations to MySQL Forge, unless dynamic data is actually needed, no connections to the database are actually created. Which leads us nicely to the other technique for handling semi-dynamic content web requests: Content Caching.

Content Caching

Caching is perhaps the most fundamental concept discussed in the field of computer sciences when it comes to performance of both hardware and software. A cache, simply defined, is a storage area for data that has been parsed, retrieved, calculated, or otherwise generated in an expensive operation. The cache functions to alleviate the need for various resources to regenerate the cached data upon every request for the data.

Caches exist everywhere in both hardware and software. For instance, on a hardware level, modern CPUs usually have at least two levels of hardware caches (usually called the L1 and L2 caches). These CPU-connected fast-access caches exist so that the CPU does not need to call a kernel-level RAM memory page access call, which is a relatively expensive operation since the speed of access to a RAM page is much slower than the access speed to the locally connected Lx caches. When speaking about caches, it’s important to recognize that everything is relative to something else. Accessing a hard disk is much more expensive than accessing a page of RAM, which is much more expensive that accessing a line of bytes stored in the L1 cache. Likewise, in application-specific caches (which we’ll be talking about next), the relative cost of accessing cached data is lower than retrieving the same information from the MySQL database.

So, let’s talk a bit about basic content caching for a PHP web application. Although these examples use PHP, the discussion of application caching applies to all languages. Every web scripting language provides similar functionality to implement caching.

Application content caching occurs when a standard call to the database is replaced with a call to an application content cache. In these examples, we’ll implement a simple file-based cache; other solutions are, of course, available, including using memcached or a static content web server proxy to serve web content. WordPress actually implements its own caching mechanism, called ObjectCache. You can take a look at the implementation in the wp-includes/cache.php. However, this implementation has a couple design limitations that make it unsuitable for a discussion on general caching. It uses a tight coupling with other WordPress functions and objects, which makes the caching mechanism unfriendly for general re-use.

A Simple File Cache Engine

Before we get into the implementation of the CacheEngine class that MySQL Forge uses, let’s first take a look at some code from the /lib/class/ProjectMemberFinder.php class that handles requests to retrieve information about the members involved in a project listed in the MySQL Forge project directory:

    /**
     * Return project members based on project ID value
     * 
     * @return  array
     * @param   int     project ID
     */
    function &GetByProjectId($Project) {
    
        /**
         * ProjectMembers don't change that often, 
         * so cache the output of these calls.
         */
        $cache_id = 'project_members-' . $Project;
        
        if ($cache = $GLOBALS['CEngine']->GetFromCache($cache_id, $Seconds=0, $IsObject=true)) {
            return $cache;
        }
        
        $sql = "SELECT 
                    pm.project
                  , pm.member
                  , fu.display_name
                  , pmr.description as role
                  , pm.can_write
                  , pm.can_read
                  , pm.joined_on
                  , pm.last_source_login
                  , pm.last_source_commit
              FROM " . $GLOBALS['SqlTables']['ProjectMember'] . " pm 
                  INNER JOIN " . $GLOBALS['SqlTables']['ForgeUser'] . " fu
                      ON pm.member = fu.user_id 
                  INNER JOIN " . $GLOBALS['SqlTables']['ProjectMemberRole'] . " pmr
                       ON pm.role = pmr.project_member_role_id
              WHERE pm.project = " . (int) $Project;

        $results = $GLOBALS['Db']->GetRecords($sql);

        $GLOBALS['CEngine']->WriteToCache($cache_id, $results);

        return $results;
    }

OK, so the first thing you will notice is that there’s a comment saying basically, “look, this information really doesn’t change all that much. Let’s go ahead and cache the results of the database query for later re-use”. We first ask the global CacheEngine object ($GLOBALS[‘CEngine’]) if we have a cached version of the supplied Project’s project members list:

if ($cache = $GLOBALS['CEngine']->GetFromCache($cache_id, $Seconds=0, $IsObject=true)) {
    return $cache;
}        

The GetFromCache() method of the CacheEngine class returns the requested data, or FALSE. So, in the above code, we simply return the cached data if it is available in the cache. The $Seconds argument to the GetFromCache() method is simply the number of seconds that the cached data should be considered valid. Passing a zero as this argument means we always consider the data valid. The $IsObject argument tells the CacheEngine to return the cached data as an array or an object. We’ll see how this is implemented in a little bit.

OK, so if the cached data does not exist in the cache, the ProjectMemberFinder::GetByProjectId() method continues on to request the data from the underlying database. The global Db abstraction layer object (described earlier) has its GetRecords() method called, with a SQL string passed as a parameter:

$results = $GLOBALS['Db']->GetRecords($sql);

It is the next line of code that facilitates the caching of this data in our content cache:

$GLOBALS['CEngine']->WriteToCache($cache_id, $results);

So, upon the first invocation of the GetByProjectId() method of ProjectMemberFinder, for each unique supplied Project ID value, we issue a request to the database and then cache the results for each subsequent call to the function. This saves us an enormous amount of database interaction, increasing the overall scalability of the system since the software can handle more concurrent requests, since the database connection will no longer be a bottleneck to the system.

There are a couple cases that we need to handle when processing cache requests, including how to invalidate data in the cache. We’ll get to these cases in a minute. First, let’s take a look at the CacheEngine class’ two main methods: WriteToCache() and GetFromCache().

The GetFromCache() Method

As you saw above, the GetFromCache() method takes three arguments and returns either FALSE, or the cached data. Let’s take a closer look at the CacheEngine::GetFromCache() method.

	/**
	 * Retrieves a Cache file and returns either an object
	 * or a string
	 *
	 * @return	mixed
	 * @param	string	Name of File in Cache
	 * @param	int		Number of Seconds until File is considered old
	 * @param	bool	Return an object from Cache?
	 * @access	public
	 */
	function GetFromCache( $FileName , $Seconds = 0, $IsObject = false) {

		$this->_BuildFileName($FileName);
		
		$return = false;
		if ($Seconds == 0) {
			if (file_exists($this->_CacheFilePath)) {
				$return = $this->_ReadFromCache();
			}
			else {
				return false;
			}
		} 
		else {
		        $refresh_time = time() - (int) $Seconds;
			if (filemtime($this->_CacheFilePath) > $refresh_time) {
				$return = $this->_ReadFromCache();
			}
			else {
                                /** Cached data not valid, remove it */
                                $this->RemoveFromCache($FileName);
				return false;
			}
		}
		if ($IsObject) {
		    $return = unserialize($return);
		}
		return $return;
	}

The GetFromCache() function should be fairly easy to follow. The meat of the function lies in either checking that the file exists (if there is no time limit on the cached entry) with the PHP file_exists() function or checking the modification time of the file using the filemtime() function otherwise. The $IsObject flag argument simply runs an unserialize() on the data coming back from the internal _ReadFromCache() method, which we will look at next:

	/**
	 * Reads the local file from the cache directory
	 *
	 * @return	mixed
	 * @access	private
	 */
	function _ReadFromCache() {
	    $mq_setting = get_magic_quotes_runtime();
	    set_magic_quotes_runtime(0);
	    if (!$return_data = @ file_get_contents($file)) {
	    	trigger_error(get_class() . 
				'::_ReadFromCache(): Unable to read file contents'
				, E_USER_ERROR);
	    }
	    set_magic_quotes_runtime($mq_setting);
	    return $return_data;
	}

This function handles reading the cached data from a cache file. The magic quotes runtime (perhaps the most annoying PHP feature ever) is turned off before reading the file to prevent automatic escaping of certain characters in the data, and then turned back to its original setting immediately after.

So, the reading of a cache file is fairly simple. Let’s take a look at the write mechanism of the CacheEngine class. This code is adapted from a technique which George Schlossnagle details in his excellent read, “Advanced PHP Programming” (Developer’s Library, 2004) called file swapping. The technique facilitates lock-less writing of a cache file while allowing for simultaneous read requests of the cache file. Let’s take a look:

	/**
	 * Writes data to the cache
	 *
	 * @return	mixed
	 * @param	string	File Name (may be encoded)
	 * @param	mixed	Data to write
	 * @access	public
	 */
	function WriteToCache( $FileName, $Data ) {
	    if (is_array($Data) || is_object($Data)) {
		$Data = serialize($Data);
	    }
	    $this->_BuildFileName($FileName); 
	    /** 
	     * Use a file swap technique to avoid need
	     * for file locks
	     */
	    if (!$file = fopen($this->_CacheFilePath . getmypid(), "wb")) {
		trigger_error(get_class($this) . 
				'::WriteToCache(): Could not open file for writing.'
				, E_USER_ERROR);
		return false;
	    }
	    $len_data = strlen($Data);
            fwrite($file, $Data, $len_data);
            fclose($file);
	    /** Handle file swap */
	    rename($this->_CacheFilePath . getmypid(), $this->_CacheFilePath);	    
	    return true;
	}

The code above opens a temporary file for writing. Notice that the actual cache file name is appended with the getmypid() function output, essentially making the filename a temporary, unique filename. Then, given the fopen() call was able to open the file for writing, the data is then written to the file, and the file closed. Finally, the rename() function is used to change the temporary filename to the actual cache filename. Because the rename() function simply acts on the file inode (a structure which stores information about the file, not the file contents itself, the rename() operation is a) very quick, and b) allows other processes to read from any existing cache file with the existing name without blocking the rename operation.

Cache invalidation

OK, so our CacheEngine class now has most of the functionality needed to effectively cache data from the database. However, we still need a method of removing old cache data files. Hence, the very simple RemoveFromCache() method:

	/**
	 * Removes a cache file from the cache directory
	 *
	 * @return	mixed
	 * @param	string	File Name to remove (will be encoded)
	 * @access	public
	 */
	function RemoveFromCache( $file_name ) {
	    $this->_BuildFileName($file_name); 
	    if (!file_exists($this->_CacheFilePath)) {
		return true;
       	    }
	    else {
        	if (!unlink($this->_CacheFilePath)) {
		    trigger_error(get_class($this) . 
				'Unable to remove from cache file'
				, E_USER_ERROR);   
		    return false;
		}
		else {
		    clearstatcache();
		    return true;
		}
	    }
	}

All the above function does is remove the cached file if it exists. So, when would we call this function? Well, if you were paying attention earlier, you wold have already seen one occasion. In the GetFromCache() method, the RemoveFromCache() method is called when the modification time of the file exceeds the expiration time supplied to that function. Additionally, the RemoveFromCache() method would be called when we manually want to remove a cache file, for instance, if the list of project members changes, as this snippet from the /lib/class/ProjectMember.php class illustrates. The snippet comes from the ApproveMembershipRequest() method of that class:

...
            /* Remove from review queue */
            $sql = "UPDATE ... "
            if ($GLOBALS['Db']->Execute($sql)) {
                $GLOBALS['Db']->Commit();
                $GLOBALS['Cache']->RemoveFromCache('project_members-' . $Project);
                return true;
            }
...

As you can see, when an approval is made, the database is updated and the cache for the project members is removed, so that upon the next request for this project’s members, the cache file will be regenerated, ensuring valid, up-to-date data.

Summary

So, this article has been an exploration into some simple steps that you can take in order to increase the scalability of your web applications by lazy loading and content caching techniques. Below are the full class files for the CacheEngine class and the SqlConnection class used in the MySQL Forge application. Feel free to use as you wish. Additionally, a patch to the WordPress source code to enable lazy loading is included below.

As always, comments and suggestions for this article are more than encouraged and appreciated.

Managing Many to Many Relationships in MySQL – Part 1

Flexible, Scalable Key Mapping Solutions

In working to answer questions on the MySQL forums, I’ve noticed
a few questions that repeatedly come up on a number of the forum areas. One of these particular questions deals with how to manage — construct, query, and maintain — many to many relationships in your schema. I decided to put together a two-part article series detailing some of the common dilemmas which inevitably arise when tackling the issue of relating two entities where one entity can be related to many instances of another, and vice versa.

Hopefully, this article will shed some light on how to structure your schema effectively to produce fast, efficient queries, and also will illustrate how key map tables can be queried for a variety of different purposes. I’ll predominantly be using standard SQL, so although I’m using MySQL as the database of choice here, the code you see is for the most part not limited to running on just MySQL. In this first part, we’ll review the concepts involved in many-to-many relationships, and discuss the most common methods for use in storing the data composing the relationship. In the second part of the article, which I should complete in about another week, we’ll look at some solid performance numbers regarding the various approaches. Also, I’ll show you how to use MySQL 5 stored procedures and views in order to most effectively manage many-to-many relationships.

A Review of Relational Concepts

For those of you unaware of what a many-to-many relationship is, let’s first briefly discuss some basic
definitions I’ll be using in the article. First, an entity, in the database world, is simply a singular
object or concept. An entity, just like a real world object, may have one or more attributes which describe different aspects of the entity. A table in a properly normalized database contains records which pertain to a single entity. These records represent instances of the entity.

To illustrate, let’s assume we’re building a website that specializes in used-car sales. We need to design a schema which will handle the searching and storage of a variety of different auto makes and models, and allow customers to filter results based on a set of features they require in the automobile.

figure A – Auto Entity

The primary entity in our schema could be considered the Auto entity. It might have some attributes such as a manufacturer, model, model year, and so forth. Figure A shows a depiction of this simple Auto entity, with the primary key attribute (auto_id) in bold and above all other descriptive attributes like manufacturer, model, etc.

One to Many Relationships

Entities in the database can relate to each other in a number of ways. The most common type of relationship is called a one-to-many relationship. Here, one instance of an entity can relate, or be attached to, many instances of another entity. In our sample schema, the Auto entity can have only one auto manufacturer. However, an auto manufacturer can produce many automobiles. Therefore, we say that the relationship from Auto Manufacturer to Auto is a one-to-many relationship. Figure B depicts a one-to-many relationship between the Auto Manufacturer entity and the Auto entity. In the figure, the line between the two entities represents the relationship. This is a common way to represent relationships, with the “one” side of the relationship having a single line and the “many” side of the relationship having a set of three lines and a circle.

figure B – One to Many Relationship

Relationships are implemented via common attributes in each Entity called key attributes. In a one-to-many relationship, these key attributes take the form of a parent key and a child, or foreign, key. The relationship is maintained by “connecting” the two attributes together via these key attributes. In the case of our Auto Manufacturer to Auto relationship, the key attributes are AutoManufacturer.auto_manufacturer_id and Auto.auto_manufacturer_id. If we wanted to list the Auto’s manufacturer name (not Manufacturer ID), we would use an INNER JOIN from the Auto table to the AutoManufacturer table along this key relationship, as shown below.

SELECT am.name
FROM Auto a
INNER JOIN AutoManufacturer am
ON a.auto_manufacturer_id = am.auto_manufacturer_id
WHERE a.auto_id = 12345;

This simple example shows how a one-to-many relationship is implemented using a simple INNER JOIN to find the intersection of two tables where the key attributes AutoManufacturer.auto_manufacturer_id and Auto.auto_manufacturer_id contain matching entries.

Many to Many Relationships

A many to many relationship is realized between two entities when either entity may be associated with more than one instance of the other entity. For example, imagine the relationship between an Auto (as in car) entity and an AutoFeature entity representing any of the myriad options a car may come with.

figure C – Many to Many Relationship

In this case, we know that any particular automobile can have many features. Likewise we know that a specific automobile feature, say power windows, may be in any number of automobiles. There’s no way for us to visually represent the association between the two entities without using a third entity, which stores the mapping of the relationship between the Auto entity and the AutoFeature entity. In this case, I use the Auto2AutoFeature entity. For mapping tables, I tend to use this Something2Something naming scheme to clearly show that it is a table which primarily serves to map the relationship from one thing to another, but of course that is merely a stylistic convention, nothing more.

Schema Representations for Many to Many Relationships

There are a few common ways for representing many-to-many relationships within structured SQL data, all of which will be detailed below:

  1. Using multiple fields of an on/off data type to store many “keys” in a single table
  2. Using the INT, BIGINT, or SET data type (or equivalent in other RDBMS) to store a fixed number of key flags in a single table field
  3. Using a CHAR string having one byte of storage per key needed, with the string acting as one long imploded array in a single table field
  4. Using a relationship, or mapping, table, like the one in figure C, to store one or more keys related to an entity

All of these methods has distinct advantages and disadvantages, in both ease of use and performance. We’ll look at each here, along with some sample code to show how common queries are performed across each storage schema.

The Multiple Field Method

In this method of defining a many-to-many relationship, the concepts of normalization are thrown away in favor of what some consider to be a simpler and more rational schema. Multiple fields, each representing an on/off value for a foreign key, are used in only a single table in order to achieve the results desired. Any data type representing an on/off value may be used to represent the key fields — CHAR(1) with ‘T’ or ‘F’, ‘Y’ or ‘N’, or a TINYINT UNSIGNED with 0 and 1 values, or an ENUM(‘Y’,’N’) etc. Below, you’ll see what a sample Auto table using this method might look like with CHAR(1) data types used for the auto option key fields:

CREATE TABLE Auto (
  auto_id INT UNSIGNED NOT NULL AUTO_INCREMENT
  , auto_manufacturer_id SMALLINT UNSIGNED NOT NULL
  , auto_model VARCHAR(20) NOT NULL
  , model_year SMALLINT UNSIGNED NOT NULL
  , asking_price DECIMAL(12,2) UNSIGNED NOT NULL
  , has_air_conditioning CHAR(1) NOT NULL DEFAULT 'N'
  , has_power_windows CHAR(1) NOT NULL DEFAULT 'N'
  , has_power_steering CHAR(1) NOT NULL DEFAULT 'N'
  , has_moonroof CHAR(1) NOT NULL DEFAULT 'N'
  , has_disk_brakes CHAR(1) NOT NULL DEFAULT 'N'
  , has_power_seats CHAR(1) NOT NULL DEFAULT 'N'
  , has_leather CHAR(1) NOT NULL DEFAULT 'N'
  , PRIMARY KEY pk_Auto (auto_id)
);

There are a couple advantages to this type of approach:

  1. Given a simple SELECT * FROM the table, it is fairly easy to understand (immediately) what options the car has.
  2. Only one table is involved in determining options for the car.

This last point is often the reason why many novice developers choose to use this approach; it’s easier and more straightforward to filter a resultset based on a single option:

SELECT *
FROM Auto a
WHERE a.has_leather = 'Y';

or more complicated request:

SELECT *
FROM Auto a
WHERE (a.has_leather = 'Y' OR a.has_power_windows = 'Y')
AND a.has_power_steering = 'Y';

The code is easy to understand for most anyone looking at it, and there’s no special syntax or SQL tricks needed in order to query on an OR condition or multiple various filters. Unfortunately, this approach has a number of downsides. Most important among them are:

  1. If you need to add an auto feature, you’ve got to ALTER TABLE Auto ADD COLUMN …

    This disadvantage is often overlooked in early design phases by novice programmers or design teams because everyone always assumes that they know every option that the customer might want to use. However, this is rarely the reality. Customers change their minds and will inevitably ask to add or remove Auto Features from the list. With this method, that requires a fundamental change in the schema: removing or adding columns. When doing so, especially on large tables, deadlocks can easily occur while read requests wait for an exclusive write lock to finish while the table is rebuilt to the new schema. (That’s a bad thing.)

  2. There isn’t any useful place for an index on any of these data fields.

    “Wait a minute!” you say, “You can place an index on any of these fields!” Sure, that is correct. Of course you can place an index on any of these fields. In fact, you can place an index on all of them if you really wanted to. But none of those indexes is likely to be used effectively by MySQL. Why? Well, that’s simple. In the schema above, there are only two values possible for each field — a ‘Y’ or ‘N’ (or 0 and 1, ‘T’ and ‘F’, etc). Let’s assume a table of 500K auto records. If the only two values in the field are ‘Y’ and ‘N’, what use would an index be? For a common auto feature, say “has_leather”. probably half of the records would contain a ‘Y’ and half an ‘N’. What use would this be to an index? None. In fact, an index would slow MySQL down, as so many index records would have to be read, with a lookup operation to the data record from each index record. The selectivity, or distribution, of key values is extremely low (see my previous article for more explanation on this) and therefore the index has limited use to MySQL.

  3. There is very little flexibility offered by this method.

    Imagine if you were to give the customer the ability to add and remove auto features at will. You would have to GRANT the customer (or the web-based or application user) the ability to ALTER TABLE. This isn’t generally considered to be the most secure or effective method of managing change…

    Also, Let’s say you came up with a list of 300 auto features. Are you going to add 300 fields to the table? Ever tried doing a SELECT * from a table with 300 fields? Even if you use the /G option in the mysql client, you’d still have a mess!

So, for all the reasons outlined above, I recommend against this approach for all but the simplest and non-production environments. It isn’t flexible enough to withstand change, and the limitations of performance far outweight any advantage to ease of use.

Wait a Minute!

“But wait!” you say, “MySQL itself uses this strategy for its own mysql database! The mysql.user table has fields like “Select_priv”, “Insert_priv”, and “Delete_priv”. Don’t you think MySQL knows what they’re doing!?”

Yes, of course MySQL knows what they’re doing. But, you have to remember one thing about the mysql system schema tables. They’re always in memory. There is no performance degradation related to the mysql.user table’s multiple field definitions for privileges because upon startup, the MySQL server actually loads all user information (privileges) into a hash table of structs containing privilege information. When a request is received to perform some operation, this hash table of user privileges is checked using very fast logical OR and AND operations. So, there aren’t any performance issues associated with the mysql tables. This is, by the way, why you must issue a FLUSH PRIVILEGES when manually changing any of the mysql tables. The FLUSH PRIVILEGES reloads this in-memory hash table.

So, for the mysql schema, the tables are designed for ease of use and simplicity, so this method was used to represent privileges. Do I favor it? Not particularly, but I can see why it was done… :)

The INT, BIGINT, or SET Bitmap Method

With this next method, the INT, BIGINT, or MySQL SET data type is used to store from zero to 64 separate key flags within a single field. Again, with this method, we denormalize the schema by reducing this many-to-many relationship down into a single field in a single table. Below, you’ll see an example of our Auto table converted to use the SET data type instead of the 7 distinct CHAR(1) fields from Method #1:

CREATE TABLE Auto (
  auto_id INT UNSIGNED NOT NULL AUTO_INCREMENT
  , auto_manufacturer_id SMALLINT UNSIGNED NOT NULL
  , auto_model VARCHAR(20) NOT NULL
  , model_year SMALLINT UNSIGNED NOT NULL
  , asking_price DECIMAL(12,2) UNSIGNED NOT NULL
  , auto_options SET('Air Conditioning'
    ,   'Power Windows'
    ,   'Power Steering'
    ,   'Moonroof'
    ,   'Disk Brakes'
    ,   'Power Seats'
    ,   'Leather') NOT NULL
  , PRIMARY KEY pk_Auto (auto_id)
);

Although the SET (and ENUM, SET’s one-to-many column type cousin) are listed in the MySQL manual as string column types, they are internally represented as integers. Below, when we cover bitwise operations, you’ll see how you can use the SET data type the same way as you would an INT or BIGINT, though there are advantages to simply sticking with the SET data type for simplicity’s sake. As with the multiple-field method, there are advantages to the SET method. They include:

  1. You can store a large (64 elements) number of key values in a small storage unit.

    For the SET data type, 1 byte of storage is used for up to 8 elements, 2 bytes for up to 16 elements, 3 bytes for up to 24 elements, 4 bytes of storage for up to 32 elements, and 8 bytes of storage for 33-64 possible elements. Compared with method #1 (and method #4 below), this does save storage space in the table

  2. MySQL automatically handles showing the descriptive value of the SET column value

    With the INT and BIGINT data types, MySQL will simply show the numeric representation of the field value. With the SET data type, by default, MySQL shows the descriptive string value of the column, instead of it’s numeric value (more below). This can be a handy feature.

  3. MySQL provides the handy FIND_IN_SET() function to quickly filter for a single key value.

    The FIND_IN_SET() function can be used to query for rows in which a specific key value is turned on for the particular row. If you want to find whether more than one key values are turned on, you can use an AND expression in the WHERE clause. For instance, let’s say we wanted to find all records in which the automobile had both the leather and power windows options. We could use the following:

    SELECT *
    FROM Auto a
    WHERE FIND_IN_SET('Leather', auto_options)>0
    AND FIND_IN_SET('Power Windows', auto_options)>0;

    Although there are some performance issues with SET fields (see below, on lack of indexing), the FIND_IN_SET() function is highly optimized to work specifically with the integer data housed beneath the surface. Bitwise operations are used to determine whether the row matches the specific key flag queried for. These bitwise operations are generally very fast, and we’ll cover them below.

Besides FIND_IN_SET(), any bitwise operator that you would normally use on integer data can be used on SET, INT and BIGINT columns.

Bitwise operations are performed on the actual binary representation of the data. Bits (each representing a key, are turned on or off by placing the corresponding bit in the binary number to 1. The table below shows the binary and decimal representations of the first byte (8 bits) of an integer bitmap. As you can see, each “on” bit simply turns on the appropriate power of 2.

Binary Decimal
0000 0001 1
0000 0010 2
0000 0100 4
0000 1000 8
0001 0000 16
0010 0000 32
0100 0000 64
1000 0000 128

Clearly, the larger the storage unit, the more bits can be used to represent the key flags. A BIGINT can store up to 64 unique values, an INT 32 unique values, and so on. One advantage to using the SET type is that it automatically chooses the smallest storage type needed to store the required number of set elements.

To have more than one key flag turned on within the bitmap field, the flag bits are added together. Therefore, if the third (23) and fourth (24) flags (0000 0100 and 0000 1000) are turned on, the bitmap field would contain 0000 1100, or the number 12 in decimal. If ever you want to see the decimal or binary representation of a SET field column, you can use +0 and the BIN() function, as shown below:

SELECT
  auto_id
, auto_options
, auto_options+0 AS 'dec'
, LPAD(BIN(auto_options+0),8,'0') AS 'bin'
FROM Auto a;
+---------+-----------------------------------------------+-----+----------+
| auto_id | auto_options                                  | dec | bin      |
+---------+-----------------------------------------------+-----+----------+
|       1 | Power Windows,Power Steering,Moonroof         |  14 | 00001110 |
|       2 | Power Windows,Power Steering,Moonroof,Leather |  78 | 01001110 |
|       3 | Power Windows,Moonroof,Leather                |  74 | 01001010 |
|       4 | Power Windows,Moonroof                        |  10 | 00001010 |
+---------+-----------------------------------------------+-----+----------+
4 rows in set (0.00 sec)

Use the LPAD() function to pretty up the output to a pre-determined width, with leading zeroes.

This means that you can do more complex querying using the numeric values of the SET elements (key values). For instance, suppose we wanted to find all those records which did not have Leather or Power Steering, but did have Power Windows. From the output above, we can easily see that the auto with ID#4 is the only record which will meet our criteria.

But, how do we structure our WHERE expression in SQL? Looking back at our table schema, we know that the numeric position (starting from the number 1) of the key values in our WHERE clause will be:

Key Value Numeric Position in SET
Leather 7
Power Steering 3
Power Windows 2

Using the MySQL bitwise functions, we can issue our query like so:

SELECT auto_id, auto_options
FROM Auto a
WHERE auto_options & (POW(2,6)+POW(2,2)) = 0
AND auto_options & POW(2,1) > 0;
+---------+------------------------+
| auto_id | auto_options           |
+---------+------------------------+
|       4 | Power Windows,Moonroof |
+---------+------------------------+
1 row in set (0.00 sec)

In the code above, the POW() function is used to get the correct bit set for each desired element in the query. We substract 1 from the number of the element, because the “on” bits are counted from the right side of the byte structure and determined as a power of 2. For the first part of the WHERE expression:

WHERE auto_options & (POW(2,6)+POW(2,2)) = 0

we ensure that the result of the bitwise & operator (which returns a 1 for each bit where both sides of the equation have the specified bit or bits turned on) results in 0. This ensures that only rows which do not have the Power Steering and Leather options are returned. The second part of the WHERE expression uses the bitwise & operator against the Power Windows option, and filters where the result is greater than zero, so that the resulting rows are known to contain the Power Windows option.

Besides violating normalization rules (and I won’t get into an idealistic debate about that here…there’s more than enough discussion online about that), there are some concrete reasons why not to use the SET data type for handling many-to-many relationships. Although I began writing this article quite a while before Sheeri Kritzer posted about the ENUM/SET type, some points are worth repeating. Foremost among these are the following:

  1. Using the SET data type for many-to-many relationships imposes a 64-element limit on the number of keys available to relate to the main entity.
  2. Using the SET data type with the built-in SET functions or any of the bitwise operators prohibits the MySQL optimizer from using indexes on the column.

    This is a major performance drawback to scalability and the primary reason I choose not to use this data type for anything but the smallest projects or for fields containing 3 or fewer elements. Why 3 or fewer? Well, because even if an index ould be used against the SET field, the chances of the index selectivity being large enough to filter an adequate amount of rows to make the index useful to the optimizer is already very small.

  3. Again, working with the SET data type is not particularly flexible.

    Changing elements of a SET data type once data is already in the table can be a real pain in the behind! I won’t go into the details, as the MySQL site covers many of the main points, and Beat Vontobel’s blog post on SETs covers what the MySQL site doesn’t (nice work, Beat!)

The Long String Method

A third common approach to many-to-many relationships is to use a single long string field to store essentially a concatenated version of Method #1’s multiple fields. An example of this method would be the schema below:

CREATE TABLE Auto (
  auto_id INT UNSIGNED NOT NULL AUTO_INCREMENT
  , auto_manufacturer_id SMALLINT UNSIGNED NOT NULL
  , auto_model VARCHAR(20) NOT NULL
  , model_year SMALLINT UNSIGNED NOT NULL
  , asking_price DECIMAL(12,2) UNSIGNED NOT NULL
  , auto_options CHAR(7) NOT NULL DEFAULT 'NNNNNNN'
  , PRIMARY KEY pk_Auto (auto_id)
);

There are few advantages to this approach, either performance or maintenance-wise, but it can come in handy in at least one particular circumstance: when application code either relies, or is made much more simple, by the use of imploded arrays.

Sometimes, particularly if you’ve inherited some legacy code which is simply too much of a nuisance to change, it can make sense to work with what you’ve got. If the application code relies heavily on returned field values being concatenated string values, this method might work well. The application returns a long list of either Y or N values representing whether keys are on or off. For instance, if we had an automobile with the Leather and Power Windows options, and we used the same order as the SET example above, the returned auto_options field would be: “NYNNNNNY”.

Perhaps the application was a PHP program which walked the string using something like the following:

<?php
/* We get the $auto_options string from the database... */
$all_options = array(
'Air Conditioning'
, 'Power Windows'
, 'Power Steering'
, 'Moonroof'
, 'Disk Brakes'
, 'Power Seats'
, 'Leather');
 
$num_options = strlen($auto_options);
for ($i=0;$i<$num_options;++$i) {
  if ($auto_options{$i} == 'Y') {
    printf("The car has %s", $all_options[$i]);
  }
}
?>

In this case, the PHP program has the entire key table in memory (the $all_options array). All the program needs to do is slice up the long string; the position of the character corresponds to the auto feature in the $all_options array. If there was little complex querying in the application, yet there were hundreds or thousands of these key values for each row, this might be a decent method.

Except for this scenario, however, this method is generally not desirable, for the following reasons:

  1. Too little flexibility

    Again, flexibility arises. In this method, what happens if you want to remove an Auto Feature from the list? Doing so in the PHP code would be fairly simple; just remove the element from the $all_options array. But, unforunately, once you did that, the offsets into the character string would be skewed. Likewise, if you wanted to add an Auto Feature, you’d have to change the table schema — not a very flexible plan. Making matters worse, you could only add an Auto Feature to the end of the character string. Adding one in the middle (say, if you wanted to keep some sort of alphabetical ordering in the PHP code) would again cause the offsets to be skewed.

  2. Performance degrades dramatically because indexes can rarely be used

    This disadvantage becomes even more noticeable as queries become more complex. Sure, an index might be used if you were querying on the first one (or more ordered) Auto Features. For instance, assuming an index on the auto_options field, the following query would indeed use an index:

    SELECT *
     FROM Auto a
    WHERE a.auto_options LIKE 'Y%';

    But, unfortunately, that’s about the limit of an indexes’ usefulness for this method. In the example above, we could use an index to find all records having Air Conditioning. But, what happens when we want to, say, find all automobiles which have Power Seats and Power Windows (the sixth and second keys)? Now we’re looking at the following SQL statement:

    SELECT *
    FROM Auto a
    WHERE SUBSTRING(a.auto_options, 6,1) = 'Y'
    AND SUBSTRING(a.auto_options, 2,1) = 'Y';

    Besides being a mess, this code will never be able to use an index because of the use of the SUBSTRING() function on the left side of the WHERE equations. And, because indexes won’t be used, the performance of this schema will not scale well at all.

The Mapping Table Method

Alright, the final (and my preferred) method for managing many-to-many relationships is to use a key mapping table (sometimes called a relationship table). In the beginning of the article, in figure C, you saw an E-R diagram showing a key mapping table relating the AutoFeature entity to the Auto entity. Note that this method is the only truly normalized method of managing many-to-many relationships. Up until now in the article, the schema organization has been in a single table. Now, through the use of three distinct tables, we are able to normalize the schema into the following DDL:

CREATE TABLE Auto (
  auto_id INT UNSIGNED NOT NULL AUTO_INCREMENT
  , auto_manufacturer_id SMALLINT UNSIGNED NOT NULL
  , auto_model VARCHAR(20) NOT NULL
  , model_year SMALLINT UNSIGNED NOT NULL
  , asking_price DECIMAL(12,2) UNSIGNED NOT NULL
  , PRIMARY KEY pk_Auto (auto_id)
);
 
CREATE TABLE AutoFeature (
  auto_feature_id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT
  , feature_name VARCHAR(80) NOT NULL
  , PRIMARY KEY pk_AutoFeature (auto_feature_id)
);
 
CREATE TABLE Auto2AutoFeature (
  auto_id INT UNSIGNED NOT NULL
  , auto_feature_id SMALLINT UNSIGNED NOT NULL
  , PRIMARY KEY pk_Auto2AutoFeature (auto_id, auto_feature_id)
);

Before we discuss the advantages to this method, it’s worth pointing out a couple drawbacks to storing many-to-many relationships in this normalized fashion.

  1. Generally, the key mapping table method will use the most overall storage of any of these described approaches.

    As with all matters in the database design world, everything comes with a tradeoff. The key mapping table is no exception. The biggest tradeoff by far is the storage space needed to store the many-to-many relationship. Instead of a single field, or multiple small fields, in a single table, we now must store the foreign keys of each entity multiple times, with each unique combination occupying a single row in the relationship table.

    Additionally, because indexes can be effectively used against the key mapping table, we now need space to store the index records as well as the data records. As with any index, performance of INSERT and UPDATE operations (especially on high-volume mixed OLTP/OLAP applications) can suffer. However, any performance impact on INSERT and UPDATE operations is usually far outweighed by the performance benefits for SELECT operations. As always, however, benchmarking and testing is always a good idea; not only in the beginning of the project, but also at timed intervals as the database grows and matures.

  2. There are now one or two more tables to maintain for the schema

    Whilst having an extra table or two will provide us with the most flexibility, that flexibility comes at the cost of extra tables which must be maintained by both the application and the database administrator. For just a few seldom-changing keys, the hassle of maintaining extra tables and relationships may not be worth the added flexibility.

Now, on to the benefits of this approach, along with some examples of how to retrieve result sets using the key mapping table.

  1. Robust Indexing possibilities are now supported.

    Because both sides of the many-to-many relationship are separate fields in distinct rows of the key mapping table, our queries can support various indexing strategies. It’s best to see this in action, so I’ll demonstrate with two simple query examples.

    Let’s assume that we wish to find all the options available, in a list, for a specific automobile. This is a fairly simple query, but a good starter:

    SELECT af.feature_name
    FROM AutoFeature af
    INNER JOIN Auto2AutoFeature a2af
    ON af.auto_feature_id = a2af.auto_feature_id
    WHERE a2af.auto_id = 7;
    +------------------+
    | feature_name     |
    +------------------+
    | Air Conditioning |
    | Disk Brakes      |
    +------------------+
    2 rows in set (0.00 sec)
    

    Pretty simple query. Just an INNER JOIN from the AutoFeature table into our key mapping table on the auto_feature_id column, then a WHERE condition specifying the needed vehicle’s ID. This kind of output isn’t possible with the previous 3 methods without a lot of headache, but let’s face it: this is generally the format that an application needs data, correct? In a list, or an array. Now, with a simple query like this, that kind of output is easy.

    But, are our indexes in operation for the above query? Let’s find out:

    mysql> EXPLAIN SELECT af.feature_name
        -> FROM AutoFeature af
        -> INNER JOIN Auto2AutoFeature a2af
        -> ON af.auto_feature_id = a2af.auto_feature_id
        -> WHERE a2af.auto_id = 7 \G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: a2af
             type: ref
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 4
              ref: const
             rows: 2
            Extra: Using index
    *************************** 2. row ***************************
               id: 1
      select_type: SIMPLE
            table: af
             type: eq_ref
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 2
              ref: test.a2af.auto_feature_id
             rows: 1
            Extra:
    2 rows in set (0.00 sec)
    

    Indeed, they are. In the EXPLAIN output above, you’ll see that the optimizer is using a constant (ref: const) on the PRIMARY KEY to filter records. In the “Extra” column, you’ll note that MySQL helpfully tells us that it’s “Using index”. Many novice (or even intermediate) database developers new to MySQL will assume that if they see “Using index” in the Extra column, that they have an optimal query plan in place for the SQL code. This is actually a bit of a misnomer. When you see “Using index” in the Extra column, it means that MySQL is able to use the index records (as opposed to the data records which the index is attached to) in order to complete the query. In other words, MySQL doesn’t have to even go to the data records; everything it needs for the query is covered by the index records. This situation is called a covering index, and is something that goes hand-in-hand with key mapping tables. Why? Because, frankly, the entire table IS the index!

    For this reason, and others, which we’ll get to below, key mapping tables, when properly structured, are ideal for joining operations. In EXPLAIN plans for queries which use the key mapping table, you will often see the use of a covering index, because all of the joined columns are available in the index records. Practically speaking, this means that for MyISAM tables, the key_buffer will contain all the information that the query needs already in RAM; there is no need to access the data in the .MYD file, saving disk accesses and making the query performance lightning fast. For InnoDB tables, the access is just as quick. The PRIMARY KEY will be housed in the InnoDB data page buffer pool (since it is a clustered index and is the actual data records). So, likewise, queries will be lightning fast as the records are in memory…

    So, what about other types of queries; do they also benefit from the PRIMARY KEY index? Let’s find out. Here’s another fairly simple query which attempts to find all automobiles having the Leather option:

    SELECT a.*
    FROM Auto a
    INNER JOIN Auto2AutoFeature a2af
    ON a.auto_id = a2af.auto_id
    WHERE a2af.auto_feature_id = 7;
    +---------+----------------------+------------+------------+--------------+
    | auto_id | auto_manufacturer_id | auto_model | model_year | asking_price |
    +---------+----------------------+------------+------------+--------------+
    |       3 |                    1 | 1          |       2003 |      5000.00 |
    |       4 |                    1 | 2          |       2005 |     38000.00 |
    |       5 |                    1 | 2          |       2004 |     31000.00 |
    |       6 |                    1 | 3          |       2003 |     10000.00 |
    |       8 |                    2 | 1          |       2004 |     12000.00 |
    |      10 |                    2 | 2          |       2005 |     32000.00 |
    |      11 |                    2 | 2          |       2005 |     42000.00 |
    +---------+----------------------+------------+------------+--------------+
    7 rows in set (0.00 sec)
    

    Let’s use an EXPLAIN to see if we’ve indeed got an ideal execution plan:

    mysql> EXPLAIN SELECT a.*
        -> FROM Auto a
        -> INNER JOIN Auto2AutoFeature a2af
        -> ON a.auto_id = a2af.auto_id
        -> WHERE a2af.auto_feature_id = 7 \G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: a
             type: ALL
    possible_keys: PRIMARY
              key: NULL
          key_len: NULL
              ref: NULL
             rows: 16
            Extra:
    *************************** 2. row ***************************
               id: 1
      select_type: SIMPLE
            table: a2af
             type: eq_ref
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 6
              ref: test.a.auto_id,const
             rows: 1
            Extra: Using index
    2 rows in set (0.00 sec)
    

    Uh oh! What happened? We’ve encountered the dreaded ALL access type! As you can see, MySQL chooses to do a table scan of the Auto table, and for each auto_id, do a lookup into the Auto2AutoFeature table along the auto_id key, filtering on the Auto2AutoFeature PRIMARY KEY’s auto_feature_id column (Look for “ref: test.a.auto_id,const”). But, you ask, why didn’t MySQL just find the auto_id records in Auto2AutoFeature that had the Leather option, and then join that smaller resultset to the Auto table?

    This is where the “robust indexing strategies” I spoke about earlier come into play, and where many novices get tripped up when dealing with key mapping tables.

    The reason that MySQL chose the access plan above is because the auto_feature_id column (on which we are supplying a constant filter value of 7) is on the right side of the PRIMARY KEY index. In order for MySQL to use an index effectively, it must be able to apply a constant or range filter value to the columns of an index, from LEFT to RIGHT. This is why, although you do see the term “Using index” in the Extra column of the second EXPLAIN, MySQL actually has chosen a sub-optimal plan. The solution? We add another index to the key mapping table which allows the reverse join direction to be followed.

    CREATE UNIQUE INDEX ix_ReversePK ON  Auto2AutoFeature (auto_feature_id, auto_id);

    Now, let’s take another stap at the same EXPLAIN from above:

    mysql> EXPLAIN SELECT a.*
        -> FROM Auto a
        -> INNER JOIN Auto2AutoFeature a2af
        -> ON a.auto_id = a2af.auto_id
        -> WHERE a2af.auto_feature_id = 7 \G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: a2af
             type: ref
    possible_keys: PRIMARY,ix_ReversePK
              key: ix_ReversePK
          key_len: 2
              ref: const
             rows: 7
            Extra: Using index
    *************************** 2. row ***************************
               id: 1
      select_type: SIMPLE
            table: a
             type: eq_ref
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 4
              ref: test.a2af.auto_id
             rows: 1
            Extra:
    2 rows in set (0.00 sec)
    

    See the difference?! Now, the new reverse direction index we just added is first used to filter the Auto2AutoFeature records, and THEN the Auto table is joined. This is the optimal query plan for this type of query.

    So, to sum up this advantage, it should be said that with key mapping tables, you’re given much more ability to tune and optimize your queries, but at the same time, you’ve got to know what you’re looking for, and got to be willing to put in the time to analyze your queries. But, hey! If you’ve read all the way through this article so far, I’m willing to bet you’ll do just that.

  2. Flexibility to add and remove elements easily, and without schema changes.

    One fatal flaw of the previously covered methods is their inability to deal easily with change. Let’s face it, change happens constantly in the business world. If you’re designing applications for use in this world, you had better make them able to deal with constant change. Otherwise, you will be on the phone 24/7 fulfilling request after request to make small changes to this type of data. Do yourself the favor, and give yourself back that time from the start.

    With key mapping tables, adding new elements to the many-to-many relationship is simple. You just add a record to the master table:

    INSERT INTO AutoFeature (feature_name) VALUES  ('GPS Navigation');

    All done. If you want to remove one, simply issue a DELETE with a join into the key mapping table. This will remove the parent and all child relationships:

    DELETE  AutoFeature, Auto2AutoFeature
    FROM AutoFeature
    INNER JOIN Auto2AutoFeature
    ON AutoFeature.auto_feature_id = Auto2AutoFeature.auto_feature_id
    WHERE AutoFeature.auto_feature_id = 4;
    Query OK, 5 rows affected (0.00 sec)
    

That wraps up this part’s coverage of the key mapping technique. It was just a brief overview. Next time, we’ll cover how to get the most out of your key mapping tables, including how to structure complex joins for various query filters against the many-to-many keys.

Conclusion

In this first part of the series, I’ve tried to thoroughly explain four common approaches for managing many-to-many relationships. Each approach has it’s distinct benefits and drawbacks, and I’ve tried to represent those aspects clearly and fairly. Please use the comments section below to let me know if you think I skimmed over certain methods or aspects, or didn’t give enough weight to one or the other. Also, feel free to point out any methods I may have left out completely; I’m always looking to round out my material in a balanced and thoughful way, and welcome any constructive criticism!

In the second part of the series, I’ll be examining ways we can use stored procedures, views, and the INFORMATION_SCHEMA in MySQL 5 to construct an easy management system for key mapping tables, and I’ll present some more advanced SQL that enables you to have full control in querying across these key mapping tables, allowing you to drill down accurately into the wealth of information stored there. We’ll also explore methods of increasing the performance of key mapping tables through the use of alternate indexes.

I’ll also be covering some benchmarks I’ll construct this coming week which show the relative performance and scalability of these methods, as well as the storage needs of each. Hopefully, that part of the article series will give some qualitative numbers for those looking for that kind of thing! ;) Till then, Cheers.

MySQL 5 Stored Procedures and INFORMATION_SCHEMA

A Handy One-Two Punch for Administrators

In writing Pro MySQL, I’ve become extremely excited about the new features debuting in MySQL 5—features that have already started to get thoroughly tested by the development community and have leveled MySQL with Oracle, SQL Server, and PostgreSQL on functional equivalencies long sought-after by developers and administrators.

I figured, partly in response to Arjen Lentz’ call to action, I’d write about two of these functional areas in this article: stored procedures and the INFORMATION_SCHEMA virtual database. Both features are detailed thoroughly in Pro MySQL, Chapters 9 and 21, respectively, but I wanted to do a quick article combining these two features into a practical example of MySQL 5 functionality.

When I worked for RadioShack, a Microsoft shop, we used stored procedures and MS SQL Server’s INFORMATION_SCHEMA views all the time. Since making the move to MySQL, I’ve missed these two features sorely. Well, now the wait’s over.