Managing Many to Many Relationships in MySQL – Part 1

Flexible, Scalable Key Mapping Solutions

In working to answer questions on the MySQL forums, I’ve noticed
a few questions that repeatedly come up on a number of the forum areas. One of these particular questions deals with how to manage — construct, query, and maintain — many to many relationships in your schema. I decided to put together a two-part article series detailing some of the common dilemmas which inevitably arise when tackling the issue of relating two entities where one entity can be related to many instances of another, and vice versa.

Hopefully, this article will shed some light on how to structure your schema effectively to produce fast, efficient queries, and also will illustrate how key map tables can be queried for a variety of different purposes. I’ll predominantly be using standard SQL, so although I’m using MySQL as the database of choice here, the code you see is for the most part not limited to running on just MySQL. In this first part, we’ll review the concepts involved in many-to-many relationships, and discuss the most common methods for use in storing the data composing the relationship. In the second part of the article, which I should complete in about another week, we’ll look at some solid performance numbers regarding the various approaches. Also, I’ll show you how to use MySQL 5 stored procedures and views in order to most effectively manage many-to-many relationships.

A Review of Relational Concepts

For those of you unaware of what a many-to-many relationship is, let’s first briefly discuss some basic
definitions I’ll be using in the article. First, an entity, in the database world, is simply a singular
object or concept. An entity, just like a real world object, may have one or more attributes which describe different aspects of the entity. A table in a properly normalized database contains records which pertain to a single entity. These records represent instances of the entity.

To illustrate, let’s assume we’re building a website that specializes in used-car sales. We need to design a schema which will handle the searching and storage of a variety of different auto makes and models, and allow customers to filter results based on a set of features they require in the automobile.

figure A – Auto Entity

The primary entity in our schema could be considered the Auto entity. It might have some attributes such as a manufacturer, model, model year, and so forth. Figure A shows a depiction of this simple Auto entity, with the primary key attribute (auto_id) in bold and above all other descriptive attributes like manufacturer, model, etc.

One to Many Relationships

Entities in the database can relate to each other in a number of ways. The most common type of relationship is called a one-to-many relationship. Here, one instance of an entity can relate, or be attached to, many instances of another entity. In our sample schema, the Auto entity can have only one auto manufacturer. However, an auto manufacturer can produce many automobiles. Therefore, we say that the relationship from Auto Manufacturer to Auto is a one-to-many relationship. Figure B depicts a one-to-many relationship between the Auto Manufacturer entity and the Auto entity. In the figure, the line between the two entities represents the relationship. This is a common way to represent relationships, with the “one” side of the relationship having a single line and the “many” side of the relationship having a set of three lines and a circle.

figure B – One to Many Relationship

Relationships are implemented via common attributes in each Entity called key attributes. In a one-to-many relationship, these key attributes take the form of a parent key and a child, or foreign, key. The relationship is maintained by “connecting” the two attributes together via these key attributes. In the case of our Auto Manufacturer to Auto relationship, the key attributes are AutoManufacturer.auto_manufacturer_id and Auto.auto_manufacturer_id. If we wanted to list the Auto’s manufacturer name (not Manufacturer ID), we would use an INNER JOIN from the Auto table to the AutoManufacturer table along this key relationship, as shown below.

SELECT am.name
FROM Auto a
INNER JOIN AutoManufacturer am
ON a.auto_manufacturer_id = am.auto_manufacturer_id
WHERE a.auto_id = 12345;

This simple example shows how a one-to-many relationship is implemented using a simple INNER JOIN to find the intersection of two tables where the key attributes AutoManufacturer.auto_manufacturer_id and Auto.auto_manufacturer_id contain matching entries.

Many to Many Relationships

A many to many relationship is realized between two entities when either entity may be associated with more than one instance of the other entity. For example, imagine the relationship between an Auto (as in car) entity and an AutoFeature entity representing any of the myriad options a car may come with.

figure C – Many to Many Relationship

In this case, we know that any particular automobile can have many features. Likewise we know that a specific automobile feature, say power windows, may be in any number of automobiles. There’s no way for us to visually represent the association between the two entities without using a third entity, which stores the mapping of the relationship between the Auto entity and the AutoFeature entity. In this case, I use the Auto2AutoFeature entity. For mapping tables, I tend to use this Something2Something naming scheme to clearly show that it is a table which primarily serves to map the relationship from one thing to another, but of course that is merely a stylistic convention, nothing more.

Schema Representations for Many to Many Relationships

There are a few common ways for representing many-to-many relationships within structured SQL data, all of which will be detailed below:

  1. Using multiple fields of an on/off data type to store many “keys” in a single table
  2. Using the INT, BIGINT, or SET data type (or equivalent in other RDBMS) to store a fixed number of key flags in a single table field
  3. Using a CHAR string having one byte of storage per key needed, with the string acting as one long imploded array in a single table field
  4. Using a relationship, or mapping, table, like the one in figure C, to store one or more keys related to an entity

All of these methods has distinct advantages and disadvantages, in both ease of use and performance. We’ll look at each here, along with some sample code to show how common queries are performed across each storage schema.

The Multiple Field Method

In this method of defining a many-to-many relationship, the concepts of normalization are thrown away in favor of what some consider to be a simpler and more rational schema. Multiple fields, each representing an on/off value for a foreign key, are used in only a single table in order to achieve the results desired. Any data type representing an on/off value may be used to represent the key fields — CHAR(1) with ‘T’ or ‘F’, ‘Y’ or ‘N’, or a TINYINT UNSIGNED with 0 and 1 values, or an ENUM(‘Y’,’N’) etc. Below, you’ll see what a sample Auto table using this method might look like with CHAR(1) data types used for the auto option key fields:

CREATE TABLE Auto (
  auto_id INT UNSIGNED NOT NULL AUTO_INCREMENT
  , auto_manufacturer_id SMALLINT UNSIGNED NOT NULL
  , auto_model VARCHAR(20) NOT NULL
  , model_year SMALLINT UNSIGNED NOT NULL
  , asking_price DECIMAL(12,2) UNSIGNED NOT NULL
  , has_air_conditioning CHAR(1) NOT NULL DEFAULT 'N'
  , has_power_windows CHAR(1) NOT NULL DEFAULT 'N'
  , has_power_steering CHAR(1) NOT NULL DEFAULT 'N'
  , has_moonroof CHAR(1) NOT NULL DEFAULT 'N'
  , has_disk_brakes CHAR(1) NOT NULL DEFAULT 'N'
  , has_power_seats CHAR(1) NOT NULL DEFAULT 'N'
  , has_leather CHAR(1) NOT NULL DEFAULT 'N'
  , PRIMARY KEY pk_Auto (auto_id)
);

There are a couple advantages to this type of approach:

  1. Given a simple SELECT * FROM the table, it is fairly easy to understand (immediately) what options the car has.
  2. Only one table is involved in determining options for the car.

This last point is often the reason why many novice developers choose to use this approach; it’s easier and more straightforward to filter a resultset based on a single option:

SELECT *
FROM Auto a
WHERE a.has_leather = 'Y';

or more complicated request:

SELECT *
FROM Auto a
WHERE (a.has_leather = 'Y' OR a.has_power_windows = 'Y')
AND a.has_power_steering = 'Y';

The code is easy to understand for most anyone looking at it, and there’s no special syntax or SQL tricks needed in order to query on an OR condition or multiple various filters. Unfortunately, this approach has a number of downsides. Most important among them are:

  1. If you need to add an auto feature, you’ve got to ALTER TABLE Auto ADD COLUMN …

    This disadvantage is often overlooked in early design phases by novice programmers or design teams because everyone always assumes that they know every option that the customer might want to use. However, this is rarely the reality. Customers change their minds and will inevitably ask to add or remove Auto Features from the list. With this method, that requires a fundamental change in the schema: removing or adding columns. When doing so, especially on large tables, deadlocks can easily occur while read requests wait for an exclusive write lock to finish while the table is rebuilt to the new schema. (That’s a bad thing.)

  2. There isn’t any useful place for an index on any of these data fields.

    “Wait a minute!” you say, “You can place an index on any of these fields!” Sure, that is correct. Of course you can place an index on any of these fields. In fact, you can place an index on all of them if you really wanted to. But none of those indexes is likely to be used effectively by MySQL. Why? Well, that’s simple. In the schema above, there are only two values possible for each field — a ‘Y’ or ‘N’ (or 0 and 1, ‘T’ and ‘F’, etc). Let’s assume a table of 500K auto records. If the only two values in the field are ‘Y’ and ‘N’, what use would an index be? For a common auto feature, say “has_leather”. probably half of the records would contain a ‘Y’ and half an ‘N’. What use would this be to an index? None. In fact, an index would slow MySQL down, as so many index records would have to be read, with a lookup operation to the data record from each index record. The selectivity, or distribution, of key values is extremely low (see my previous article for more explanation on this) and therefore the index has limited use to MySQL.

  3. There is very little flexibility offered by this method.

    Imagine if you were to give the customer the ability to add and remove auto features at will. You would have to GRANT the customer (or the web-based or application user) the ability to ALTER TABLE. This isn’t generally considered to be the most secure or effective method of managing change…

    Also, Let’s say you came up with a list of 300 auto features. Are you going to add 300 fields to the table? Ever tried doing a SELECT * from a table with 300 fields? Even if you use the /G option in the mysql client, you’d still have a mess!

So, for all the reasons outlined above, I recommend against this approach for all but the simplest and non-production environments. It isn’t flexible enough to withstand change, and the limitations of performance far outweight any advantage to ease of use.

Wait a Minute!

“But wait!” you say, “MySQL itself uses this strategy for its own mysql database! The mysql.user table has fields like “Select_priv”, “Insert_priv”, and “Delete_priv”. Don’t you think MySQL knows what they’re doing!?”

Yes, of course MySQL knows what they’re doing. But, you have to remember one thing about the mysql system schema tables. They’re always in memory. There is no performance degradation related to the mysql.user table’s multiple field definitions for privileges because upon startup, the MySQL server actually loads all user information (privileges) into a hash table of structs containing privilege information. When a request is received to perform some operation, this hash table of user privileges is checked using very fast logical OR and AND operations. So, there aren’t any performance issues associated with the mysql tables. This is, by the way, why you must issue a FLUSH PRIVILEGES when manually changing any of the mysql tables. The FLUSH PRIVILEGES reloads this in-memory hash table.

So, for the mysql schema, the tables are designed for ease of use and simplicity, so this method was used to represent privileges. Do I favor it? Not particularly, but I can see why it was done… :)

The INT, BIGINT, or SET Bitmap Method

With this next method, the INT, BIGINT, or MySQL SET data type is used to store from zero to 64 separate key flags within a single field. Again, with this method, we denormalize the schema by reducing this many-to-many relationship down into a single field in a single table. Below, you’ll see an example of our Auto table converted to use the SET data type instead of the 7 distinct CHAR(1) fields from Method #1:

CREATE TABLE Auto (
  auto_id INT UNSIGNED NOT NULL AUTO_INCREMENT
  , auto_manufacturer_id SMALLINT UNSIGNED NOT NULL
  , auto_model VARCHAR(20) NOT NULL
  , model_year SMALLINT UNSIGNED NOT NULL
  , asking_price DECIMAL(12,2) UNSIGNED NOT NULL
  , auto_options SET('Air Conditioning'
    ,   'Power Windows'
    ,   'Power Steering'
    ,   'Moonroof'
    ,   'Disk Brakes'
    ,   'Power Seats'
    ,   'Leather') NOT NULL
  , PRIMARY KEY pk_Auto (auto_id)
);

Although the SET (and ENUM, SET’s one-to-many column type cousin) are listed in the MySQL manual as string column types, they are internally represented as integers. Below, when we cover bitwise operations, you’ll see how you can use the SET data type the same way as you would an INT or BIGINT, though there are advantages to simply sticking with the SET data type for simplicity’s sake. As with the multiple-field method, there are advantages to the SET method. They include:

  1. You can store a large (64 elements) number of key values in a small storage unit.

    For the SET data type, 1 byte of storage is used for up to 8 elements, 2 bytes for up to 16 elements, 3 bytes for up to 24 elements, 4 bytes of storage for up to 32 elements, and 8 bytes of storage for 33-64 possible elements. Compared with method #1 (and method #4 below), this does save storage space in the table

  2. MySQL automatically handles showing the descriptive value of the SET column value

    With the INT and BIGINT data types, MySQL will simply show the numeric representation of the field value. With the SET data type, by default, MySQL shows the descriptive string value of the column, instead of it’s numeric value (more below). This can be a handy feature.

  3. MySQL provides the handy FIND_IN_SET() function to quickly filter for a single key value.

    The FIND_IN_SET() function can be used to query for rows in which a specific key value is turned on for the particular row. If you want to find whether more than one key values are turned on, you can use an AND expression in the WHERE clause. For instance, let’s say we wanted to find all records in which the automobile had both the leather and power windows options. We could use the following:

    SELECT *
    FROM Auto a
    WHERE FIND_IN_SET('Leather', auto_options)>0
    AND FIND_IN_SET('Power Windows', auto_options)>0;

    Although there are some performance issues with SET fields (see below, on lack of indexing), the FIND_IN_SET() function is highly optimized to work specifically with the integer data housed beneath the surface. Bitwise operations are used to determine whether the row matches the specific key flag queried for. These bitwise operations are generally very fast, and we’ll cover them below.

Besides FIND_IN_SET(), any bitwise operator that you would normally use on integer data can be used on SET, INT and BIGINT columns.

Bitwise operations are performed on the actual binary representation of the data. Bits (each representing a key, are turned on or off by placing the corresponding bit in the binary number to 1. The table below shows the binary and decimal representations of the first byte (8 bits) of an integer bitmap. As you can see, each “on” bit simply turns on the appropriate power of 2.

Binary Decimal
0000 0001 1
0000 0010 2
0000 0100 4
0000 1000 8
0001 0000 16
0010 0000 32
0100 0000 64
1000 0000 128

Clearly, the larger the storage unit, the more bits can be used to represent the key flags. A BIGINT can store up to 64 unique values, an INT 32 unique values, and so on. One advantage to using the SET type is that it automatically chooses the smallest storage type needed to store the required number of set elements.

To have more than one key flag turned on within the bitmap field, the flag bits are added together. Therefore, if the third (23) and fourth (24) flags (0000 0100 and 0000 1000) are turned on, the bitmap field would contain 0000 1100, or the number 12 in decimal. If ever you want to see the decimal or binary representation of a SET field column, you can use +0 and the BIN() function, as shown below:

SELECT
  auto_id
, auto_options
, auto_options+0 AS 'dec'
, LPAD(BIN(auto_options+0),8,'0') AS 'bin'
FROM Auto a;
+---------+-----------------------------------------------+-----+----------+
| auto_id | auto_options                                  | dec | bin      |
+---------+-----------------------------------------------+-----+----------+
|       1 | Power Windows,Power Steering,Moonroof         |  14 | 00001110 |
|       2 | Power Windows,Power Steering,Moonroof,Leather |  78 | 01001110 |
|       3 | Power Windows,Moonroof,Leather                |  74 | 01001010 |
|       4 | Power Windows,Moonroof                        |  10 | 00001010 |
+---------+-----------------------------------------------+-----+----------+
4 rows in set (0.00 sec)

Use the LPAD() function to pretty up the output to a pre-determined width, with leading zeroes.

This means that you can do more complex querying using the numeric values of the SET elements (key values). For instance, suppose we wanted to find all those records which did not have Leather or Power Steering, but did have Power Windows. From the output above, we can easily see that the auto with ID#4 is the only record which will meet our criteria.

But, how do we structure our WHERE expression in SQL? Looking back at our table schema, we know that the numeric position (starting from the number 1) of the key values in our WHERE clause will be:

Key Value Numeric Position in SET
Leather 7
Power Steering 3
Power Windows 2

Using the MySQL bitwise functions, we can issue our query like so:

SELECT auto_id, auto_options
FROM Auto a
WHERE auto_options & (POW(2,6)+POW(2,2)) = 0
AND auto_options & POW(2,1) > 0;
+---------+------------------------+
| auto_id | auto_options           |
+---------+------------------------+
|       4 | Power Windows,Moonroof |
+---------+------------------------+
1 row in set (0.00 sec)

In the code above, the POW() function is used to get the correct bit set for each desired element in the query. We substract 1 from the number of the element, because the “on” bits are counted from the right side of the byte structure and determined as a power of 2. For the first part of the WHERE expression:

WHERE auto_options & (POW(2,6)+POW(2,2)) = 0

we ensure that the result of the bitwise & operator (which returns a 1 for each bit where both sides of the equation have the specified bit or bits turned on) results in 0. This ensures that only rows which do not have the Power Steering and Leather options are returned. The second part of the WHERE expression uses the bitwise & operator against the Power Windows option, and filters where the result is greater than zero, so that the resulting rows are known to contain the Power Windows option.

Besides violating normalization rules (and I won’t get into an idealistic debate about that here…there’s more than enough discussion online about that), there are some concrete reasons why not to use the SET data type for handling many-to-many relationships. Although I began writing this article quite a while before Sheeri Kritzer posted about the ENUM/SET type, some points are worth repeating. Foremost among these are the following:

  1. Using the SET data type for many-to-many relationships imposes a 64-element limit on the number of keys available to relate to the main entity.
  2. Using the SET data type with the built-in SET functions or any of the bitwise operators prohibits the MySQL optimizer from using indexes on the column.

    This is a major performance drawback to scalability and the primary reason I choose not to use this data type for anything but the smallest projects or for fields containing 3 or fewer elements. Why 3 or fewer? Well, because even if an index ould be used against the SET field, the chances of the index selectivity being large enough to filter an adequate amount of rows to make the index useful to the optimizer is already very small.

  3. Again, working with the SET data type is not particularly flexible.

    Changing elements of a SET data type once data is already in the table can be a real pain in the behind! I won’t go into the details, as the MySQL site covers many of the main points, and Beat Vontobel’s blog post on SETs covers what the MySQL site doesn’t (nice work, Beat!)

The Long String Method

A third common approach to many-to-many relationships is to use a single long string field to store essentially a concatenated version of Method #1’s multiple fields. An example of this method would be the schema below:

CREATE TABLE Auto (
  auto_id INT UNSIGNED NOT NULL AUTO_INCREMENT
  , auto_manufacturer_id SMALLINT UNSIGNED NOT NULL
  , auto_model VARCHAR(20) NOT NULL
  , model_year SMALLINT UNSIGNED NOT NULL
  , asking_price DECIMAL(12,2) UNSIGNED NOT NULL
  , auto_options CHAR(7) NOT NULL DEFAULT 'NNNNNNN'
  , PRIMARY KEY pk_Auto (auto_id)
);

There are few advantages to this approach, either performance or maintenance-wise, but it can come in handy in at least one particular circumstance: when application code either relies, or is made much more simple, by the use of imploded arrays.

Sometimes, particularly if you’ve inherited some legacy code which is simply too much of a nuisance to change, it can make sense to work with what you’ve got. If the application code relies heavily on returned field values being concatenated string values, this method might work well. The application returns a long list of either Y or N values representing whether keys are on or off. For instance, if we had an automobile with the Leather and Power Windows options, and we used the same order as the SET example above, the returned auto_options field would be: “NYNNNNNY”.

Perhaps the application was a PHP program which walked the string using something like the following:

<?php
/* We get the $auto_options string from the database... */
$all_options = array(
'Air Conditioning'
, 'Power Windows'
, 'Power Steering'
, 'Moonroof'
, 'Disk Brakes'
, 'Power Seats'
, 'Leather');
 
$num_options = strlen($auto_options);
for ($i=0;$i<$num_options;++$i) {
  if ($auto_options{$i} == 'Y') {
    printf("The car has %s", $all_options[$i]);
  }
}
?>

In this case, the PHP program has the entire key table in memory (the $all_options array). All the program needs to do is slice up the long string; the position of the character corresponds to the auto feature in the $all_options array. If there was little complex querying in the application, yet there were hundreds or thousands of these key values for each row, this might be a decent method.

Except for this scenario, however, this method is generally not desirable, for the following reasons:

  1. Too little flexibility

    Again, flexibility arises. In this method, what happens if you want to remove an Auto Feature from the list? Doing so in the PHP code would be fairly simple; just remove the element from the $all_options array. But, unforunately, once you did that, the offsets into the character string would be skewed. Likewise, if you wanted to add an Auto Feature, you’d have to change the table schema — not a very flexible plan. Making matters worse, you could only add an Auto Feature to the end of the character string. Adding one in the middle (say, if you wanted to keep some sort of alphabetical ordering in the PHP code) would again cause the offsets to be skewed.

  2. Performance degrades dramatically because indexes can rarely be used

    This disadvantage becomes even more noticeable as queries become more complex. Sure, an index might be used if you were querying on the first one (or more ordered) Auto Features. For instance, assuming an index on the auto_options field, the following query would indeed use an index:

    SELECT *
     FROM Auto a
    WHERE a.auto_options LIKE 'Y%';

    But, unfortunately, that’s about the limit of an indexes’ usefulness for this method. In the example above, we could use an index to find all records having Air Conditioning. But, what happens when we want to, say, find all automobiles which have Power Seats and Power Windows (the sixth and second keys)? Now we’re looking at the following SQL statement:

    SELECT *
    FROM Auto a
    WHERE SUBSTRING(a.auto_options, 6,1) = 'Y'
    AND SUBSTRING(a.auto_options, 2,1) = 'Y';

    Besides being a mess, this code will never be able to use an index because of the use of the SUBSTRING() function on the left side of the WHERE equations. And, because indexes won’t be used, the performance of this schema will not scale well at all.

The Mapping Table Method

Alright, the final (and my preferred) method for managing many-to-many relationships is to use a key mapping table (sometimes called a relationship table). In the beginning of the article, in figure C, you saw an E-R diagram showing a key mapping table relating the AutoFeature entity to the Auto entity. Note that this method is the only truly normalized method of managing many-to-many relationships. Up until now in the article, the schema organization has been in a single table. Now, through the use of three distinct tables, we are able to normalize the schema into the following DDL:

CREATE TABLE Auto (
  auto_id INT UNSIGNED NOT NULL AUTO_INCREMENT
  , auto_manufacturer_id SMALLINT UNSIGNED NOT NULL
  , auto_model VARCHAR(20) NOT NULL
  , model_year SMALLINT UNSIGNED NOT NULL
  , asking_price DECIMAL(12,2) UNSIGNED NOT NULL
  , PRIMARY KEY pk_Auto (auto_id)
);
 
CREATE TABLE AutoFeature (
  auto_feature_id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT
  , feature_name VARCHAR(80) NOT NULL
  , PRIMARY KEY pk_AutoFeature (auto_feature_id)
);
 
CREATE TABLE Auto2AutoFeature (
  auto_id INT UNSIGNED NOT NULL
  , auto_feature_id SMALLINT UNSIGNED NOT NULL
  , PRIMARY KEY pk_Auto2AutoFeature (auto_id, auto_feature_id)
);

Before we discuss the advantages to this method, it’s worth pointing out a couple drawbacks to storing many-to-many relationships in this normalized fashion.

  1. Generally, the key mapping table method will use the most overall storage of any of these described approaches.

    As with all matters in the database design world, everything comes with a tradeoff. The key mapping table is no exception. The biggest tradeoff by far is the storage space needed to store the many-to-many relationship. Instead of a single field, or multiple small fields, in a single table, we now must store the foreign keys of each entity multiple times, with each unique combination occupying a single row in the relationship table.

    Additionally, because indexes can be effectively used against the key mapping table, we now need space to store the index records as well as the data records. As with any index, performance of INSERT and UPDATE operations (especially on high-volume mixed OLTP/OLAP applications) can suffer. However, any performance impact on INSERT and UPDATE operations is usually far outweighed by the performance benefits for SELECT operations. As always, however, benchmarking and testing is always a good idea; not only in the beginning of the project, but also at timed intervals as the database grows and matures.

  2. There are now one or two more tables to maintain for the schema

    Whilst having an extra table or two will provide us with the most flexibility, that flexibility comes at the cost of extra tables which must be maintained by both the application and the database administrator. For just a few seldom-changing keys, the hassle of maintaining extra tables and relationships may not be worth the added flexibility.

Now, on to the benefits of this approach, along with some examples of how to retrieve result sets using the key mapping table.

  1. Robust Indexing possibilities are now supported.

    Because both sides of the many-to-many relationship are separate fields in distinct rows of the key mapping table, our queries can support various indexing strategies. It’s best to see this in action, so I’ll demonstrate with two simple query examples.

    Let’s assume that we wish to find all the options available, in a list, for a specific automobile. This is a fairly simple query, but a good starter:

    SELECT af.feature_name
    FROM AutoFeature af
    INNER JOIN Auto2AutoFeature a2af
    ON af.auto_feature_id = a2af.auto_feature_id
    WHERE a2af.auto_id = 7;
    +------------------+
    | feature_name     |
    +------------------+
    | Air Conditioning |
    | Disk Brakes      |
    +------------------+
    2 rows in set (0.00 sec)
    

    Pretty simple query. Just an INNER JOIN from the AutoFeature table into our key mapping table on the auto_feature_id column, then a WHERE condition specifying the needed vehicle’s ID. This kind of output isn’t possible with the previous 3 methods without a lot of headache, but let’s face it: this is generally the format that an application needs data, correct? In a list, or an array. Now, with a simple query like this, that kind of output is easy.

    But, are our indexes in operation for the above query? Let’s find out:

    mysql> EXPLAIN SELECT af.feature_name
        -> FROM AutoFeature af
        -> INNER JOIN Auto2AutoFeature a2af
        -> ON af.auto_feature_id = a2af.auto_feature_id
        -> WHERE a2af.auto_id = 7 \G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: a2af
             type: ref
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 4
              ref: const
             rows: 2
            Extra: Using index
    *************************** 2. row ***************************
               id: 1
      select_type: SIMPLE
            table: af
             type: eq_ref
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 2
              ref: test.a2af.auto_feature_id
             rows: 1
            Extra:
    2 rows in set (0.00 sec)
    

    Indeed, they are. In the EXPLAIN output above, you’ll see that the optimizer is using a constant (ref: const) on the PRIMARY KEY to filter records. In the “Extra” column, you’ll note that MySQL helpfully tells us that it’s “Using index”. Many novice (or even intermediate) database developers new to MySQL will assume that if they see “Using index” in the Extra column, that they have an optimal query plan in place for the SQL code. This is actually a bit of a misnomer. When you see “Using index” in the Extra column, it means that MySQL is able to use the index records (as opposed to the data records which the index is attached to) in order to complete the query. In other words, MySQL doesn’t have to even go to the data records; everything it needs for the query is covered by the index records. This situation is called a covering index, and is something that goes hand-in-hand with key mapping tables. Why? Because, frankly, the entire table IS the index!

    For this reason, and others, which we’ll get to below, key mapping tables, when properly structured, are ideal for joining operations. In EXPLAIN plans for queries which use the key mapping table, you will often see the use of a covering index, because all of the joined columns are available in the index records. Practically speaking, this means that for MyISAM tables, the key_buffer will contain all the information that the query needs already in RAM; there is no need to access the data in the .MYD file, saving disk accesses and making the query performance lightning fast. For InnoDB tables, the access is just as quick. The PRIMARY KEY will be housed in the InnoDB data page buffer pool (since it is a clustered index and is the actual data records). So, likewise, queries will be lightning fast as the records are in memory…

    So, what about other types of queries; do they also benefit from the PRIMARY KEY index? Let’s find out. Here’s another fairly simple query which attempts to find all automobiles having the Leather option:

    SELECT a.*
    FROM Auto a
    INNER JOIN Auto2AutoFeature a2af
    ON a.auto_id = a2af.auto_id
    WHERE a2af.auto_feature_id = 7;
    +---------+----------------------+------------+------------+--------------+
    | auto_id | auto_manufacturer_id | auto_model | model_year | asking_price |
    +---------+----------------------+------------+------------+--------------+
    |       3 |                    1 | 1          |       2003 |      5000.00 |
    |       4 |                    1 | 2          |       2005 |     38000.00 |
    |       5 |                    1 | 2          |       2004 |     31000.00 |
    |       6 |                    1 | 3          |       2003 |     10000.00 |
    |       8 |                    2 | 1          |       2004 |     12000.00 |
    |      10 |                    2 | 2          |       2005 |     32000.00 |
    |      11 |                    2 | 2          |       2005 |     42000.00 |
    +---------+----------------------+------------+------------+--------------+
    7 rows in set (0.00 sec)
    

    Let’s use an EXPLAIN to see if we’ve indeed got an ideal execution plan:

    mysql> EXPLAIN SELECT a.*
        -> FROM Auto a
        -> INNER JOIN Auto2AutoFeature a2af
        -> ON a.auto_id = a2af.auto_id
        -> WHERE a2af.auto_feature_id = 7 \G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: a
             type: ALL
    possible_keys: PRIMARY
              key: NULL
          key_len: NULL
              ref: NULL
             rows: 16
            Extra:
    *************************** 2. row ***************************
               id: 1
      select_type: SIMPLE
            table: a2af
             type: eq_ref
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 6
              ref: test.a.auto_id,const
             rows: 1
            Extra: Using index
    2 rows in set (0.00 sec)
    

    Uh oh! What happened? We’ve encountered the dreaded ALL access type! As you can see, MySQL chooses to do a table scan of the Auto table, and for each auto_id, do a lookup into the Auto2AutoFeature table along the auto_id key, filtering on the Auto2AutoFeature PRIMARY KEY’s auto_feature_id column (Look for “ref: test.a.auto_id,const”). But, you ask, why didn’t MySQL just find the auto_id records in Auto2AutoFeature that had the Leather option, and then join that smaller resultset to the Auto table?

    This is where the “robust indexing strategies” I spoke about earlier come into play, and where many novices get tripped up when dealing with key mapping tables.

    The reason that MySQL chose the access plan above is because the auto_feature_id column (on which we are supplying a constant filter value of 7) is on the right side of the PRIMARY KEY index. In order for MySQL to use an index effectively, it must be able to apply a constant or range filter value to the columns of an index, from LEFT to RIGHT. This is why, although you do see the term “Using index” in the Extra column of the second EXPLAIN, MySQL actually has chosen a sub-optimal plan. The solution? We add another index to the key mapping table which allows the reverse join direction to be followed.

    CREATE UNIQUE INDEX ix_ReversePK ON  Auto2AutoFeature (auto_feature_id, auto_id);

    Now, let’s take another stap at the same EXPLAIN from above:

    mysql> EXPLAIN SELECT a.*
        -> FROM Auto a
        -> INNER JOIN Auto2AutoFeature a2af
        -> ON a.auto_id = a2af.auto_id
        -> WHERE a2af.auto_feature_id = 7 \G
    *************************** 1. row ***************************
               id: 1
      select_type: SIMPLE
            table: a2af
             type: ref
    possible_keys: PRIMARY,ix_ReversePK
              key: ix_ReversePK
          key_len: 2
              ref: const
             rows: 7
            Extra: Using index
    *************************** 2. row ***************************
               id: 1
      select_type: SIMPLE
            table: a
             type: eq_ref
    possible_keys: PRIMARY
              key: PRIMARY
          key_len: 4
              ref: test.a2af.auto_id
             rows: 1
            Extra:
    2 rows in set (0.00 sec)
    

    See the difference?! Now, the new reverse direction index we just added is first used to filter the Auto2AutoFeature records, and THEN the Auto table is joined. This is the optimal query plan for this type of query.

    So, to sum up this advantage, it should be said that with key mapping tables, you’re given much more ability to tune and optimize your queries, but at the same time, you’ve got to know what you’re looking for, and got to be willing to put in the time to analyze your queries. But, hey! If you’ve read all the way through this article so far, I’m willing to bet you’ll do just that.

  2. Flexibility to add and remove elements easily, and without schema changes.

    One fatal flaw of the previously covered methods is their inability to deal easily with change. Let’s face it, change happens constantly in the business world. If you’re designing applications for use in this world, you had better make them able to deal with constant change. Otherwise, you will be on the phone 24/7 fulfilling request after request to make small changes to this type of data. Do yourself the favor, and give yourself back that time from the start.

    With key mapping tables, adding new elements to the many-to-many relationship is simple. You just add a record to the master table:

    INSERT INTO AutoFeature (feature_name) VALUES  ('GPS Navigation');

    All done. If you want to remove one, simply issue a DELETE with a join into the key mapping table. This will remove the parent and all child relationships:

    DELETE  AutoFeature, Auto2AutoFeature
    FROM AutoFeature
    INNER JOIN Auto2AutoFeature
    ON AutoFeature.auto_feature_id = Auto2AutoFeature.auto_feature_id
    WHERE AutoFeature.auto_feature_id = 4;
    Query OK, 5 rows affected (0.00 sec)
    

That wraps up this part’s coverage of the key mapping technique. It was just a brief overview. Next time, we’ll cover how to get the most out of your key mapping tables, including how to structure complex joins for various query filters against the many-to-many keys.

Conclusion

In this first part of the series, I’ve tried to thoroughly explain four common approaches for managing many-to-many relationships. Each approach has it’s distinct benefits and drawbacks, and I’ve tried to represent those aspects clearly and fairly. Please use the comments section below to let me know if you think I skimmed over certain methods or aspects, or didn’t give enough weight to one or the other. Also, feel free to point out any methods I may have left out completely; I’m always looking to round out my material in a balanced and thoughful way, and welcome any constructive criticism!

In the second part of the series, I’ll be examining ways we can use stored procedures, views, and the INFORMATION_SCHEMA in MySQL 5 to construct an easy management system for key mapping tables, and I’ll present some more advanced SQL that enables you to have full control in querying across these key mapping tables, allowing you to drill down accurately into the wealth of information stored there. We’ll also explore methods of increasing the performance of key mapping tables through the use of alternate indexes.

I’ll also be covering some benchmarks I’ll construct this coming week which show the relative performance and scalability of these methods, as well as the storage needs of each. Hopefully, that part of the article series will give some qualitative numbers for those looking for that kind of thing! ;) Till then, Cheers.

  • http://rpbouman.blogspot.com Roland Bouman

    Nice one Jay, I enjoyed it very much.

    I just have some comments, hope to see your thoughts on that.

    On many to many relationships, you say that:

    “There’s no way for us to visually represent the association between the two entities without using a third entity, which stores the mapping of the relationship between the Auto entity and the AutoFeature entity.”

    We’ll, I know that you know that there actually is such a method: just a relationship with crowfeet on both relationship ends. This diagram technique is often used in devising the conceptual model, before the data model is designed.

    Now, in most RDBMS’s you’d implement a many to many relationship using the intersection table approach, but what I find so striking about MySQL is that you *CAN* create _a_ _foreign_ _key_ _that_ _has_ _a_ *NON-UNIQUE* _parent_ _key_! It’s not that explicit, but you can find it in the MySQL reference manual:
    http://dev.mysql.com/doc/mysql/en/innodb-foreign-key-constraints.html
    Effectively, this allows for a relational many to many relationship implementation without an intersection table!
    This is what i mean:

    create table A(
    id int unsigned not null auto_increment
    , name varchar(30) not null
    , constraint pk_A primary key(
    id
    )
    , index i_A (
    name
    )
    )
    engine=innodb
    ;

    create table B(
    id int unsigned not null auto_increment
    , a_name varchar(30) not null
    , constraint pk_b primary key(
    id
    )
    , constraint fk_b_a foreign key (
    a_name
    ) references a (
    name
    )
    )
    engine=innodb
    ;

    As you can see, a non unique index in table A is used as parent for a foreign key in table B.

    Now, I posted a question on the MySQL forums (http://forums.mysql.com/read.php?20,47619,47619#msg-47619) to see if anyone has ever used this, but alas, I did not get a conclusive response. (the post contains some more details).

    What do you think of this device? It certainly isn’t common in the RDBMS world, or what?

    One other thing I’d like to add in is that in MySQL, you can use the bitwise shift operator. So, instead of

    POW(2,6)

    you can write

    1<<6 (It’s a matter of taste, performance seems to be quite equivalent as far as i could see) CU, looking forward to the next article!

    • http://jpipes.com Jay Pipes

      Hi Roland!

      Sorry to take so long to reply; I’ve been having issues with Serendipity emailing me when comments come in…

      Anyway, in response to your thoughts, I have a couple things…

      First off, yes, I was aware of the bitshift operator, but I thought that the POW function was a little clearer for the reader than having to explain the bitshift operator. I’ll go back and put a little sidebar in the article explaining the alternate technique a bit; nice suggestion!

      Secondly, as for visually representing the many-to-many relationship, it’s difficult for me to advise the crowfoot to crowfoot method without a mapping table because it doesn’t clearly specify the method by which the mapping shall be stored. I suppose in a conceptual object model that would be OK, but for an Entity-Relationship diagram at the Data level, like you say, the method I use more clearly indicates the relationship between the objects, and how the relationship is stored on the schematic level…that’s all :)

      As for the InnoDB bug/feature, I tend to side with Felix on this one ;) To me, it violates normalization rules of each row containing attributes that pertain to a single unique tuple. In the case of your example, the rows in child table B could not reliably be used to uniquely identify the parent tuples, as non-uniqueness of the parent key means the child keys could refer to many rows of the parent, but without knowing exactly *which* rows… a dilemma indeed! So, while interesting as a theoretical example, it would make for a frightening schema, as you’ve already pointed out…

      Cheers!

      • http://rpbouman.blogspot.com/ Roland Bouman

        Hi Jay,

        thanks for the reply. You know what, if I would’ve bumped into the many to many innodb feature/bug, I would immediately assume that it is indeed a bug. However, I found out that it is possible by just reading the manual.
        In http://dev.mysql.com/doc/refman/5.0/en/innodb-foreign-key-constraints.html, under the first little header “Deviation from SQL standards”, it reads:

        “If in the _parent_ _table_ there are _several_ _rows_ that have _the_ _same_ _referenced_ _key_ _value_, then InnoDB acts in foreign key checks as if the other parent rows with the same key value do not exist.”

        Note that the deviation deals with the cascading behaviour – there’s no mention at all that this type of foreign key is non-standard.

        Now, I agree that this type of relationship is whack and probably useless (I’m still in doubt but I can’t seem to think of a convincing example where to use this), but i do not agree to the assertion that this structure is unnormalized.

        If it is unnormalized as you say, wich of the normal forms is violated by this structure?

      • http://rpbouman.blogspot.com/ Roland Bouman
  • http://rpbouman.blogspot.com Roland Bouman

    Whoops, something went wrong parsing my comment,
    the last part shoudl read

    POW(2,6)

    you can write

    1<%lt;6

    whereas I typed 1 lessthan lessthan 6

  • SBG

    What about storing data in a comma deliminated fashion in a single field based on a lookup table? (Note: I have done this, but with a small problem I will identify at the end)

    For example, I have 2 tables

    Table 1 – People
    ——————————
    PeopleID (PK)(autoincr)

    Type

    Table 2 – People Types
    ——————————–
    PeopleTypeID (PK)(autoincr)
    Type

    So, I set up a multiple select lookup for TYPE in Table 1 that points to the PEOPLETYPEID in Table 2 (with the Type of table 2 as the display).

    Now when I go to ADD a record into table 1, in the Type input I am presented with a box showing all of the values from Type in table 2. I can select 1 or more of these values (using CTRL), and they are stored in the Type field as, say: ‘2,4,5’

    Even though I’m storing the ID values, when I browse table 1, it shows the field with corresponding text as, say: ‘doctor,patient,drugrep’

    And when I edit Table 1, the selection box already has highlighted the items previously selected, making it easy to see what options are left.

    Because I am storing ID #’s and not the actual text, it means I can go into Table 2 and change, say, ‘Doctor” to “Physician”, and now Physician will appear in Table 1 Type anywhere that Doctor did. Also, if I delete Doctor from Table 2, it will in effect no longer appear anywhere in Table 1 – sort of referential integrity (well not really but somewhat)

    I don’t know that this would be considered a ‘many-to-many’ solution, but this could be a legitimate way of replacing what would otherwise take 3 tables to do properly (be able to see all people relating to single type, and all types that relate to single people) in which you have a lookup to just a few values and you want to have the ability to add/modify those values easily…. and I really like how smooth this can appear to the end user with limited coding.

    …….. BUT ……..

    The one thing I have struggled with is performing a query based on the comma-deliminated field. When using the IN function, it only seems to compare my query value to the FIRST item in the list – so 2,3,5,6 would only return values on a 2. If someone knows how to structure the query so that it would allow me to find *any* of the values I would greatly appreciate your thoughts. Without this, the solution is largely invaluable, and cannot product the results that a 3 table solution would.

    Any other comments on this method???

    • http://jpipes.com Jay Pipes

      SBG,

      What you describe is very similar to the Method #3 above; you are describing the user interface of such a system. However, the same disadvantages to Method #3 apply to your system. It’s difficult to use an index on anything but the simplest of queries, and the system does not provide much flexibility.

      By deleting from the parent table (table 2), you don’t delete any records from table 1, therefore there will be stranded IDs in the child table which refer to a non-existent parent (orphan IDs). What happens if you then re-use that ID at a later time in the parent table? Now all the orphan IDs that used to point to a different parent record now point to the new one — not the desired effect, correct?

  • SBG

    Note: In Table 1 of the above post the space between PeopleID and Type was supposed to say ‘…Various fields relating to people…’ – Just FYI.

  • Roger Merritt

    Your article has been very helpful to me already, although nothing about the design of the tables has been new. I’m eagerly looking forward to part 2, so I’ll be checking back frequently for the next few weeks to see if it’s been posted yet, but let me offer (or add to) the WISHLIST ™: Please show at least one example of how to add data to the tables and keep the bridging table up to date. Your example uses a hypothetical used car dealership but didn’t use any dummy data, so just suppose the dealership gets a new used car in the lot with three of the features; how do you add the car to the Auto table and enter the features in the Auto2AutoFeature table?

    • http://jpipes.com Jay Pipes

      Roger,

      Your comments are duly noted, and your wishlist will be granted! When I finish the next article, you’ll have instructions on how to populate the schema with sample data. Also, I’m going to spend the first part of the article going over much more advanced scenarios than the couple I ended this article with. We’ll also use stored procedures to manage the maintenance of the bridging table… so stay tuned, and thanks for reading.

      Cheers,

      Jay

  • Badri

    I need one help in writing query

    i have to write one query

    Division Units Year
    ameerpet 200 2004
    ameerpet 300 2005
    ameerpet 500 2006

    From these values i want to retreive as like this

    Division Units Year
    ameerpet 200 2004
    ameerpet 100 2005
    ameerpet 200 2006
    Means difference of the Units values by year can u give me this query

    • http://jpipes.com Jay Pipes

      I believe you are asking for the year over year net difference in unit sales?

      Assuming a table called "Orders", this should do the trick:

      SELECT
      Division
      , (thisyear.Units – IFNULL(lastyear.Units, 0) as NetDifference
      , Year
      FROM Orders AS thisyear
      LEFT JOIN Order AS lastyear
      ON thisyear.Division = lastyear.Division
      AND thisyear.Year = (lastyear.Year + 1);

      Of course, this *does* assume that Division and Year make up the primary key of your Order table…

      Hope this helps!

  • Anonymous

    this is great stuff! very enlightening. thanks. can’t wait for the second part.

  • Sonny Suri

    Hello

    I have a unique problem at hand pertaining to many-to-many relationships. First off, I must say that one of my tables has markers in it which may change at any time and can go upto 1000 different values. The other parent table is domains. So the resolving table has domain_id, marker_id.

    The problem is to report on domains which hit certain markers and the report could have any number of combination of ANDs and ORs. Like:

    select marker_id, domain_id from domain_marker_link where marker_id 1 is matched and marker_id 2 is matched OR marker_id 3 is matched.

    I don’t want to join the table several times for performance reasons.

    Is there a solution to such a problem in SQL or it has to be done programmatically.

    Please advise.

    Thanks,
    Sonny

    • http://jpipes.com Jay Pipes

      Sonny, have you seen performance problems with multiple joins? If so, perhaps doing a loop in your application code which, every X joins, materializes the query to a temporary table and then continues the joins from there?

  • Timo

    there was never a part 2 was it?

    • Eduardo Mello

      Still no part 2?

      • nargual

        Still no part 2 :)

  • Anjanish

    A very well explained article. Looking for part-2. but unfortunatly can not find it.

  • jihad khorfan

    please post the part 2 …