Yesterday, I posed a question to the ZanyWeb about what exactly a REPLACE statement does behind the scenes in the storage engine. There were many excellent comments and these comments exposed some misunderstandings (including some of my own misconceptions) about the REPLACE statement itself and what goes on behind the scenes in the storage engine.
The question I asked was this: if I execute the following statements in a client, what would you expect would happen behind the scenes in the storage engine?
CREATE TABLE t1 ( id INT NOT NULL AUTO_INCREMENT PRIMARY KEY , padding VARCHAR(200) NOT NULL ); INSERT INTO t1 VALUES (1, "I love testing."); INSERT INTO t1 VALUES (2, "I hate testing."); REPLACE INTO t1 VALUE (2, "I love testing.");
Based purely on the manual, one would expect, as Ryan Thiessen expressed in his comment
try to insert once, detect a failure and then delete/insert in failure case.
or, as Ryan put it in pseudo-code:
INSERT INTO t1 VALUES (1, "I love testing.");
INSERT INTO t1 VALUES (2, "I hate testing.");
INSERT INTO t1 VALUE (2, "I love testing.");
if error() { DELETE FROM t1 WHERE id = 2; INSERT INTO t1 VALUE (2, "I love testing."); }
Unfortunately, this is not the case. At least, it is not always the case.
ha_write_row() vs. ha_update_row()
In MySQL, as well as in Drizzle, there is a pluggable storage engine API. This API consists primarily of two classes: In MySQL, one class is called handlerton and the other is called handler — in Drizzle, these classes are called StorageEngine and Cursor (Current Set Of Records), because that is what they actually represent.
Without going into too many details, the Cursor (handler) interface has four calls which are relevant to this discussion:
- int Cursor::ha_write_row(unsigned char *new_record) — inserts a new record into the table pointed to by the Cursor
- int Cursor::ha_update_row(unsigned char *old_record, unsigned char *new_record) — update an existing record in the table pointed to by the Cursor
- int Cursor::ha_delete_row(unsigned char *old_record) — delete an existing record in the table pointed to by the Cursor
- virtual int Cursor::extra(ha_extra_function operation) — indicates to the Cursor that it should handle an operation in a certain way. If this sounds vague to you, it is, so read on…
OK, So What Does Happen?
What actually happens behind the scenes for the code above is the following “optimized” execution path, again in pseudo-code:
// (0) Enter mysql_insert() and prepare to do an INSERT mysql_insert(); // (1) tell engine that a write can replace an existing record Cursor::extra(HA_EXTRA_WRITE_CAN_REPLACE); // (2) tell engine to ignore duplicate keys Cursor::extra(HA_EXTRA_IGNORE_DUP_KEY); // (3) For each record to be inserted... for each record begin: // (4) Insert the record by calling ha_write_row() Cursor::ha_write_row(); // (5) Cursor tries to "insert" the record error= Cursor::write_row(); // (6) If the write fails, then DELETE, the record and try to INSERT again. if error is true: Cursor::ha_delete_row(); Cursor::ha_write_row();
The above code looks very similar to what Ryan has already said he expected the underlying code to look like, no? The problem is the calls to Cursor::extra() (steps 1 and 2) and what happens inside InnoDB (step 5) when Cursor::write_row() is called.
InnoDB tries to insert the record and realizes that the new record violates an existing primary key value but since MySQL has already told it to ignore duplicate key violations, InnoDB updates the existing row and returns successfully from write_row()!.
Unfortunately, because of this “optimization”, Cursor::ha_write_row(), which is the kernel’s wrapper around the virtualized storage engine’s Cursor::write_row call is completely unaware that an UPDATE and not an INSERT has occurred. Why is this a problem? Well, what happens after a call to Cursor::write_row() succeeds? That’s correct: post-change logging occurs. In other words, logging for row-based replication occurs. Well, Cursor::ha_write_row() then logs an INSERT and not the appropriate UPDATE. Oops.
There are three major things to note about the above logic:
- This is undocumented behaviour. According to the manual, a REPLACE statement works like an INSERT, except if a primary or unique key is violated, the original row is DELETEd and the new row is INSERTed.
- This breaks the defined public Cursor (known as handler in MySQL) interface for ha_write_row() since rows may be updated when ha_update_row() is NOT called.
- The code has side effects, namely that the change to the state of the server is unknown to the Cursor: an update occurred but the Cursor believes an insert occurred.
When Optimizations Aren’t
This kind of coding can be argued to be an “optimization”. I will argue that it is not an optimization, but is overly clever and reduces the clarity of the code thereby making the code unnecessarily difficult to trace and follow. Personally, I wasted days of research time trying to understand how a call to ha_write_row() was actually updating an existing record.
In my opinion, a proper optimization would be to:
- Make the internal code function exactly as the documentation says it does.
- Adapt the documentation for REPLACE and state that REPLACE is not as efficient as INSERT .. ON DUPLICATE KEY UPDATE and whenever possible, prefer the latter statement over the former.
Does REPLACE belong in Drizzle?
One of the reasons that REPLACE may exist in MySQL is to make the INSERT statement idempotent when run on a replication slave when run in mixed or statement-based mode. Don’t believe me? Here is a code snippet and comment from the sql/log_event.cc file in MySQL 5.4:
Write_rows_log_event::do_before_row_operations(const Slave_reporting_capability *const) ... /** todo: to introduce a property for the event (handler?) which forces applying the event in the replace (idempotent) fashion. */ if (bit_is_set(slave_exec_mode, SLAVE_EXEC_MODE_IDEMPOTENT) == 1 || m_table->s->db_type()->db_type == DB_TYPE_NDBCLUSTER) { /* We are using REPLACE semantics and not INSERT IGNORE semantics when writing rows, that is: new rows replace old rows. We need to inform the storage engine that it should use this behaviour. */ /* Tell the storage engine that we are using REPLACE semantics. */ thd->lex->duplicates= DUP_REPLACE; /* Pretend we're executing a REPLACE command: this is needed for InnoDB and NDB Cluster since they are not (properly) checking the lex->duplicates flag. */ thd->lex->sql_command= SQLCOM_REPLACE; /* Do not raise the error flag in case of hitting to an unique attribute */ m_table->file->extra(HA_EXTRA_IGNORE_DUP_KEY); ...
By contrast to MySQL 5.4, Drizzle’s replication system is a reflection of the atomic changes made to the state of a server. These changes are, by definition, deterministic, and applying a transaction log to a Drizzle replica can result in one and only one state in the resulting server. In Drizzle, having the application of the transaction log be idempotent is not necessary, as the transaction log stream is always deterministic. For instance, if an INSERT ... ON DUPLICATE KEY UPDATE is issued against a Drizzle server, the replication system shall create either an InsertStatement or an UpdateStatement message, depending on what precisely occurred on the server, and these messages are logged in Drizzle’s transaction log. There is no fudging of statements. What happened on the server is logged as an exact change in the state of the server, nothing more.
So…anyway, because of the above “optimization”, I’m now left with a tricky problem: continue to pay the interest on this Technical Debt or rework the Cursor interface so that behaviour is always idempotent via the wrapper interface and has no side effects. I’m not sure what I will do…but this has delayed me substantially.






