Yesterday’s article on connection handling and caching techniques spurred a number of comments both on and offline, regarding whether or not the solution I propose is actually a good one. I wanted to follow up with some benchmark results that show the performance difference of caching in files, versus using the MySQL query cache.
A note about lazy loading
Before I go on, however, I did want to point out that Andrew Magrude did have an excellent comment on the previous article, and brought up something I had not thought of regarding the mysql_real_escape_string() function. He correctly states that using the mysql_real_excape_string() function will require a connection to the database, and for folks concerned about SQL injection attacks for various character sets, you will want to ensure that a connection is made to the database before trying to use mysql_real_escape_string() on passed-in data.
That being said, however, there still is no reason whatsoever to not use lazy loading; simply wrap the mysql_real_escape_string() function in a helper method of the abstraction layer that calls Connect(). Problem solved. That way, if you don’t use the escape function, then you don’t have to connect needlessly to the database.
OK, on to the dispute about caching
The majority of comments I received regarding the caching method I went over questioned the performance of a file-based cache versus strictly using the MySQL query cache’s ability to cache result sets. The reasoning made by the various commenters was that since the MySQL query cache operates in memory, which is faster than accessing the disk, then it should be faster than the CacheEngine class, since the CacheEngine class relies on file-based lookups. In theory, this sounds like it makes sense, but in actuality, there are a couple things that, as you’ll see, actually make the file-based cache more performant, and certainly more flexible. The remainder of this article seeks to demonstrate how a file-based cache can certainly enhance the performance of certain applications, and at the end of the article, I’ll discuss why it is a combination of both a file-based cache and the MySQL query cache that really makes site performance top-notch.
Setting up the benchmarks
Someone noted that I had not included details on the hardware or MySQL version in my benchmarks, which is true. Here they are:
- IBM Thinkpad T43
- 1GB RAM, Pentium(R) M processor 1.86GHz, 2MB L2 cache
- Linux 2.6.15-26-386
- MySQL 5.0.22, for pc-linux-gnu (i486) using readline 5.1
- All tables involved are InnoDB tables
- I was also asked to specify MySQL configuration setting, however I fail to see how any MySQL settings would affect the outcome of the benchmarks besides the query_cache_size, which is clearly set to 16MB in the benchmark script. If anyone know of configuration settings that would affect this benchmark, please let me know.
To benchmark the scenario, I set up two simple shell scripts that set up the benchmark environment and use the ApacheBench load-testing utility to perform the actual benchmarking. The key to benchmarking is to, as much as possible, isolate the condition upon which you are trying to determine changes in performance. In this scenario, we are trying to prove whether a file-based cache outperforms the MySQL query cache in a simple SELECT statement for infrequently updated data.
To test this condition, I created two PHP scripts, almost identical in nature, which grab data for the MySQL Forge project members for a specific project. The first script, test-cache.php, deploys the CacheEngine class discussed in the previous article. The second script, test-no-cache.php, does not deploy the CacheEngine class, and therefore relies on the MySQL query cache in order to cache the resultset of project member data. Here is test-cache.php, with passwords obviously removed:
define('SQL_NUM_SLAVES', 0); define('SQL_FAIL_SLAVE_TO_MASTER', true); define('SQL_MASTER_HOST', 'localhost'); define('SQL_MASTER_USER', 'xxx'); define('SQL_MASTER_PASS', 'xxx'); define('SQL_MASTER_DB', 'mysqlforge'); define('SQL_MASTER_SOCKET', '/tmp/mysql.socket'); define('DIR_CACHE', '/var/www/apache2-default/benchmark/cache/'); include('SqlConnection.php'); include('CacheEngine.php'); $GLOBALS['Db'] =& new SqlConnection(); $GLOBALS['CEngine'] =& new CacheEngine(); $GLOBALS['SqlTables']['ForgeUser'] = 'ForgeUser'; $GLOBALS['SqlTables']['ProjectMemberRole'] = 'ProjectMemberRole'; $GLOBALS['SqlTables']['ProjectMember'] = 'ProjectMember'; function &GetByProjectId($Project) { $cache_id = 'project_members-' . $Project; if ($cache = $GLOBALS['CEngine']->GetFromCache($cache_id, $Seconds=0, $IsObject=true)) { return $cache; } $sql = "SELECT pm.project , pm.member , fu.display_name , pmr.description as role , pm.can_write , pm.can_read , pm.joined_on , pm.last_source_login , pm.last_source_commit FROM " . $GLOBALS['SqlTables']['ProjectMember'] . " pm INNER JOIN " . $GLOBALS['SqlTables']['ForgeUser'] . " fu ON pm.member = fu.user_id INNER JOIN " . $GLOBALS['SqlTables']['ProjectMemberRole'] . " pmr ON pm.role = pmr.project_member_role_id WHERE pm.project = " . (int) $Project; $results = $GLOBALS['Db']->GetRecords($sql); $GLOBALS['CEngine']->WriteToCache($cache_id, $results); return $results; } $results =& GetByProjectId(2); print_r($results);
And here is the corresponding test-no-cache.php page, which, as you can see, differs only in that the calls to the CacheEngine object are removed:
define('SQL_NUM_SLAVES', 0); define('SQL_FAIL_SLAVE_TO_MASTER', true); define('SQL_MASTER_HOST', 'localhost'); define('SQL_MASTER_USER', 'xxx'); define('SQL_MASTER_PASS', 'xxx'); define('SQL_MASTER_DB', 'mysqlforge'); define('SQL_MASTER_SOCKET', '/tmp/mysql.socket'); define('DIR_CACHE', '/var/www/apache2-default/benchmark/cache/'); include('SqlConnection.php'); include('CacheEngine.php'); $GLOBALS['Db'] =& new SqlConnection(); $GLOBALS['CEngine'] =& new CacheEngine(); $GLOBALS['SqlTables']['ForgeUser'] = 'ForgeUser'; $GLOBALS['SqlTables']['ProjectMemberRole'] = 'ProjectMemberRole'; $GLOBALS['SqlTables']['ProjectMember'] = 'ProjectMember'; function &GetByProjectId($Project) { $sql = "SELECT pm.project , pm.member , fu.display_name , pmr.description as role , pm.can_write , pm.can_read , pm.joined_on , pm.last_source_login , pm.last_source_commit FROM " . $GLOBALS['SqlTables']['ProjectMember'] . " pm INNER JOIN " . $GLOBALS['SqlTables']['ForgeUser'] . " fu ON pm.member = fu.user_id INNER JOIN " . $GLOBALS['SqlTables']['ProjectMemberRole'] . " pmr ON pm.role = pmr.project_member_role_id WHERE pm.project = " . (int) $Project; $results = $GLOBALS['Db']->GetRecords($sql); return $results; } $results =& GetByProjectId(2); print_r($results);
OK, the next step is to create the shell scripts which handle the benchmarking process. Below is benchmark-cache.sh:
#!/bin/sh # Restart Apache to ensure cleared buffers sudo apache2ctl restart # Restart MySQL sudo /etc/init.d/mysql restart # Kill any cached files sudo rm -rf /var/www/apache2-default/benchmark/cache/* # Warm up Apache with a simple page ab -c 100 -n 2000 http://localhost/apache2-default/php-info.php >& /dev/null echo " CACHE BENCHMARK RUN: *********************** " > cache-benchmark.results # Reset the query cache and # flush tables and status counters mysql --skip-column-names --user=root < setup-benchmark.sql >> cache-benchmark.results echo " " >> cache-benchmark.results # Run the benchmark on the warmed-up server ab -n 2000 -c 300 \ http://localhost/apache2-default/benchmark/test-cache.php \ >> cache-benchmark.results echo " " >> cache-benchmark.results # Run the post-benchmark status script mysql --skip-column-names --user=root < post-benchmark.sql >> cache-benchmark.results
and the corresponding benchmark-no-cache.sh:
#!/bin/sh # Restart Apache to ensure cleared buffers sudo apache2ctl restart # Restart MySQL sudo /etc/init.d/mysql restart # Kill any cached files sudo rm -rf /var/www/apache2-default/benchmark/cache/* # Warm up Apache with a simple page ab -c 300 -n 2000 http://localhost/apache2-default/php-info.php >& /dev/null echo " NO CACHE BENCHMARK RUN: *********************** " > no-cache-benchmark.results # Reset the query cache and # flush tables and status counters mysql --skip-column-names --user=root < setup-benchmark.sql >> no-cache-benchmark.results echo " " >> no-cache-benchmark.results # Run the benchmark on the warmed-up server ab -n 2000 -c 300 \ http://localhost/apache2-default/benchmark/test-no-cache.php \ >> no-cache-benchmark.results echo " " >> no-cache-benchmark.results # Run the post-benchmark status script mysql --skip-column-names --user=root < post-benchmark.sql >> no-cache-benchmark.results
The shell scripts are pretty self-explanatory. We’re restarting both Apache and MySQL to ensure a clean slate, and ensuring we clear any cached files from our cache directory. However, to simulate a currently running Apache server, I “warm up” the server with a dummy run of ApacheBench to a simple PHP page. This ensures that the Apache subprocesses which will be handling our benchmark are already around. Then, we’re running a setup SQL script before the benchmark run by ApacheBench, and then a post process SQL script. The ab (ApacheBench) run is set to 300 concurrent users, and a total of 2000 requests (-c 300 -n 2000). Pretty standard stuff here. Let’s look at the setup-benchmark.sql script:
SET GLOBAL QUERY_CACHE_SIZE=0; FLUSH TABLES; FLUSH STATUS; SET GLOBAL QUERY_CACHE_SIZE=(16*1024*1024); SHOW VARIABLES LIKE 'query_cache_size%' \G SHOW STATUS LIKE 'Qcache_insert%' \G SHOW STATUS LIKE 'Qcache_hits%' \G
Here, all we’re doing is setting the query cache size to zero, then flushing stuff, and then setting the query cache size to 16M. After that, we output some status variables which will show us if the MySQL query cache was used during the test run.
Finally, we have a quick SQL script which runs after the benchmark is done which outputs the same query cache status variables, so we can check the results of the benchmark:
SHOW STATUS LIKE 'Qcache_insert%' \G SHOW STATUS LIKE 'Qcache_hits%' \G
And, then we run the tests. Here is the output from the CacheEngine-enabled test:
CACHE BENCHMARK RUN: *********************** *************************** 1. row *************************** Variable_name: query_cache_size Value: 16777216 *************************** 1. row *************************** Variable_name: Qcache_inserts Value: 0 *************************** 1. row *************************** Variable_name: Qcache_hits Value: 0 This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0 Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/ Benchmarking localhost (be patient) Server Software: Apache/2.0.55 Server Hostname: localhost Server Port: 80 Document Path: /apache2-default/benchmark/test-cache.php Document Length: 362 bytes Concurrency Level: 300 Time taken for tests: 4.980576 seconds Complete requests: 2000 Failed requests: 0 Write errors: 0 Total transferred: 1130000 bytes HTML transferred: 724000 bytes Requests per second: 401.56 [#/sec] (mean) Time per request: 747.086 [ms] (mean) Time per request: 2.490 [ms] (mean, across all concurrent requests) Transfer rate: 221.46 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 36 178.7 24 3037 Processing: 50 334 88.1 333 1794 Waiting: 3 309 90.7 295 1791 Total: 75 370 230.6 356 4831 Percentage of the requests served within a certain time (ms) 50% 356 66% 359 75% 361 80% 362 90% 365 95% 459 98% 462 99% 593 100% 4831 (longest request) *************************** 1. row *************************** Variable_name: Qcache_inserts Value: 1 *************************** 1. row *************************** Variable_name: Qcache_hits Value: 0
I have highlighted the pieces of the test that are crucially important (although the rest of the test results are also informative). You can see that the CacheEngine was able to support 401.56 requests per second. Additionally, you see that the longest request took 4831 ms, which makes sense: the cache file had to be written out on the first request, which undoubtedly took longer than other read requests. However, 99% of all requests were serviced in 593 ms. Finally, you will note that there was only a single insert into the MySQL query cache, and no hits to the cache. This makes sense: on the first query to the database, which is eventually stored in a file for later use by the CacheEngine, MySQL caches the results in the query cache. Subsequent reads of the same data resulted in no hits to the query cache because the CacheEngine serviced the requests, leading to absolutely no involvement from the MySQL at all.
Let’s take a look at the results of the CacheEngine-disabled test run, which shows how the MySQL query cache performed compared to CacheEngine:
NO CACHE BENCHMARK RUN: *********************** *************************** 1. row *************************** Variable_name: query_cache_size Value: 16777216 *************************** 1. row *************************** Variable_name: Qcache_inserts Value: 0 *************************** 1. row *************************** Variable_name: Qcache_hits Value: 0 This is ApacheBench, Version 2.0.41-dev <$Revision: 1.141 $> apache-2.0 Copyright (c) 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Copyright (c) 1998-2002 The Apache Software Foundation, http://www.apache.org/ Benchmarking localhost (be patient) Server Software: Apache/2.0.55 Server Hostname: localhost Server Port: 80 Document Path: /apache2-default/benchmark/test-no-cache.php Document Length: 362 bytes Concurrency Level: 300 Time taken for tests: 7.945251 seconds Complete requests: 2000 Failed requests: 0 Write errors: 0 Total transferred: 1130000 bytes HTML transferred: 724000 bytes Requests per second: 251.72 [#/sec] (mean) Time per request: 1191.788 [ms] (mean) Time per request: 3.973 [ms] (mean, across all concurrent requests) Transfer rate: 138.83 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 157 669.5 0 3004 Processing: 113 434 316.3 436 4560 Waiting: 112 434 316.3 436 4560 Total: 113 592 730.9 439 5916 Percentage of the requests served within a certain time (ms) 50% 439 66% 447 75% 457 80% 466 90% 502 95% 3193 98% 3382 99% 3439 100% 5916 (longest request) *************************** 1. row *************************** Variable_name: Qcache_inserts Value: 1 *************************** 1. row *************************** Variable_name: Qcache_hits Value: 2012
In this test, you will see that the MySQL query cache was able to service 251.72 requests per second, compared with the CacheEngine’s 401.56. Unless my math is wrong, this gives CacheEngine an overall performance advantage of 59.52%. Additionally, you will notice that CacheEngine was able to service > 98% of requests in ~500ms, versus 90% for the query cache. Finally, note that the query cache status variables do indeed show that the query cache was used to service the SQL requests, as expected.
Can we conclude decisively that a file-based cache is more performant?
Of course we can’t. All I’ve proven here is that for this scenario, a file-based cache was better performant than the MySQL query cache. And, you probably will ask, what exactly is this scenario that I’ve benchmarked?
Believe it or not, the above scenario is becoming the fastest growing data retrieval access pattern in modern web applications. It is a model of the AJAX/Web 2.0 XML-RPC request that is taking over modern web applications. In this scenario, you have lots and lots of requests for small amounts of data, much of which is infrequently updated. As AJAX creeps into more and more web applications, this scenario will become more and more important to benchmark as I have done here.
OK, so this profiles an XML-RPC request? What about more complex stuff?
Good question. The reason that the file-based cache worked very effectively is because the operating system, or indeed Apache, put the data contained in the cache file into RAM, and serviced the remaining requests out of RAM, just like our good old query cache did. Of course, no connection to the database was required in order to retrieve the information, which resulted in better overall performance.
However, what happens if the requested cache file isn’t in memory? Say, for instance, in a typical example of a shared web host that cannot possible keep all requested files in memory? Well, I don’t know. I will work on putting together a benchmark which models this type of environment (PeterZ, perhaps some pointers from you? 🙂 ) and see what happens. But, for right now, I am fairly confident that the caching technique outlined in these last two articles is ideal for the next generation web applications, particularly those serving XML-RPC requests on heavy read-to-write data patterns.
OK, so what about the MySQL Query Cache?
So, does this mean that you should turn off the MySQL query cache? No, not at all. The big advantage of the CacheEngine file-based cache is that it gives you the flexibility to use the file-based cache when it makes sense to do so, and fall back to the MySQL query cache in all other cases. So, what are the cases when I would use a file-based cache?
Well, the scenario I present in this benchmark not only represents one of the most common data access patterns today, but it also beautifully illustrates a weakness of the MySQL query cache: there is only table-level granularity when invalidating stored query cache result sets. What does that mean? Well, it means that if I have a result set of data cached for project members of project 2, and a new project member for any project is updated, deleted, or inserted into the ProjectMember table, then my dataset stored in the query cache will be invalidated (thrown away), even though technically, the underlying table rows didn’t change.
However, my file-based cache allows me to invalidate the project member data for Project #2 a little more smartly: when that data actually changes. Therefore, the file-based cache can have a much higher cache hit rate than the query cache. Remember, just because Project #2’s project members don’t change that much doesn’t mean that new project member data isn’t coming in to the ProjectMember table at a steady clip…
So, my advice would be to carefully analyze whether a file-based cache is a better fit for certain data than the MySQL query cache. In data warehousing scenarios, where the query cache can routinely store large, non-updated resultsets, the query cache may perform better. Who knows. Hopefully, however, this article has given you the tools you need to test it yourself and determine what works for you. Cheers!