Wednesday, October 01, 2014

MySQL Date Comparison Performance

As the webmaster of a major IT solutions provider, one of my responsibilities involves optimizing queries against our MySQL database in order to ensure fast page loads. Like most programming languages and technologies, there are often multiple ways to perform a certain action or get a certain result, but only a few of those possible approaches are efficient. MySQL is no exception. Hundreds of articles exist that compare the performance of different query strategies. However, not all scenarios have been exhausted yet. Today, I will be examining one of those scenarios: range comparisons against columns of the DATE datatype in InnoDB.

The slow query that posed the original question for me is a compound query responsible for gathering traffic counter statistics for a web form we have. Each time a visitor to our site interacts with the form by viewing, beginning, or submitting it, we update a counter column in our traffic table on the row that contains the current date and interaction condition. Later, an administrator can access these statistics via a page that executes a query that accumulates the counters for each condition in a couple different date ranges: today, yesterday, this week, last week, this month, and this year. The table structure and the broken-out queries look like this:

CREATE TABLE `landing_submission_new_traffic` (
   `traffic_condition` enum('viewed','started','finished') NOT NULL,
   `traffic_date` date NOT NULL,
   `traffic_count` smallint(5) unsigned NOT NULL default '0',
   PRIMARY KEY  (`traffic_condition`,`traffic_date`)
) ENGINE=InnoDB;

# Today's traffic
SELECT DISTINCT traffic_condition, 
   IFNULL((SELECT SUM(traffic_count) FROM landing_submission_new_traffic WHERE traffic_condition = lsnt1.traffic_condition AND traffic_date = CURRENT_DATE), 0) AS today_sum 
 FROM landing_submission_new_traffic lsnt1 ORDER BY traffic_condition+0;

# Yesterday's traffic
SELECT DISTINCT traffic_condition, 
   IFNULL((SELECT SUM(traffic_count) FROM landing_submission_new_traffic WHERE traffic_condition = lsnt1.traffic_condition AND traffic_date = CURRENT_DATE - INTERVAL 1 DAY), 0) AS yesterday_sum 
 FROM landing_submission_new_traffic lsnt1 ORDER BY traffic_condition+0;

# This week's traffic
SELECT DISTINCT traffic_condition, 
   IFNULL((SELECT SUM(traffic_count) FROM landing_submission_new_traffic WHERE traffic_condition = lsnt1.traffic_condition AND YEARWEEK(traffic_date) = YEARWEEK(CURRENT_DATE)), 0) AS week_sum 
 FROM landing_submission_new_traffic lsnt1 ORDER BY traffic_condition+0;

# Last week's traffic
SELECT DISTINCT traffic_condition, 
   IFNULL((SELECT SUM(traffic_count) FROM landing_submission_new_traffic WHERE traffic_condition = lsnt1.traffic_condition AND YEARWEEK(traffic_date) = YEARWEEK(CURRENT_DATE - INTERVAL 1 WEEK)), 0) AS lastweek_sum 
 FROM landing_submission_new_traffic lsnt1 ORDER BY traffic_condition+0;

# This month's traffic
SELECT DISTINCT traffic_condition, 
   IFNULL((SELECT SUM(traffic_count) FROM landing_submission_new_traffic WHERE traffic_condition = lsnt1.traffic_condition AND YEAR(traffic_date) = YEAR(CURRENT_DATE) AND MONTH(traffic_date) = MONTH(CURRENT_DATE)), 0) AS month_sum 
 FROM landing_submission_new_traffic lsnt1 ORDER BY traffic_condition+0;

# This year's traffic
SELECT DISTINCT traffic_condition, 
   IFNULL((SELECT SUM(traffic_count) FROM landing_submission_new_traffic WHERE traffic_condition = lsnt1.traffic_condition AND YEAR(traffic_date) = YEAR(CURRENT_DATE)), 0) AS year_sum 
 FROM landing_submission_new_traffic lsnt1 ORDER BY traffic_condition+0;

At the time of this writing, the table has 1,778 unique rows (401 unique dates) with the compound query taking about 4-7 seconds to execute with caching disabled on a high-load shared database. Obviously, a high-load shared database isn't the best candidate for running time-critical queries, and the primary query could be rewritten to perform better (consider joins and unions), but let's focus on the hypothesis that the subqueries are less than optimal because of the ways they compare dates. Checking each query individually shows that the first two queries, today's and yesterday's traffic, run in less than 1/10 secs each while the other queries run in about 1.5-2 secs each. This makes sense because the first two queries run a simple equality check against the date. So, what's making the others so slow? They all use MySQL internal functions and minimal arithmetic, but could that really be the cause? And if so, what should we use instead?

Let's see how different forms of year, month, and week queries perform. First, we need a sterile test environment where we can get consistent execution times without the added noise of a live production system. I chose to use a low-load VPS where the database isn't being accessed by anything else and created a table containing only the dates and row IDs (for index hit testing. This'll make sense in a bit). I also exploded the dates out to year, month, and day of month for an integer comparison test case. I then populated the table with ~1,000,000 rows of 397 dates. The quantity of dates is unimportant. The dates were generated from a backup of the production table referenced earlier being repeatedly inserted until the table was filled with at least 1 million rows. The table structure follows:

CREATE TABLE `test_dateindex` (
   `test_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
   `test_date` date NOT NULL,
   `test_date_year` year(4) NOT NULL,
   `test_date_month` tinyint(2) unsigned NOT NULL,
   `test_date_day` tinyint(2) unsigned NOT NULL,
   PRIMARY KEY (`test_id`),
   KEY `test_date` (`test_date`),
   KEY `test_date_unfolded` (`test_date_year`,`test_date_month`,`test_date_day`)
) ENGINE=InnoDB;

Before we get to the testing, let's review some important characteristics of MySQL and the InnoDB engine...

  • Test cases for one engine don't necessarily translate to another. In this test case specifically, the index layout wouldn't make much sense for a MyISAM table, but lends to our benchmarking in InnoDB. This is because InnoDB places the primary key at the end of every other index while MyISAM does not. The term for this type of combined index is an extended secondary index. In order to ensure all of our queries only use one index or another, I've separated them with neither of them being the primary key. I chose to make a row ID primary key just to better illustrate that I purposefully avoided making either index the primary -- you could remove the column and primary key entirely without affecting our testing.
  • DATEs are accepted in multiple formats. When MySQL can tell you are trying to use a representation of a datatype, it automatically converts the representation to whatever makes the most sense in the context. For example, the string value "20" can be converted to the integer value 20 when MySQL detects an integer operation or comparison is being attempted. This can be an advantageous optimization because integer operations are usually much faster to perform. The same automatic type casting applies to the DATE datatype. The string '2014-01-24' and the integer 20140124 are both valid DATE expressions (though the string version is preferred as it's valid SQL). See B.5.5.2 Problems Using DATE Columns for more information.
  • Indexes can be accessed by prefix, both by column and value. If a table has a single index on ('column1', 'column2', 'column3'), a query accessing column1 alone or column1 and column2 together can take advantage of the index. However, a query accessing column2 alone, column3 alone, or column2 and column3 together can not. Similarly, a query accessing column1 and column3 together can take advantage of the index, but only for column1. Partial value checks are also subject to this prefix rule. If we want to find all rows with a string in column1 starting with "hello", the index can be used. If we want to find all rows with column1 containing "hello" in the middle or at the end of the string, the index can not be used. We could also use the index in checks on column2 and/or column3 for values beginning with a given string as long as the preceding columns on the index are also accessed by the query within the same constraints. This behavior is discussed further in 8.5.3 How MySQL Uses Indexes.
  • Computed values of a column can not take advantage of indexes. While built-in functions and operators are typically much faster than user-defined functions and stored procedures, anything that has to do some type of computation upon a column can't use the index to quickly find rows matching a given criteria. This is because the index only contains the raw values of the column, not possible results of a function. This is one scenario when denormalization should be considered: another column may be indexed which contains the precomputed result of the function, and the query can be rewritten to check this column for the expected result. I've done this with our test table by breaking the date into three separate columns containing the results of the YEAR(test_date), MONTH(test_date), and DAYOFMONTH(test_date) functions. To illustrate the performance gains of doing so, I've included both strategies in the test queries.

Now for the testing. For each of the ranges (yearly, monthly, and weekly), I have designed a couple different queries that find the same rows through different WHERE clauses and only return the constant TRUE in order to eliminate data retrieval overhead from the benchmarks. Each query is executed sequentially once with caching disabled and MySQL profiling enabled, then the sequence starts over. This occurs 5 times, and the minimum, maximum, and average execution time of each variant is recorded. I've created a PHP script to automate this process. If the minimum and maximum times for any single query appear to be too dispersed, the results are thrown out for that sequence and the script is run again until a good sample is collected. Additionally, the ranking of queries by average execution time must be recreatable by running the script for the sequence again after a few minutes, though the average execution times themselves may differ. In order for a query to be considered to outperform another, their average execution times must differ by at least 1/100 secs.

Yearly Queries

SELECT SQL_NO_CACHE TRUE FROM test_dateindex...
QueryMinMaxAvgRank
...WHERE YEAR(test_date) = 20140.4195150.5251190.47137266 6 6
...WHERE test_date LIKE '2014-%'0.4420240.5349590.50278247 7 7
...WHERE test_date BETWEEN '2014-01-01' AND '2014-12-31'0.4146620.4934430.45227185 5 5
...WHERE test_date BETWEEN 20140101 AND 201412310.3728470.4280810.39401043 2 2
...WHERE test_date >= '2014-01-01' AND test_date <= '2014-12-31'0.3743200.4837260.44037364 4 4
...WHERE test_date >= 20140101 AND test_date <= 201412310.3542980.4440410.40827581 3 3
...WHERE test_date_year = 20140.3627290.4245680.38994682 1 1

These queries were chosen for the following qualities:

  • Query 1 uses the built-in YEAR() and then compares the result to a constant. This is a common query because of its readability. This query is expected to be the slowest because it requires a computational comparison and therefore cannot use the date column's index.
  • Query 2 performs a string comparison on the prefix of the date (the year). String comparisons are slower than integral ones, but this query was chosen to see if the optimizer can check column prefixes of a date with a string. Remember: DATEs can be accepted in multiple formats.
  • Queries 3-6 use the two common range comparison methods with both DATE (string) constants and integers. This is to clarify which type of range comparison performs the best, even though the SQL standard suggests the string constant form. The integral versions are expected to outperform the string constant ones because integral comparisons are so fast.
  • Query 7 is a control variable in our experiment. It uses the exploded index containing precomputed values for the date column's various parts. This query is expected to be the fastest because it requires a simple, fast integral comparison against the index.

What do the results tell us? Surprisingly, query 1 is not the slowest, being slightly faster than query 2. This means that, like query 1, query 2 cannot use the index, which means the optimizer is unable to check column prefixes of a date with a string. Moving on to the range queries in 3-6, we see that there isn't a huge difference in BETWEEN versus integer inequality operators. Range comparisons involving integers seem faster than DATE (string) constants, however. This is odd since the SQL standard and MySQL both state a preference to the string constant form. In fact, both forms of range comparisons involving integers appear to be nearly as fast as our control query 7!

Monthly Queries

SELECT SQL_NO_CACHE TRUE FROM test_dateindex...
QueryMinMaxAvgRank
...WHERE YEAR(test_date) = 2014 AND MONTH(test_date) = 10.2490900.2864190.26023245 5 5
...WHERE test_date LIKE '2014-01-%'0.2279030.2492120.23527244 4 4
...WHERE test_date BETWEEN '2014-01-01' AND '2014-01-31'0.0444130.0492770.04752423 3 3
...WHERE test_date >= '2014-01-01' AND test_date <= '2014-01-31'0.0413430.0448280.0429452 2 2
...WHERE test_date_year = 2014 AND test_date_month = 10.0272410.0310580.02917661 1 1

These queries were chosen for the following qualities:

  • Query 1 is similar to the yearly query 1. It uses the built-in YEAR() and MONTH(), and then compares the result to a constant. This is a common query because of its readability. This query is expected to be the slowest because it requires a computational comparison and therefore cannot use the date column's index.
  • Query 2 is similar to the yearly query 2. It performs a string comparison on the prefix of the date (the year and month). String comparisons are slower than integral ones, but this query was chosen to see if the optimizer can check column prefixes of a date with a string. Remember: DATEs can be accepted in multiple formats.
  • Queries 3-4 are similar to yearly queries 3-6. They use the two common range comparison methods with DATE (string) constants. As with the yearly queries, these are expected to be outperformed by integral versions, though I do not include them here for brevity.
  • Query 5 is a control variable in our experiment. It uses the exploded index containing precomputed values for the date column's various parts. This query is expected to be the fastest because it requires a simple, fast integral comparison against the index.

What do the results tell us? As expected, query 1 is the slowest by a difference of 3/100 secs from the second slowest and nearly a quarter of a second from the fastest. The lesson learned from yearly query 2 applies to monthly query 2 as well: string comparisons against dates can't use the index or check prefixes. However, the difference between the two is that the monthly form outperforms monthly query 1, whereas the yearly form was slower than yearly query 1. My belief is that this happens because the yearly query 1 uses only one computational comparison. In other words, string comparisons against dates are faster than two computational comparisons, but slower than one. Looking at queries 3-4, we see another parallel to the yearly queries, being that BETWEEN and integer inequality operators perform about the same (a difference of less than 1/100 secs). We could also infer that either range comparison method would be outperformed by an integer-using counterpart, and that those counterparts would both perform nearly the same as our control query.

Weekly Queries

SELECT SQL_NO_CACHE TRUE FROM test_dateindex...
QueryMinMaxAvgRank
...WHERE YEARWEEK(test_date) = YEARWEEK('2014-01-25')0.2272270.2798950.24871225 5 5
...WHERE test_date BETWEEN ('2014-01-25' - INTERVAL (DAYOFWEEK('2014-01-25')-1) DAY) AND '2014-01-25'0.0130000.0185270.01527784 4 4
...WHERE test_date BETWEEN '2014-01-19' AND '2014-01-25'0.0106430.0129640.01182142 1 1
...WHERE test_date_year = 2014 AND test_date_month = 1 AND test_date_day >= 19 AND test_date_day <= 250.0098180.0148830.0126391 2 2
...WHERE test_date_year = 2014 AND test_date_month = 1 AND test_date_day BETWEEN 19 AND 250.0118370.0156040.01387563 3 3

These queries were chosen for the following qualities:

  • Query 1 uses the built-in YEARWEEK() to compute the week of the date column and the current date. This differs from yearly query 1 in that two computations are performed instead of one for the sake of readability (I consider YEARWEEK('2014-01-25') to be more readable than simply 201403), though the second usage only needs to be calculated once for the lifetime of the query, not for each row. This is a common query because of its readability. This query is expected to be the slowest because it requires a computational comparison and therefore cannot use the date column's index.
  • Queries 2-3 are similar to yearly queries 3-6. They use the BETWEEN range comparison method with DATE (string) constants. As with the yearly queries, these are expected to be outperformed by integral versions, though I do not include them here for brevity. These queries are expected to perform nearly identically because query 2's computations only need to be done once for the lifetime of the query, similar to weekly query 1.
  • Queries 4-5 is a control variable in our experiment and are also similar to yearly queries 3-6. They use the exploded index containing precomputed values for the date column's various parts, each also using one of the two range comparison methods. These queries are expected to be the fastest because they each require a simple, fast integral range comparison against the index.

What do the results tell us? The results for the weekly queries don't tell us much because all queries excluding the first are equal for our purposes (difference of less than 1/100 secs). In nearly every test run, queries 3-5 swap rankings, though query 2 is the second slowest by a small margin. I believe the reason for query 2 being slower than the others is the one-time computational overhead. I had expected queries 4-5 to outperform query 3 because of their use of the exploded index, but that does not appear to be the case. No real advantage is demonstrated among queries 3-5.

...

In summation, we now have a few rules of thumb that we can and probably should apply to production-level queries. First, queries relying on per-row computed values are always the slowest. Unless readability is critical to a query in production, functions and operators acting upon a column should be replaced. Also, for all the debating of DBAs and programmers over BETWEEN versus less-than/greater-than operators, the difference is nearly non-existent. It's all a choice of personal preference. The real performance boost lies in the operand being an integer rather than a string constant, despite the recommendations to use the constants by various documentations. Finally, denormalization of a DATE column has negligible performance advantages over simply running comparisons against the date in whole, though the advantage does exist in most cases.

Happy optimizing!

No comments: