Press enter to see results or esc to cancel.

Optimizing OR’ed WHERE Clauses Between JOIN’ed Tables

OR conditions are normally difficult to optimize when used on different columns, a pain when the columns are of range conditions and worst when done between 2 or more different tables. Look at my example below, the original query  is trying to find rows which conditions are based on columns from the JOIN’ed tables.

mysql [localhost] {msandbox} (employees) > EXPLAIN SELECT
    ->      e.emp_no, birth_date, first_name,
    ->      last_name, gender, hire_date,
    ->      salary, from_date, to_date
    -> FROM employees e
    -> INNER JOIN salaries s ON (e.emp_no = s.emp_no)
    -> WHERE e.hire_date BETWEEN '1990-06-01 00:00:00' AND '1990-07-01 00:00:00' OR
    ->      s.from_date BETWEEN '1990-06-01 00:00:00' AND '1990-07-01 00:00:00'
    -> ORDER BY e.emp_no, hire_date, from_date \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: e
         type: ALL
possible_keys: PRIMARY,hire_date
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 300547
        Extra: Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: s
         type: ref
possible_keys: PRIMARY,emp_no,from_date
          key: PRIMARY
      key_len: 4
          ref: employees.e.emp_no
         rows: 4
        Extra: Using where
2 rows in set (0.00 sec)

mysql [localhost] {msandbox} (employees) > SHOW CREATE TABLE employees \G
*************************** 1. row ***************************
       Table: employees
Create Table: CREATE TABLE `employees` (
  `emp_no` int(11) NOT NULL,
  `birth_date` date NOT NULL,
  `first_name` varchar(14) NOT NULL,
  `last_name` varchar(16) NOT NULL,
  `gender` enum('M','F') NOT NULL,
  `hire_date` date NOT NULL,
  PRIMARY KEY (`emp_no`),
  KEY `hire_date` (`hire_date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql [localhost] {msandbox} (employees) > SHOW CREATE TABLE salaries \G
*************************** 1. row ***************************
       Table: salaries
Create Table: CREATE TABLE `salaries` (
  `emp_no` int(11) NOT NULL,
  `salary` int(11) NOT NULL,
  `from_date` date NOT NULL,
  `to_date` date NOT NULL,
  PRIMARY KEY (`emp_no`,`from_date`),
  KEY `emp_no` (`emp_no`),
  KEY `from_date` (`from_date`),
  CONSTRAINT `salaries_ibfk_1` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

If you look at the execution plan, the optimizer will always do a full table scan of the first table even though there is proper indexes on employees.hire_date and salaries.from_date. This is because it cannot know in advance which rows from the second table will match the rows from the second table, the OR conditions needs to match both tables. For the sake of brevity, observe what happens if I change the OR condition to AND instead.

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: e
         type: range
possible_keys: PRIMARY,hire_date
          key: hire_date
      key_len: 3
          ref: NULL
         rows: 2101
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: s
         type: ref
possible_keys: PRIMARY,emp_no,from_date
          key: PRIMARY
      key_len: 4
          ref: employees.e.emp_no
         rows: 4
        Extra: Using where
2 rows in set (0.00 sec)

As you can see, the optimizer now used the hire_date index in which for every matching row, there is about 4 rows matching on the second table.

Notice how many index reads and the full table scan the original query does. Imagine if you are working on a reporting query based on date ranges that has millions of rows for the first table, you might have to wait the next day for the results!

mysql [localhost] {msandbox} (employees) > SELECT
    ->      e.emp_no, birth_date, first_name,
    ->      last_name, gender, hire_date,
    ->      salary, from_date, to_date
    -> FROM employees e
    -> INNER JOIN salaries s ON (e.emp_no = s.emp_no)
    -> WHERE e.hire_date BETWEEN '1990-06-01 00:00:00' AND '1990-07-01 00:00:00' OR
    ->      s.from_date BETWEEN '1990-06-01 00:00:00' AND '1990-07-01 00:00:00'
    -> ORDER BY e.emp_no, hire_date, from_date;

b5abad9ac152d6289d4cc62b7d71fb83  -
28133 rows in set (1.39 sec)

mysql [localhost] {msandbox} (employees) > NOPAGER; SHOW STATUS LIKE 'Handler%';
PAGER set to stdout
+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
...
| Handler_read_key           | 300025  |
| Handler_read_next          | 2844047 |
| Handler_read_prev          | 0       |
| Handler_read_rnd           | 28133   |
| Handler_read_rnd_next      | 328159  |
...
| Handler_write              | 28133   |
+----------------------------+---------+
15 rows in set (0.00 sec)

Because the queries need to return rows matching both tables, we can rewrite it as UNION instead separating the WHERE clauses into 2 different SELECT queries like below.

mysql [localhost] {msandbox} (employees) > PAGER md5sum; FLUSH STATUS;
PAGER set to 'md5sum'
Query OK, 0 rows affected (0.00 sec)

mysql [localhost] {msandbox} (employees) > (
    -> SELECT
    ->      e.emp_no, birth_date, first_name,
    ->      last_name, gender, hire_date,
    ->      salary, from_date, to_date
    -> FROM employees e
    -> INNER JOIN salaries s ON (e.emp_no = s.emp_no)
    -> WHERE e.hire_date BETWEEN '1990-06-01 00:00:00' AND '1990-07-01 00:00:00'
    -> )
    -> UNION
    -> (
    -> SELECT e.emp_no, birth_date, first_name,
    ->      last_name, gender, hire_date,
    ->      salary, from_date, to_date
    -> FROM employees e
    -> INNER JOIN salaries s ON (e.emp_no = s.emp_no)
    -> WHERE s.from_date BETWEEN '1990-06-01 00:00:00' AND '1990-07-01 00:00:00'
    -> )
    -> ORDER BY emp_no, hire_date, from_date;
b5abad9ac152d6289d4cc62b7d71fb83  -
28133 rows in set (0.14 sec)

mysql [localhost] {msandbox} (employees) > NOPAGER; SHOW STATUS LIKE 'Handler%';
PAGER set to stdout
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
...
| Handler_read_key           | 11746 |
| Handler_read_next          | 31302 |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 28133 |
| Handler_read_rnd_next      | 28134 |
...
| Handler_write              | 29200 |
+----------------------------+-------+
15 rows in set (0.00 sec)

With this approach, you are actually limiting the number of rows needed to be examined immediately before being JOINed. There is no real downsize with the new query, you just have to watch out that results from both SELECT’s does not grow too large, nearly the size of the total rows of both tables. In which case, you should not be doing the latter anyway i.e. trying to filter with big date ranges especially if you have very large dataset, instead you should implement summary tables that produces the results you need.