Optimizing OR’ed WHERE Clauses Between JOIN’ed Tables
OR conditions are normally difficult to optimize when used on different columns, a pain when the columns are of range conditions and worst when done between 2 or more different tables. Look at my example below, the original query is trying to find rows which conditions are based on columns from the JOIN’ed tables.
mysql [localhost] {msandbox} (employees) > EXPLAIN SELECT -> e.emp_no, birth_date, first_name, -> last_name, gender, hire_date, -> salary, from_date, to_date -> FROM employees e -> INNER JOIN salaries s ON (e.emp_no = s.emp_no) -> WHERE e.hire_date BETWEEN '1990-06-01 00:00:00' AND '1990-07-01 00:00:00' OR -> s.from_date BETWEEN '1990-06-01 00:00:00' AND '1990-07-01 00:00:00' -> ORDER BY e.emp_no, hire_date, from_date \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: e type: ALL possible_keys: PRIMARY,hire_date key: NULL key_len: NULL ref: NULL rows: 300547 Extra: Using temporary; Using filesort *************************** 2. row *************************** id: 1 select_type: SIMPLE table: s type: ref possible_keys: PRIMARY,emp_no,from_date key: PRIMARY key_len: 4 ref: employees.e.emp_no rows: 4 Extra: Using where 2 rows in set (0.00 sec) mysql [localhost] {msandbox} (employees) > SHOW CREATE TABLE employees \G *************************** 1. row *************************** Table: employees Create Table: CREATE TABLE `employees` ( `emp_no` int(11) NOT NULL, `birth_date` date NOT NULL, `first_name` varchar(14) NOT NULL, `last_name` varchar(16) NOT NULL, `gender` enum('M','F') NOT NULL, `hire_date` date NOT NULL, PRIMARY KEY (`emp_no`), KEY `hire_date` (`hire_date`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec) mysql [localhost] {msandbox} (employees) > SHOW CREATE TABLE salaries \G *************************** 1. row *************************** Table: salaries Create Table: CREATE TABLE `salaries` ( `emp_no` int(11) NOT NULL, `salary` int(11) NOT NULL, `from_date` date NOT NULL, `to_date` date NOT NULL, PRIMARY KEY (`emp_no`,`from_date`), KEY `emp_no` (`emp_no`), KEY `from_date` (`from_date`), CONSTRAINT `salaries_ibfk_1` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE ) ENGINE=InnoDB DEFAULT CHARSET=latin1 1 row in set (0.00 sec)
If you look at the execution plan, the optimizer will always do a full table scan of the first table even though there is proper indexes on employees.hire_date and salaries.from_date. This is because it cannot know in advance which rows from the second table will match the rows from the second table, the OR conditions needs to match both tables. For the sake of brevity, observe what happens if I change the OR condition to AND instead.
*************************** 1. row *************************** id: 1 select_type: SIMPLE table: e type: range possible_keys: PRIMARY,hire_date key: hire_date key_len: 3 ref: NULL rows: 2101 Extra: Using where; Using temporary; Using filesort *************************** 2. row *************************** id: 1 select_type: SIMPLE table: s type: ref possible_keys: PRIMARY,emp_no,from_date key: PRIMARY key_len: 4 ref: employees.e.emp_no rows: 4 Extra: Using where 2 rows in set (0.00 sec)
As you can see, the optimizer now used the hire_date index in which for every matching row, there is about 4 rows matching on the second table.
Notice how many index reads and the full table scan the original query does. Imagine if you are working on a reporting query based on date ranges that has millions of rows for the first table, you might have to wait the next day for the results!
mysql [localhost] {msandbox} (employees) > SELECT -> e.emp_no, birth_date, first_name, -> last_name, gender, hire_date, -> salary, from_date, to_date -> FROM employees e -> INNER JOIN salaries s ON (e.emp_no = s.emp_no) -> WHERE e.hire_date BETWEEN '1990-06-01 00:00:00' AND '1990-07-01 00:00:00' OR -> s.from_date BETWEEN '1990-06-01 00:00:00' AND '1990-07-01 00:00:00' -> ORDER BY e.emp_no, hire_date, from_date; b5abad9ac152d6289d4cc62b7d71fb83 - 28133 rows in set (1.39 sec) mysql [localhost] {msandbox} (employees) > NOPAGER; SHOW STATUS LIKE 'Handler%'; PAGER set to stdout +----------------------------+---------+ | Variable_name | Value | +----------------------------+---------+ ... | Handler_read_key | 300025 | | Handler_read_next | 2844047 | | Handler_read_prev | 0 | | Handler_read_rnd | 28133 | | Handler_read_rnd_next | 328159 | ... | Handler_write | 28133 | +----------------------------+---------+ 15 rows in set (0.00 sec)
Because the queries need to return rows matching both tables, we can rewrite it as UNION instead separating the WHERE clauses into 2 different SELECT queries like below.
mysql [localhost] {msandbox} (employees) > PAGER md5sum; FLUSH STATUS; PAGER set to 'md5sum' Query OK, 0 rows affected (0.00 sec) mysql [localhost] {msandbox} (employees) > ( -> SELECT -> e.emp_no, birth_date, first_name, -> last_name, gender, hire_date, -> salary, from_date, to_date -> FROM employees e -> INNER JOIN salaries s ON (e.emp_no = s.emp_no) -> WHERE e.hire_date BETWEEN '1990-06-01 00:00:00' AND '1990-07-01 00:00:00' -> ) -> UNION -> ( -> SELECT e.emp_no, birth_date, first_name, -> last_name, gender, hire_date, -> salary, from_date, to_date -> FROM employees e -> INNER JOIN salaries s ON (e.emp_no = s.emp_no) -> WHERE s.from_date BETWEEN '1990-06-01 00:00:00' AND '1990-07-01 00:00:00' -> ) -> ORDER BY emp_no, hire_date, from_date; b5abad9ac152d6289d4cc62b7d71fb83 - 28133 rows in set (0.14 sec) mysql [localhost] {msandbox} (employees) > NOPAGER; SHOW STATUS LIKE 'Handler%'; PAGER set to stdout +----------------------------+-------+ | Variable_name | Value | +----------------------------+-------+ ... | Handler_read_key | 11746 | | Handler_read_next | 31302 | | Handler_read_prev | 0 | | Handler_read_rnd | 28133 | | Handler_read_rnd_next | 28134 | ... | Handler_write | 29200 | +----------------------------+-------+ 15 rows in set (0.00 sec)
With this approach, you are actually limiting the number of rows needed to be examined immediately before being JOINed. There is no real downsize with the new query, you just have to watch out that results from both SELECT’s does not grow too large, nearly the size of the total rows of both tables. In which case, you should not be doing the latter anyway i.e. trying to filter with big date ranges especially if you have very large dataset, instead you should implement summary tables that produces the results you need.
Comments
Leave a Comment