I’m having an issue where the optimizer seems to be doing a full table scan and exhibiting a big performance hit when a hash join is used.
Schema and sample data:
CREATE TABLE `px_po_status` (
`po_status_id` tinyint NOT NULL,
`name` varchar(32) NOT NULL,
PRIMARY KEY (`po_status_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `px_po_status` (`po_status_id`, `name`)
VALUES
(0, 'PENDING'),(1, 'SENT'),(3, 'COMPLETE');
CREATE TABLE `px_po` (
`po_id` int NOT NULL AUTO_INCREMENT,
`po_date` date NOT NULL,
`po_status_id` tinyint(1) NOT NULL,
PRIMARY KEY (`po_id`),
KEY `po_date` (`po_date`),
KEY `po_status_id` (`po_status_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `px_po` (`po_id`, `po_date`, `po_status_id`)
VALUES
(39647, '2018-10-09', 3),(39648, '2018-10-09', 3),(39649, '2018-10-09', 3),(39651, '2018-10-09', 3),(39653, '2018-10-09', 3),(39654, '2018-10-09', 3),(39655, '2018-10-09', 3),(39657, '2018-10-09', 3),(39658, '2018-10-09', 3),(39659, '2018-10-09', 3),(39660, '2018-10-09', 3),(39661, '2018-10-09', 3),(39662, '2018-10-10', 3),(39663, '2018-10-10', 3),(39665, '2018-10-10', 3),(39666, '2018-10-10', 3),(39667, '2018-10-10', 3),(39668, '2018-10-10', 3),(39669, '2018-10-10', 3),(39670, '2018-10-10', 3),(39671, '2018-10-10', 3),(39672, '2018-10-10', 3),(39673, '2018-10-10', 3),(39674, '2018-10-10', 3),(39675, '2018-10-11', 3),(39676, '2018-10-11', 3),(39677, '2018-10-11', 3),(39678, '2018-10-11', 3),(39679, '2018-10-11', 3),(39680, '2018-10-11', 3),(39681, '2018-10-15', 3),(39682, '2018-10-15', 3),(39683, '2018-10-15', 3),(39684, '2018-10-15', 3),(39685, '2018-10-15', 3),(39686, '2018-10-15', 3),(39687, '2018-10-15', 3),(39688, '2018-10-15', 3),(39689, '2018-10-15', 3),(39690, '2018-10-15', 3);
Problem query:
SELECT
po.po_id,
pst.name AS po_status
FROM px_po po
LEFT JOIN px_po_status pst ON po.po_status_id = pst.po_status_id
ORDER BY po_date DESC
LIMIT 0,25;
Running EXPLAIN
here shows the following:
+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------+
| 1 | SIMPLE | po | NULL | index | NULL | po_date | 4 | NULL | 41 | 100.00 | Using index; Using temporary; Using filesort |
| 1 | SIMPLE | pst | NULL | ALL | PRIMARY | NULL | NULL | NULL | 3 | 100.00 | Using where; Using join buffer (hash join) |
+----+-------------+-------+------------+-------+---------------+---------+---------+------+------+----------+----------------------------------------------+
Workaround #1:
SELECT
po.po_id,
(SELECT pst.name FROM px_po_status pst WHERE po.po_status_id = pst.po_status_id) as po_status
FROM px_po po
ORDER BY po_date DESC
LIMIT 0,25;
Oddly, using a subquery performs much better and gets optimized as I’d expect:
+----+--------------------+-------+------------+--------+---------------+---------+---------+-------------------------+------+----------+---------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------------+-------+------------+--------+---------------+---------+---------+-------------------------+------+----------+---------------------+
| 1 | PRIMARY | po | NULL | index | NULL | po_date | 3 | NULL | 25 | 100.00 | Backward index scan |
| 2 | DEPENDENT SUBQUERY | pst | NULL | eq_ref | PRIMARY | PRIMARY | 1 | sandbox.po.po_status_id | 1 | 100.00 | NULL |
+----+--------------------+-------+------------+--------+---------------+---------+---------+-------------------------+------+----------+---------------------+
Workaround #2:
I’m also aware that I can leverage optimizer hints to circumvent the poor performance as outlined here: https://dev.mysql.com/doc/refman/8.0/en/optimizer-hints.html#optimizer-hints-table-level. This yields an identical explain to Workaround #1:
SELECT /*+ NO_BNL() */
po.po_id,
(SELECT pst.name FROM px_po_status pst WHERE po.po_status_id = pst.po_status_id) as po_status
FROM px_po po
ORDER BY po_date DESC
LIMIT 0,25;
Couple of notes:
My Question:
Why does the optimizer seem to fail such a basic operation? Is this a bug?