MySQL Hash Join前世今生

GreatSQL社区原创内容未经授权不得随意使用，转载请联系小编并注明来源。
GreatSQL是MySQL的国产分支版本，使用上与MySQL一致。
作者：nw

MySQL Hash Join前世今生

因工作需要，对MySQL Hash Join的内部实现做了一些探索和实践，对这个由8.0.18开始引入的连接算法有了一定的了解，写文总结与各位大佬分享，欢迎大家指教。因篇幅较长，这份总结分成若干部分，我们今天先一起来看一下MySQL Hash join的变迁史。

爬了一下 MySQL worklog[1]，并结合源码及各版本的实际使用，个人认为比较重要的worklogs为如下几个，其它的变更一般围绕这些worklogs做的小调整或bugfixes，这里我们就不一一列举。

1. WL#2241: Hash join (变更版本：8.0.18)

主要内容：

新增执行类HashJoinIterator，实现hash join算法原型 (支持on-disk hash)
HASH JOIN 仅支持INNER JOIN，并对使用hash join做了限制：关联表连接条件必须至少包含一条等值条件(equi-join condition, 如t1.a = t2.a)，且join条件中的列不包含索引

注：这里我认为官网的 Release Notes[2] 描述是不太准确的，以如下例子为例，虽然该查询包含了非等值连接条件(non-equi-join condition, 如 t1.a <> t2.a ,t1.a = 1, t2.a > 1 等），但实际查询中还是使用hash join的, 因为查询语句在解析执行过程中，可能会经历语句重写、sql优化等步骤，与表象上会有所不同，建议借助explain工具，查看实际上查询语句选择的join算法。

-- 版本：8.0.18
-- 在创建iterator时，t1.a > 1 会被当成表的过滤条件，而非inner join的join条件
-- 此查询仍使用了hash join（join 条件为空）
EXPLAIN FORMAT=tree SELECT * FROM t1 JOIN t2 ON t1.i > 1;
-> Inner hash join  (cost=1.17 rows=3)
    -> Table scan on t2  (cost=0.34 rows=2)
    -> Hash
        -> Filter: (t1.i > 1)  (cost=0.65 rows=1)
            -> Table scan on t1  (cost=0.65 rows=4)1
2
3
4
5
6
7
8

(尽量)使用HASH JOIN算法替换BNL(Blocked Nested-Loop)连接算法

2. WL#13377: Add support for hash outer, anti and semi join( 变更版本：8.0.20)

主要内容:

HASH INNER JOIN改进
INNER JOIN中的non-equi-join conditions, 会被附为hash join的过滤条件:

-- 版本：8.0.20
EXPLAIN FORMAT=tree SELECT * FROM t1 JOIN t2 ON t1.i <> t2.i;
-> Filter: (t1.i <> t2.i)  (cost=1.10 rows=2)
    -> Inner hash join (no condition)  (cost=1.10 rows=2)
        -> Table scan on t2  (cost=0.18 rows=2)
        -> Hash
            -> Table scan on t1  (cost=0.45 rows=2)1
2
3
4
5
6

HAH JOIN支持outer join/anti join/semi join

-- 版本：8.0.20
-- Left outer join
EXPLAIN FORMAT=tree SELECT * FROM t1 LEFT JOIN t2 ON t1.i=t2.i;
-> Left hash join (t2.i = t1.i)  (cost=0.88 rows=4)                                                                                                                                                                          
    -> Table scan on t1  (cost=0.45 rows=2)                                                                                                                                                                                    
    -> Hash                                                                                                                                                                                                                    
        -> Table scan on t2  (cost=0.23 rows=2)

-- Right outer join（注：MySQL在parser阶段会将所有的right join改写为left join
--                      所以我们这里看到的explain为Left hash join
EXPLAIN FORMAT=tree SELECT * FROM t1 RIGHT JOIN t2 ON t1.i=t2.i;
-> Left hash join (t1.i = t2.i)  (cost=0.88 rows=4)                                                                                                                                                                          
    -> Table scan on t2  (cost=0.45 rows=2)                                                                                                                                                                                    
    -> Hash                                                                                                                                                                                                                    
        -> Table scan on t1  (cost=0.23 rows=2)

-- Semijoin
EXPLAIN FORMAT=tree SELECT * FROM t1 WHERE (t1.i) IN (SELECT t2.i FROM t2);
-> Hash semijoin (t2.i = t1.i)  (cost=0.83 rows=2)
    -> Table scan on t1  (cost=0.45 rows=2)
    -> Hash
        -> Table scan on t2  (cost=0.18 rows=2)

-- Antijoin
EXPLAIN FORMAT=tree 
SELECT * FROM t2 WHERE NOT EXISTS (SELECT * FROM t1 WHERE t1.i = t2.i);
-> Hash antijoin (t1.i = t2.i)  (cost=1.10 rows=4)                                                                                                                                                                           
    -> Table scan on t2  (cost=0.45 rows=2)                                                                                                                                                                                    
    -> Hash                                                                                                                                                                                                                    
        -> Table scan on t1  (cost=0.45 rows=2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

所有使用BNL的连接，都将被转为使用HASH JOIN
non-equi-join conditions 处理
Hash join iterator引入"extra" condition的概念，部分 non-equi-join conditions会被当成Hash join iterator的extra condition, 在建hash table时，join key的计算不依赖这些条件，但会在hash查找到匹配项后，作为附加的过滤条件：

-- 版本: 8.0.20
-- 注： t1.i > 1 会被放到hash join的 extra conditions中
EXPLAIN FORMAT=tree SELECT * FROM  t1 LEFT JOIN t2 ON t1.i=t2.i AND t1.i > 1;
-> Left hash join (t2.i = t1.i), extra conditions: (t1.i > 1)  (cost=0.88 rows=4)
    -> Table scan on t1  (cost=0.45 rows=2)
    -> Hash
        -> Table scan on t2  (cost=0.23 rows=2)1
2
3
4
5
6

3. WL#13459: Optimize hash table in hash join (变更版本：8.0.23）

主要内容：

优化hash join table的创建方法
这里MySQL所说的“优化”，实际上会更激进一点，这个版本中，MySQL直接使用了一个基于 robin hood hashing[3] 实现的 开源hash table[4] ，更换了原先的hash join table实现( from mem_root_unordered_multimap to robin_hood::unordered_flat_map)
优化内存管理和使用，降低了 on-disk hash 的频率，提高了内存有效使用率等（其他的改进内容及提升的效果可以参考MySQL 8.0.23的release notes[5] 以及 worklog #13459[6] 的Low Level Design页面）

本篇我们对MySQL hash join的3个重要变更做了简要的总结，算是对它的前世今生有了印象，谢谢各位阅读；之后让我们会结合具体的sql查询样例，去跟踪一下对应的代码执行流程，不日更新，敬请期待。

参考文档

[1] MySQL worklog: https://dev.mysql.com/worklog/
[2] MySQL 8.0.18 release notes#optimizer: https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-18.html#mysqld-8-0-18-optimizer
[3] robin hood hashing 算法介绍: https://programming.guide/robin-hood-hashing.html
[4] robin hood hasing 开源实现: https://github.com/martinus/robin-hood-hashing
[5] MySQL 8.0.23 release notes#optimizer: https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-23.html#mysqld-8-0-23-optimizer
[6] MySQL worklog #13459: https://dev.mysql.com/worklog/task/?id=13459

Enjoy GreatSQL :)

关于 GreatSQL

GreatSQL是由万里数据库维护的MySQL分支，专注于提升MGR可靠性及性能，支持InnoDB并行查询特性，是适用于金融级应用的MySQL分支版本。

相关链接： GreatSQL社区 Gitee GitHub Bilibili

GreatSQL社区：

捉虫活动详情：https://greatsql.cn/thread-97-1-1.html

社区博客有奖征稿详情：https://greatsql.cn/thread-100-1-1.html

6440

相关阅读:
Python基础：【习题系列】列表、元组、字典和集合
 c++基础知识点总结
 猿创征文｜瑞吉外卖——管理端_后台登录与退出
 Mybatis-动态 SQL详解
 一文彻底搞懂协程（coroutine）是什么，值得收藏
 详解 Serverless 架构的 6 大应用场景
 oracle常见报错问题处理
 Java常用类
 显示今天的年、月、日日期、时间的数据处理timetuple()
【MySQL】（六）DQL表查询操作
原文地址：https://blog.csdn.net/GreatSQL2021/article/details/127974696