• 2022-07-20 mysql-hashJoin说明


    abstract:

    hash join describe

    hash join:

    Hash Join
    The hash join algorithm is a very recent addition to MySQL and is supported in MySQL
    8.0.18 and later. It marks a significant break with the tradition of nested loop joins
    including the block nested loop variant. It is particularly useful for large joins without
    indexes but can in some cases even outperform an index join.
    MySQL implements a hybrid between a classic in-memory hash join and the on-disk
    GRACE hash join algorithm. 2 If it is possible to store all the hashes in memory, then the
    pure in-memory implementation is used. The join buffer is used for the in-memory part,
    so the amount of memory that can be used for the hashes is limited by join_buffer_
    size . When the join does not fit in memory, the join spills over to disk, but the actual
    join operations are still performed in memory.
    The in-memory hash join algorithm consists of two steps:
    1. One of the tables in the join is chosen as the build table . The hash
    is calculated for the columns required for the join and loaded into
    memory. This is known as the build phase .
    2. The other table in the join is the probe input . For this table, rows
    are read one at a time, and the hash is calculated. Then a hash
    key lookup is performed on the hashes calculated from the build
    table, and the result of the join is generated from the matching
    rows. This is known as the probe phase .
    When the hashes of the build table do not fit into memory, MySQL automatically
    switches to use the on-disk implementation (based on the GRACE hash join algorithm).
    The switch from the in-memory to the on-disk algorithm happens if the join buffer
    becomes full during the build phase. The on-disk algorithm consists of three steps:
    1. Calculate the hashes of all rows in both the build and probe tables
    and store them on disk in several small files partitioned by the
    hash. The number of partitions is chosen to make each partition
    of the probe table fit into the join buffer but with a limit of at most
    128 partitions.
    2. Load the first partition of the build table into memory and iterate
    over the hashes from the probe table in the same way as for the
    probe phase for the in-memory algorithm. Since the partitioning
    in step 1 uses the same hash function for both the build and probe
    tables, it is only necessary to iterate over the first partition of the
    probe table.
    3. Clear the in-memory buffer and continue with the rest of the
    partitions one by one.
    Both the in-memory and on-disk algorithms use the xxHash64 hash function which
    is known as being fast while still providing hashes of good quality (reducing the number
    of hash collisions). For the optimal performance, the join buffer needs to be large
    enough to fit all the hashes from the build table. That said, the same considerations for
    join_buffer_size exist for hash joins as for block nested loop joins.
    MySQL will use the hash join whenever the block nested loop would otherwise be
    chosen, and the hash join algorithm is supported for the query. At the time of writing,
    the following requirements exist for the hash join algorithm to be used:
    • The join must be an inner join.
    • The join cannot be performed using an index, either because there is
    no available index or because the indexes have been disabled for the
    query.
    • All joins in the query must have at least one equi-join condition
    between the two tables in the join, and only columns from the two
    tables as well as constants are referenced in the condition.
    • As of 8.0.20, anti, semi, and outer joins are also supported. 3 If you join
    the two tables t1 and t2 , then examples of join conditions that are
    supported for hash join include
    t1.t1_val = t2.t2_val
    t1.t1_val = t2.t2_val + 2
    t1.t1_val1 = t2.t2_val AND t1.t1_val2 > 100
    MONTH(t1.t1_val) = MONTH(t2.t2_val)
    If you consider the recurring example query for this section, you can execute it using
    a hash join by ignoring the indexes on the tables that can be used for the join:
    SELECT CountryCode, country.Name AS Country,
    city.Name AS City, city.District
    FROM world.country IGNORE INDEX (Primary)
    INNER JOIN world.city IGNORE INDEX (CountryCode)
    ON city.CountryCode = country.Code
    WHERE Continent = 'Asia';
    The pseudo code for performing this join is similar to that of a block nested loop
    except that the columns needed for the join are hashed and that there is support for
    overflowing to disk.
    1. result = []
    2. join_buffer = []
    3. partitions = 0
    4. on_disk = False
    5. for country_row in country:
    6. if country_row.Continent == 'Asia':
    7. hash = xxHash64(country_row.Code)
    8. if not on_disk:
    9. join_buffer.append(hash)
    10. if is_full(join_buffer):
    11. # Create partitions on disk
    12. on_disk = True
    13. partitions = write_buffer_to_disk(join_buffer)
    14. join_buffer = []
    15. else
    16. write_hash_to_disk(hash)
    17. if not on_disk:
    18. for city_row in city:
    19. hash = xxHash64(city_row.CountryCode)
    20. if hash in join_buffer:
    21. country_row = get_row(hash)
    22. city_row = get_row(hash)
    23. result.append(join_rows(country_row, city_row))
    24. else:
    25. for city_row in city:
    26. hash = xxHash64(city_row.CountryCode)
    27. write_hash_to_disk(hash)
    28. for partition in range(partitions):
    29. join_buffer = load_build_from_disk(partition)
    30. for hash in load_hash_from_disk(partition):
    31. if hash in join_buffer:
    32. country_row = get_row(hash)
    33. city_row = get_row(hash)
    34. result.append(join_rows(country_row, city_row))
    35. join_buffer = []

    The pseudo code starts out reading the rows from the country table and calculates
    the hash for the Code column and stores it in the join buffer. If the buffer becomes full,
    then the code switches to the on-disk algorithm and writes out the hashes from the
    buffer. This is also where the number of partitions is determined. After this the rest of the
    country table is hashed.
    In the next part, for the in-memory algorithm, there is a simple loop over the rows in
    the city table comparing the hashes to those in the buffer. For the on-disk algorithm, the
    hashes of the city table are first calculated and stored on disk; then the partitions are
    handled one by one.

    The values of the Code column for the matching rows from the country table are
    hashed and stored in the join buffer. Then a table scan is executed for the city table with
    the hash calculated of CountryCode for each row, and the result is constructed from the
    matching rows.

    You can see the query performs very well with the hash join examining the same
    number of rows as the block nested loop, but it is faster than an index join. This is not a
    mistake: in some cases, a hash join can outperform even an index join. You can use the
    following rules to estimate how the hash join algorithm will perform compared to index
    and block nested loop joins:
    • For a join without using an index, the hash join will usually be much
    faster than a block nested join unless a LIMIT clause has been added.
    Improvements of more than a factor of 1000 have been observed. 4
    • For a join without an index where there is a LIMIT clause, a block
    nested loop can exit when enough rows have been found, whereas
    a hash join will complete the entire join (but can skip fetching the
    rows). If the number of rows included due to the LIMIT clause is small
    compared to the total number of rows found by the join, a block
    nested loop may be faster.
    • For joins supporting an index, the hash join algorithm can be faster if
    the index has a low selectivity.
    The biggest benefit using the hash joins is by far for joins without an index and
    without a LIMIT clause. In the end, only testing can prove which join strategy is the
    optimal for your queries.

    You can enable and disable support for hash joins using the hash_join optimizer
    switch. Additionally, the block_nested_loop optimizer switch must be enabled. Both are
    enabled by default. If you want to configure the use of hash joins for specific joins, you
    can use the HASH_JOIN() and NO_HASH_JOIN() optimizer hints.
    That concludes the discussion about the three high-level join strategies supported in
    MySQL. There are some lower-level optimizations as well that are worth considering.

  • 相关阅读:
    【软考】系统架构设计师 - 知识扩展 - “区块链技术“
    优雅的接口防刷处理方案
    介绍五个很实用的IDEA使用技巧
    Linux中sh与bash的区别
    CSS 小球随着椭圆移动
    【计算机网络】poll | epoll
    赞奇科技参与华为云828 B2B企业节,云工作站入选精选产品解决方案
    docker consul集群
    Web基础与http协议
    python——类
  • 原文地址:https://blog.csdn.net/adofsauron/article/details/125901265