phmap 学习和思考

phmap 学习和思考
hashing 分类

 Closed hashing

Closed hashing：也称为开放寻址（open addressing），是 hashmap 的一种实现方式，在这种方法中，所有元素都直接存储在 hashmap 的数组里。当发生hash碰撞时，按照某种方法在 hashmap 的底层数组中寻找另一个空槽位，这个过程称为探测。探测方法包括：
- 线性探测：当发生冲突时，顺序检查表中的下一个槽位，直到找到一个空槽位。这种方法简单，但可能导致聚集（clustering）问题，即连续的槽位被占用，从而影响性能。
- 二次探测：当发生冲突时，探测序列不是简单地检查下一个槽位，而是检查当前位置加上探测次数的平方的位置。这种方法可以减少聚集问题，但仍然可能存在次级聚集。
- 双重散列探测：使用两个哈希函数，当发生冲突时，使用第二个哈希函数计算探测的步长。这种方法通常可以避免聚集，并且分布更加均匀。
Closed hashing的优点非常明显：
1. 不需要额外的指针和链表，占用的内存更少。
2. 不像链表hash中，元素在内存中较为分散，closed hashing 使用连续数组存储的，可以提高缓存命中率（cache-friendly）。
3. 由于连续存储，序列化和反序列化容易。
Closed hashing的缺点也非常明显：
1. 装填因子限制。当哈希表变得过于满时，性能会显著下降（探测开销变大），因此需要定期调整哈希表的大小。
2. 删除复杂。由于探测序列的存在，简单删除探测序列中的某个值，会造成探测序列的中断，因此需要为删除的值添加一个特殊标记以防止探测序列中断。
Open hashing

Open hashing：也称为链表哈希（chaining），其在哈希表的每个槽位上维护一个链表。当多个元素发生hash碰撞时，这些元素会被添加到这个槽位指向的链表中。

phmap

全称为 parallel hashmap。其中的 hashmap 类是基于google开源的 abseil 库中的 absl::flat_hash_map 实现的。它们都使用前文提到的 closed hashing。

相比于 stl 库中的 unordered_hashmap 的 open hashing，其使用的是 closed hashing，将 k-v pair 直接存储在数组中。下图是寻址的过程。

phmap 提供了 phmap::flat_hash_map 和 phmap::parallel_flat_hash_map 。前者的使用场景是，有大量的 hash map（即 hashmap 中有大量 item），而每个item 的 value 长度较小。后者的使用场景是，只有少量的 hash map（即少量的item），而每个 item 的 value 长度非常大（其支持并行处理，能同时分别更新不同item 的 value）。

absl::flat_hash_map 与 std::unordered_map 的比较

上面的图是往 map 中插入1亿个 k-v 对（大小为两个 8-byte 的int 类型）的过程中，所统计的内存占用和耗时。

从图中可以观察到无论是内存占用还是插入耗时，absl::flat_hash_map 的性能都是优于 std:unordered_map 的。

peak memory usage issue

从上图中可以观察到，absl::flat_hash_map 存在 peak memory 问题。该问题存在的原因是 absl::flat_hash_map 底层数组的扩容。当底层数组的容量到达总容量的 87.5% 的时候，就会进行扩容，扩容之后的总容量将会是扩容前总容量的 2 倍，扩容之后，原来数组中的元素会被重新 hash，存到新申请的数组中。当老数组中的元素全部移动到新数组之后，老数组才会被销毁。也就是说当新数组和老数组并存时，占用的内存就是老数组占用内存的3倍。

举个例子，假设运行当前程序的机器的 RAM 为 32 GB，当 absl::flat_hash_map 插入 10GB 数据时想要进行扩容（也就是说老数组容量为 11.42 GB，新分配的数组的容量为 22.85GB）此时的 peak memory 为 34.28 GB，这显然会造成 OOM 问题。

为了避免 peak memory issue，只能使用 sparsepp 或者 google 的 cpp-btree。

phmap 中的 parallel hashmap 是如何解决 peak memory usage 问题的

为了解决 peak memory 问题。phmap 提出了 parallel hashmap ，其内部维护了一个 submap 的列表，每个 submap 自己又是一个 hashmap，当 submap 的容量到达 87.5% 的时候，会自动进行扩容，而不是整个 parallel hashmap 一起进行扩容。下图是对要插入的 key 被分配到哪个 submap 的过程（真是天才）。（不过相对于不使用 submap 来说，增加了确定插入的 submap 的 index 的计算时间，和存储 submap 带来的一些额外开销）。

此处对 phmap::parallel_flat_hash_map 引入 submap 所带来的额外开销进行分析。

以插入为例子，可以拆解为三步：
1. 计算 key 的 hash 值
2. 对 hash 值计算 index，index = (hash ^ (h >> 3)) & 0x7
3. 将 k-v 插入到 submap 中
对于步骤 2 实现仅仅只需要几条指令，执行速度快。
对于步骤 1，开销会比较大，但是其通过给 submap 传递这个 hash 值，来帮助 submap 节省一次 hash 运算（真是天才）。

然而从上图可以发现盲目使用 parallel_flat_hash_map 可能会带来轻微得增大插入耗时的问题。

比较 phmap::flat_hash_map、 phmap::parallel_flat_hash_map 和 std::unordered_hash_map 的性能

上图表明 parallel_flat_hash_map 的性能出奇得好，而且可以观察到 submap 的内存扩容基本上发生在同一个时间点附近，由此可以表明，其提出的将 key 进行 hashing 然后来寻找 submap 的算法也是非常优秀的。

parallel_flat_hash_map 带来的天然的并行优势

因为 parallel_flat_hash_map 内部维护了 16 个 submap，因此其天然就是支持多线程处理的（只要让不同的线程同时写入的是不同的 submap 就可以了）。

但是当线程访问的是同一个 submap 的时候，仍然存在数据竞争问题，因此 parallel_flat_hash_map 模板参数可以传递一个 std::mutex ，利用互斥来解决多线程访问同一个 submap 的问题，当然这又会引入开锁解锁的的开销。

要解决锁带来的开销，实现真正的 lock-free。可以魔改如下代码
```
template <class HT>
void _fill_random_inner_mt(int64_t cnt, HT &hash, RSU &rsu)
{
    constexpr int64_t num_threads = 8;   // has to be a power of two
    std::unique_ptr<std::thread> threads[num_threads];

    auto thread_fn = [&hash, cnt, num_threads](int64_t thread_idx, RSU rsu) {
        size_t modulo = hash.subcnt() / num_threads;        // subcnt() returns the number of submaps

        for (int64_t i=0; i<cnt; ++i)                       // iterate over all values
        {
            unsigned int key = rsu.next();                  // get next key to insert
            size_t hashval = hash.hash(key);                // compute its hash
            size_t idx  = hash.subidx(hashval);             // compute the submap index for this hash
            if (idx / modulo == thread_idx)                 // if the submap is suitable for this thread
            {
                hash.insert(typename HT::value_type(key, 0)); // insert the value
                ++(num_keys[thread_idx]);                     // increment count of inserted values
            }
        }
    };

    // create and start 8 threads - each will insert in their own submaps
    // thread 0 will insert the keys whose hash direct them to submap0 or submap1
    // thread 1 will insert the keys whose hash direct them to submap2 or submap3
    // --------------------------------------------------------------------------
    for (int64_t i=0; i<num_threads; ++i)
        threads[i].reset(new std::thread(thread_fn, i, rsu));

    // rsu passed by value to threads... we need to increment the reference object
    for (int64_t i=0; i<cnt; ++i)
        rsu.next();
    
    // wait for the threads to finish their work and exit
    for (int64_t i=0; i<num_threads; ++i)
        threads[i]->join();
}
```
也就是为每个线程分配其能够处理的 submap 的index，对插入的全量数据计算 submap 的index，发现不是自负责的 submap 就不处理，发现是自己负责的 submap 就插入到 submap 中。使用这种方式的性能开销见下图

这里分析一下，在扩容时 peak memory 问题稍微被放大的原因。这是因为上图是在 8 线程下运行的，也就说同时会有 8 个 submap 进行扩容（一共有 16 个submap），peak memory 将是不使用 submap 的 1/2。

可以看到利用无锁还是比较麻烦的（自己要实现寻找submap的代码），因此我们还是使用phmap::flat_hash_map的互斥来实现吧（只要开多线程往 phmap::flat_hash_map 里面塞数据就好，其能够帮我们解决数据竞争）。

上图比较的是使用 1个线程往 phmap::flat_hash_map 插入1亿 k-v 和使用 8个线程并发得往 phmap::parallel_flat_hash_map 插入1亿 k-v 的性能。

std::unordered_map

unordered_map 的实现与一般的 open hashing 方案（每个 bucket 维护一个单独的链表）还略有不同，其底层只有一个单链表，bucket指向链表的不同位置（这样设计是为了让 iteration 更快）。

参考
1. parallel-hashmap
2. The Parallel Hashmap
本文大部分是翻译的，只加了一点点自己的感受，是自己的学习笔记。
相关阅读:
GIS Office国产基础软件，助力移动通信基础资源管理建设工程
 Java多线程专题之Callable、Future与FutureTask(含源码分析)
HarmonyOS 4.0 实况窗上线！支付宝实现医疗场景智能提醒
 游戏软件开发与应用软件开发有什么不同呢？
[WPF] 使用 Effect 玩玩阴影、内阴影、长阴影
 在Mac m1运行ChatGLM3-6B cpu版本1-3秒出结果，速度明显超过T4 GPU，接近V100。
Qt --- Day01
Java泛型方法与普通成员方法以及案例说明（五）
wait与notify的使用
 大数据学习(11)-hive on mapreduce详解
原文地址：https://blog.csdn.net/weixin_42877425/article/details/139884650

hashing 分类

Closed hashing

Open hashing

phmap

absl::flat_hash_map 与 std::unordered_map 的比较

peak memory usage issue

phmap 中的 parallel hashmap 是如何解决 peak memory usage 问题的

此处对 phmap::parallel_flat_hash_map 引入 submap 所带来的额外开销进行分析。

比较 phmap::flat_hash_map、 phmap::parallel_flat_hash_map 和 std::unordered_hash_map 的性能

parallel_flat_hash_map 带来的天然的并行优势

std::unordered_map

参考