• 大数据开发之Hive案例篇10-大表笛卡尔积优化


    一. 问题描述

    需求描述:
    表概述:

    dt                  时间分区
    data_source  数据来源类别
    start_date      时间
    data_count    当前时间的数量
    
    • 1
    • 2
    • 3
    • 4

    需要实现的需求

    求每个data_source 下start_date 当前累积的data_count
    
    • 1

    SQL代码:

    select dt,
              data_souce,
              start_date,
              data_count,
              sum(data_count) over(partition by data_source order by start_date) as data_cum_count
      from table_name
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    运行日志:
    从日志可以看到,数据倾斜了,redcue一直卡在99%不动,过一段时间就被断开了。

    2023-05-30 12:05:40,318 Stage-1 map = 100%,  reduce = 75%, Cumulative CPU 2693.11 sec
    2023-05-30 12:05:41,349 Stage-1 map = 100%,  reduce = 76%, Cumulative CPU 2716.81 sec
    2023-05-30 12:05:43,411 Stage-1 map = 100%,  reduce = 77%, Cumulative CPU 2774.08 sec
    2023-05-30 12:05:45,478 Stage-1 map = 100%,  reduce = 78%, Cumulative CPU 2795.55 sec
    2023-05-30 12:05:46,509 Stage-1 map = 100%,  reduce = 79%, Cumulative CPU 2851.83 sec
    2023-05-30 12:05:47,547 Stage-1 map = 100%,  reduce = 80%, Cumulative CPU 2880.86 sec
    2023-05-30 12:05:51,678 Stage-1 map = 100%,  reduce = 81%, Cumulative CPU 2935.67 sec
    2023-05-30 12:05:52,710 Stage-1 map = 100%,  reduce = 84%, Cumulative CPU 3031.14 sec
    2023-05-30 12:05:54,772 Stage-1 map = 100%,  reduce = 85%, Cumulative CPU 3086.83 sec
    2023-05-30 12:05:56,833 Stage-1 map = 100%,  reduce = 86%, Cumulative CPU 3101.59 sec
    2023-05-30 12:06:00,956 Stage-1 map = 100%,  reduce = 87%, Cumulative CPU 3213.04 sec
    2023-05-30 12:06:07,173 Stage-1 map = 100%,  reduce = 89%, Cumulative CPU 3332.53 sec
    2023-05-30 12:06:08,209 Stage-1 map = 100%,  reduce = 90%, Cumulative CPU 3348.58 sec
    2023-05-30 12:06:09,241 Stage-1 map = 100%,  reduce = 93%, Cumulative CPU 3399.05 sec
    2023-05-30 12:06:10,272 Stage-1 map = 100%,  reduce = 94%, Cumulative CPU 3456.29 sec
    2023-05-30 12:06:12,334 Stage-1 map = 100%,  reduce = 95%, Cumulative CPU 3503.32 sec
    2023-05-30 12:06:14,406 Stage-1 map = 100%,  reduce = 96%, Cumulative CPU 3550.1 sec
    2023-05-30 12:06:15,433 Stage-1 map = 100%,  reduce = 97%, Cumulative CPU 3576.75 sec
    2023-05-30 12:06:19,561 Stage-1 map = 100%,  reduce = 98%, Cumulative CPU 3674.46 sec
    2023-05-30 12:06:29,878 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 3860.69 sec
    2023-05-30 12:07:30,726 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 4349.64 sec
    2023-05-30 12:08:31,498 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 4622.97 sec
    2023-05-30 12:09:32,161 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 4857.09 sec
    2023-05-30 12:10:32,788 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 5046.44 sec
    2023-05-30 12:11:33,443 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 5196.55 sec
    2023-05-30 12:12:34,216 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 5325.04 sec
    2023-05-30 12:13:34,952 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 5454.34 sec
    2023-05-30 12:14:35,677 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 5584.3 sec
    2023-05-30 12:15:36,383 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 5722.47 sec
    2023-05-30 12:16:37,011 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 5796.86 sec
    2023-05-30 12:17:37,641 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 5864.27 sec
    2023-05-30 12:18:38,284 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 5929.96 sec
    2023-05-30 12:19:38,916 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 5999.27 sec
    2023-05-30 12:20:39,508 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 6066.16 sec
    2023-05-30 12:21:40,153 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 6133.75 sec
    2023-05-30 12:22:40,776 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 6202.56 sec
    2023-05-30 12:23:41,326 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 6271.21 sec
    2023-05-30 12:24:41,947 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 6338.7 sec
    2023-05-30 12:25:42,696 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 6406.98 sec
    2023-05-30 12:26:43,307 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 6474.84 sec
    2023-05-30 12:27:43,873 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 6543.65 sec
    2023-05-30 12:28:44,449 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 6610.24 sec
    2023-05-30 12:29:45,003 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 6679.73 sec
    2023-05-30 12:30:45,623 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 6746.93 sec
    2023-05-30 12:31:46,118 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 6822.78 sec
    2023-05-30 12:32:46,658 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 6890.72 sec
    2023-05-30 12:33:47,212 Stage-1 map = 100%,  reduce = 99%, Cumulative CPU 6959.17 sec
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47

    web页面日志:
    从web页面可以看到,reduce被kill的原因是Container被ApplicationMaster给kill掉了

    Speculation: attempt_1680276634497_67940_r_000001_1 succeeded first! [2023-05-30 10:56:47.400]Container killed by the ApplicationMaster. [2023-05-30 10:56:47.422]Container killed on request. Exit code is 143 [2023-05-30 10:56:47.442]Container exited with a non-zero exit code 143.
    
    
    • 1
    • 2

    过一段时间整个Job都被kill掉了

    二.解决方案

    2.1 数据倾斜

    因为reduce卡在了99%,所以首先想到的是数据倾斜,后面了解了下,data_source字段确实存在数据倾斜

    调整参数:
    然后没什么用

    -- 加大reduce个数
    set mapred.reduce.tasks = 100;
    set hive.auto.convert.join = true;
    -- 超过一万行就认为是倾斜
    set hive.skewjoin.key=100000;
    
    set mapreduce.map.memory.mb=16384;
    set mapreduce.reduce.memory.mb=2048;
    set yarn.nodemanager.vmem-pmem-ratio=4.1;
    set mapreduce.reduce.memory.mb=5120;
    set mapred.map.child.java.opts=-Xmx13106M;
    set mapreduce.map.java.opts=-Xmx13106M;
    set mapreduce.reduce.java.opts=-Xmx13106M;
    set mapreduce.task.io.sort.mb=512;
    set mapreduce.job.reduce.slowstart.completedmaps=0.8;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15

    调整代码:
    将数据倾斜严重的数据,单独拿出来执行
    然后也没什么作用

    select dt,
              data_souce,
              start_date,
              data_count,
              sum(data_count) over(partition by data_source order by start_date) as data_cum_count
      from table_name
    where data_source in (数据倾斜);
    
    select dt,
              data_souce,
              start_date,
              data_count,
              sum(data_count) over(partition by data_source order by start_date) as data_cum_count
      from table_name
    where data_source not in (数据倾斜);
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15

    2.2 SQL改写1:由分析函数改为常规写法

    不确定是不是Hive分析函数的问题,然后我将原始的SQL改为了表连接和临时表的方法来解决

    代码:

    select t1.dt,t1.data_source,t1.start_date,
              sum(data_count)  data_cum_count
      from table_name t1
     left join table_name t2
      on t1.data_souce = t2.data_souce
    where t1.start_date >= t2.start_date
    group by t1.dt,t1.data_source,t1.start_date;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    运行结果:
    运行结果中,某一个job也是卡在reduce 99%,但是卡了20分钟左右,就执行成功了
    最终SQL在30分钟左右执行完成

    同样的逻辑,表连接的方式居然就可以了,而分析函数却不行,估计一个是写内存,一个是写磁盘把。

    然而:
    然后这个是测试表,只有一个月的数据,补历史数据要补几年的,那么这个SQL肯定只会更慢。

    2.3 分析数据分布

    最大的一个data_source居然有9w多个,产生的笛卡尔积得有81亿之多,虽然集群有20个节点,资源还不错,执行也要半个小时以上。

    不敢想象如果是一年甚至数年的,那这个笛卡尔积只会更大。

    所以只能改SQL了

    2.4 SQL改写2:重写

    我们需要求每一个start_date的累积数量,那么此时我们可以先求每天的,然后求每天累积的,再求当天每一个start_date累积的,加上前一日的累积的,就是最终我们需要的数据。

    SQL代码:

    with tmp1 as (
    select t1.data_source,t1.dt,sum(t1.data_count) as sum_v_dt
        from table_name t1
      group by t1.data_source,t1.dt
     ),
    tmp2 as (
    select data_source,
           dt,
           sum(sum_v_dt) over( partition by data_source order by dt) as sum_v_cum_dt
      from tmp1
    )
    select t2.data_source,
           t2.dt,
           t2.start_date,
           nvl(sum(t2.data_count) over(partition by t2.data_source,t2.dt order by t2.start_date),0) + nvl(tmp2.sum_v_cum_dt,0) as sum_v_cum_dt_sdate
      from table_name t2
      join tmp2
     on t2.data_source = tmp2.data_source
    and t2.dt = tmp2.dt +1;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    运行记录:
    最终的运行时间在5分钟左右
    就算数据量提升数倍,因为 join的条件由一个 data_source 变为了两个 data_souce 、dt,大大减少了笛卡尔积的数据量,整个代码的计算量也减少了许多。

    参考:

    1. https://zhuanlan.zhihu.com/p/398374859
    2. https://blog.csdn.net/wisgood/article/details/77063606
    3. https://www.jianshu.com/p/fe0c5c7f62ed
    4. https://www.jianshu.com/p/9fb56b668ea0
    5. https://www.jianshu.com/p/d13f2c0db335
  • 相关阅读:
    squid代理服务器
    HTML标签
    K8S Pod Sidecar 应用场景之一-加入 NGINX Sidecar 做反代和 web 服务器
    KMP / EXKMP
    王道操作系统---操作系统运行环境
    SpringCloudAlibaba-Seata整合
    域内批量获取敏感文件
    假期摆烂之学习javaweb
    [附源码]计算机毕业设计JAVAjsp心理测评系统
    Unity3D学习笔记11——后处理
  • 原文地址:https://blog.csdn.net/u010520724/article/details/131081536