• hive学习之grouping set /cube/rollup


    GROUPING SETS clause

    The GROUPING SETS clause in GROUP BY allows us to specify more than one GROUP BY option in the same record set. All GROUPING SET clauses can be logically expressed in terms of several GROUP BY queries connected by UNION. Table-1 shows several such equivalent statements. This is helpful in forming the idea of the GROUPING SETS clause. A blank set ( ) in the GROUPING SETS clause calculates the overall aggregate.

    这句话就是说 grouping sets 让我们一句sql可以计算出多个group by的值。

    Table 1 - GROUPING SET queries and the equivalent GROUP BY queries

    Aggregate Query with GROUPING SETS

    Equivalent Aggregate Query with GROUP BY

    SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b) )

    SELECT a, b, SUM(c) FROM tab1 GROUP BY a, b

    SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a,b), a)

    SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b

    UNION

    SELECT a, null, SUM( c ) FROM tab1 GROUP BY a

    SELECT a,b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS (a,b)

    SELECT a, null, SUM( c ) FROM tab1 GROUP BY a

    UNION

    SELECT null, b, SUM( c ) FROM tab1 GROUP BY b

    SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b GROUPING SETS ( (a, b), a, b, ( ) )

    SELECT a, b, SUM( c ) FROM tab1 GROUP BY a, b

    UNION

    SELECT a, null, SUM( c ) FROM tab1 GROUP BY a, null

    UNION

    SELECT null, b, SUM( c ) FROM tab1 GROUP BY null, b

    UNION

    SELECT null, null, SUM( c ) FROM tab1

    数据准备

    create table test.cc_groupingset(
    a int ,
    b int ,
    c int );
    insert into test.cc_groupingset values
    (1,1,1),
    (2,2,2),
    (3,3,3),
    (4,4,4),
    (1,null,null),
    (null,2,null),
    (null,null,3)

    需求1 , 算出 根据a,b分组后c的sum值

    select a,b,sum(c) from test.cc_groupingset group by a,b GROUPING sets((a,b))

    select a,b,sum(c) from test.cc_groupingset group by a,b

     

    需求2,同时算出需求1也就是根据a,b分组的sum(c),和根据a分组后的sum(c)

    select a,b,sum(c) from  test.cc_groupingset group by a,b GROUPING sets((a,b),a)

    select a,b,sum(c) from test.cc_groupingset group by a,b

    union 

    select a,null,sum(c) from test.cc_groupingset group by a

    注意我下面标红的第3、8、12行就是group by a 的值。

     

    需求3,同时根据a分组后的sum(c)和根据b分组后的sum(c)

    select a,b,sum(c) from test.cc_groupingset group by a,b GROUPING sets(a,b)

    explain 
    select null,b,sum(c) from test.cc_groupingset group by b
    union 
    select a,null,sum(c) from test.cc_groupingset group by a

     需求4,算个各个维度的sum(c),即根据group a / group b/ group a,b/ 不分组直接sum

    select a,b,sum(c) from test.cc_groupingset group by a,b GROUPING sets(a,b,(a,b))

    根据我之前的select *可以看到 

    group by a 有 5条 group by b 有 5条,group by a,b 有 7条 ,直接sum()有1条

    所以总数=5+5+7+1=18条

     现在来总结下用法。

    就是在我们根据多个维度分组的时候,例如group by a,b,c,d 有时候我们也需要这几个维度中部分维度的聚合值 例如group by a,b 。这个时候通过group by a,b,c,d grouping sets((a,b,c,d),(a,b))即可,看起来简洁,使用方便。

    上面的grouping sets 还是要手动指定各个维度。

    Cubes and Rollups

    The general syntax is WITH CUBE/ROLLUP. It is used with the GROUP BY only. CUBE creates a subtotal of all possible combinations of the set of column in its argument. Once we compute a CUBE on a set of dimension, we can get answer to all possible aggregation questions on those dimensions.

    It might be also worth mentioning here that
    GROUP BY a, b, c WITH CUBE is equivalent to
    GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (b, c), (a, c), (a), (b), (c), ( )).

    ROLLUP clause is used with GROUP BY to compute the aggregate at the hierarchy levels of a dimension.
    GROUP BY a, b, c with ROLLUP assumes that the hierarchy is "a" drilling down to "b" drilling down to "c".

    GROUP BY a, b, c, WITH ROLLUP is equivalent to GROUP BY a, b, c GROUPING SETS ( (a, b, c), (a, b), (a), ( )).

    简单来说WITH CUBE 就是各个维度都包含,

    with ROLLUP 就是根据第一个维度去看所有和这个维度一起的聚合结果

     因为cube 和roll up计算的维度比较多,所以会有多个任务使用下列参数优化

    hive.new.job.grouping.set.cardinality

    Whether a new map-reduce job should be launched for grouping sets/rollups/cubes.
    For a query like: select a, b, c, count(1) from T group by a, b, c with rollup;
    4 rows are created per row: (a, b, c), (a, b, null), (a, null, null), (null, null, null)
    This can lead to explosion across map-reduce boundary if the cardinality of T is very high
    and map-side aggregation does not do a very good job.

    This parameter decides if hive should add an additional map-reduce job. If the grouping set
    cardinality (4 in the example above), is more than this value, a new MR job is added under the
    assumption that the orginal group by will reduce the data size

    grouping__id

    还是之前的原始数据

    我们之前写的sql 有个问题。就是结果虽然全部出来了,但是我根本分不清谁是谁,比如

    a=1,b=null ,sum(c)=1  这个到底是根据a 分的组 b默认是null ,还是根据ab 分组,a=1 b=null

    所以多了一个Grouping__ID   

    之前的18条数据,杂乱不堪。

    select a,b,sum(c),grouping__ID from test.cc_groupingset group by a,b with cube order by grouping__ID 

     

    为什么要学这种。因为这个效率高。

    explain select a,b,sum(c) from test.cc_groupingset group by a,b GROUPING sets(a,b)

    explain 
    select null,b,sum(c) from test.cc_groupingset group by b
    union 
    select a,null,sum(c) from test.cc_groupingset group by a

    通过explain可以看到reduce的个数减少了 

     

     

  • 相关阅读:
    RFID产线自动化升级改造管理方案
    重学设计模式之-桥接模式
    《云计算:云端协同,智慧互联》
    Spring AOP的失效场景
    Tomcat 9.0.54源码环境搭建
    2024.6.14刷题记录-KMP记录
    优化算法|MOAVOA:一种新的多目标人工秃鹰优化算法(Matlab代码实现)
    地级市-空气流动系数数据-更新至2019(含10米风速、边界高度等)
    Java 多线程写zip文件遇到的错误 write beyond end of stream!
    基于SVM的车牌识别算法
  • 原文地址:https://blog.csdn.net/cclovezbf/article/details/126705112