• Hive中的explode函数、posexplode函数与later view函数


    1.概述

      在离线数仓处理通过HQL业务数据时,经常会遇到行转列或者列转行之类的操作,就像concat_ws之类的函数被广泛使用,今天这个也是经常要使用的拓展方法。

    2.explode函数

    2.1 函数语法

    -- explode(a) - separates the elements of array a into multiple rows, or the elements of a map into multiple rows and columns 
    Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDTFExplode
    Function type:BUILTIN
    -- explode()用于array的语法如下
    select explode(arraycol) as newcol from tablename;
    -- explode()用于map的语法如下:
    select explode(mapcol) as (keyname,valuename) from tablename;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    2.2 函数说明

    • explode 函数是UDTF函数,将hive一列中复杂的array或者map结构拆分成多行。
    • Explode函数是不允许在select再有其他字段,
      • explode(ARRAY) 列表中的每个元素生成一行。
      • explode(MAP) map中每个key-value对,生成一行,key为一列,value为一列。

    2.3 使用案例

    -- explode (array)
    select explode(array('A','B','C'));
    select explode(array('A','B','C')) as col;
    select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf;
    select tf.* from (select 0) t lateral view explode(array('A','B','C')) tf as col;
    -- 结果
    col
    A
    B
    C
    -- explode (map)
    select explode(map('A',10,'B',20,'C',30));
    select explode(map('A',10,'B',20,'C',30)) as (key,value);
    select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf;
    select tf.* from (select 0) t lateral view explode(map('A',10,'B',20,'C',30)) tf as key,value;
    -- 结果
    key value
    A	10
    B	20
    C	30
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20

    3.posexplode函数

    2.1 函数语法

    -- posexplode(a) - behaves like explode for arrays, but includes the position of items in the original array
    Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDTFPosExplode
    Function type:BUILTIN
    select posexplode(ARRAY<T> a)
    -- Explodes an array to multiple rows with additional positional column of int type (position of items in the original array, starting with 0). Returns a row-set with two columns (pos,val), one row for each element from the array.
    
    • 1
    • 2
    • 3
    • 4
    • 5

    2.2 函数说明

    • posexplode 函数,将ARRAY数组a展开,每个Value一行,每行两列分别对应数组从0开始的下标和数组元素。

    2.3 使用案例

    -- posexplode (array)
    select posexplode(array('A','B','C'));
    select posexplode(array('A','B','C')) as (pos,val);
    select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf;
    select tf.* from (select 0) t lateral view posexplode(array('A','B','C')) tf as pos,val;
    -- 结果
    pos val
    0	A
    1	B
    2	C
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    4.later view

    4.1 语法

    lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
    fromClause: FROM baseTable (lateralView)*
    -- columnAlias是给udtf(expression)列起的别名。
    -- tableAlias 虚拟表的别名。
    
    • 1
    • 2
    • 3
    • 4

    4.2 用法描述

    • lateral view为侧视图,意义是为了配合UDTF来使用,把某一行数据拆分成多行数据。
    • 不加lateral view的UDTF只能提取单个字段拆分,并不能塞回原来数据表中。
    • 加上lateral view就可以将拆分的单个字段数据与原始表数据关联上。
    • lateral view函数会将UDTF生成的结果放到一个虚拟表中,然后虚拟表中的数据和输入行进行join来达到连接UDTF外的select字段的目的。(本质是笛卡尔积)

    4.3 使用案例

    4.3.1 准备数据

    下表 pageAds. 它有两个字段: pageid (页码) and adid_list (页面上的adid):

    Column nameColumn type
    pageidSTRING
    adid_listArray

    表中数据如下:

    pageidadid_list
    front_page[1, 2, 3]
    contact_page[3, 4, 5]

    需求: 统计各个页面出现的广告的次数

    4.3.2 代码实现

    第一步: 使用 lateral view 和 explore() 函数将 adid_list 列的 list 拆分,sql代码如下:

    select pageid, adid
    FROM pageAds lateral view explode(adid_list) ad_view as adid;
    
    • 1
    • 2

    可的如下结果

    pageidadid
    front_page1
    front_page2
    front_page3
    contact_page4
    contact_page5

    第二步: 使用 count/group by 语句统计出每个adid出现的次数:

    select adid,count(1) as cnt
    FROM pageAds lateral view explode(adid_list) ad_view as adid
    group by adid;
    
    • 1
    • 2
    • 3
    adidcnt
    11
    21
    32
    41
    51

    4.4 Multiple Lateral Views

    FROM子句可以有多个LATERAL VIEW子句。 后面的LATERAL VIEWS子句可以引用出现在LATERAL VIEWS左侧表的任何列。

    例如,如下查询:

    SELECT * FROM exampleTable
    LATERAL VIEW explode(col1) myTable1 AS myCol1
    LATERAL VIEW explode(col2) myTable2 AS myCol2;
    
    • 1
    • 2
    • 3

    例如使用以下基表:

    Array pageid_listArray adid_list
    [1, 2, 3][“a”, “b”, “c”]
    [3, 4][“c”, “d”]

    单个Lateral View查询:

    SELECT pageid_list, adid
    FROM pageAds_1
             LATERAL VIEW explode(adid_list) adTable AS adid;
    [1,2,3]	a
    [1,2,3]	b
    [1,2,3]	c
    [4,5]	c
    [4,5]	d
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    多个Lateral View查询:

    select pageid,adid FROM pageAds_1
    lateral view explode(pageid_list) adTable as pageid
    lateral view explode(adid_list) adTable as adid;
    1,a
    1,b
    1,c
    2,a
    2,b
    2,c
    3,a
    3,b
    3,c
    3,c
    3,d
    4,c
    4,d
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16

    4.5 later view json_tuple()

    4.5.1 准备数据
    create table lateral_tal_3
    (
        id   int,
        col1 string,
        col2 string
    );
    
     insert into lateral_tal_3 values(1234,'{"part1" : "61", "total" : "623", "part2" : "560", "part3" : "1", "part4" : "1"}','	{"to_part2" : "0", "to_part4" : "0", "to_up" : "0", "to_part3" : "0", "to_part34" : "0"}'),
    (4567,'{"part1" : "451", "total" : "89928", "part2" : "88653", "part3" : "789", "part4" : "35"}','{"to_part2" : "54", "to_part4" : "6", "to_up" : "65", "to_part3" : "2", "to_part34" : "3"}'),
    (7890,'{"part1" : "142", "total" : "351808", "part2" : "346778", "part3" : "4321", "part4" : "567"}','{"to_part2" : "76", "to_part4" : "23", "to_up" : "65", "to_part3" : "14", "to_part34" : "53"}');
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    idcol1col2
    1234{“part1” : “61”, “total” : “623”, “part2” : “560”, “part3” : “1”, “part4” : “1”}{“to_part2” : “0”, “to_part4” : “0”, “to_up” : “0”, “to_part3” : “0”, “to_part34” : “0”}
    4567{“part1” : “451”, “total” : “89928”, “part2” : “88653”, “part3” : “789”, “part4” : “35”}{“to_part2” : “54”, “to_part4” : “6”, “to_up” : “65”, “to_part3” : “2”, “to_part34” : “3”}
    7890{“part1” : “142”, “total” : “351808”, “part2” : “346778”, “part3” : “4321”, “part4” : “567”}{“to_part2” : “76”, “to_part4” : “23”, “to_up” : “65”, “to_part3” : “14”, “to_part34” : “53”}

    需求: 解析非结构化的json数据类型

    “json_tuple(jsonStr, p1, p2, …, pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.”
    Function class:org.apache.hadoop.hive.ql.udf.generic.GenericUDTFJSONTuple
    Function type:BUILTIN

    json_tuple : 第一个参数是json 字符串所在的列名,其它参数是获取 json 字符串中的哪些key值;

    4.5.2 代码实现
    SELECT id,
           part1,
           part3,
           part4,
           to_part2,
           to_part3,
           to_part4,
           IF(part3 = 0, 0.0, to_part3 / part3) as ratio3,
           IF(part4 = 0, 0.0, to_part4 / part4) as ratio4
    FROM lateral_tal_3
             lateral VIEW json_tuple(col1, 'part3', 'part4', 'part1') json1 AS part3, part4, part1
             lateral VIEW json_tuple(col2, 'to_part2','to_part3', 'to_part4') json2 AS to_part2, to_part3, to_part4
    ;
    
    1234,61,1,1,0,0,0,0,0
    4567,451,789,35,54,2,6,0.0025348542458808617,0.17142857142857143
    7890,142,4321,567,76,14,23,0.0032399907428835918,0.04056437389770723
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18

    5.使用案例

    需求1: 如何产生1-100的连续的数字?

    --方式1: 结合space函数与split函数,posexplode函数,lateral view函数获得
    select id_start + pos as id
    from (
             select 1   as id_start,
                    100 as id_end
         ) m lateral view posexplode(split(space(id_end - id_start), '')) t as pos, val;
      
    -- 方式2:结合space函数与split函数,explode函数,lateral view函数+窗口函数获得
    select row_number() over () as id
    from (select split(space(99), '') as x) t
             lateral view
                 explode(x) ex;
    -- 方式2:结合space函数与split函数,posexplode函数,lateral view函数获取
    from (select split(space(99), ' ') as x) t
             lateral view
                 posexplode(x) ex as pos,val;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16

    需求2: 获取2024-07-15至2024-07-29间所有的日期

    SELECT pos,
           date_add(start_date, pos) dd
    FROM (SELECT '2024-07-15' AS start_date, '2024-07-29' AS end_date) temp
             lateral VIEW
                 posexplode(split(space(datediff(end_date, start_date)), '')) t
             AS pos, val;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
  • 相关阅读:
    Config 分布式配置中心
    leetcode 494.目标和 动态规划背包问题 (c++版本)
    【c++刷题Day2】专题2线性DP
    golang指针和结构体、序列化
    非标准化套利
    自动创建表分区存储
    el-dialog element-ui弹窗
    数据结构 栈Stack的理解与C语言实现
    Linux镜像下载及虚拟机中安装
    Carla自动驾驶仿真八:两种查找CARLA地图坐标点的方法
  • 原文地址:https://blog.csdn.net/weixin_44852067/article/details/136751954