• 【hive】行转列—explode()/posexplode()/lateral view 函数使用场景



    一、lateral view函数

    • 功能:

      • 用于和UDTF函数(explode,split)结合使用,把某一行数据拆分成多行数据,再将多行结果组合成一个支持别名的虚拟表。
      • 主要解决select使用UDTF做查询的过程中查询只能包含单个UDTF,不能包含其它字段以及多个UDTF的情况。
    • 语法:LATERAL VIEW udtf(expression) tableAlias AS columnAlias (‘,’ columnAlias)

      • columnAlias是给udtf(expression)列起的别名。
      • tableAlias 虚拟表的别名。
    • lateral view函数会将lUDTF生成的结果放到一个虚拟表中,然后这个虚拟表和输入行进行join来达到连接UDTF外的select字段的目的。

    使用方式:

    举例1:与explode() 函数连用

    --测试表的数据结构,可以看到第一列是id(string类型),第二列是rule_array(array<string>类型)
    > desc test_2;
    
    --结果:
    +-------------+----------------+----------+
    |  col_name   |   data_type    | comment  |
    +-------------+----------------+----------+
    | id          | string         |          |
    | rule_array  | array<string>  |          |
    +-------------+----------------+----------+
    
    --查看测试表的内容
    > select * from test_2;
    
    --结果:
    +------------+----------------------------------------+
    | test_2.id  |           test_2.rule_array            |
    +------------+----------------------------------------+
    | a          | ["501","502","503","501","512"]        |
    | b          | ["511","512","513","511","512","511"]  |
    | c          | ["512","513","511","512"]              |
    +------------+----------------------------------------+
    
    --拆分rule_array
    > select id, rule_explode 
    > from test_2
    > lateral view explode(rule_array) adTable AS rule_explode;
    
    --结果:
    +-----+---------------+
    | id  | rule_explode  |
    +-----+---------------+
    | a   | 501           |
    | a   | 502           |
    | a   | 503           |
    | a   | 501           |
    | a   | 512           |
    | b   | 511           |
    | b   | 512           |
    | b   | 513           |
    | b   | 511           |
    | b   | 512           |
    | b   | 511           |
    | c   | 512           |
    | c   | 513           |
    | c   | 511           |
    | c   | 512           |
    +-----+---------------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48

    举例2:与parse_url_tuple()函数连用

    • parse_url_tuple()函数
      • 语法:parse_url(string urlString, string partToExtract [, string keyToExtract])
      • 功能:UDTF函数,解析URL字符串,它通过一次指定多个参数,从URL中解析出多个参数值返回多列。
      • 参数:partToExtract的有效值为:HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO。
      • 注意:在使用 select 时不能同时返回其他字段,不能嵌套调用,不能与group by放在一起调用等。为解决此问题,可以通过测试图(Lateral View)搭配调用。
    --测试表的数据结构,可以看到第一列是id(int类型),第二列是url(string类型)
    > desc tb_url;
    
    --结果:
    +-----------+------------+----------+
    | col_name  | data_type  | comment  |
    +-----------+------------+----------+
    | id        | int        |          |
    | url       | string     |          |
    +-----------+------------+----------+
    
    --查看测试表的内容
    > select * from tb_url;
    
    --结果:
    +------------+-------------------------------------------------+
    | tb_url.id  |                   tb_url.url                    |
    +------------+-------------------------------------------------+
    | 1          | http://facebook.com/path/p1.php?query=1         |
    | 2          | http://www.baidu.com/news/index.jsp?uuid=frank  |
    | 3          | http://www.jd.com/index?source=baidu            |
    +------------+-------------------------------------------------+
    
    --提取tb_url的query部分
    > select
    > a.id as id,
    > c.query as query
    > from tb_url a
    > lateral view parse_url_tuple(url,"QUERY") c as query
    > ;
    
    --结果:
    +-----+---------------+
    | id  |     query     |
    +-----+---------------+
    | 1   | query=1       |
    | 2   | uuid=frank    |
    | 3   | source=baidu  |
    +-----+---------------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39

    举例3:多个lateral view连用

    --提取tb_url的HOST、PATH和QUERY部分
    > select
    > a.id as id,
    > b.host as host,
    > b.path as path,
    > c.query as query
    > from tb_url a
    > lateral view parse_url_tuple(url,"HOST","PATH") b as host,path
    > lateral view parse_url_tuple(url,"QUERY") c as query
    > ;
    
    --结果:
    +-----+----------------+------------------+---------------+
    | id  |      host      |       path       |     query     |
    +-----+----------------+------------------+---------------+
    | 1   | facebook.com   | /path/p1.php     | query=1       |
    | 2   | www.baidu.com  | /news/index.jsp  | uuid=frank    |
    | 3   | www.jd.com     | /index           | source=baidu  |
    +-----+----------------+------------------+---------------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    注意:

    • lateral view函数的位置在from后where条件前。
    • 生成虚拟表的别名不可省略。
    • from后可带多个 lateral view函数。
    • 如果要拆分的字段有null值,需要使用 lateral view outer替代,避免数据缺失。

    二、explode()函数

    • 功能:UDTF函数,可以将一个array或者map展开
    • 语法:explode(map ,array )
      • explode(array):将array列表里的每个元素生成一行
      • explode(map):每一对元素作为一行,key为一列,value为一列

    使用方式:

    举例1:直接使用

    --explode(array)
    > select explode(array(11,22,33)) as item;
    
    --结果:
    +-------+
    | item  |
    +-------+
    | 11    |
    | 22    |
    | 33    |
    +-------+
    
    --explode(map)
    > select explode(map("id",11,"name","lily","age",18)) as (k,v);
    
    --结果:
    +-------+-------+
    |   k   |   v   |
    +-------+-------+
    | id    | 11    |
    | name  | lily  |
    | age   | 18    |
    +-------+-------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23

    举例2:与lateral view一起使用

    • 语法:lateral view explode(字段) 表别名 as 列别名
    • 如果字段类型是map:lateral view explode(字段) 表别名 as (key别名,value别名)
    > select * from test_2;
    
    --结果:
    +------------+----------------------------------------+
    | test_2.id  |           test_2.rule_array            |
    +------------+----------------------------------------+
    | a          | ["501","502","503","501","512"]        |
    | b          | ["511","512","513","511","512","511"]  |
    | c          | ["512","513","511","512"]              |
    +------------+----------------------------------------+
    
    > select id,rule
    > from test_2 
    > lateral view explode(rule_array) tbl as rule
    > ;
    
    --结果:
    +-----+-------+
    | id  | rule  |
    +-----+-------+
    | a   | 501   |
    | a   | 502   |
    | a   | 503   |
    | a   | 501   |
    | a   | 512   |
    | b   | 511   |
    | b   | 512   |
    | b   | 513   |
    | b   | 511   |
    | b   | 512   |
    | b   | 511   |
    | c   | 512   |
    | c   | 513   |
    | c   | 511   |
    | c   | 512   |
    +-----+-------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36

    举例3:使用局限性

    • 不能关联原有的表中其它字段。例如:
    > select id, explode(rule_array) as rule from test_2;
    
    --报错:
    Error: Error while compiling statement: FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions (state=42000,code=10081)
    
    • 1
    • 2
    • 3
    • 4
    • 不能与group by、cluster by 、distribute by、sort by联用。例如:
    > select explode(rule_array) as rule from test_2 group by explode(rule_array);
    
    --报错:
    Error: Error while compiling statement: FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions (state=42000,code=10081)
    
    • 1
    • 2
    • 3
    • 4
    • 不能进行UDTF嵌套。例如:
    > select explode(explode(rule_array)) from test_2 ;
    
    --报错:
    Error: Error while compiling statement: FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions (state=42000,code=10081)
    
    • 1
    • 2
    • 3
    • 4
    • 不允许选择其他表达式。例如:
    > select explode("1,2,3,4,5") from test_2 ; 
    
    --报错:
    Error: Error while compiling statement: FAILED: UDFArgumentException explode() takes an array or a map as a parameter (state=42000,code=40000)
    
    • 1
    • 2
    • 3
    • 4

    三、posexplode()函数

    • 功能:UDTF函数,将一个array或者map展开,可以将index(索引)和数据都取出来,使用两次posexplode()并用where语句使两次取到的index相等,从而实现对多列进行多行转换。explode()函数只能将对一列进行行转换。
    • 语法:posexplode(map ,array )

    使用方式:

    举例1:

    > desc test_2;
    
    --结果:
    +-------------+----------------+----------+
    |  col_name   |   data_type    | comment  |
    +-------------+----------------+----------+
    | id          | string         |          |
    | dt_array    | array<string>  |          |
    | rule_array  | array<string>  |          |
    +-------------+----------------+----------+
    
    > select * from test_2;
    
    --结果:
    +------------+----------------------------------------------------+----------------------------------------+
    | test_2.id  |                  test_2.dt_array                   |           test_2.rule_array            |
    +------------+----------------------------------------------------+----------------------------------------+
    | a          | ["20230809","20230811","20230812","20230812","20230813"] | ["501","502","503","501","512"]        |
    | b          | ["20230809","20230811","20230812","20230812","20230813","20230809"] | ["511","512","513","511","512","511"]  |
    | c          | ["20230811","20230812","20230812","20230813"]      | ["512","513","511","512"]              |
    +------------+----------------------------------------------------+----------------------------------------+
    
    --注:语句中b.idxb和c.idxc分别是子视图的两个索引,我们where子句中使用b.idxb=c.idxc保证顺序的一致性。
    > select a.id,b.cur_day,c.rule 
    > from test_2 a
    > lateral view posexplode(dt_array) b as idxb,cur_day,
    > lateral view posexplode(rule_array) c as idxc,rule where b.idxb=c.idxc
    > ;
    
    --结果:
    +-------+------------+---------+
    | a.id  | b.cur_day  | c.rule  |
    +-------+------------+---------+
    | a     | 20230809   | 501     |
    | a     | 20230811   | 502     |
    | a     | 20230812   | 503     |
    | a     | 20230812   | 501     |
    | a     | 20230813   | 512     |
    | b     | 20230809   | 511     |
    | b     | 20230811   | 512     |
    | b     | 20230812   | 513     |
    | b     | 20230812   | 511     |
    | b     | 20230813   | 512     |
    | b     | 20230809   | 511     |
    | c     | 20230811   | 512     |
    | c     | 20230812   | 513     |
    | c     | 20230812   | 511     |
    | c     | 20230813   | 512     |
    +-------+------------+---------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49

    四、行转列使用

    单列转多行

    举例1:使用explode()+lateral view

    > select t.id,a.cur_day
    > from test_2 t
    > lateral view explode(t.dt_array) a as cur_day
    > ;
    
    --结果:
    +-------+------------+
    | t.id  | a.cur_day  |
    +-------+------------+
    | b     | 20230809   |
    | b     | 20230811   |
    | b     | 20230812   |
    | b     | 20230812   |
    | b     | 20230813   |
    | b     | 20230809   |
    | c     | 20230811   |
    | c     | 20230812   |
    | c     | 20230812   |
    | c     | 20230813   |
    | a     | 20230809   |
    | a     | 20230811   |
    | a     | 20230812   |
    | a     | 20230812   |
    | a     | 20230813   |
    +-------+------------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25

    举例2:explode字段是string格式,先split()指定分隔符,如果省略split()则使用默认分隔符是英文逗号。

    > desc test_3;
    
    --结果:
    +-----------+------------+----------+
    | col_name  | data_type  | comment  |
    +-----------+------------+----------+
    | id        | string     |          |
    | dt        | string     |          |
    | rule1     | string     |          |
    +-----------+------------+----------+
    
    > select * from test_3;
    
    --结果:
    +------------+----------------------------------------------------+--------------------------+
    | test_3.id  |                     test_3.dt                      |       test_3.rule1       |
    +------------+----------------------------------------------------+--------------------------+
    | a          | 20230809,20230811,20230812,20230812,20230813       | 501,502,503,501,512      |
    | b          | 20230809,20230811,20230812,20230812,20230813,20230809 | 511,512,513,511,512,511  |
    | c          | 20230811,20230812,20230812,20230813                | 512,513,511,512          |
    +------------+----------------------------------------------------+--------------------------+
    
    > select t.id,a.cur_day
    > from test_3 t
    > lateral view explode(split(t.dt,',')) a as cur_day
    > ;
    
    --结果:
    +-------+------------+
    | t.id  | a.cur_day  |
    +-------+------------+
    | a     | 20230809   |
    | a     | 20230811   |
    | a     | 20230812   |
    | a     | 20230812   |
    | a     | 20230813   |
    | b     | 20230809   |
    | b     | 20230811   |
    | b     | 20230812   |
    | b     | 20230812   |
    | b     | 20230813   |
    | b     | 20230809   |
    | c     | 20230811   |
    | c     | 20230812   |
    | c     | 20230812   |
    | c     | 20230813   |
    +-------+------------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47

    多列转多行

    举例1:使用posexplode()+lateral view

    先测试使用explode(), 看看效果:

    > select  a.id,b.cur_day,c.rule 
    > from test_2 a
    > lateral view explode(dt_array) b as cur_day,
    > lateral view explode(rule_array) c as rule
    > ;
    
    --结果:
    +-------+------------+---------+
    | a.id  | b.cur_day  | c.rule  |
    +-------+------------+---------+
    | b     | 20230809   | 511     |
    | b     | 20230809   | 512     |
    | b     | 20230809   | 513     |
    | b     | 20230809   | 511     |
    | b     | 20230809   | 512     |
    | b     | 20230809   | 511     |
    | b     | 20230811   | 511     |
    | b     | 20230811   | 512     |
    | b     | 20230811   | 513     |
    | b     | 20230811   | 511     |
    | b     | 20230811   | 512     |
    | b     | 20230811   | 511     |
    | b     | 20230812   | 511     |
    | b     | 20230812   | 512     |
    | b     | 20230812   | 513     |
    | b     | 20230812   | 511     |
    | b     | 20230812   | 512     |
    | b     | 20230812   | 511     |
    | b     | 20230812   | 511     |
    | b     | 20230812   | 512     |
    | b     | 20230812   | 513     |
    | b     | 20230812   | 511     |
    | b     | 20230812   | 512     |
    | b     | 20230812   | 511     |
    | b     | 20230813   | 511     |
    | b     | 20230813   | 512     |
    | b     | 20230813   | 513     |
    | b     | 20230813   | 511     |
    | b     | 20230813   | 512     |
    | b     | 20230813   | 511     |
    | b     | 20230809   | 511     |
    | b     | 20230809   | 512     |
    | b     | 20230809   | 513     |
    | b     | 20230809   | 511     |
    | b     | 20230809   | 512     |
    | b     | 20230809   | 511     |
    | c     | 20230811   | 512     |
    | c     | 20230811   | 513     |
    | c     | 20230811   | 511     |
    | c     | 20230811   | 512     |
    | c     | 20230812   | 512     |
    | c     | 20230812   | 513     |
    | c     | 20230812   | 511     |
    | c     | 20230812   | 512     |
    | c     | 20230812   | 512     |
    | c     | 20230812   | 513     |
    | c     | 20230812   | 511     |
    | c     | 20230812   | 512     |
    | c     | 20230813   | 512     |
    | c     | 20230813   | 513     |
    | c     | 20230813   | 511     |
    | c     | 20230813   | 512     |
    | a     | 20230809   | 501     |
    | a     | 20230809   | 502     |
    | a     | 20230809   | 503     |
    | a     | 20230809   | 501     |
    | a     | 20230809   | 512     |
    | a     | 20230811   | 501     |
    | a     | 20230811   | 502     |
    | a     | 20230811   | 503     |
    | a     | 20230811   | 501     |
    | a     | 20230811   | 512     |
    | a     | 20230812   | 501     |
    | a     | 20230812   | 502     |
    | a     | 20230812   | 503     |
    | a     | 20230812   | 501     |
    | a     | 20230812   | 512     |
    | a     | 20230812   | 501     |
    | a     | 20230812   | 502     |
    | a     | 20230812   | 503     |
    | a     | 20230812   | 501     |
    | a     | 20230812   | 512     |
    | a     | 20230813   | 501     |
    | a     | 20230813   | 502     |
    | a     | 20230813   | 503     |
    | a     | 20230813   | 501     |
    | a     | 20230813   | 512     |
    +-------+------------+---------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88

    出现这种情况,是因为两个并列的explode()的hql没办法识别cur_day对应的rule是什么,对于多个数组的行转列可以使用posexplode()函数。
    例如使用如下查询语句:

    > select a.id,b.cur_day,c.rule 
    > from test_2 a
    > lateral view posexplode(dt_array) b as idxb,cur_day,
    > lateral view posexplode(rule_array) c as idxc,rule where b.idxb=c.idxc
    > ;
    
    --结果:
    +-------+------------+---------+
    | a.id  | b.cur_day  | c.rule  |
    +-------+------------+---------+
    | b     | 20230809   | 511     |
    | b     | 20230811   | 512     |
    | b     | 20230812   | 513     |
    | b     | 20230812   | 511     |
    | b     | 20230813   | 512     |
    | b     | 20230809   | 511     |
    | c     | 20230811   | 512     |
    | c     | 20230812   | 513     |
    | c     | 20230812   | 511     |
    | c     | 20230813   | 512     |
    | a     | 20230809   | 501     |
    | a     | 20230811   | 502     |
    | a     | 20230812   | 503     |
    | a     | 20230812   | 501     |
    | a     | 20230813   | 512     |
    +-------+------------+---------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
  • 相关阅读:
    OceanBase 分布式数据库【信创/国产化】- OceanBase Demo 环境搭建
    Spring 01: Spring配置 + 基于xml的IOC控制反转 + Setter注入
    位移运算法的解释与hashMap的应用。
    【Web实战-Tomcat-Servlet-Thymeleaf -JDBC-MySQL】浏览器页面显示数据库数据(水果库存系统)
    120.三角形最小路径和
    SpringCloud 微服务全栈体系(三)
    前端培训丁鹿学堂:vue的性能优化(一)
    .NET Core Web API扩展框架
    OpenCV项目实战(2)— 如何用OpenCV实现弹球动画
    PowerQuery领域的经典之作“猴子书“中文版来啦!
  • 原文地址:https://blog.csdn.net/sodaloveer/article/details/132837340