【Hive】CDPHiveNULL值排序前后的问题

最近公司在从本地大数据环境迁移到CDP的集群。在跑相同的SQL代码时，两边对比发现数据有问题，然后就开始查找原因：在对数据进行分组排序时，两个环境的NULL值默认排序不同。

排错

1、首先我们执行相同的SQL，对比了两边SQL的执行计划，

select id
     , times
     , row_number() over (partition by id order by times desc) as rn
from (select 1 as id, 1659258809386 as times
      union all
      select 1 as id, null as times
      ) t1;
1
2
3
4
5
6
7

CDP Hive的执行计划如下

Plan optimized by CBO.

Vertex dependency in root stage
Map 1 <- Union 2 (CONTAINS)
Map 4 <- Union 2 (CONTAINS)
Reducer 3 <- Union 2 (SIMPLE_EDGE)

Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 3
      File Output Operator [FS_12]
        Select Operator [SEL_10] (rows=2 width=16)
           Output:["_col0","_col1","_col2"]"
          PTF Operator [PTF_9] (rows=2 width=12)
             Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col1 DESC NULLS LAST","partition by:":"_col0"}]"
            Select Operator [SEL_8] (rows=2 width=12)
               Output:["_col0","_col1"]"
            <-Union 2 [SIMPLE_EDGE]
              <-Map 1 [CONTAINS]
                Reduce Output Operator [RS_7]
                  PartitionCols:1
                  Select Operator [SEL_6] (rows=2 width=12)
                     Output:["_col1"]"
                    Select Operator [SEL_1] (rows=1 width=12)
                       Output:["_col1"]"
                      TableScan [TS_0] (rows=1 width=1)
              <-Map 4 [CONTAINS]
                Reduce Output Operator [RS_7]
                  PartitionCols:1
                  Select Operator [SEL_6] (rows=2 width=12)
                     Output:["_col1"]
                    Select Operator [SEL_3] (rows=1 width=8)
                       Output:["_col1"]
                      TableScan [TS_2] (rows=1 width=1)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

本地的执行计划结果如下：

Plan not optimized by CBO.

Vertex dependency in root stage
Map 1 <- Union 2 (CONTAINS)
Map 4 <- Union 2 (CONTAINS)
Reducer 3 <- Union 2 (SIMPLE_EDGE)

Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 3
      File Output Operator [FS_12]
        Select Operator [SEL_10] (rows=2 width=16)
           Output:["_col0","_col1","_col2"]"
          PTF Operator [PTF_9] (rows=2 width=12)
             Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col1 DESC NULLS LAST","partition by:":"_col0"}]"
            Select Operator [SEL_8] (rows=2 width=12)
               Output:["_col0","_col1"]"
            <-Union 2 [SIMPLE_EDGE]
              <-Map 1 [CONTAINS]
                Reduce Output Operator [RS_7]
                  PartitionCols:1
                  Select Operator [SEL_6] (rows=2 width=12)
                     Output:["_col1"]"
                    Select Operator [SEL_1] (rows=1 width=12)
                       Output:["_col1"]"
                      TableScan [TS_0] (rows=1 width=1)
              <-Map 4 [CONTAINS]
                Reduce Output Operator [RS_7]
                  PartitionCols:1
                  Select Operator [SEL_6] (rows=2 width=12)
                     Output:["_col1"]
                    Select Operator [SEL_3] (rows=1 width=8)
                       Output:["_col1"]
                      TableScan [TS_2] (rows=1 width=1)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

发现是对NULL值的排序顺序两边不一致
对比结果如下：
在这里插入图片描述

2、于是去官网确认默认的配置什么
本地集群用的是Hive的2.7.3版本，在Hive官网发现ISSUE： HIVE-12994

SQL standard does not specify the behavior by default. Currently in Hive, null values sort as if lower than any non-null value; that is, NULLS FIRST is the default for ASC order, and NULLS LAST for DESC order.
即：默认情况下，SQL标准不会指定该行为。目前在Hive中，空值排序似乎低于任何非空值；即ASC顺序为NULLS FIRST， DESC顺序为NULLS LAST。
如下图所示：
在这里插入图片描述
但是CDP的Hive默认NULL值排序是NULLS FIRST。

3、解决办法：
1)改动代码在排序语句后加nulls last ，或者使用nvl()函数对空值做处理
2)修改配置不修改代码：
CDP中默认·hive.default.nulls.last=true·，如果您需要在集群中永久设置·hive.default.nulls.last=false·
您可如下设置：
CM - Hive on tez - 配置 - 搜索hive_hs2_config_safety_valve - 添加hive.default.nulls.last=false - 保存并重启服务。

相关阅读:
文件上传 upload 组件 on-success 事件，添加自定义参数
 把短信验证码储存在Redis
c/c++常见的数据类型表示的范围
 nodejs+vue宁夏旅游景点客流量数据分析
 一、VSCode——免安装
 RedisKey的基本命令
 6 年大厂程序员跟你聊聊，算法其实没那么难，要怎么准备比较好
 SpringCloud——服务注册——Consul
Spring中Bean的作用域和生命周期
 第一章三层交换应用
原文地址：https://blog.csdn.net/weixin_42804692/article/details/126130321