• 【Hive】CDPHiveNULL值排序前后的问题


    最近公司在从本地大数据环境迁移到CDP的集群。在跑相同的SQL代码时,两边对比发现数据有问题,然后就开始查找原因:在对数据进行分组排序时,两个环境的NULL值默认排序不同。

    排错

    • 1、首先我们执行相同的SQL,对比了两边SQL的执行计划,
    select id
         , times
         , row_number() over (partition by id order by times desc) as rn
    from (select 1 as id, 1659258809386 as times
          union all
          select 1 as id, null as times
          ) t1;
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    CDP Hive的执行计划如下

    Plan optimized by CBO.
    
    Vertex dependency in root stage
    Map 1 <- Union 2 (CONTAINS)
    Map 4 <- Union 2 (CONTAINS)
    Reducer 3 <- Union 2 (SIMPLE_EDGE)
    
    Stage-0
      Fetch Operator
        limit:-1
        Stage-1
          Reducer 3
          File Output Operator [FS_12]
            Select Operator [SEL_10] (rows=2 width=16)
               Output:["_col0","_col1","_col2"]"
              PTF Operator [PTF_9] (rows=2 width=12)
                 Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col1 DESC NULLS LAST","partition by:":"_col0"}]"
                Select Operator [SEL_8] (rows=2 width=12)
                   Output:["_col0","_col1"]"
                <-Union 2 [SIMPLE_EDGE]
                  <-Map 1 [CONTAINS]
                    Reduce Output Operator [RS_7]
                      PartitionCols:1
                      Select Operator [SEL_6] (rows=2 width=12)
                         Output:["_col1"]"
                        Select Operator [SEL_1] (rows=1 width=12)
                           Output:["_col1"]"
                          TableScan [TS_0] (rows=1 width=1)
                  <-Map 4 [CONTAINS]
                    Reduce Output Operator [RS_7]
                      PartitionCols:1
                      Select Operator [SEL_6] (rows=2 width=12)
                         Output:["_col1"]
                        Select Operator [SEL_3] (rows=1 width=8)
                           Output:["_col1"]
                          TableScan [TS_2] (rows=1 width=1)
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37

    本地的执行计划结果如下:

    Plan not optimized by CBO.
    
    Vertex dependency in root stage
    Map 1 <- Union 2 (CONTAINS)
    Map 4 <- Union 2 (CONTAINS)
    Reducer 3 <- Union 2 (SIMPLE_EDGE)
    
    Stage-0
      Fetch Operator
        limit:-1
        Stage-1
          Reducer 3
          File Output Operator [FS_12]
            Select Operator [SEL_10] (rows=2 width=16)
               Output:["_col0","_col1","_col2"]"
              PTF Operator [PTF_9] (rows=2 width=12)
                 Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col1 DESC NULLS LAST","partition by:":"_col0"}]"
                Select Operator [SEL_8] (rows=2 width=12)
                   Output:["_col0","_col1"]"
                <-Union 2 [SIMPLE_EDGE]
                  <-Map 1 [CONTAINS]
                    Reduce Output Operator [RS_7]
                      PartitionCols:1
                      Select Operator [SEL_6] (rows=2 width=12)
                         Output:["_col1"]"
                        Select Operator [SEL_1] (rows=1 width=12)
                           Output:["_col1"]"
                          TableScan [TS_0] (rows=1 width=1)
                  <-Map 4 [CONTAINS]
                    Reduce Output Operator [RS_7]
                      PartitionCols:1
                      Select Operator [SEL_6] (rows=2 width=12)
                         Output:["_col1"]
                        Select Operator [SEL_3] (rows=1 width=8)
                           Output:["_col1"]
                          TableScan [TS_2] (rows=1 width=1)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36

    发现是对NULL值的排序顺序两边不一致
    对比结果如下:

    在这里插入图片描述

    • 2、于是去官网确认默认的配置什么
      本地集群用的是Hive的2.7.3版本,在Hive官网发现ISSUE: HIVE-12994

    SQL standard does not specify the behavior by default. Currently in Hive, null values sort as if lower than any non-null value; that is, NULLS FIRST is the default for ASC order, and NULLS LAST for DESC order.
    即:默认情况下,SQL标准不会指定该行为。目前在Hive中,空值排序似乎低于任何非空值;即ASC顺序为NULLS FIRST, DESC顺序为NULLS LAST
    如下图所示:
    在这里插入图片描述
    但是CDP的Hive默认NULL值排序是NULLS FIRST。

    • 3、解决办法
        1)改动代码在排序语句后加nulls last ,或者使用nvl()函数对空值做处理
        2)修改配置不修改代码:
      CDP中默认·hive.default.nulls.last=true·,如果您需要在集群中永久设置·hive.default.nulls.last=false·
      您可如下设置:
      CM - Hive on tez - 配置 - 搜索hive_hs2_config_safety_valve - 添加hive.default.nulls.last=false - 保存并重启服务。
  • 相关阅读:
    文件上传 upload 组件 on-success 事件,添加自定义参数
    把短信验证码储存在Redis
    c/c++常见的数据类型表示的范围
    nodejs+vue宁夏旅游景点客流量数据分析
    一、VSCode——免安装
    RedisKey的基本命令
    6 年大厂程序员跟你聊聊,算法其实没那么难,要怎么准备比较好
    SpringCloud——服务注册——Consul
    Spring中Bean的作用域和生命周期
    第一章三层交换应用
  • 原文地址:https://blog.csdn.net/weixin_42804692/article/details/126130321