最近公司在从本地大数据环境迁移到CDP的集群。在跑相同的SQL代码时,两边对比发现数据有问题,然后就开始查找原因:在对数据进行分组排序时,两个环境的NULL值默认排序不同。
select id
, times
, row_number() over (partition by id order by times desc) as rn
from (select 1 as id, 1659258809386 as times
union all
select 1 as id, null as times
) t1;
CDP Hive的执行计划如下
Plan optimized by CBO.
Vertex dependency in root stage
Map 1 <- Union 2 (CONTAINS)
Map 4 <- Union 2 (CONTAINS)
Reducer 3 <- Union 2 (SIMPLE_EDGE)
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 3
File Output Operator [FS_12]
Select Operator [SEL_10] (rows=2 width=16)
Output:["_col0","_col1","_col2"]"
PTF Operator [PTF_9] (rows=2 width=12)
Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col1 DESC NULLS LAST","partition by:":"_col0"}]"
Select Operator [SEL_8] (rows=2 width=12)
Output:["_col0","_col1"]"
<-Union 2 [SIMPLE_EDGE]
<-Map 1 [CONTAINS]
Reduce Output Operator [RS_7]
PartitionCols:1
Select Operator [SEL_6] (rows=2 width=12)
Output:["_col1"]"
Select Operator [SEL_1] (rows=1 width=12)
Output:["_col1"]"
TableScan [TS_0] (rows=1 width=1)
<-Map 4 [CONTAINS]
Reduce Output Operator [RS_7]
PartitionCols:1
Select Operator [SEL_6] (rows=2 width=12)
Output:["_col1"]
Select Operator [SEL_3] (rows=1 width=8)
Output:["_col1"]
TableScan [TS_2] (rows=1 width=1)
本地的执行计划结果如下:
Plan not optimized by CBO.
Vertex dependency in root stage
Map 1 <- Union 2 (CONTAINS)
Map 4 <- Union 2 (CONTAINS)
Reducer 3 <- Union 2 (SIMPLE_EDGE)
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 3
File Output Operator [FS_12]
Select Operator [SEL_10] (rows=2 width=16)
Output:["_col0","_col1","_col2"]"
PTF Operator [PTF_9] (rows=2 width=12)
Function definitions:[{},{"name:":"windowingtablefunction","order by:":"_col1 DESC NULLS LAST","partition by:":"_col0"}]"
Select Operator [SEL_8] (rows=2 width=12)
Output:["_col0","_col1"]"
<-Union 2 [SIMPLE_EDGE]
<-Map 1 [CONTAINS]
Reduce Output Operator [RS_7]
PartitionCols:1
Select Operator [SEL_6] (rows=2 width=12)
Output:["_col1"]"
Select Operator [SEL_1] (rows=1 width=12)
Output:["_col1"]"
TableScan [TS_0] (rows=1 width=1)
<-Map 4 [CONTAINS]
Reduce Output Operator [RS_7]
PartitionCols:1
Select Operator [SEL_6] (rows=2 width=12)
Output:["_col1"]
Select Operator [SEL_3] (rows=1 width=8)
Output:["_col1"]
TableScan [TS_2] (rows=1 width=1)
发现是对NULL值的排序顺序两边不一致
对比结果如下:

SQL standard does not specify the behavior by default. Currently in Hive, null values sort as if lower than any non-null value; that is, NULLS FIRST is the default for ASC order, and NULLS LAST for DESC order.
即:默认情况下,SQL标准不会指定该行为。目前在Hive中,空值排序似乎低于任何非空值;即ASC顺序为NULLS FIRST, DESC顺序为NULLS LAST。
如下图所示:

但是CDP的Hive默认NULL值排序是NULLS FIRST。
nulls last ,或者使用nvl()函数对空值做处理