• 解决 Spark2.X 查询Hive分区表修改后的字段为null?


    解决 Spark2.X 查询Hive分区表修改后的字段为null?

    1、问题描述

    使用Hive语法修改了列名;比如将字段A55改为A66

    ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type  [COMMENT col_comment]  [FIRST|AFTER column_name]
    
    • 1

    hive表是ORC格式分区表,Spark版本2.x

    问题描述:修改字段名后,导致spark-sql读取hive表历史分区时,被修改的字段读取到的都是null值。但是读取新的分区数据时,不受影响,被修改的字段值可以正常读取;

    2、问题排查

    排查测试,换个其他格式的表如textfile发现没有这种问题。于是查询spark官网,搜索关于ORC表这块的Jira

    发现是可能是由于spark.sql.hive.convertMetastoreOrc参数默认值的变化造成的,Spark 2.4 以后, Spark将采用自身的 SerDe 而非采用 Hive 的SerDe 处理 Hive 的 ORC 数据表(SerDe是Serialize/Deserilize的简称,目的是用于序列化和反序列化)

    In version 2.3 and earlier, Spark converts Parquet Hive tables by default but ignores table properties like . This happens for ORC Hive table properties like in case of , too. Since Spark 2.4, Spark respects Parquet/ORC specific table properties while converting Parquet/ORC Hive tables. As an example, would generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would be uncompressed parquet files.TBLPROPERTIES (parquet.compression ‘NONE’)TBLPROPERTIES (orc.compress ‘NONE’)spark.sql.hive.convertMetastoreOrc=trueCREATE TABLE t(id int) STORED AS PARQUET TBLPROPERTIES (parquet.compression ‘NONE’)

    Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark’s ORC data source table and ORC vectorization would be applied. To set to restores the previous behavior.CREATE TABLE t(id int) STORED AS ORCfalsespark.sql.hive.convertMetastoreOrc

    具体源代码在于: HiveStrategies.scalaRelationConversions类中, 可以看到根据配置不同, 对于 ORC 文件采用两种不同的处理方式

    org.apache.spark.sql.execution.datasources.orc.OrcFileFormatorg.apache.spark.sql.hive.orc.OrcFileFormat 处理方式
    
    • 1
    • 2

    其中采用 org.apache.spark.sql.execution.datasources.orc.OrcFileFormat 会导致如上问题.

    case class RelationConversions(
        conf: SQLConf,
        sessionCatalog: HiveSessionCatalog) extends Rule[LogicalPlan] {
      private def convert(relation: HiveTableRelation): LogicalRelation = {
        val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
    
        // Consider table and storage properties. For properties existing in both sides, storage
        // properties will supersede table properties.
        if (serde.contains("parquet")) {
          val options = relation.tableMeta.properties.filterKeys(isParquetProperty) ++
            relation.tableMeta.storage.properties + (ParquetOptions.MERGE_SCHEMA ->
            conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString)
          sessionCatalog.metastoreCatalog
            .convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet")
        } else {
          val options = relation.tableMeta.properties.filterKeys(isOrcProperty) ++
            relation.tableMeta.storage.properties
          if (conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native") {
            sessionCatalog.metastoreCatalog.convertToLogicalRelation(
              relation,
              options,
              classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
              "orc")
          } else {
            sessionCatalog.metastoreCatalog.convertToLogicalRelation(
              relation,
              options,
              classOf[org.apache.spark.sql.hive.orc.OrcFileFormat],
              "orc")
          }
        }
      }
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33

    3、总结

    在 Spark 2.3 以后, Spark 增加 spark.sql.hive.convertMetastoreOrc 参数, 设定 ORC 文件使用采用向量化 Reader, 但是会引申由于 Spark SQL 无法与 Hive SQL 采用同一 SerDe 从而对 Parquet/Hive 数据表产生处理上的不同, 因此需要限定如下参数从而让 Spark 使用 Hive 的SerDe.

    SET spark.sql.hive.convertMetastoreOrc=false;
    SET spark.sql.hive.convertMetastoreParquet=false;
    
    • 1
    • 2
  • 相关阅读:
    【Qt样式(qss)-5】qss局部渲染混乱,错乱,不生效的一种原因
    SystemV
    CompletableFuture idea执行与springboot打包后 类加载器 不同 导致类加载错误
    华为数通方向HCIP-DataCom H12-831题库(多选题:121-140)
    【JavaScript 算法】链表操作:从基础到进阶
    qt对于一个QWidget的所有事件
    iptables学习
    java中Collections.addAll方法具有什么功能呢?
    EasySwipeMenuLayout - 独立的侧滑删除
    81-RabbitMQ详解
  • 原文地址:https://blog.csdn.net/qq_32727095/article/details/127648606