使用Hive语法修改了列名;比如将字段A55
改为A66
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
hive表是ORC格式分区表,Spark版本2.x
问题描述:修改字段名后,导致spark-sql读取hive表历史分区时,被修改的字段读取到的都是null值。但是读取新的分区数据时,不受影响,被修改的字段值可以正常读取;
排查测试,换个其他格式的表如textfile发现没有这种问题。于是查询spark官网,搜索关于ORC表这块的Jira
发现是可能是由于spark.sql.hive.convertMetastoreOrc
参数默认值的变化造成的,Spark 2.4 以后, Spark将采用自身的 SerDe 而非采用 Hive 的SerDe 处理 Hive 的 ORC 数据表(SerDe是Serialize/Deserilize的简称,目的是用于序列化和反序列化)
In version 2.3 and earlier, Spark converts Parquet Hive tables by default but ignores table properties like . This happens for ORC Hive table properties like in case of , too. Since Spark 2.4, Spark respects Parquet/ORC specific table properties while converting Parquet/ORC Hive tables. As an example, would generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, the result would be uncompressed parquet files.TBLPROPERTIES (parquet.compression ‘NONE’)TBLPROPERTIES (orc.compress ‘NONE’)spark.sql.hive.convertMetastoreOrc=trueCREATE TABLE t(id int) STORED AS PARQUET TBLPROPERTIES (parquet.compression ‘NONE’)
Since Spark 2.0, Spark converts Parquet Hive tables by default for better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, too. It means Spark uses its own ORC support by default instead of Hive SerDe. As an example, would be handled with Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark’s ORC data source table and ORC vectorization would be applied. To set to restores the previous behavior.CREATE TABLE t(id int) STORED AS ORCfalsespark.sql.hive.convertMetastoreOrc
具体源代码在于: HiveStrategies.scala
中 RelationConversions类
中, 可以看到根据配置不同, 对于 ORC 文件采用两种不同的处理方式
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat 或
org.apache.spark.sql.hive.orc.OrcFileFormat 处理方式
其中采用 org.apache.spark.sql.execution.datasources.orc.OrcFileFormat
会导致如上问题.
case class RelationConversions(
conf: SQLConf,
sessionCatalog: HiveSessionCatalog) extends Rule[LogicalPlan] {
private def convert(relation: HiveTableRelation): LogicalRelation = {
val serde = relation.tableMeta.storage.serde.getOrElse("").toLowerCase(Locale.ROOT)
// Consider table and storage properties. For properties existing in both sides, storage
// properties will supersede table properties.
if (serde.contains("parquet")) {
val options = relation.tableMeta.properties.filterKeys(isParquetProperty) ++
relation.tableMeta.storage.properties + (ParquetOptions.MERGE_SCHEMA ->
conf.getConf(HiveUtils.CONVERT_METASTORE_PARQUET_WITH_SCHEMA_MERGING).toString)
sessionCatalog.metastoreCatalog
.convertToLogicalRelation(relation, options, classOf[ParquetFileFormat], "parquet")
} else {
val options = relation.tableMeta.properties.filterKeys(isOrcProperty) ++
relation.tableMeta.storage.properties
if (conf.getConf(SQLConf.ORC_IMPLEMENTATION) == "native") {
sessionCatalog.metastoreCatalog.convertToLogicalRelation(
relation,
options,
classOf[org.apache.spark.sql.execution.datasources.orc.OrcFileFormat],
"orc")
} else {
sessionCatalog.metastoreCatalog.convertToLogicalRelation(
relation,
options,
classOf[org.apache.spark.sql.hive.orc.OrcFileFormat],
"orc")
}
}
}
}
在 Spark 2.3 以后, Spark 增加 spark.sql.hive.convertMetastoreOrc
参数, 设定 ORC 文件使用采用向量化 Reader, 但是会引申由于 Spark SQL 无法与 Hive SQL 采用同一 SerDe 从而对 Parquet/Hive 数据表产生处理上的不同, 因此需要限定如下参数从而让 Spark 使用 Hive 的SerDe.
SET spark.sql.hive.convertMetastoreOrc=false;
SET spark.sql.hive.convertMetastoreParquet=false;