• Hive分区表修改(增删)列


    一、环境及测试数据

    环境:CDH6.3.0,Hive 2.1.1-cdh6.3.0

    基础数据分区表test1,包含a,b,c,d共4列加分区列p_day,向其中插入两行数据

    create table test1(a int,b bigint,c float,d string) partitioned by(p_day string) stored as parquet;
    
    insert into test1 partition(p_day) values(1,11,1.1,'str1','2022-11-19'),(2,22,2.2,'str2','2022-11-19');
    
    • 1
    • 2
    • 3

    表中数据及parquet文件信息如下:

    0: jdbc:hive2://dev-master1:10000 > select * from test1;
    +----------+----------+----------+----------+--------------+
    | test1.a  | test1.b  | test1.c  | test1.d  | test1.p_day  |
    +----------+----------+----------+----------+--------------+
    | 1        | 11       | 1.1      | str1     | 2022-11-19   |
    | 2        | 22       | 2.2      | str2     | 2022-11-19   |
    +----------+----------+----------+----------+--------------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    [hive@dev-master1 tmp]$ hdfs dfs -get /user/hive/warehouse/debug_test.db/test1/p_day=2022-11-19/000000_0 ./
    [hive@dev-master1 tmp]$ parquet-tools schema 000000_0
    message hive_schema {
      optional int32 a;
      optional int64 b;
      optional float c;
      optional binary d (STRING);
    }
    
    [hive@dev-master1 tmp]$ parquet-tools cat 000000_0
    a = 1
    b = 11
    c = 1.1
    d = str1
    
    a = 2
    b = 22
    c = 2.2
    d = str2
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    二、 删除列

    2.1 测试表test2

    create table test2(b bigint,a int,c float) partitioned by(p_day string) stored as parquet;
    
    • 1

    test2表直接使用test1表的文件:

    [hive@dev-master1 tmp]$ hdfs dfs -mkdir /user/hive/warehouse/debug_test.db/test2/p_day=2022-11-19
    [hive@dev-master1 tmp]$ hdfs dfs -cp /user/hive/warehouse/debug_test.db/test1/p_day=2022-11-19/000000_0 /user/hive/warehouse/debug_test.db/test2/p_day=2022-11-19/
    
    • 1
    • 2

    修复分区并查询数据

    msck repair table test2;
    select * from test2;
    +----------+----------+----------+--------------+
    | test2.b  | test2.a  | test2.c  | test2.p_day  |
    +----------+----------+----------+--------------+
    | 11       | 1        | 1.1      | 2022-11-19   |
    | 22       | 2        | 2.2      | 2022-11-19   |
    +----------+----------+----------+--------------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    2.2 DDL删除列?

    删除test2表的a列,看起来只有通过replace columns实现,但是运行报错,根据官方文档,只有表的序列化方式为native SerDe(DynamicSerDe, MetadataTypedColumnsetSerDe, LazySimpleSerDe and ColumnarSerDe)才能执行。

    0: jdbc:hive2://dev-master1:10000> alter table test2 replace columns(b bigint,c float);
    INFO  : Compiling command(queryId=hive_20221119180121_23e7971f-7b2f-4693-90b9-469ec44a97bd): alter table test2 replace columns(b bigint,c float)
    INFO  : Semantic Analysis Completed
    INFO  : Returning Hive schema: Schema(fieldSchemas:null, properties:null)
    INFO  : Completed compiling command(queryId=hive_20221119180121_23e7971f-7b2f-4693-90b9-469ec44a97bd); Time taken: 1.479 seconds
    INFO  : Executing command(queryId=hive_20221119180121_23e7971f-7b2f-4693-90b9-469ec44a97bd): alter table test2 replace columns(b bigint,c float)
    INFO  : Starting task [Stage-0:DDL] in serial mode
    ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Replacing columns cannot drop columns for table debug_test.test2. SerDe may be incompatible
    INFO  : Completed executing command(queryId=hive_20221119180121_23e7971f-7b2f-4693-90b9-469ec44a97bd); Time taken: 0.018 seconds
    Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Replacing columns cannot drop columns for table debug_test.test2. SerDe may be incompatible (state=42000,code=1)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10

    2.3 代码连接Hive Metastore删除列

    主要Maven依赖:

    用代码连接Hive MetaStore修改可以成功:

    package com.bigdata.databasetest.hive.metastore;
    
    import org.apache.hadoop.hive.conf.HiveConf;
    import org.apache.hadoop.hive.metastore.HiveMetaStoreClient;
    import org.apache.hadoop.hive.metastore.api.FieldSchema;
    import org.apache.hadoop.hive.metastore.api.Table;
    import org.apache.thrift.TException;
    
    import java.util.List;
    import java.util.stream.Collectors;
    
    
    /**
     * ClassName: HiveMetaStoreClientTest
     * Description:
     *
     * @author 0x3E6
     * @version 1.0.0
     * @date 2022/11/19 17:12
     */
    public class HiveMetaStoreClientTest {
    
        public static void main(String[] args) throws TException {
            HiveConf hiveConf = new HiveConf();
            System.setProperty("HADOOP_USER_NAME", "hive");
            hiveConf.set(HiveConf.ConfVars.METASTOREURIS.varname, "thrift://dev-master1:9083");
            try (HiveMetaStoreClient client = new HiveMetaStoreClient(hiveConf)) {
                Table table = client.getTable("debug_test", "test2");
                List<FieldSchema> cols = table.getSd().getCols();
                cols = cols.stream().filter(fieldSchema -> !"a".equalsIgnoreCase(fieldSchema.getName())).collect(Collectors.toList());
                table.getSd().setCols(cols);
                client.alter_table("debug_test", "test2", table);
            }
        }
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35

    这样查询数据有问题,不论sql是否带分区:

    select * from test2;
    select * from test2 where p_day='2022-11-19';
    
    • 1
    • 2

    查询结果为:

    +----------+----------+--------------+
    | test2.b  | test2.c  | test2.p_day  |
    +----------+----------+--------------+
    | 11       | NULL     | 2022-11-19   |
    | 22       | NULL     | 2022-11-19   |
    +----------+----------+--------------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    因为只改了表的元数据,而未改分区的元数据。

    0: jdbc:hive2://dev-master1:10000> desc test2;
    +--------------------------+-----------------------+-----------------------+
    |         col_name         |       data_type       |        comment        |
    +--------------------------+-----------------------+-----------------------+
    | b                        | bigint                |                       |
    | c                        | float                 |                       |
    | p_day                    | string                |                       |
    |                          | NULL                  | NULL                  |
    | # Partition Information  | NULL                  | NULL                  |
    | # col_name               | data_type             | comment               |
    |                          | NULL                  | NULL                  |
    | p_day                    | string                |                       |
    +--------------------------+-----------------------+-----------------------+
    0: jdbc:hive2://dev-master1:10000> desc test2 partition(p_day='2022-11-19');
    +--------------------------+-----------------------+-----------------------+
    |         col_name         |       data_type       |        comment        |
    +--------------------------+-----------------------+-----------------------+
    | b                        | bigint                |                       |
    | a                        | int                   |                       |
    | c                        | float                 |                       |
    | p_day                    | string                |                       |
    |                          | NULL                  | NULL                  |
    | # Partition Information  | NULL                  | NULL                  |
    | # col_name               | data_type             | comment               |
    |                          | NULL                  | NULL                  |
    | p_day                    | string                |                       |
    +--------------------------+-----------------------+-----------------------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27

    具体原理及原因还未分析,但可以修复,只需要保持分区的列与表的列顺序一致,无法执行REPLACE COLUMNS语句,通过CHANGE COLUMNS语句将c列移动到a列之前:

    alter table test2 partition(p_day='2022-11-19') change column c c float after b;
    select * from test2 where p_day='2022-11-19';
    +----------+----------+--------------+
    | test2.b  | test2.c  | test2.p_day  |
    +----------+----------+--------------+
    | 11       | 1.1      | 2022-11-19   |
    | 22       | 2.2      | 2022-11-19   |
    +----------+----------+--------------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8

    2.3.1 同时更新表与分区元数据

    ALTER TABLE语句可以添加CASCADE,更新表元数据的同时级联更新分区元数据,Hive MetaStoreClient API也可以使用带CASCADE的方法,前面的代码调用的HiveMetaStoreClient的alter_table(String dbname, String tbl_name, Table new_tbl)方法,可以调用alter_table(String defaultDatabaseName, String tblName, Table table, boolean cascade)方法,新建表test3与test2结构一致,也同样把test1的数据拷贝到表test3,再通过代码删除表test3的列a:

    package com.bigdata.databasetest.hive.metastore;
    
    import org.apache.hadoop.hive.conf.HiveConf;
    import org.apache.hadoop.hive.metastore.HiveMetaStoreClient;
    import org.apache.hadoop.hive.metastore.api.FieldSchema;
    import org.apache.hadoop.hive.metastore.api.Table;
    import org.apache.thrift.TException;
    
    import java.util.List;
    import java.util.stream.Collectors;
    
    
    /**
     * ClassName: HiveMetaStoreClientTest
     * Description:
     *
     * @author 0x3E6
     * @version 1.0.0
     * @date 2022/11/19 17:12
     */
    public class HiveMetaStoreClientTest {
    
        public static void main(String[] args) throws TException {
            HiveConf hiveConf = new HiveConf();
            System.setProperty("HADOOP_USER_NAME", "hive");
            hiveConf.set(HiveConf.ConfVars.METASTOREURIS.varname, "thrift://dev-master1:9083");
            try (HiveMetaStoreClient client = new HiveMetaStoreClient(hiveConf)) {
                Table table = client.getTable("debug_test", "test3");
                List<FieldSchema> cols = table.getSd().getCols();
                cols = cols.stream().filter(fieldSchema -> !"a".equalsIgnoreCase(fieldSchema.getName())).collect(Collectors.toList());
                table.getSd().setCols(cols);
    //            client.alter_table("debug_test", "test2", table);
                client.alter_table("debug_test", "test3", table, true);
            }
        }
    }
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36

    更新后表test3可正常查询:

    0: jdbc:hive2://dev-master1:10000> select * from test3;
    +----------+----------+--------------+
    | test3.b  | test3.c  | test3.p_day  |
    +----------+----------+--------------+
    | 11       | 1.1      | 2022-11-19   |
    | 22       | 2.2      | 2022-11-19   |
    +----------+----------+--------------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    表与分区元数据也相同:

    0: jdbc:hive2://dev-master1:10000> desc test3;
    +--------------------------+-----------------------+-----------------------+
    |         col_name         |       data_type       |        comment        |
    +--------------------------+-----------------------+-----------------------+
    | b                        | bigint                |                       |
    | c                        | float                 |                       |
    | p_day                    | string                |                       |
    |                          | NULL                  | NULL                  |
    | # Partition Information  | NULL                  | NULL                  |
    | # col_name               | data_type             | comment               |
    |                          | NULL                  | NULL                  |
    | p_day                    | string                |                       |
    +--------------------------+-----------------------+-----------------------+
    0: jdbc:hive2://dev-master1:10000> desc test3 partition(p_day='2022-11-19');
    +--------------------------+-----------------------+-----------------------+
    |         col_name         |       data_type       |        comment        |
    +--------------------------+-----------------------+-----------------------+
    | b                        | bigint                |                       |
    | c                        | float                 |                       |
    | p_day                    | string                |                       |
    |                          | NULL                  | NULL                  |
    | # Partition Information  | NULL                  | NULL                  |
    | # col_name               | data_type             | comment               |
    |                          | NULL                  | NULL                  |
    | p_day                    | string                |                       |
    +--------------------------+-----------------------+-----------------------+
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26

    三、添加列

    根据Hive文档,可通过ADD COLUMNS语句添加列,ADD COLUMNS语句会将指定的列添加到(除分区列外)其他列后面。

    ALTER TABLE table_name 
      [PARTITION partition_spec]                 -- (Note: Hive 0.14.0 and later)
      ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
      [CASCADE|RESTRICT]                         -- (Note: Hive 1.1.0 and later)
    
    • 1
    • 2
    • 3
    • 4

    如果使用Hive MetaStoreClient API不小心将列加到了(除分区列)外其他列之间,且造成了表与分区列顺序不同,也可结合CHANGE COLUMNS语句修改表或分区的列顺序恢复即可:

    ALTER TABLE table_name [PARTITION partition_spec] CHANGE [COLUMN] col_old_name col_new_name column_type
      [COMMENT col_comment] [FIRST|AFTER column_name] [CASCADE|RESTRICT];
    
    • 1
    • 2

    结论

    可以通过Hive MetaStoreClient API增删Hive列,但必须保持Hive表与各分区元数据的列顺序一致。

    参考链接

  • 相关阅读:
    05 Pod:如何理解这个Kubernetes里最核心的概念?
    16.Redis系列之Redisson分布式锁原理
    在linux(centOS 7)服务器上操作数据库
    C# 控制科兴16路继电器
    Unity AVProVideo 不显示画面问题解决
    【PHP实现微信公众平台开发—基础篇】第2章 微信公众账号及申请流程详解
    力扣每日一题43:字符串相乘
    《Python 语音转换简易速速上手小册》第5章 音频数据处理(2024 最新版)
    Git分支操作
    高通 mtk 展讯等芯片机型读取 备份手机全字库分区 的一些操作解析
  • 原文地址:https://blog.csdn.net/qq_31922231/article/details/127941319