• Hive日分区表如何快速导入到StarRocks


    1、背景

    业务现状:集团使用FineBI做数据呈现及报表分析工具,经过近两年的BI建设,供应链域及营销域的BI建设已初具规模并体系化。数仓规模60TB,FineBI数据集约8000个,BI挂出报表数约1600个,报表月增幅在40左右。

    技术现状:数据加工链路:业务系统数据库->Hive数仓->PG导出库->FineBI抽取数据集->BI报表。该数据链路下,遇到了一些问题。本文不展开全部问题,仅讨论PG导出库数据规模过大,导致FineBI每日数据更新任务卡住问题。

    问题描述:集团供应链的物料的帐龄表、库龄表按日分区,日增量100W,且存储了所有历史数据,该账龄、库龄数据在一年内单表接近160GB,FineBI每日早晨抽取数据的更新任务卡住(FineBI只配置了4个后台更新线程),导致大部分数据更新延迟。即使开放更多的后台数据更新线程(按CPU核数推荐经验值,过多会影响BI的报表查看性能),该数据的抽取对PG导出库也有压力。

    解决思路:1)从报表的逻辑优化减少数据量;2)切换查询引擎,使用StarRocks替换PG,FineBI直连StarRocks,减少BI数据抽取。本文仅讨论切换查询引擎的方案。

    2、最初实验

    思路:

    使用Broker Load导入Hive中的按日动态分区表数据导出到StarRocks中。StarRocks中的表也按日动态分区。

    相关文档:

    • 参考Broker Load官方文档:Broker load研究导入配置。
    • 参考动态分区官方文档:动态分区研究StarRocks按日动态分区表建表语句。

    问题:

    使用Brocker Load方式遇到以下问题:

    • hdfs parquet文件有几年的分区数据,有的是月分区,有的是天分区
    • 一些分区的字段不同,有的有A BC字段,有的有ABD字段有的ABCD字段
    • StarRocks导入时,不允许StarRocks的字段比原表字段多

    官方给出了以下脚本,创建一张支持动态分区的表,表名为 site_access,动态分区通过 PROPERTIES 进行配置。分区的区间为当前时间的前后 3 天,总共 6 天。

    1. CREATE TABLE site_access(
    2. event_day DATE,
    3. site_id INT DEFAULT '10',
    4. city_code VARCHAR(100),
    5. user_name VARCHAR(32) DEFAULT '',
    6. pv BIGINT DEFAULT '0'
    7. )
    8. DUPLICATE KEY(event_day, site_id, city_code, user_name)
    9. PARTITION BY RANGE(event_day)(
    10. PARTITION p20200321 VALUES LESS THAN ("2020-03-22"),
    11. PARTITION p20200322 VALUES LESS THAN ("2020-03-23"),
    12. PARTITION p20200323 VALUES LESS THAN ("2020-03-24"),
    13. PARTITION p20200324 VALUES LESS THAN ("2020-03-25")
    14. )
    15. DISTRIBUTED BY HASH(event_day, site_id) BUCKETS 32
    16. PROPERTIES(
    17. "dynamic_partition.enable" = "true",
    18. "dynamic_partition.time_unit" = "DAY",
    19. "dynamic_partition.start" = "-3",
    20. "dynamic_partition.end" = "3",
    21. "dynamic_partition.prefix" = "p",
    22. "dynamic_partition.buckets" = "32"
    23. );

    如果是Hive使用经验的用户,第一次使用StarRocks,参考上面的SQL逻辑设计自己的表逻辑时,可能有疑问:PARTITION BY RANGE(event_day)(这里的内容改如何填写呢)?比如,以下需要迁移的Hive表,

    1. -- drop table if exists bda${db_para}.bda_inv_item_age_dtl;
    2. create table if not exists bda${db_para}.bda_inv_item_age_dtl (
    3. stat_date string comment '统计日期',
    4. entr_date string comment '入库日期',
    5. item_id string comment '物料id',
    6. itm_cd string comment '物料编码',
    7. org_id string comment '库存组织id',
    8. org_cd string comment '库存组织编码',
    9. sub_invtr_cd string comment '子库编码',
    10. bch_nbr string comment '批次号',
    11. entr_qty string comment '入库数量',
    12. total_qty string comment '倒算总入库数',
    13. left_qty string comment '剩余库存数',
    14. alloc_qty string comment '分摊库存数',
    15. stock_age string comment '库龄',
    16. item_cost string comment 'pac成本',
    17. actual_cost string comment '实际成本',
    18. item_business string comment '最近三个月使用事业部',
    19. pch_big_ctg string comment '采购品类大类',
    20. pch_med_ctg string comment '采购品类中类',
    21. pch_sml_ctg string comment '采购品类小类',
    22. ship_customer_name string comment '订单收货客户',
    23. team_bu_name string comment '战队',
    24. real_customer_name string comment '真实客户',
    25. item_category string comment '物料类别',
    26. prod_model_id string comment '产品型号id',
    27. prod_model string comment '产品型号',
    28. job_belong_bu string comment '工单所属事业部',
    29. item_bu string comment '近三个使用事业部',
    30. pac_cost string comment 'pac成本单价',
    31. item_category_bg string comment '物料大类')
    32. comment '库龄'
    33. partitioned by(part_dt string)
    34. row format delimited fields terminated by '\036'
    35. stored as parquet;
    36. -- add by tjl 2022.03.10
    37. -- alter table bda${db_para}.bda_inv_item_age_dtl add columns(dept_name string comment '工单所属部门') cascade;
    38. -- 2022.04.18:xxx:新增:工单制单时间、制单人
    39. alter table bda${db_para}.bda_inv_item_age_dtl add columns(wdj_creation_date string comment '工单制单时间') cascade;
    40. alter table bda${db_para}.bda_inv_item_age_dtl add columns(created_by string comment '工单制单人') cascade;

    请注意:这里面有表字段表更。Parquet格式的分区表使用Alter更新字段或新增字段,Parquet文件会如何变化?旧分区的历史数据要如何处理?

    其hdfs文件如下:

    1. drwxrwx--x+ - hive hive 0 2018-08-30 03:51 warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2018-08-29
    2. #此处省略一万行
    3. drwxrwx--x+ - hive hive 0 2022-11-26 03:55 warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2022-11-25
    4. drwxrwx--x+ - hive hive 0 2022-11-27 03:52 warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2022-11-26

    讨论:

    • StarRocks有多种数据表模型,该场景应该选择什么数据表模型呢?
    • 应该如何建StarRocks的表呢?
    • 应该如何建Brocker load任务呢?

    3、问题分析

    • Parquet格式的Hive表,执行add columns的工作流程及底层执行机制如何?

    1、Hive元数据表中,在表末尾,追加一列。

    2、对于Hive分区表,需要cascade修饰词,管理旧分区的元数据信息。

    3、如果是使用Parquet、ORC等存储格式,文件中存储的内容没有重写,则不会变动。

    4、Hive如何识别就分区表中没有新增列的?

    参考:[Hive] Alter Table/Partition/Column - Huawei Enterprise Support Community

    简单而言,如果不使用cascade修饰词,只是修改了Hive中bda_inv_item_age_dtl表元信息(表结构)。写入新分区的数据,hive能自动识别并正常显示。对于旧分区,如果包含了新列,即使旧分区使用了INSERT OVERWRITE重写数据,如果没有使用Drop 并 recreate分区,hive查询有不能正常工作。

    参考:partitioning - how to add columns to existing hive partitioned table? - Stack Overflow

    • Parquet文件中的schema变化情况

    查看最早分区2016-06-30的数据结构:

    [cloud@dp-master001 ~]$ sudo hdfs dfs -get warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2016-06-30/000000_0
    [cloud@dp-master001 ~]$ /opt/cloudera/parcels/CDH/lib/parquet/bin/parquet-tools schema -d 000000_0 >> hive20160630.sql
    [cloud@dp-master001 ~]$ vcat hive20160630.sql 
    -bash: vcat: command not found
    [cloud@dp-master001 ~]$ cat hive20160630.sql 
    message hive_schema {
      optional binary stat_date (UTF8);
      optional binary entr_date (UTF8);
      optional binary item_id (UTF8);
      optional binary itm_cd (UTF8);
      optional binary org_id (UTF8);
      optional binary org_cd (UTF8);
      optional binary sub_invtr_cd (UTF8);
      optional binary bch_nbr (UTF8);
      optional binary entr_qty (UTF8);
      optional binary total_qty (UTF8);
      optional binary left_qty (UTF8);
      optional binary alloc_qty (UTF8);
      optional binary stock_age (UTF8);
      optional binary item_cost (UTF8);
      optional binary actual_cost (UTF8);
      optional binary item_business (UTF8);
      optional binary pch_big_ctg (UTF8);
      optional binary pch_med_ctg (UTF8);
      optional binary pch_sml_ctg (UTF8);
      optional binary ship_customer_name (UTF8);
      optional binary team_bu_name (UTF8);
      optional binary real_customer_name (UTF8);
      optional binary item_category (UTF8);
      optional binary prod_model_id (UTF8);
      optional binary prod_model (UTF8);
      optional binary job_belong_bu (UTF8);
      optional binary item_bu (UTF8);
      optional binary pac_cost (UTF8);
      optional binary item_category_bg (UTF8);
    }
    creator: parquet-mr version 1.5.0-cdh5.15.2 (build ${buildNumber})
    
    file schema: hive_schema
    ----------------------------------------------------------------------------------------------------
    stat_date: OPTIONAL BINARY O:UTF8 R:0 D:1
    entr_date: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_id: OPTIONAL BINARY O:UTF8 R:0 D:1
    itm_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
    org_id: OPTIONAL BINARY O:UTF8 R:0 D:1
    org_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
    sub_invtr_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
    bch_nbr: OPTIONAL BINARY O:UTF8 R:0 D:1
    entr_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    total_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    left_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    alloc_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    stock_age: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
    actual_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_business: OPTIONAL BINARY O:UTF8 R:0 D:1
    pch_big_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
    pch_med_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
    pch_sml_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
    ship_customer_name: OPTIONAL BINARY O:UTF8 R:0 D:1
    team_bu_name: OPTIONAL BINARY O:UTF8 R:0 D:1
    real_customer_name: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_category: OPTIONAL BINARY O:UTF8 R:0 D:1
    prod_model_id: OPTIONAL BINARY O:UTF8 R:0 D:1
    prod_model: OPTIONAL BINARY O:UTF8 R:0 D:1
    job_belong_bu: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_bu: OPTIONAL BINARY O:UTF8 R:0 D:1
    pac_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_category_bg: OPTIONAL BINARY O:UTF8 R:0 D:1
    
    row group 1: RC:190899 TS:8780200
    ----------------------------------------------------------------------------------------------------
    stat_date:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:204/204/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    entr_date:  BINARY UNCOMPRESSED DO:0 FPO:208 SZ:249018/249018/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    item_id:  BINARY UNCOMPRESSED DO:0 FPO:249226 SZ:245992/245992/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    itm_cd:  BINARY UNCOMPRESSED DO:0 FPO:495218 SZ:409890/409890/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    org_id:  BINARY UNCOMPRESSED DO:0 FPO:905108 SZ:3641/3641/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    org_cd:  BINARY UNCOMPRESSED DO:0 FPO:908749 SZ:3637/3637/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    sub_invtr_cd:  BINARY UNCOMPRESSED DO:0 FPO:912386 SZ:187698/187698/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    bch_nbr:  BINARY UNCOMPRESSED DO:0 FPO:1100084 SZ:286227/286227/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    entr_qty:  BINARY UNCOMPRESSED DO:0 FPO:1386311 SZ:495330/495330/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    total_qty:  BINARY UNCOMPRESSED DO:0 FPO:1881641 SZ:59/59/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    left_qty:  BINARY UNCOMPRESSED DO:0 FPO:1881700 SZ:486445/486445/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    alloc_qty:  BINARY UNCOMPRESSED DO:0 FPO:2368145 SZ:536856/536856/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    stock_age:  BINARY UNCOMPRESSED DO:0 FPO:2905001 SZ:243903/243903/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    item_cost:  BINARY UNCOMPRESSED DO:0 FPO:3148904 SZ:2322210/2322210/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,PLAIN,RLE
    actual_cost:  BINARY UNCOMPRESSED DO:0 FPO:5471114 SZ:2922009/2922009/1.00 VC:190899 ENC:BIT_PACKED,PLAIN,RLE
    item_business:  BINARY UNCOMPRESSED DO:0 FPO:8393123 SZ:1817/1817/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    pch_big_ctg:  BINARY UNCOMPRESSED DO:0 FPO:8394940 SZ:4757/4757/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    pch_med_ctg:  BINARY UNCOMPRESSED DO:0 FPO:8399697 SZ:13669/13669/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    pch_sml_ctg:  BINARY UNCOMPRESSED DO:0 FPO:8413366 SZ:27595/27595/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    ship_customer_name:  BINARY UNCOMPRESSED DO:0 FPO:8440961 SZ:26641/26641/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    team_bu_name:  BINARY UNCOMPRESSED DO:0 FPO:8467602 SZ:12107/12107/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    real_customer_name:  BINARY UNCOMPRESSED DO:0 FPO:8479709 SZ:33/33/1.00 VC:190899 ENC:BIT_PACKED,PLAIN,RLE
    item_category:  BINARY UNCOMPRESSED DO:0 FPO:8479742 SZ:399/399/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    prod_model_id:  BINARY UNCOMPRESSED DO:0 FPO:8480141 SZ:36922/36922/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    prod_model:  BINARY UNCOMPRESSED DO:0 FPO:8517063 SZ:33472/33472/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    job_belong_bu:  BINARY UNCOMPRESSED DO:0 FPO:8550535 SZ:3769/3769/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    item_bu:  BINARY UNCOMPRESSED DO:0 FPO:8554304 SZ:33/33/1.00 VC:190899 ENC:BIT_PACKED,PLAIN,RLE
    pac_cost:  BINARY UNCOMPRESSED DO:0 FPO:8554337 SZ:225484/225484/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    item_category_bg:  BINARY UNCOMPRESSED DO:0 FPO:8779821 SZ:383/383/1.00 VC:190899 ENC:PLAIN_DICTIONARY,BIT_PACKED,RLE
    

    查看最新分区2022-11-26,Parquet文件的数据结构

    [cloud@dp-master001 ~]$ sudo hdfs dfs -get warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2022-11-24/000000_0
    [cloud@dp-master001 ~]$ /opt/cloudera/parcels/CDH/lib/parquet/bin/parquet-tools schema -d 000000_0 >> hive.sql
    [cloud@dp-master001 ~]$ cat hive.sql 
    message hive_schema {
      optional binary stat_date (UTF8);
      optional binary entr_date (UTF8);
      optional binary item_id (UTF8);
      optional binary itm_cd (UTF8);
      optional binary org_id (UTF8);
      optional binary org_cd (UTF8);
      optional binary sub_invtr_cd (UTF8);
      optional binary bch_nbr (UTF8);
      optional binary entr_qty (UTF8);
      optional binary total_qty (UTF8);
      optional binary left_qty (UTF8);
      optional binary alloc_qty (UTF8);
      optional binary stock_age (UTF8);
      optional binary item_cost (UTF8);
      optional binary actual_cost (UTF8);
      optional binary item_business (UTF8);
      optional binary pch_big_ctg (UTF8);
      optional binary pch_med_ctg (UTF8);
      optional binary pch_sml_ctg (UTF8);
      optional binary ship_customer_name (UTF8);
      optional binary team_bu_name (UTF8);
      optional binary real_customer_name (UTF8);
      optional binary item_category (UTF8);
      optional binary prod_model_id (UTF8);
      optional binary prod_model (UTF8);
      optional binary job_belong_bu (UTF8);
      optional binary item_bu (UTF8);
      optional binary pac_cost (UTF8);
      optional binary item_category_bg (UTF8);
      optional binary dept_name (UTF8);
      optional binary wdj_creation_date (UTF8);
      optional binary created_by (UTF8);
    }
    
    creator: parquet-mr version 1.5.0-cdh5.15.2 (build ${buildNumber})
    
    file schema: hive_schema
    ----------------------------------------------------------------------------------------------------
    stat_date: OPTIONAL BINARY O:UTF8 R:0 D:1
    entr_date: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_id: OPTIONAL BINARY O:UTF8 R:0 D:1
    itm_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
    org_id: OPTIONAL BINARY O:UTF8 R:0 D:1
    org_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
    sub_invtr_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
    bch_nbr: OPTIONAL BINARY O:UTF8 R:0 D:1
    entr_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    total_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    left_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    alloc_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    stock_age: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
    actual_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_business: OPTIONAL BINARY O:UTF8 R:0 D:1
    pch_big_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
    pch_med_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
    pch_sml_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
    ship_customer_name: OPTIONAL BINARY O:UTF8 R:0 D:1
    team_bu_name: OPTIONAL BINARY O:UTF8 R:0 D:1
    real_customer_name: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_category: OPTIONAL BINARY O:UTF8 R:0 D:1
    prod_model_id: OPTIONAL BINARY O:UTF8 R:0 D:1
    prod_model: OPTIONAL BINARY O:UTF8 R:0 D:1
    job_belong_bu: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_bu: OPTIONAL BINARY O:UTF8 R:0 D:1
    pac_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_category_bg: OPTIONAL BINARY O:UTF8 R:0 D:1
    dept_name: OPTIONAL BINARY O:UTF8 R:0 D:1
    wdj_creation_date: OPTIONAL BINARY O:UTF8 R:0 D:1
    created_by: OPTIONAL BINARY O:UTF8 R:0 D:1
    
    row group 1: RC:428480 TS:21535478
    ----------------------------------------------------------------------------------------------------
    stat_date:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:381/381/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    entr_date:  BINARY UNCOMPRESSED DO:0 FPO:385 SZ:608962/608962/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    item_id:  BINARY UNCOMPRESSED DO:0 FPO:609347 SZ:774164/774164/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    itm_cd:  BINARY UNCOMPRESSED DO:0 FPO:1383511 SZ:1185783/1185783/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    org_id:  BINARY UNCOMPRESSED DO:0 FPO:2569294 SZ:49829/49829/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    org_cd:  BINARY UNCOMPRESSED DO:0 FPO:2619123 SZ:49810/49810/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    sub_invtr_cd:  BINARY UNCOMPRESSED DO:0 FPO:2668933 SZ:539203/539203/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    bch_nbr:  BINARY UNCOMPRESSED DO:0 FPO:3208136 SZ:582198/582198/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    entr_qty:  BINARY UNCOMPRESSED DO:0 FPO:3790334 SZ:1028459/1028459/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    total_qty:  BINARY UNCOMPRESSED DO:0 FPO:4818793 SZ:141/141/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    left_qty:  BINARY UNCOMPRESSED DO:0 FPO:4818934 SZ:1074391/1074391/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    alloc_qty:  BINARY UNCOMPRESSED DO:0 FPO:5893325 SZ:1063130/1063130/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    stock_age:  BINARY UNCOMPRESSED DO:0 FPO:6956455 SZ:599509/599509/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    item_cost:  BINARY UNCOMPRESSED DO:0 FPO:7555964 SZ:5625018/5625018/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,PLAIN,RLE
    actual_cost:  BINARY UNCOMPRESSED DO:0 FPO:13180982 SZ:6410120/6410120/1.00 VC:428480 ENC:BIT_PACKED,PLAIN,RLE
    item_business:  BINARY UNCOMPRESSED DO:0 FPO:19591102 SZ:114122/114122/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    pch_big_ctg:  BINARY UNCOMPRESSED DO:0 FPO:19705224 SZ:33/33/1.00 VC:428480 ENC:BIT_PACKED,PLAIN,RLE
    pch_med_ctg:  BINARY UNCOMPRESSED DO:0 FPO:19705257 SZ:33/33/1.00 VC:428480 ENC:BIT_PACKED,PLAIN,RLE
    pch_sml_ctg:  BINARY UNCOMPRESSED DO:0 FPO:19705290 SZ:33/33/1.00 VC:428480 ENC:BIT_PACKED,PLAIN,RLE
    ship_customer_name:  BINARY UNCOMPRESSED DO:0 FPO:19705323 SZ:37444/37444/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    team_bu_name:  BINARY UNCOMPRESSED DO:0 FPO:19742767 SZ:17370/17370/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    real_customer_name:  BINARY UNCOMPRESSED DO:0 FPO:19760137 SZ:18033/18033/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    item_category:  BINARY UNCOMPRESSED DO:0 FPO:19778170 SZ:917/917/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    prod_model_id:  BINARY UNCOMPRESSED DO:0 FPO:19779087 SZ:148421/148421/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    prod_model:  BINARY UNCOMPRESSED DO:0 FPO:19927508 SZ:113394/113394/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    job_belong_bu:  BINARY UNCOMPRESSED DO:0 FPO:20040902 SZ:16902/16902/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    item_bu:  BINARY UNCOMPRESSED DO:0 FPO:20057804 SZ:91023/91023/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    pac_cost:  BINARY UNCOMPRESSED DO:0 FPO:20148827 SZ:791340/791340/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    item_category_bg:  BINARY UNCOMPRESSED DO:0 FPO:20940167 SZ:673/673/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    dept_name:  BINARY UNCOMPRESSED DO:0 FPO:20940840 SZ:7675/7675/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    wdj_creation_date:  BINARY UNCOMPRESSED DO:0 FPO:20948515 SZ:574110/574110/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE
    created_by:  BINARY UNCOMPRESSED DO:0 FPO:21522625 SZ:12857/12857/1.00 VC:428480 ENC:BIT_PACKED,PLAIN_DICTIONARY,RLE

    4、方案验证

    4.1、StarRocks建表

    • 选择什么样的StarRocks表模型?

    根据BI报表的分析逻辑,主要用于监控日/月维度库龄、帐龄的趋势(影响物料/产成品跌价计提),主要是明细表做分析。故选择StarRocks明细表模型。期望按照Hive的按日分区建表。考虑到Brocker load只支持忽略源表中的字段,不支持目标表多的字段忽略/或者填充默认值的机制。根据探源发现,最早分区2016-06-30与最新分区2022-11-26字段不同(DDL中显示,2022.03.10,2022.04.18增加了三个字段。)可以考虑,03.10前,建立一个StarRocks表,之后建立一个StarRocks表。数据同步到StarRocks后,做数据汇聚。为了简单0310前的存储到一个分区表。

    • 如何建StarRocks的表

    0310前一个分区表:

    drop table  if  exists bda.bda_inv_item_age_dtl0310;
    create table if not exists bda.bda_inv_item_age_dtl0310(
        stat_date date,
    	entr_date date,
    	item_id varchar(10),
    	itm_cd varchar(500),
        org_id varchar(5),
        org_cd varchar(500),
        sub_invtr_cd varchar(20),
        bch_nbr varchar(50),
    	entr_qty decimal,
    	total_qty decimal,
    	left_qty decimal,
    	alloc_qty decimal,
    	stock_age decimal,
    	item_cost decimal,
    	actual_cost decimal,
    	item_business varchar(500),
    	pch_big_ctg varchar(500),
    	pch_med_ctg varchar(500),
    	pch_sml_ctg varchar(500),
    	ship_customer_name varchar(500),
    	team_bu_name varchar(500),
    	real_customer_name varchar(500),
    	item_category varchar(500),
    	prod_model_id varchar(500),
    	prod_model varchar(500),
    	job_belong_bu varchar(500),
    	item_bu varchar(500),
    	pac_cost decimal,
    	item_category_bg varchar(500)
    )
    DUPLICATE KEY(stat_date,entr_date,item_id)
    DISTRIBUTED BY HASH(entr_date, item_id) BUCKETS 32;

    0310后动态分区

    create table if not exists bda.bda_inv_item_age_dtl_part(
    	stat_date date,
        entr_date date,
        item_id varchar(10),
        itm_cd varchar(500),
        org_id varchar(5),
    	org_cd varchar(500),
        sub_invtr_cd varchar(20),
        bch_nbr varchar(50),
    	entr_qty varchar(500),
    	total_qty varchar(500),
    	left_qty varchar(500),
    	alloc_qty varchar(500),
    	stock_age varchar(500),
    	item_cost varchar(500),
        actual_cost varchar(500),
    	item_business varchar(500),
    	pch_big_ctg varchar(500),
    	pch_med_ctg varchar(500),
    	pch_sml_ctg varchar(500),
    	ship_customer_name varchar(500),
    	team_bu_name varchar(500),
    	real_customer_name varchar(500),
    	item_category varchar(500),
    	prod_model_id varchar(500),
    	prod_model varchar(500),
    	job_belong_bu varchar(500),
    	item_bu varchar(500),
    	pac_cost varchar(500),
    	item_category_bg varchar(500),
    	dept_name varchar(300),
    	wdj_creation_date varchar(200),
    	created_by varchar(200)
    )
    DUPLICATE KEY(stat_date,entr_date,item_id)
    PARTITION BY RANGE(stat_date)()
    DISTRIBUTED BY HASH(entr_date, item_id) BUCKETS 32
    PROPERTIES(
        "dynamic_partition.enable" = "true",
        "dynamic_partition.time_unit" = "DAY",
        "dynamic_partition.end" = "1",
        "dynamic_partition.prefix" = "p",
        "dynamic_partition.buckets" = "32"
    );

    4.1、Brocker 同步任务

    • 如何建Brocker load任务

    参考文档:FileSystem (Apache Hadoop Main 3.3.4 API)

    0310之前分区的抽取任务

    LOAD LABEL bda.bda_inv_item_age_dtl0310_before
    (
    DATA INFILE("hdfs://10.21.25.161:8020/user/hive/warehouse/bda.db/bda_inv_item_age_dtl/part_dt=[2016-2021]-*/*")
    INTO TABLE `bda_inv_item_age_dtl_part`
    COLUMNS TERMINATED BY "\036"
    FORMAT AS "parquet"
    where entr_date is not null 
    )
    WITH BROKER broker198
    (
    "hadoop.security.authentication"="kerberos",
    "kerberos_principal"="hdfs",
    "kerberos_keytab"="/opt/StarRocks/kerberos/hdfs.keytab"
    );

    执行发现,依旧报错,字段缺少,通过二分法,逐一排查hdfs分区文件,发现2022-03-10之前的分区,有的是按月分区,有的是按日分区,一些分区的字段各不相同,例如2018年分区情况、2022-02分区情况:

    [cloud@dp-master001 ~]$ /opt/cloudera/parcels/CDH/lib/parquet/bin/parquet-tools schema -d 000000_0_2018_12_02 >> hive20181202.sql
    [cloud@dp-master001 ~]$ cat hive20181202.sql 
    message hive_schema {
      optional binary stat_date (UTF8);
      optional binary entr_date (UTF8);
      optional binary item_id (UTF8);
      optional binary itm_cd (UTF8);
      optional binary org_id (UTF8);
      optional binary org_cd (UTF8);
      optional binary sub_invtr_cd (UTF8);
      optional binary bch_nbr (UTF8);
      optional binary entr_qty (UTF8);
      optional binary total_qty (UTF8);
      optional binary left_qty (UTF8);
      optional binary alloc_qty (UTF8);
      optional binary stock_age (UTF8);
      optional binary item_cost (UTF8);
      optional binary actual_cost (UTF8);
      optional binary item_business (UTF8);
    }
    
    creator: parquet-mr version 1.5.0-cdh5.15.2 (build ${buildNumber})
    
    file schema: hive_schema
    ----------------------------------------------------------------------------------------------------
    stat_date: OPTIONAL BINARY O:UTF8 R:0 D:1
    entr_date: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_id: OPTIONAL BINARY O:UTF8 R:0 D:1
    itm_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
    org_id: OPTIONAL BINARY O:UTF8 R:0 D:1
    org_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
    sub_invtr_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
    bch_nbr: OPTIONAL BINARY O:UTF8 R:0 D:1
    entr_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    total_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    left_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    alloc_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    stock_age: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
    actual_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_business: OPTIONAL BINARY O:UTF8 R:0 D:1
    
    row group 1: RC:244224 TS:12749582
    ----------------------------------------------------------------------------------------------------
    stat_date:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:263/263/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
    entr_date:  BINARY UNCOMPRESSED DO:0 FPO:267 SZ:354158/354158/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
    item_id:  BINARY UNCOMPRESSED DO:0 FPO:354425 SZ:746753/746753/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
    itm_cd:  BINARY UNCOMPRESSED DO:0 FPO:1101178 SZ:1059151/1059151/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
    org_id:  BINARY UNCOMPRESSED DO:0 FPO:2160329 SZ:52235/52235/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
    org_cd:  BINARY UNCOMPRESSED DO:0 FPO:2212564 SZ:52222/52222/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
    sub_invtr_cd:  BINARY UNCOMPRESSED DO:0 FPO:2264786 SZ:279733/279733/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
    bch_nbr:  BINARY UNCOMPRESSED DO:0 FPO:2544519 SZ:633935/633935/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
    entr_qty:  BINARY UNCOMPRESSED DO:0 FPO:3178454 SZ:690868/690868/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
    total_qty:  BINARY UNCOMPRESSED DO:0 FPO:3869322 SZ:100/100/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
    left_qty:  BINARY UNCOMPRESSED DO:0 FPO:3869422 SZ:686921/686921/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
    alloc_qty:  BINARY UNCOMPRESSED DO:0 FPO:4556343 SZ:747099/747099/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
    stock_age:  BINARY UNCOMPRESSED DO:0 FPO:5303442 SZ:345381/345381/1.00 VC:244224 ENC:PLAIN_DICTIONARY,RLE,BIT_PACKED
    item_cost:  BINARY UNCOMPRESSED DO:0 FPO:5648823 SZ:3316772/3316772/1.00 VC:244224 ENC:RLE,PLAIN,BIT_PACKED
    actual_cost:  BINARY UNCOMPRESSED DO:0 FPO:8965595 SZ:3783958/3783958/1.00 VC:244224 ENC:RLE,PLAIN,BIT_PACKED
    item_business:  BINARY UNCOMPRESSED DO:0 FPO:12749553 SZ:33/33/1.00 VC:244224 ENC:RLE,PLAIN,BIT_PACKED
    [cloud@dp-master001 ~]$ sudo hdfs dfs -get warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2018-12-15/000000_0
    get: `warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2018-12-15/000000_0': No such file or directory
    [cloud@dp-master001 ~]$ sudo hdfs dfs -get warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2020-02-26/000000_0
    [cloud@dp-master001 ~]$ mv 000000_0 000000_0_2020_02_26
    [cloud@dp-master001 ~]$ /opt/cloudera/parcels/CDH/lib/parquet/bin/parquet-tools schema -d 000000_0_2020_02_26 >> hive20200226.sql
    [cloud@dp-master001 ~]$ cat hive20200226.sql 
    message hive_schema {
      optional binary stat_date (UTF8);
      optional binary entr_date (UTF8);
      optional binary item_id (UTF8);
      optional binary itm_cd (UTF8);
      optional binary org_id (UTF8);
      optional binary org_cd (UTF8);
      optional binary sub_invtr_cd (UTF8);
      optional binary bch_nbr (UTF8);
      optional binary entr_qty (UTF8);
      optional binary total_qty (UTF8);
      optional binary left_qty (UTF8);
      optional binary alloc_qty (UTF8);
      optional binary stock_age (UTF8);
      optional binary item_cost (UTF8);
      optional binary actual_cost (UTF8);
      optional binary item_business (UTF8);
      optional binary pch_big_ctg (UTF8);
      optional binary pch_med_ctg (UTF8);
      optional binary pch_sml_ctg (UTF8);
    }
    
    creator: parquet-mr version 1.5.0-cdh5.15.2 (build ${buildNumber})
    
    file schema: hive_schema
    ----------------------------------------------------------------------------------------------------
    stat_date: OPTIONAL BINARY O:UTF8 R:0 D:1
    entr_date: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_id: OPTIONAL BINARY O:UTF8 R:0 D:1
    itm_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
    org_id: OPTIONAL BINARY O:UTF8 R:0 D:1
    org_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
    sub_invtr_cd: OPTIONAL BINARY O:UTF8 R:0 D:1
    bch_nbr: OPTIONAL BINARY O:UTF8 R:0 D:1
    entr_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    total_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    left_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    alloc_qty: OPTIONAL BINARY O:UTF8 R:0 D:1
    stock_age: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
    actual_cost: OPTIONAL BINARY O:UTF8 R:0 D:1
    item_business: OPTIONAL BINARY O:UTF8 R:0 D:1
    pch_big_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
    pch_med_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
    pch_sml_ctg: OPTIONAL BINARY O:UTF8 R:0 D:1
    
    row group 1: RC:242410 TS:9017086
    ----------------------------------------------------------------------------------------------------
    stat_date:  BINARY UNCOMPRESSED DO:0 FPO:4 SZ:263/263/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    entr_date:  BINARY UNCOMPRESSED DO:0 FPO:267 SZ:317433/317433/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    item_id:  BINARY UNCOMPRESSED DO:0 FPO:317700 SZ:562585/562585/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    itm_cd:  BINARY UNCOMPRESSED DO:0 FPO:880285 SZ:900712/900712/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    org_id:  BINARY UNCOMPRESSED DO:0 FPO:1780997 SZ:42974/42974/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    org_cd:  BINARY UNCOMPRESSED DO:0 FPO:1823971 SZ:42965/42965/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    sub_invtr_cd:  BINARY UNCOMPRESSED DO:0 FPO:1866936 SZ:294784/294784/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    bch_nbr:  BINARY UNCOMPRESSED DO:0 FPO:2161720 SZ:437366/437366/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    entr_qty:  BINARY UNCOMPRESSED DO:0 FPO:2599086 SZ:691103/691103/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    total_qty:  BINARY UNCOMPRESSED DO:0 FPO:3290189 SZ:100/100/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    left_qty:  BINARY UNCOMPRESSED DO:0 FPO:3290289 SZ:707584/707584/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    alloc_qty:  BINARY UNCOMPRESSED DO:0 FPO:3997873 SZ:754179/754179/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    stock_age:  BINARY UNCOMPRESSED DO:0 FPO:4752052 SZ:310402/310402/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    item_cost:  BINARY UNCOMPRESSED DO:0 FPO:5062454 SZ:33/33/1.00 VC:242410 ENC:RLE,PLAIN,BIT_PACKED
    actual_cost:  BINARY UNCOMPRESSED DO:0 FPO:5062487 SZ:3631076/3631076/1.00 VC:242410 ENC:RLE,PLAIN,BIT_PACKED
    item_business:  BINARY UNCOMPRESSED DO:0 FPO:8693563 SZ:67646/67646/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    pch_big_ctg:  BINARY UNCOMPRESSED DO:0 FPO:8761209 SZ:63343/63343/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    pch_med_ctg:  BINARY UNCOMPRESSED DO:0 FPO:8824552 SZ:83151/83151/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    pch_sml_ctg:  BINARY UNCOMPRESSED DO:0 FPO:8907703 SZ:109387/109387/1.00 VC:242410 ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED
    [cloud@dp-master001 ~]$ 

    这就导致了同步的数据导入的难度,需要检查每个分区的schema。为什么会出现这种情况呢?出现了这种情况如何快速同步呢?

    0310以后的分区抽取任务

    LOAD LABEL bda.bda_inv_item_age_dtl0310_after
    (
    DATA INFILE("hdfs://10.21.25.161:8020/user/hive/warehouse/bda.db/bda_inv_item_age_dtl/part_dt=2022-[03-12]-*/*")
    INTO TABLE `bda_inv_item_age_dtl_part`
    COLUMNS TERMINATED BY "\036"
    FORMAT AS "parquet"
    where entr_date is not null 
    )
    WITH BROKER broker198
    (
    "hadoop.security.authentication"="kerberos",
    "kerberos_principal"="hdfs",
    "kerberos_keytab"="/opt/StarRocks/kerberos/hdfs.keytab"
    );
    

    最终方案思考:

    方案1:

      修改StarRocks源码,支持StarRocks目标表字段多于原表字段时,填充默认值。

    方案2:

    • 使用python 读取hdfs 分区的schema,根据schema建立StarRocks分区临时表
    • 各分区导入成功后,自动合并成StarRocks目标表

    方案3:

    • 使用DataX同步Hive表,每个分区配置一个同步作业。有点不用考虑表字段问题。但是分区较多时,配置同步任务也比较麻烦。可以使用脚本自动生成。如果数据量较大,同步效率也不如Brocks Load方式。

     

  • 相关阅读:
    细胞膜杂化脂质体载紫杉醇靶向/仿生型细胞膜嵌合脂质体递送KGF-2研究
    Postman抓包浏览器请求--傻瓜式操作
    3分钟轻松实现网关网口远程监控安川PLC
    [数据结构]链表OJ--环形链表判断是否有环(快慢指针法)
    求二进制中1的个数的三种方法
    想要精通算法和SQL的成长之路 - 二叉树的判断问题(子树判断 | 对称性 | 一致性判断)
    react路由传参3种方式
    加入自定义函数共享output数组功能
    一个对C#程序混淆加密,小巧但够用的小工具
    Nginx的重写功能——Rewrite
  • 原文地址:https://blog.csdn.net/zdsx1104/article/details/128062646