• 离线数仓(12):DWD层实现之流量日志解析


    目录

    0. 相关文章链接

    1. DWD层概述 

    1.1. 数仓分层图

    1.2. 数仓数据走向图

    1.3. 实现目的

    2. APP商品点击

    2.1. 建表语句

    2.2. 执行脚本

    2.3. 创建视图

    2.4. 结果展示

    3. APP商品曝光

    3.1. 建表语句

    3.2. 执行脚本

    3.3. 创建视图

    3.4. 结果展示

    4. APP启动

    4.1. 建表语句

    4.2. 执行脚本

    4.3. 创建视图

    4.4. 结果展示

    5. 总结

    5.1. 为什么从ODS层读取当天全量数据

    5.2. DWD层流量表视图的创建

    5.3. 对JSONArray进行炸裂处理


    0. 相关文章链接

     离线数仓文章汇总 

    1. DWD层概述 

    1.1. 数仓分层

    1.2. 数仓数据走向图

    1.3. 实现目的

    DWD层的目的是进行维度建模(尽可能的使用星型模型,减少雪花模型的使用),所以一般情况下一个DWD层的表为一个事实表:

    • 针对流量日志,在ODS层很多流量日志主题会聚合在一起(博主这里ODS层分为9个大主题,例:APP点击、APP曝光、H5所有、路径等),所以到DWD层会将ODS层大主题根据业务分为小主题,比如APP商品点击、APP商品曝光、商品详情页面曝光等;
    • 一般情况下,流量日志使用的是批量上报的方式,所以需要对一些主题的流量日志进行炸裂(例:APP商品曝光等);
    • 流量日志在DWD层为每天一个分区,存储当天的增量数据,并注意要使用压缩;
    • 流量日志在DWD层一个业务主题为一张表;
    • 流量日志在DWD层的粒度为最细(例:一条数据为用户针对一个商品的一次曝光);

    2. APP商品点击

    注:脚本实现如下所示,关于具体使用和总结可以查看文章最后DWD层流量日志解析总结章节。

    2.1. 建表语句

    1. -- 建表语句
    2. create table if not exists ${yishou_data_dbname}.dwd_goods_click_dt (
    3. `user_id` STRING COMMENT '用户id',
    4. `goods_id` STRING COMMENT '商品id',
    5. `is_rec` STRING COMMENT '是否搜索结果推荐商品',
    6. `special_id` STRING COMMENT '专场id',
    7. `os` STRING COMMENT '操作系统',
    8. `goods_no` STRING COMMENT '货号',
    9. `pid` STRING COMMENT '来源',
    10. `ptime` STRING COMMENT '页面停留时间',
    11. `source` STRING COMMENT '专场列表来源',
    12. `report_time` STRING COMMENT '上报时间',
    13. `event_id` STRING COMMENT '事件id',
    14. `search_event_id` STRING COMMENT '搜索事件id',
    15. `keyword` STRING COMMENT '关键词',
    16. `app_version` STRING COMMENT 'APP版本',
    17. `index` STRING COMMENT '商品下标',
    18. `strategy_id` STRING COMMENT '策略id',
    19. `is_default` STRING COMMENT '主被动切换(1主动,0被动)',
    20. `is_operat` STRING COMMENT '是否广告位商品'
    21. )
    22. comment 'APP商品点击分区表'
    23. PARTITIONED BY (`dt` STRING COMMENT '日期分区(yyyymmdd)')
    24. STORED AS orc LOCATION 'obs://yishou-test/yishou_data_qa_test.db/dwd_goods_click_dt'
    25. ;

    2.2. 执行脚本

    1. -- 写数脚本
    2. insert overwrite table ${yishou_data_dbname}.dwd_goods_click_dt partition(dt)
    3. select
    4. user_id,
    5. coalesce(get_json_object(click_goods, '$.goods_id'), get_json_object(click_goods, '$.good_id')) as goods_id,
    6. get_json_object(click_goods, '$.is_rec') as is_rec,
    7. special_id,
    8. os,
    9. get_json_object(click_goods, '$.goods_no') as goods_no,
    10. pid,
    11. ptime,
    12. source,
    13. time as report_time,
    14. event_id,
    15. search_event_id,
    16. keyword,
    17. app_version,
    18. get_json_object(click_goods, '$.index') as index,
    19. get_json_object(click_goods, '$.strategy_id') as strategy_id,
    20. get_json_object(click_goods, '$.is_default') as is_default,
    21. get_json_object(click_goods, '$.is_operat') as is_operat,
    22. from_unixtime(time - 25200, 'yyyyMMdd') as dt
    23. from (
    24. select
    25. get_json_object(scdata, '$.user_id') as user_id,
    26. get_json_object(scdata, '$.special_id') as special_id,
    27. get_json_object(scdata, '$.os') as os,
    28. get_json_object(scdata, '$.pid') as pid,
    29. get_json_object(scdata, '$.ptime') as ptime,
    30. get_json_object(scdata, '$.source') as source,
    31. __time__ as time,
    32. get_json_object(scdata, '$.event_id') as event_id,
    33. get_json_object(scdata, '$.search_event_id') as search_event_id,
    34. get_json_object(scdata, '$.keyword') as keyword,
    35. get_json_object(scdata, '$.app_version') as app_version,
    36. regexp_replace1(
    37. regexp_replace1(
    38. regexp_replace1(
    39. get_json_object(scdata, '$.click_goods_arr'),
    40. '^\\[',
    41. ''
    42. ),
    43. '\\]$',
    44. ''
    45. ),
    46. '},\\{',
    47. '}|-|{'
    48. ) as click_goods_arr
    49. from
    50. ${yishou_data_dbname}.ods_yishou_log_exposure_dt
    51. where
    52. dt BETWEEN ${one_day_ago} and ${today}
    53. and from_unixtime(__time__ - 25200, 'yyyyMMdd') = ${one_day_ago}
    54. and (lower(topic) = 'goodsexposure' or lower(__topic__) = 'goodsexposure')
    55. )
    56. lateral view outer explode (split(click_goods_arr, '\\|-\\|'))t as click_goods
    57. where
    58. click_goods is not null
    59. and click_goods != ''
    60. DISTRIBUTE by floor(rand() * 10)
    61. ;

    2.3. 创建视图

    1. -- 创建视图
    2. drop view if exists ${yishou_data_dbname}.dwd_goods_click;
    3. create view if not exists ${yishou_data_dbname}.dwd_goods_click (
    4. `user_id` COMMENT '用户id',
    5. `goods_id` COMMENT '商品id',
    6. `is_rec` COMMENT '是否搜索结果推荐商品',
    7. `special_id` COMMENT '专场id',
    8. `os` COMMENT '操作系统',
    9. `goods_no` COMMENT '货号',
    10. `pid` COMMENT '来源',
    11. `ptime` COMMENT '页面停留时间',
    12. `source` COMMENT '专场列表来源',
    13. `report_time` COMMENT '上报时间',
    14. `event_id` COMMENT '事件id',
    15. `search_event_id` COMMENT '搜索事件id',
    16. `keyword` COMMENT '关键词',
    17. `app_version` COMMENT 'APP版本',
    18. `index` COMMENT '商品下标',
    19. `strategy_id` COMMENT '策略id',
    20. `is_default` COMMENT '主被动切换(1主动,0被动)',
    21. `is_operat` COMMENT '是否广告位商品',
    22. `dt` COMMENT '日期分区(yyyymmdd)'
    23. ) comment 'APP商品点击视图'
    24. as
    25. select
    26. user_id,
    27. coalesce(get_json_object(click_goods, '$.goods_id'), get_json_object(click_goods, '$.good_id')) as goods_id,
    28. get_json_object(click_goods, '$.is_rec') as is_rec,
    29. special_id,
    30. os,
    31. get_json_object(click_goods, '$.goods_no') as goods_no,
    32. pid,
    33. ptime,
    34. source,
    35. time as report_time,
    36. event_id,
    37. search_event_id,
    38. keyword,
    39. app_version,
    40. get_json_object(click_goods, '$.index') as index,
    41. get_json_object(click_goods, '$.strategy_id') as strategy_id,
    42. get_json_object(click_goods, '$.is_default') as is_default,
    43. get_json_object(click_goods, '$.is_operat') as is_operat,
    44. from_unixtime(time - 25200, 'yyyyMMdd') as dt
    45. from (
    46. select
    47. get_json_object(scdata, '$.user_id') as user_id,
    48. get_json_object(scdata, '$.special_id') as special_id,
    49. get_json_object(scdata, '$.os') as os,
    50. get_json_object(scdata, '$.pid') as pid,
    51. get_json_object(scdata, '$.ptime') as ptime,
    52. get_json_object(scdata, '$.source') as source,
    53. __time__ as time,
    54. get_json_object(scdata, '$.event_id') as event_id,
    55. get_json_object(scdata, '$.search_event_id') as search_event_id,
    56. get_json_object(scdata, '$.keyword') as keyword,
    57. get_json_object(scdata, '$.app_version') as app_version,
    58. regexp_replace1(
    59. regexp_replace1(
    60. regexp_replace1(
    61. get_json_object(scdata, '$.click_goods_arr'),
    62. '^\\[',
    63. ''
    64. ),
    65. '\\]$',
    66. ''
    67. ),
    68. '},\\{',
    69. '}|-|{'
    70. ) as click_goods_arr
    71. from ${yishou_data_dbname}.ods_yishou_log_exposure_dt
    72. where
    73. dt >= ${thirty_day_ago}
    74. and (lower(topic) = 'goodsexposure' or lower(__topic__) = 'goodsexposure')
    75. )
    76. lateral view outer explode (split(click_goods_arr, '\\|-\\|'))t as click_goods
    77. where
    78. click_goods is not null
    79. and click_goods != ''
    80. ;

    2.4. 结果展示

    分区表查询结果展示:

    视图结果展示: 

    3. APP商品曝光

    注:脚本实现如下所示,关于具体使用和总结可以查看文章最后DWD层流量日志解析总结章节。

    3.1. 建表语句

    1. -- 建表语句
    2. create table if not exists ${yishou_data_dbname}.dwd_goods_exposure_dt (
    3. `user_id` STRING COMMENT '用户id',
    4. `goods_id` STRING COMMENT '商品id',
    5. `is_rec` STRING COMMENT '是否搜索结果推荐商品',
    6. `special_id` STRING COMMENT '专场id',
    7. `os` STRING COMMENT '操作系统',
    8. `goods_no` STRING COMMENT '货号',
    9. `pid` STRING COMMENT '来源',
    10. `ptime` STRING COMMENT '页面停留时间',
    11. `source` STRING COMMENT '专场列表来源',
    12. `report_time` STRING COMMENT '上报时间',
    13. `event_id` STRING COMMENT '事件id',
    14. `search_event_id` STRING COMMENT '搜索事件id',
    15. `keyword` STRING COMMENT '关键词',
    16. `app_version` STRING COMMENT 'APP版本',
    17. `exposure_index` STRING COMMENT '曝光商品下标',
    18. `strategy_id` STRING COMMENT '策略id',
    19. `is_default` STRING COMMENT '主被动切换(1主动,0被动)',
    20. `is_operat` STRING COMMENT '是否广告位商品'
    21. )
    22. comment 'APP商品曝光分区表'
    23. PARTITIONED BY (`dt` STRING COMMENT '日期分区(yyyymmdd)')
    24. STORED AS orc LOCATION 'obs://yishou-test/yishou_data_qa_test.db/dwd_goods_exposure_dt'
    25. ;

    3.2. 执行脚本

    1. -- 写数脚本
    2. insert overwrite table ${yishou_data_dbname}.dwd_goods_exposure_dt partition(dt)
    3. select
    4. user_id,
    5. coalesce(get_json_object(exposure_goods, '$.goods_id'), get_json_object(exposure_goods, '$.good_id')) as goods_id,
    6. get_json_object(exposure_goods, '$.is_rec') as is_rec,
    7. special_id,
    8. os,
    9. get_json_object(exposure_goods, '$.goods_no') as goods_no,
    10. pid,
    11. ptime,
    12. source,
    13. time as report_time,
    14. event_id,
    15. search_event_id,
    16. keyword,
    17. app_version,
    18. get_json_object(exposure_goods, '$.index') as exposure_index,
    19. get_json_object(exposure_goods, '$.strategy_id') as strategy_id,
    20. get_json_object(exposure_goods, '$.is_default') as is_default,
    21. get_json_object(exposure_goods, '$.is_operat') as is_operat,
    22. from_unixtime(time - 25200, 'yyyyMMdd') as dt
    23. from (
    24. select
    25. get_json_object(scdata, '$.user_id') as user_id,
    26. get_json_object(scdata, '$.special_id') as special_id,
    27. get_json_object(scdata, '$.os') as os,
    28. get_json_object(scdata, '$.pid') as pid,
    29. get_json_object(scdata, '$.ptime') as ptime,
    30. get_json_object(scdata, '$.source') as source,
    31. __time__ as time,
    32. get_json_object(scdata, '$.event_id') as event_id,
    33. get_json_object(scdata, '$.search_event_id') as search_event_id,
    34. get_json_object(scdata, '$.keyword') as keyword,
    35. get_json_object(scdata, '$.app_version') as app_version,
    36. regexp_replace1(
    37. regexp_replace1(
    38. regexp_replace1(
    39. get_json_object(scdata, '$.goods_arr'),
    40. '^\\[',
    41. ''
    42. ),
    43. '\\]$',
    44. ''
    45. ),
    46. '},\\{',
    47. '}|-|{'
    48. ) as exposure_goods_arr
    49. from
    50. ${yishou_data_dbname}.ods_yishou_log_exposure_dt
    51. where
    52. dt BETWEEN ${one_day_ago} and ${today}
    53. and from_unixtime(__time__ - 25200, 'yyyyMMdd') = ${one_day_ago}
    54. and (lower(topic) = 'goodsexposure' or lower(__topic__) = 'goodsexposure')
    55. )
    56. lateral view outer explode (split(exposure_goods_arr, '\\|-\\|'))t as exposure_goods
    57. where
    58. exposure_goods is not null
    59. and exposure_goods != ''
    60. DISTRIBUTE by floor(rand() * 20)
    61. ;

    3.3. 创建视图

    1. -- 创建视图
    2. drop view if exists ${yishou_data_dbname}.dwd_goods_exposure;
    3. create view if not exists ${yishou_data_dbname}.dwd_goods_exposure (
    4. `user_id` COMMENT '用户id',
    5. `goods_id` COMMENT '商品id',
    6. `is_rec` COMMENT '是否搜索结果推荐商品',
    7. `special_id` COMMENT '专场id',
    8. `os` COMMENT '操作系统',
    9. `goods_no` COMMENT '货号',
    10. `pid` COMMENT '来源',
    11. `ptime` COMMENT '页面停留时间',
    12. `source` COMMENT '专场列表来源',
    13. `report_time` COMMENT '上报时间',
    14. `event_id` COMMENT '事件id',
    15. `search_event_id` COMMENT '搜索事件id',
    16. `keyword` COMMENT '关键词',
    17. `app_version` COMMENT 'APP版本',
    18. `exposure_index` COMMENT '曝光商品下标',
    19. `strategy_id` COMMENT '策略id',
    20. `is_default` COMMENT '主被动切换(1主动,0被动)',
    21. `is_operat` COMMENT '是否广告位商品',
    22. `dt` COMMENT '日期分区(yyyymmdd)'
    23. ) comment 'APP商品曝光视图'
    24. as
    25. select
    26. user_id,
    27. coalesce(get_json_object(exposure_goods, '$.goods_id'), get_json_object(exposure_goods, '$.good_id')) as goods_id,
    28. get_json_object(exposure_goods, '$.is_rec') as is_rec,
    29. special_id,
    30. os,
    31. get_json_object(exposure_goods, '$.goods_no') as goods_no,
    32. pid,
    33. ptime,
    34. source,
    35. time as report_time,
    36. event_id,
    37. search_event_id,
    38. keyword,
    39. app_version,
    40. get_json_object(exposure_goods, '$.index') as exposure_index,
    41. get_json_object(exposure_goods, '$.strategy_id') as strategy_id,
    42. get_json_object(exposure_goods, '$.is_default') as is_default,
    43. get_json_object(exposure_goods, '$.is_operat') as is_operat,
    44. from_unixtime(time - 25200, 'yyyyMMdd') as dt
    45. from (
    46. select
    47. get_json_object(scdata, '$.user_id') as user_id,
    48. get_json_object(scdata, '$.special_id') as special_id,
    49. get_json_object(scdata, '$.os') as os,
    50. get_json_object(scdata, '$.pid') as pid,
    51. get_json_object(scdata, '$.ptime') as ptime,
    52. get_json_object(scdata, '$.source') as source,
    53. __time__ as time,
    54. get_json_object(scdata, '$.event_id') as event_id,
    55. get_json_object(scdata, '$.search_event_id') as search_event_id,
    56. get_json_object(scdata, '$.keyword') as keyword,
    57. get_json_object(scdata, '$.app_version') as app_version,
    58. regexp_replace1(
    59. regexp_replace1(
    60. regexp_replace1(
    61. get_json_object(scdata, '$.goods_arr'),
    62. '^\\[',
    63. ''
    64. ),
    65. '\\]$',
    66. ''
    67. ),
    68. '},\\{',
    69. '}|-|{'
    70. ) as exposure_goods_arr
    71. from
    72. ${yishou_data_dbname}.ods_yishou_log_exposure_dt
    73. where
    74. dt > ${thirty_day_ago}
    75. and (lower(topic) = 'goodsexposure' or lower(__topic__) = 'goodsexposure')
    76. )
    77. lateral view outer explode (split(exposure_goods_arr, '\\|-\\|'))t as exposure_goods
    78. where
    79. exposure_goods is not null
    80. and exposure_goods != ''
    81. ;

    3.4. 结果展示

    分区表查询结果展示:

    视图结果展示: 

    4. APP启动

    注:脚本实现如下所示,关于具体使用和总结可以查看文章最后DWD层流量日志解析总结章节。

    4.1. 建表语句

    1. -- 建表语句
    2. create table if not exists ${yishou_data_dbname}.dwd_app_start_dt (
    3. `report_time` bigint COMMENT '上报时间(10位时间戳,秒值)',
    4. `time` bigint COMMENT '触发时间(10位时间戳,秒值)',
    5. `distinct_id` STRING COMMENT '唯一标识码',
    6. `os` STRING COMMENT '手机操作系统',
    7. `ab_test` STRING COMMENT 'AB版本区分',
    8. `is_first_day` STRING COMMENT '是否首日访问',
    9. `is_first_time` STRING COMMENT '是否首次访问',
    10. `resume_from_background` STRING COMMENT '是否从后台唤起',
    11. `user_id` STRING COMMENT '用户id',
    12. `app_version` STRING COMMENT 'app版本',
    13. `network_type` STRING COMMENT '网络环境',
    14. `manufacturer` STRING COMMENT '设备名称',
    15. `device_id` STRING COMMENT '设备号',
    16. `os_version` STRING COMMENT '手机操作系统版本号',
    17. `carrier` STRING COMMENT '运营商',
    18. `is_pro` STRING COMMENT '是否为Pro版{1:童装货源APP,0:一手APP}',
    19. `model` STRING COMMENT '手机型号'
    20. )
    21. comment 'APP启动分区表'
    22. PARTITIONED BY (`dt` STRING COMMENT '日期分区(yyyymmdd)')
    23. STORED AS orc LOCATION 'obs://yishou-test/yishou_data_qa_test.db/dwd_app_start_dt'
    24. ;

    4.2. 执行脚本

    1. -- 写数脚本
    2. insert overwrite table ${yishou_data_dbname}.dwd_app_start_dt partition(dt)
    3. select
    4. __time__ as report_time,
    5. get_json_object(scdata, '$.time') / 1000 as time,
    6. get_json_object(scdata, '$.distinct_id') as distinct_id,
    7. get_json_object(scdata, '$.properties.os') as os,
    8. get_json_object(scdata, '$.properties.isAorB') as ab_test,
    9. get_json_object(scdata, '$.properties.is_first_day') as is_first_day,
    10. get_json_object(scdata, '$.properties.is_first_time') as is_first_time,
    11. get_json_object(scdata, '$.properties.resume_from_background') as resume_from_background,
    12. get_json_object(scdata, '$.properties.userid') as user_id,
    13. get_json_object(scdata, '$.properties.app_version') as app_version,
    14. get_json_object(scdata, '$.properties.network_type') as network_type,
    15. get_json_object(scdata, '$.properties.manufacturer') as manufacturer,
    16. get_json_object(scdata, '$.properties.device_id') as device_id,
    17. get_json_object(scdata, '$.properties.os_version') as os_version,
    18. get_json_object(scdata, '$.properties.carrier') as carrier,
    19. get_json_object(scdata, '$.properties.isPro') as is_pro,
    20. get_json_object(scdata, '$.properties.model') as model,
    21. from_unixtime(__time__ - 25200, 'yyyyMMdd') as dt
    22. from
    23. ${yishou_data_dbname}.ods_new_app_log_store_dt
    24. where
    25. dt BETWEEN ${one_day_ago} and ${today}
    26. and from_unixtime(__time__ - 25200, 'yyyyMMdd') = ${one_day_ago}
    27. and (lower(topic) = 'appstart' or lower(__topic__) = 'appstart')
    28. DISTRIBUTE by floor(rand() * 3)
    29. ;

    4.3. 创建视图

    1. -- 创建视图
    2. drop view if exists ${yishou_data_dbname}.dwd_app_start;
    3. create view if not exists ${yishou_data_dbname}.dwd_app_start (
    4. `report_time` COMMENT '上报时间(10位时间戳,秒值)',
    5. `time` COMMENT '触发时间(10位时间戳,秒值)',
    6. `distinct_id` COMMENT '唯一标识码',
    7. `os` COMMENT '手机操作系统',
    8. `ab_test` COMMENT 'AB版本区分',
    9. `is_first_day` COMMENT '是否首日访问',
    10. `is_first_time` COMMENT '是否首次访问',
    11. `resume_from_background` COMMENT '是否从后台唤起',
    12. `user_id` COMMENT '用户id',
    13. `app_version` COMMENT 'app版本',
    14. `network_type` COMMENT '网络环境',
    15. `manufacturer` COMMENT '设备名称',
    16. `device_id` COMMENT '设备号',
    17. `os_version` COMMENT '手机操作系统版本号',
    18. `carrier` COMMENT '运营商',
    19. `is_pro` COMMENT '是否为Pro版{1:童装货源APP,0:一手APP}',
    20. `model` COMMENT '手机型号',
    21. `dt` COMMENT '日期分区(yyyymmdd)'
    22. )
    23. as
    24. select
    25. __time__ as report_time,
    26. get_json_object(scdata, '$.time') / 1000 as time,
    27. get_json_object(scdata, '$.distinct_id') as distinct_id,
    28. get_json_object(scdata, '$.properties.os') as os,
    29. get_json_object(scdata, '$.properties.isAorB') as ab_test,
    30. get_json_object(scdata, '$.properties.is_first_day') as is_first_day,
    31. get_json_object(scdata, '$.properties.is_first_time') as is_first_time,
    32. get_json_object(scdata, '$.properties.resume_from_background') as resume_from_background,
    33. get_json_object(scdata, '$.properties.userid') as user_id,
    34. get_json_object(scdata, '$.properties.app_version') as app_version,
    35. get_json_object(scdata, '$.properties.network_type') as network_type,
    36. get_json_object(scdata, '$.properties.manufacturer') as manufacturer,
    37. get_json_object(scdata, '$.properties.device_id') as device_id,
    38. get_json_object(scdata, '$.properties.os_version') as os_version,
    39. get_json_object(scdata, '$.properties.carrier') as carrier,
    40. get_json_object(scdata, '$.properties.isPro') as is_pro,
    41. get_json_object(scdata, '$.properties.model') as model,
    42. from_unixtime(__time__ - 25200, 'yyyyMMdd') as dt
    43. from
    44. ${yishou_data_dbname}.ods_new_app_log_store_dt
    45. where
    46. dt > ${thirty_day_ago}
    47. and (lower(topic) = 'appstart' or lower(__topic__) = 'appstart')
    48. ;

    4.4. 结果展示

    分区表查询结果展示:

    视图结果展示: 

    5. 总结

    5.1. 为什么从ODS层读取当天全量数据

            在上述代码中,会从ODS中获取数据,然后根据不同业务主题进行数据ETL工作,再写入到DWD层对应表中,每次都是获取ODS所有的数据,一个ODS表中会包含几十个DWD的主题;这里可以考虑在ODS层就对以日期和主题名进行分区,但博主这里没有划分,因为ODS层会根据APP曝光、APP点击、H5日志等已经划分为9个大主题了,并且每次读取数据也很快(20秒以内),如下图SparkUI所示:

            在对ODS是否进行主题分区,可以综合考虑读取文件的时间和文件系统中小文件的大小个数;根据 离线数仓(8):ODS层实现之导入流量日志 文章中介绍,会3分钟生成一个文件,每天会有480个文件生成,每个文件大小平均为20M左右,但如果进行分区,那有多少个分区就会有多少倍的小文件;可以对这2方面综合考虑,来看是否在ODS层中对日期和主题进行分区。

    5.2. DWD层流量表视图的创建

            在上述代码中,我们可以看到每天会将前一天的数据进行ELT,并将结果数据写入到DWD层对应表的分区中;那为什么还要创建视图呢?这是因为博主这会有近实时作业,要求使用最新的数据,每小时跑一次,这种作业如果每小时将数据重写一次到文件系统中会比较麻烦;而且现在是每小时,后续可能会要求30分钟,但综合考虑又不需要使用Flink(使用Flink完全实时对资源要求比较高);这时候就能使用视图了,在ODS中为3分钟级别更新,这样就能满足需要了。

            并且注意,视图为每天早晨会删除旧视图,然后创建新的视图,视图中获取的是近30天的数据;这是因为需要当天的数据时,一般来说30天的历史数据就完全足够,这可以根据实际情况增长或者缩短。

    5.3. 对JSONArray进行炸裂处理

            在HQL中(博主这里使用的是华为云的DLI服务,使用的是Spark on Hive,所以编写的为SparkSQL),如果要获取JSONObject中的数据,可以使用get_json_object函数来获取;但没有现成的解析JSONArray的函数,所以一般情况下需要通过编写代码,自定义UDTF函数来实现此功能,然后通过 lateral view outer explode () 来将数据由一行转换成多行。

            博主这里由于历史原因,导致埋点的数据不是很符合JSONArray格式(即商品点击或商品曝光中,既有JSONArray,又有JSONObject);所以博主使用的方法是,通过regexp_replace方法和正则表达式(在代码中是regexp_replace1,是由于原先是阿里云后面迁移到华为云,迁移导致的),将其中的开始的 [ 和最后的 ] 进行替换,再将其中的 },{ 进行替换,最后使用 split 切割成集合,最后炸裂实现,具体可以查看上述代码。


    注:其他 离线数仓 相关文章链接由此进 -> 离线数仓文章汇总


  • 相关阅读:
    ARMday2
    使用Mindspore推理会出现内存泄漏问题
    猿创征文|redis安装(Linux)
    "科来杯"第十届山东省大学生网络安全技能大赛决赛复现WP
    FreeRtos 任务优先级和中断优先级
    gradle8.0或者其他版本下载太慢或者time out超时(完美解决方法)
    web前端-html-css-HACK(CSS Hack 实际上指的就是一个特殊的代码,这段代码只在某些特殊的浏览器中可以识别)
    将可调用对象转换为 c 风格指针
    Linux系统防火墙firewalld
    Mode Field Diameter(MFD)(模场直径)
  • 原文地址:https://blog.csdn.net/yang_shibiao/article/details/126760797