• 尚硅谷大数据项目《在线教育之离线数仓》笔记006


    视频地址:尚硅谷大数据项目《在线教育之离线数仓》_哔哩哔哩_bilibili

    目录

    第11章 数仓开发之ADS层

    P087

    P088

    P089

    P090

    P091

    P092

    P093

    P094

    P095

    P096

    P097

    P098

    P099

    P100

    P101

    P102

    P103

    P104

    P105

    P106

    P107

    P108

    P109

    P110

    P111


    第11章 数仓开发之ADS层

    P087

    第11章 数仓开发之ADS层

    11.1 流量主题

    11.1.1 各来源流量统计

    [atguigu@node001 ~]$ nohup /opt/module/hive/hive-3.1.2/bin/hive &

    [atguigu@node001 ~]$ nohup /opt/module/hive/hive-3.1.2/bin/hive --service hiveserver2 &

    1. [atguigu@node001 hadoop]$ cd /opt/module/hive/hive-3.1.2/
    2. [atguigu@node001 hive-3.1.2]$ nohup bin/hive &
    3. [1] 11485
    4. [atguigu@node001 hive-3.1.2]$ nohup: 忽略输入并把输出追加到"nohup.out"
    5. [atguigu@node001 hive-3.1.2]$ nohup bin/hive --service hiveserver2 &
    6. [2] 11626
    7. [atguigu@node001 hive-3.1.2]$ nohup: 忽略输入并把输出追加到"nohup.out"
    8. [atguigu@node001 hive-3.1.2]$
    1. [atguigu@node001 ~]$ nohup /opt/module/hive/hive-3.1.2/bin/hive &
    2. [1] 3815
    3. [atguigu@node001 ~]$ nohup: 忽略输入并把输出追加到"nohup.out"
    4. [atguigu@node001 ~]$
    5. [atguigu@node001 ~]$ jpsall
    6. ================ node001 ================
    7. 2848 NameNode
    8. 3634 JobHistoryServer
    9. 3014 DataNode
    10. 3815 RunJar
    11. 3933 Jps
    12. 3422 NodeManager
    13. ================ node002 ================
    14. 2113 NodeManager
    15. 2531 Jps
    16. 1989 ResourceManager
    17. 1783 DataNode
    18. ================ node003 ================
    19. 1908 SecondaryNameNode
    20. 2055 NodeManager
    21. 2270 Jps
    22. 1823 DataNode
    23. [atguigu@node001 ~]$
    24. [atguigu@node001 ~]$ nohup /opt/module/hive/hive-3.1.2/bin/hive --service hiveserver2 &
    25. [2] 4038
    26. [atguigu@node001 ~]$ nohup: 忽略输入并把输出追加到"nohup.out"
    27. [atguigu@node001 ~]$
    28. [atguigu@node001 ~]$ jpsall
    29. ================ node001 ================
    30. 2848 NameNode
    31. 4225 Jps
    32. 3634 JobHistoryServer
    33. 4038 RunJar
    34. 3014 DataNode
    35. 3815 RunJar
    36. 3422 NodeManager
    37. ================ node002 ================
    38. 2113 NodeManager
    39. 1989 ResourceManager
    40. 1783 DataNode
    41. 2591 Jps
    42. ================ node003 ================
    43. 1908 SecondaryNameNode
    44. 2055 NodeManager
    45. 2334 Jps
    46. 1823 DataNode
    47. [atguigu@node001 ~]$

    P088

    11.1.2 路径分析

    桑基图

    P089

    11.1.3 各来源下单统计

    P090

    11.2 用户主题

    11.2.1 用户变动统计

    P091

    11.2.2 用户留存率

    P092

    11.2.3 用户新增活跃统计

    [2023-09-04 10:04:35] org.apache.hadoop.hive.ql.parse.SemanticException:Failed to get a spark session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create Spark client for Spark session 1af1be50-bc68-46e3-a27f-ff9bf04c44c0

    1. [atguigu@node001 ~]$ cd /opt/module/hive/hive-3.1.2/
    2. [atguigu@node001 hive-3.1.2]$ bin/hive
    3. hive (default)> set hive.spark.client.server.connect.timeout=300000ms;
    4. 2023-09-04 10:14:03,044 INFO [fd295ca1-5b9b-4f12-bcff-c5db47bee932 main] conf.HiveConf (HiveConf.java:getLogIdVar(5040)) - Using the default value passed in for log id: fd295ca1-5b9b-4f12-bcff-c5db47bee932
    5. Query returned non-zero code: 1, cause: Cannot modify hive.spark.client.server.connect.timeout at runtime. It is in the list of parameters that can't be modified at runtime or is prefixed by a restricted variable
    6. 2023-09-04 10:14:03,091 INFO [fd295ca1-5b9b-4f12-bcff-c5db47bee932 main] conf.HiveConf (HiveConf.java:getLogIdVar(5040)) - Using the default value passed in for log id: fd295ca1-5b9b-4f12-bcff-c5db47bee932
    7. 2023-09-04 10:14:03,091 INFO [fd295ca1-5b9b-4f12-bcff-c5db47bee932 main] session.SessionState (SessionState.java:resetThreadName(452)) - Resetting thread name to main
    8. hive (default)>
    set hive.spark.client.server.connect.timeout=300000ms;

    P093

    11.2.4 用户行为漏斗分析

    P094

    11.2.5 新增交易用户统计

    P095

    11.2.6 各年龄段下单用户数

    P096

    11.3 商品主题

    11.3.1 各分类课程交易统计

    P097

    11.3.3 各课程交易统计

    P098

    11.3.4 各课程评价统计

    P099

    11.3.5 各分类课程试听留存统计

    P100

    11.3.6 各学科试听留存统计

    P101

    11.4 交易主题

    11.4.1 交易综合统计

    11.4.2 各省份交易统计

    P102

    11.5 考试主题

    11.5.1 各试卷平均统计

    P103

    11.5.2 各课程考试相关指标统计

    P104

    11.5.3 各试卷成绩分布

    P105

    11.5.4 各题正确率

    P106

    11.6 播放主题

    11.6.1 各章节视频播放情况统计

    P107

    11.6.2 各课程视频播放情况统计

    P108

    11.7 完课主题

    11.7.1 各课程完课人数统计

    1. -- 11.7 完课主题
    2. -- 11.7.1 各课程完课人数统计
    3. -- 1)建表语句
    4. DROP TABLE IF EXISTS ads_complete_complete_user_count_per_course;
    5. CREATE EXTERNAL TABLE ads_complete_complete_user_count_per_course
    6. (
    7. `dt` STRING COMMENT '统计日期',
    8. `recent_days` BIGINT COMMENT '最近天数,1:最近1天,7:最近7天,30:最近30天',
    9. `course_id` STRING COMMENT '课程 id',
    10. `user_count` BIGINT COMMENT '完课人数'
    11. ) COMMENT '各课程完课人数'
    12. ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    13. LOCATION '/warehouse/edu/ads/ads_complete_complete_user_count_per_course/';
    14. -- 2)数据装载
    15. -- 统计用户对应一个课程完成多少个章节
    16. insert overwrite table edu2077.ads_complete_complete_user_count_per_course
    17. select dt,
    18. recent_days,
    19. course_id,
    20. user_count
    21. from edu2077.ads_complete_complete_user_count_per_course
    22. union
    23. select '2022-02-21' dt,
    24. recent_days,
    25. course_id,
    26. count(user_id) order_count
    27. from (select course_id,
    28. id,
    29. user_id,
    30. max_chapter_complete_date
    31. from (select course_id,
    32. user_id,
    33. max(first_complete_date) max_chapter_complete_date,
    34. count(first_complete_date) user_chapter_complete_count
    35. from edu2077.dwd_learn_play_stats_full
    36. where dt = '2022-02-21'
    37. group by course_id, user_id
    38. ) chapter_complete
    39. left join
    40. (select id,
    41. chapter_num
    42. from edu2077.dim_course_full
    43. where dt = '2022-02-21') dim_course
    44. on chapter_complete.course_id = dim_course.id
    45. where user_chapter_complete_count = chapter_num
    46. ) course_complete
    47. lateral view explode(array(1, 7, 30)) tmp as recent_days
    48. where max_chapter_complete_date >= date_add('2022-02-21', -recent_days + 1)
    49. group by recent_days, course_id;
    50. select * from ads_complete_complete_user_count_per_course;
    51. insert overwrite table edu2077.ads_complete_complete_user_count_per_course
    52. select *
    53. from edu2077.ads_complete_complete_user_count_per_course
    54. union
    55. select '2022-02-21' dt,
    56. recent_days,
    57. course_id,
    58. count(user_id) user_conunt
    59. from (
    60. -- 完整的用户数据
    61. select course_id,
    62. user_id,
    63. first_complete_date
    64. from (
    65. -- 统计用户对应一个课程完成多少个章节
    66. select course_id,
    67. user_id,
    68. max(first_complete_date) first_complete_date,
    69. count(first_complete_date) user_complete_nums
    70. from dwd_learn_play_stats_full
    71. where dt = '2022-02-21'
    72. group by course_id, user_id
    73. ) t1
    74. left join (
    75. -- 统计课程一共有多少章节
    76. select id,
    77. chapter_num
    78. from dim_course_full
    79. where dt = '2022-02-21'
    80. ) t2
    81. on t1.course_id = t2.id and user_complete_nums = chapter_num
    82. ) t3 lateral view explode(array(1, 7, 30)) tmp as recent_days
    83. where first_complete_date <= '2022-02-21'
    84. and first_complete_date >= date_sub('2022-02-21', recent_days - 1)
    85. group by recent_days, course_id;
    86. select * from ads_complete_complete_user_count_per_course;

    P109

    11.7.2 完课综合指标

    1. -- 11.7.2 完课综合指标
    2. -- 1)建表语句
    3. DROP TABLE IF EXISTS ads_complete_complete_stats;
    4. CREATE EXTERNAL TABLE ads_complete_complete_stats
    5. (
    6. `dt` STRING COMMENT '统计日期',
    7. `recent_days` BIGINT COMMENT '最近天数,1:最近1天,7:最近7天,30:最近30天',
    8. `user_complete_count` BIGINT COMMENT '完课人数',
    9. `user_course_complete_count` BIGINT COMMENT '完课人次'
    10. ) COMMENT '完课综合指标'
    11. ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    12. LOCATION '/warehouse/edu/ads/ads_complete_complete_stats/';
    13. -- 2)数据装载
    14. insert overwrite table edu2077.ads_complete_complete_stats
    15. select dt,
    16. recent_days,
    17. user_complete_count,
    18. user_course_complete_count
    19. from edu2077.ads_complete_complete_stats
    20. union
    21. select '2022-02-21' dt,
    22. recent_days,
    23. count(distinct user_id) user_complete_count,
    24. count(*) user_course_complete_count
    25. from (select course_id,
    26. id,
    27. user_id,
    28. max_chapter_complete_date
    29. from (select course_id,
    30. user_id,
    31. max(first_complete_date) max_chapter_complete_date,
    32. count(first_complete_date) user_chapter_complete_count
    33. from edu2077.dwd_learn_play_stats_full
    34. where dt = '2022-02-21'
    35. group by course_id, user_id
    36. ) chapter_complete
    37. left join
    38. (select id,
    39. chapter_num
    40. from edu2077.dim_course_full
    41. where dt = '2022-02-21') dim_course
    42. on chapter_complete.course_id = dim_course.id
    43. where user_chapter_complete_count = chapter_num
    44. ) course_complete
    45. lateral view explode(array(1, 7, 30)) tmp as recent_days
    46. where max_chapter_complete_date >= date_add('2022-02-21', -recent_days + 1)
    47. group by recent_days;
    48. select * from ads_complete_complete_stats;
    49. insert overwrite table edu2077.ads_complete_complete_stats
    50. select dt,
    51. recent_days,
    52. user_complete_count,
    53. user_course_complete_count
    54. from edu2077.ads_complete_complete_stats
    55. union
    56. select '2022-02-21' dt,
    57. recent_days,
    58. count(distinct user_id) user_complete_count,
    59. count(*) user_course_complete_count
    60. from (
    61. -- 完整的用户数据
    62. select course_id,
    63. user_id,
    64. first_complete_date
    65. from (
    66. -- 统计用户对应一个课程完成多少个章节
    67. select course_id,
    68. user_id,
    69. max(first_complete_date) first_complete_date,
    70. count(first_complete_date) user_complete_nums
    71. from dwd_learn_play_stats_full
    72. where dt = '2022-02-21'
    73. group by course_id, user_id
    74. ) t1
    75. left join (
    76. -- 统计课程一共有多少章节
    77. select id,
    78. chapter_num
    79. from dim_course_full
    80. where dt = '2022-02-21'
    81. ) t2
    82. on t1.course_id = t2.id and user_complete_nums = chapter_num
    83. ) t3 lateral view explode(array(1, 7, 30)) tmp as recent_days
    84. where first_complete_date <= '2022-02-21'
    85. and first_complete_date >= date_sub('2022-02-21', recent_days - 1)
    86. group by recent_days;
    87. select * from ads_complete_complete_stats;

    P110

    11.7.3 各个课程人均完成章节数

    1. -- 11.7.3 各个课程人均完成章节数
    2. -- 1)建表语句
    3. DROP TABLE IF EXISTS ads_complete_complete_chapter_count_per_course;
    4. CREATE EXTERNAL TABLE ads_complete_complete_chapter_count_per_course
    5. (
    6. `dt` STRING COMMENT '统计日期',
    7. `recent_days` BIGINT COMMENT '最近天数,1:最近1天,7:最近7天,30:最近30天',
    8. `course_id` STRING COMMENT '课程 id',
    9. `complete_chapter_count` DECIMAL(16, 2) COMMENT '用户平均完成章节数'
    10. ) COMMENT '各课程人均完成章节视频数'
    11. ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
    12. LOCATION '/warehouse/edu/ads/ads_complete_complete_chapter_count_per_course/';
    13. -- 2)数据装载
    14. insert overwrite table edu2077.ads_complete_complete_chapter_count_per_course
    15. select dt,
    16. recent_days,
    17. course_id,
    18. complete_chapter_count
    19. from edu2077.ads_complete_complete_chapter_count_per_course
    20. union
    21. select '2022-02-21' dt,
    22. recent_days,
    23. course_id,
    24. count(first_complete_date) / count(distinct user_id) complete_chapter_count
    25. from edu2077.dwd_learn_play_stats_full
    26. lateral view explode(array(1, 7, 30)) tmp as recent_days
    27. where dt = '2022-02-21'
    28. and first_complete_date is not null
    29. and first_complete_date >= date_add('2022-02-21', -recent_days + 1)
    30. group by recent_days, course_id;
    31. select * from ads_complete_complete_chapter_count_per_course;
    32. insert overwrite table edu2077.ads_complete_complete_chapter_count_per_course
    33. select *
    34. from edu2077.ads_complete_complete_chapter_count_per_course
    35. union
    36. select '2022-02-21' dt,
    37. recent_days,
    38. course_id,
    39. cast(count(first_complete_date) / count(distinct user_id) as decimal(16, 2)) complete_chapter_count
    40. from (
    41. -- 统计用户对应一个课程完成多少个章节
    42. -- 完成的章节数
    43. select user_id,
    44. course_id,
    45. max(first_complete_date) first_complete_date,
    46. count(first_complete_date) user_complete_nums
    47. from dwd_learn_play_stats_full
    48. where dt = '2022-02-21'
    49. group by user_id, course_id
    50. ) t1 lateral view explode(array(1, 7, 30)) tmp as recent_days
    51. group by recent_days, course_id;
    52. select * from ads_complete_complete_chapter_count_per_course;

    P111

    11.8 数据装载脚本

  • 相关阅读:
    javascript关于NaN ,isNaN,Number.isNaN必须知道的知识
    Java发展历史
    ClickHouse—入门
    DBeaver,一款好用的通用数据库管理器
    DBA面试题:MySQL缓存池LRU算法做了哪些改进?
    RedisSyncer 中采用链式策略处理同步数据
    【C++】函数指针 ② ( 数组类型基本语法 | 数组语法 | 数组首元素地址 和 数组地址 | 定义数组类型 | 定义指针数组类型语法 | 直接定义指针数组类型变量语法 )
    聊聊logback的AsyncAppender
    1、7focus伪类选择器
    [附源码]计算机毕业设计JAVAssm酒店综合管理平台
  • 原文地址:https://blog.csdn.net/weixin_44949135/article/details/132575874