• 数仓SQL相关的场景解法


    1.求用户连续登录天数、连续下单次数、连续......等次数等。

    1. use_id event_time
    2. 1 2022-08-20
    3. 1 2022-08-21
    4. 2 2022-08-20
    5. 2 2022-08-22
    6. 2 2022-08-23
    7. 3 2022-08-20
    8. 3 2022-08-23
    9. 3 2022-08-25

    此种场景,一般先按照用户维度开窗,然后拿事件减去当前行的序号,就能将连续的数据得到相同的值,然后进行count:datediff(event_time, row_number over(partition by user_id order by event_time))。或者: row_number over(partition by 1 order by event_time) - row_number over(partition by user order by event_time)等。

    1.2 连续问题的变种

    某游戏公司记录的用户每日登录数据:

    1. id dt
    2. 1001 2021-12-12
    3. 1002 2021-12-12
    4. 1001 2021-12-13
    5. 1001 2021-12-14
    6. 1001 2021-12-16
    7. 1002 2021-12-16
    8. 1001 2021-12-19
    9. 1002 2021-12-17
    10. 1001 2021-12-20

    计算每个用户最大的连续登录天数,可以间隔一天。解释:如果一个用户在 1,3,5,6 登 录游戏,则视为连续 6 天登录。

    1. select
    2. id,
    3. max(days)+1
    4. from
    5. (
    6. select
    7. id,
    8. flag,
    9. datediff(max(dt),min(dt)) days
    10. from
    11. (
    12. select
    13. id,
    14. dt,
    15. sum(if(flag>2,1,0)) over(partition by id order by dt) flag
    16. from
    17. (
    18. select
    19. id,
    20. dt,
    21. datediff(dt,lagdt) flag
    22. from
    23. (
    24. select
    25. id,
    26. dt,
    27. lag(dt,1,'1970-01-01') over(partition by id order by dt) lagdt
    28. from
    29. test3
    30. )t1
    31. )t2
    32. )t3
    33. group by id,flag
    34. )t4
    35. group by id;

    2. 求直播间中并发最高的人数

    1. user_id enter_time out_time
    2. 1 2022-08-23 11:00:11 2022-08-23 12:11:00
    3. 2 2022-08-23 10:00:12 2022-08-23 11:10:00
    4. 1 2022-08-23 09:00:56 2022-08-23 10:10:00
    5. 4 2022-08-23 11:20:32 2022-08-23 14:11:00
    6. 5 2022-08-23 12:10:00 2022-08-23 15:12:00
    7. 6 2022-08-23 13:00:23 2022-08-23 13:13:00
    8. 7 2022-08-23 11:30:00 2022-08-23 12:14:00

    并发指的是:直播过程中直播间某一时刻最大的人数

    这种情况,我们需要将直播的进入时间和离开时间分别拆成两条数据,并且为进入事件打上1的标识,离开事件打上-1的标识,然后开窗sum(flag) over(partition by 1 oder by evnet_time),然后对其求最大的那条数据就OK了。此种方法,其实就是模拟直播开始过程中,来一个人就+1,离开一个人就-1的情况,然后求最大就是某一时刻最大的人数。

    3.求互为好友的用户有哪些

    1. user_id firend_id
    2. 1 2
    3. 3 1
    4. 2 1
    5. 4 3
    6. 3 4

    数据如上,此种有很多种解法,如下:

    1. select
    2. T1.user_id, T1.user_id
    3. from
    4. tmp T1 inner join tmp T2
    5. on T1.user_id = T2.firend_id and T1.firend_id = T2.user_id
    6. where T1.user_id > T2.user_id
    7. select
    8. split(T.flag, '#')[0] as user_id,
    9. split(T.flag, '#')[1] as frend_id
    10. from
    11. (
    12. select
    13. concat(if(user_id > firend_id, user_id, firend_id), '#', if(user_id < firend_id, user_id, firend_id)) as flag
    14. from
    15. tmp
    16. group by flag, having count(1) >= 2
    17. ) as T

    4.有msg_time,from,to, msg等字段,计算聊天两轮(用户消息一来一回表示一轮)以上的用户。

    1. time from to
    2. 2021-11-24 21:23:00 1 2
    3. 2021-11-24 21:23:01 2 1
    4. 2021-11-24 21:23:02 1 2
    5. 2021-11-24 21:23:03 2 1
    6. 2021-11-24 21:23:04 2 1
    7. 2021-11-24 21:23:01 3 1
    8. 2021-11-24 21:23:02 1 3
    9. 2021-11-24 21:23:03 3 1
    10. 2021-11-24 21:23:04 1 3

    这个其实也和上面一样的一样,但是要判断是不是相邻的就比较麻烦点,解法如下:

    1. select
    2. T4.newId,
    3. count(T4.times_flag)/2 as times
    4. from
    5. (
    6. select
    7. T3.newId,
    8. case
    9. when T3.from = nvl(lead(T3.to) over(partition by T3.newId order by T3.time), T3.from)
    10. and T3.to = nvl(lead(T3.from) over(partition by T3.newId order by T3.time), T3.to) then 1
    11. else null end as times_flag
    12. from
    13. (
    14. select
    15. T1.time, T1.from, T1.to,
    16. concat(if(T1.from < T1.to, T1.from, T1.to), "#", if(T1.from > T1.to, T1.from, T1.to)) as newId
    17. from
    18. temp as T1
    19. ) as T3
    20. ) as T4
    21. where T4.times_flag is not null
    22. group by T4.newId having count(T4.times_flag)/2 >= 2;

    5.微信读书,每次进入书籍后会上报init,点击翻页会上报read事件,求每次init事件之后, read事件出现的次数。

    1. uid event_time event
    2. 1 2021-11-24 21:23:10 init
    3. 1 2021-11-24 21:23:11 read
    4. 1 2021-11-24 21:23:12 read
    5. 1 2021-11-24 21:23:13 read
    6. 1 2021-11-24 21:23:14 init
    7. 1 2021-11-24 21:23:15 read
    8. 2 2021-11-24 21:23:11 init
    9. 2 2021-11-24 21:23:12 read

    很简单,解法如下:

    1. select
    2. T2.uid,
    3. T2.event_time,
    4. nvl(lead(T2.rn1, 1) over(partition by T2.uid order by T2.event_time) - 1, T2.cnt) - T2.rn1 as times
    5. from
    6. ( select
    7. T1.uid, T1.event_time, T1.rn1, T1.cnt
    8. from
    9. (
    10. select
    11. uid,
    12. event_time,
    13. event,
    14. row_number() over(partition by uid order by event_time) as rn1,
    15. count(1) over(partition by uid) as cnt
    16. from
    17. temp
    18. ) as T1
    19. where T1.event = 'init'
    20. ) as T2

    6.有一场篮球赛,参赛双方是A队和B队,场边记录员记录下了每次得分的详细信息:
    team:队名
    number:球衣号,
    name:球员姓名,
    score_time:得分时间,
    score:当次得分


    1)输出每一次的比分的反超时刻,以及对应的完成反超的球员姓名

    2)输出连续三次或以上得分的球员姓名,以及那一拨连续得分的数值

    7.打折日期交叉问题

    如下为平台商品促销数据:字段为品牌,打折开始日期,打折结束日期

    1. brand stt edt
    2. oppo 2021-06-05 2021-06-09
    3. oppo 2021-06-11 2021-06-21
    4. vivo 2021-06-05 2021-06-15
    5. vivo 2021-06-09 2021-06-21
    6. redmi 2021-06-05 2021-06-21
    7. redmi 2021-06-09 2021-06-15
    8. redmi 2021-06-17 2021-06-26
    9. huawei 2021-06-05 2021-06-26
    10. huawei 2021-06-09 2021-06-15
    11. huawei 2021-06-17 2021-06-21

    计算每个品牌总的打折销售天数,注意其中的交叉日期,比如 vivo 品牌,第一次活动时 间为 2021-06-05 到 2021-06-15,第二次活动时间为 2021-06-09 到 2021-06-21 其中 9 号到 15 号为重复天数,只统计一次,即 vivo 总打折天数为 2021-06-05 到 2021-06-21 共计 17 天。

    1. select
    2. id,
    3. sum(if(days>=0,days+1,0)) days
    4. from
    5. (select
    6. id,
    7. datediff(edt,stt) days
    8. from
    9. (select
    10. id,
    11. if(maxEdt is null,stt,if(stt>maxEdt,stt,date_add(maxEdt,1))) stt,
    12. edt
    13. from
    14. (select
    15. id,
    16. stt,
    17. edt,
    18. max(edt) over(partition by id order by stt rows between UNBOUNDED PRECEDING and 1 PRECEDING) maxEdt
    19. from test4
    20. )t1
    21. )t2
    22. )t3
    23. group by id;

    1. 行列转换

    描述:表中记录了各年份各部门的平均绩效考核成绩。
    表名:t1
    表结构:

    1. a -- 年份
    2. b -- 部门
    3. c -- 绩效得分

    表内容:

    1. a b c
    2. 2014 B 9
    3. 2015 A 8
    4. 2014 A 10
    5. 2015 B 7

    问题一:多行转多列

    问题描述:将上述表内容转为如下输出结果所示:

    1. a col_A col_B
    2. 2014 10 9
    3. 2015 8 7

    参考答案:

    1. select
    2. a,
    3. max(case when b="A" then c end) col_A,
    4. max(case when b="B" then c end) col_B
    5. from t1
    6. group by a;

    问题二:如何将结果转成源表?(多列转多行)

    问题描述:将问题一的结果转成源表,问题一结果表名为t1_2。

    参考答案:

    1. select
    2. a,
    3. b,
    4. c
    5. from (
    6. select a,"A" as b,col_a as c from t1_2
    7. union all
    8. select a,"B" as b,col_b as c from t1_2
    9. )tmp;

    问题三:同一部门会有多个绩效,求多行转多列结果

    问题描述:2014年公司组织架构调整,导致部门出现多个绩效,业务及人员不同,无法合并算绩效,源表内容如下:

    1. 2014 B 9
    2. 2015 A 8
    3. 2014 A 10
    4. 2015 B 7
    5. 2014 B 6

    输出结果如下所示:

    1. a col_A col_B
    2. 2014 10 6,9
    3. 2015 8 7

    参考答案:

    1. select
    2. a,
    3. max(case when b="A" then c end) col_A,
    4. max(case when b="B" then c end) col_B
    5. from (
    6. select
    7. a,
    8. b,
    9. concat_ws(",",collect_set(cast(c as string))) as c
    10. from t1
    11. group by a,b
    12. )tmp
    13. group by a;

    2. 排名中取他值

    表名:t2
    表字段及内容:

    1. a b c
    2. 2014 A 3
    3. 2014 B 1
    4. 2014 C 2
    5. 2015 A 4
    6. 2015 D 3

    问题一:按a分组取b字段最小时对应的c字段
    问题二:按a分组取b字段排第二时对应的c字段
    问题二:按a分组取b字段最小和最大时的c的值

    问题四:按a分组取b字段第二小和第二大时对应的c字段

    问题五:按a分组取b字段前两小和前两大时对应的c字段

    参考答案:

    1. select
    2. tmp1.a as a,
    3. min_c,
    4. max_c
    5. from
    6. (
    7. select
    8. a,
    9. concat_ws(',', collect_list(c)) as min_c
    10. from
    11. (
    12. select
    13. a,
    14. b,
    15. c,
    16. row_number() over(partition by a order by b) as asc_rn
    17. from t2
    18. )a
    19. where asc_rn <= 2
    20. group by a
    21. )tmp1
    22. join
    23. (
    24. select
    25. a,
    26. concat_ws(',', collect_list(c)) as max_c
    27. from
    28. (
    29. select
    30. a,
    31. b,
    32. c,
    33. row_number() over(partition by a order by b desc) as desc_rn
    34. from t2
    35. )a
    36. where desc_rn <= 2
    37. group by a
    38. )tmp2
    39. on tmp1.a = tmp2.a;

    3. 累计求值

    表名:t3
    表字段及内容:

    1. a b c
    2. 2014 A 3
    3. 2014 B 1
    4. 2014 C 2
    5. 2015 A 4
    6. 2015 D 3

    问题一:按a分组按b字段排序,对c累计求和

    问题二:按a分组按b字段排序,对c取累计平均值

    问题三:按a分组按b字段排序,对b取累计排名比例

    问题四:按a分组按b字段排序,对b取累计求和比例

    1. select
    2. a,
    3. b,
    4. c,
    5. round(sum(c) over(partition by a order by b) / (sum(c) over(partition by a)),2) as ratio_c
    6. from t3
    7. order by a,b;

    4. 窗口大小控制

    表名:t4
    表字段及内容:

    1. a b c
    2. 2014 A 3
    3. 2014 B 1
    4. 2014 C 2
    5. 2015 A 4
    6. 2015 D 3

    问题一:按a分组按b字段排序,对c取前后各一行的和

    问题二:按a分组按b字段排序,对c取平均值

    参考答案:

    1. select
    2. a,
    3. b,
    4. case when lag_c is null then c
    5. else (c+lag_c)/2 end as avg_c
    6. from
    7. (
    8. select
    9. a,
    10. b,
    11. c,
    12. lag(c,1) over(partition by a order by b) as lag_c
    13. from t4
    14. )temp;

    5. 产生连续数值

    参考答案:

    1. select
    2. id_start+pos as id
    3. from(
    4. select
    5. 1 as id_start,
    6. 1000000 as id_end
    7. ) m lateral view posexplode(split(space(id_end-id_start), '')) t as pos, val
    1. select
    2. row_number() over(order by x) as id
    3. from
    4. (select split(space(999999), '') as x) t
    5. lateral view
    6. explode(x) ex;

    6. 数据扩充与收缩

    表名:t6
    表字段及内容:

    1. a
    2. 3
    3. 2
    4. 4

    问题一:数据扩充

    输出结果如下所示:

    1. a b
    2. 3 321
    3. 2 21
    4. 4 4321

    参考答案:

    1. select
    2. t.a,
    3. concat_ws('、',collect_set(cast(t.rn as string))) as b
    4. from
    5. (
    6. select
    7. t6.a,
    8. b.rn
    9. from t6
    10. left join
    11. (
    12. select
    13. row_number() over() as rn
    14. from
    15. (select split(space(5), '') as x) t -- space(5)可根据t6表的最大值灵活调整
    16. lateral view
    17. explode(x) as t1 pe
    18. ) b
    19. on 1 = 1
    20. where t6.a >= b.rn
    21. order by t6.a, b.rn desc
    22. ) t
    23. group by t.a;

    7. 合并与拆分

    表名:t7
    表字段及内容:

    1. a b
    2. 2014 A
    3. 2014 B
    4. 2015 B
    5. 2015 D

    问题一:合并

    输出结果如下所示:

    1. 2014 A、B
    2. 2015 B、D

    参考答案:

    1. select
    2. a,
    3. concat_ws('、', collect_set(t.b)) b
    4. from t7
    5. group by a;

    问题二:拆分

    问题描述:将分组合并的结果拆分出来

    参考答案:

    1. select
    2. t.a,
    3. d
    4. from
    5. (
    6. select
    7. a,
    8. concat_ws('、', collect_set(t7.b)) b
    9. from t7
    10. group by a
    11. )t
    12. lateral view
    13. explode(split(t.b, '、')) table_tmp as d;

    8. 模拟循环操作

    表名:t8
    表字段及内容:

    1. a
    2. 1011
    3. 0101

    问题一:如何将字符’1’的位置提取出来

    1. select
    2. a,
    3. concat_ws(",",collect_list(cast(index as string))) as res
    4. from (
    5. select
    6. a,
    7. index+1 as index,
    8. chr
    9. from (
    10. select
    11. a,
    12. concat_ws(",",substr(a,1,1),substr(a,2,1),substr(a,3,1),substr(a,-1)) str
    13. from t8
    14. ) tmp1
    15. lateral view posexplode(split(str,",")) t as index,chr
    16. where chr = "1"
    17. ) tmp2
    18. group by a;

    9. 不使用distinct或group by去重

    使用row_number()进行去重操作

    10. 容器–反转内容

    表名:t10
    表字段及内容:

    1. a
    2. AB,CA,BAD
    3. BD,EA

    问题一:反转逗号分隔的数据:改变顺序,内容不变

    1. select
    2. a,
    3. concat_ws(",",collect_list(reverse(str)))
    4. from
    5. (
    6. select
    7. a,
    8. str
    9. from t10
    10. lateral view explode(split(reverse(a),",")) t as str
    11. ) tmp1
    12. group by a;

    问题二:反转逗号分隔的数据:改变内容,顺序不变

    1. select
    2. a,
    3. concat_ws(",",collect_list(reverse(str)))
    4. from
    5. (
    6. select
    7. a,
    8. str
    9. from t10
    10. lateral view explode(split(a,",")) t as str
    11. ) tmp1
    12. group by a;

    11. 多容器–成对提取数据

    表名:t11
    表字段及内容:

    1. a b
    2. A/B 1/3
    3. B/C/D 4/5/2

    问题一:成对提取数据,字段一一对应

    输出结果如下所示:

    1. a b
    2. A 1
    3. B 3
    4. B 4
    5. C 5
    6. D 2
    1. select
    2. a_inx,
    3. b_inx
    4. from
    5. (
    6. select
    7. a,
    8. b,
    9. a_id,
    10. a_inx,
    11. b_id,
    12. b_inx
    13. from t11
    14. lateral view posexplode(split(a,'/')) t as a_id,a_inx
    15. lateral view posexplode(split(b,'/')) t as b_id,b_inx
    16. ) tmp
    17. where a_id=b_id;

    12. 多容器–转多行

    表名:t12
    表字段及内容:

    1. a b c
    2. 001 A/B 1/3/5
    3. 002 B/C/D 4/5

    问题一:转多行

    输出结果如下所示:

    1. a d e
    2. 001 type_b A
    3. 001 type_b B
    4. 001 type_c 1
    5. 001 type_c 3
    6. 001 type_c 5
    7. 002 type_b B
    8. 002 type_b C
    9. 002 type_b D
    10. 002 type_c 4
    11. 002 type_c 5
    1. select
    2. a,
    3. d,
    4. e
    5. from
    6. (
    7. select
    8. a,
    9. "type_b" as d,
    10. str as e
    11. from t12
    12. lateral view explode(split(b,"/")) t as str
    13. union all
    14. select
    15. a,
    16. "type_c" as d,
    17. str as e
    18. from t12
    19. lateral view explode(split(c,"/")) t as str
    20. ) tmp
    21. order by a,d;

    13. 抽象分组–断点排序

    表名:t13
    表字段及内容:

    1. a b
    2. 2014 1
    3. 2015 1
    4. 2016 1
    5. 2017 0
    6. 2018 0
    7. 2019 -1
    8. 2020 -1
    9. 2021 -1
    10. 2022 1
    11. 2023 1

    问题一:断点排序

    输出结果如下所示:

    1. a b c
    2. 2014 1 1
    3. 2015 1 2
    4. 2016 1 3
    5. 2017 0 1
    6. 2018 0 2
    7. 2019 -1 1
    8. 2020 -1 2
    9. 2021 -1 3
    10. 2022 1 1
    11. 2023 1 2
    1. select
    2. a,
    3. b,
    4. row_number() over( partition by b,repair_a order by a asc) as c--按照b列和[b的组首]分组,排序
    5. from
    6. (
    7. select
    8. a,
    9. b,
    10. a-b_rn as repair_a--根据b列值出现的次序,修复a列值为b首次出现的a列值,称为b的[组首]
    11. from
    12. (
    13. select
    14. a,
    15. b,
    16. row_number() over( partition by b order by a asc ) as b_rn--按b列分组,按a列排序,得到b列各值出现的次序
    17. from t13
    18. )tmp1
    19. )tmp2--注意,如果不同的b列值,可能出现同样的组首值,但组首值需要和a列值 一并参与分组,故并不影响排序。
    20. order by a asc;

     14. 业务逻辑的分类与抽象–时效

    日期表:d_date
    表字段及内容:

    1. date_id is_work
    2. 2017-04-13 1
    3. 2017-04-14 1
    4. 2017-04-15 0
    5. 2017-04-16 0
    6. 2017-04-17 1

    工作日:周一至周五09:30-18:30

    客户申请表:t14
    表字段及内容:

    1. a b c
    2. 1 申请 2017-04-14 18:03:00
    3. 1 通过 2017-04-17 09:43:00
    4. 2 申请 2017-04-13 17:02:00
    5. 2 通过 2017-04-15 09:42:00

    问题一:计算上表中从申请到通过占用的工作时长

    输出结果如下所示:

    1. a d
    2. 1 0.67h
    3. 2 10.67h

  • 相关阅读:
    Coelho2021_GMGCv1 计算丰度的方法
    7、Mybatis-Plus condition的使用
    面试(五)
    deepstream中Gst-nvmsgconv和Gst-nvmsgbroker两个插件的作用和关系
    MySQL字符串提取
    Jenkins自动构建(Gitee)
    【41. 最短编辑距离(线性DP)】
    回溯算法题型总结
    韩国程序员面试考什么?
    PHP通过sql生成CSV文件并下载,PHP实现文件下载
  • 原文地址:https://blog.csdn.net/qq_32323239/article/details/126493383