• Hive(20):Hive的函数之窗口函数


    目录

    0. 相关文章链接

    1. 窗口函数概述

    1.1. 什么是窗口函数

    1.2. 窗口函数的功能

    1.3. 窗口函数的种类和位置

    1.4. 窗口函数说明

    2. 聚合窗口函数

    3. 排序窗口函数

    3.1. 不同排序窗口函数的区别

    3.2. SQL编写和查询结果展示

    4. 排序截取数据函数

    4.1. 排序截取数据函数的区别

    4.2. 原始数据

    4.3. lag和lead函数的使用和结果

    5. 排序限制分区函数

    5.1. 各排序限制分区函数说明

    5.2. SQL编写和结果展示


    0. 相关文章链接

     Hive文章汇总 

    1. 窗口函数概述

    1.1. 什么是窗口函数

    • 简单理解,就是对查询的结果多出一列,这一列可以是聚合值,也可以是排序值。
    • 开窗函数一般就是说的是over()函数,其窗口是由一个 OVER 子句 定义的多行记录
    • 开窗函数一般分为两类,聚合开窗函数和排序开窗函数。

    1.2. 窗口函数的功能

    <窗口函数> over (partition by <用于分组的列名> order by <用于排序的列名> [rows between ?? and ???])
    
    • 同时具有分组和排序的功能
    • 不减少原表的行数

    1.3. 窗口函数的种类和位置

    • 专用窗口函数,包括rank, dense_rank, row_number等专用窗口函数
    • 聚合函数,如sum(). avg(), count(), max(), min()等,rows between…and…
    • 因为窗口函数是对where或者group by子句处理后的结果进行操作,所以窗口函数原则上只能写在select子句中

    1.4. 窗口函数说明

    • over():是窗口函数的核心,需要开窗的全部都需要使用这个函数
    • partition by:用来对表分组,后续跟上分组的字段,会给该字段中每个不同的值开一个窗(可以添加多个字段)
    • order by:是对分组后的结果进行排序,默认是按照升序(asc)排列,可以进行多个字段降序
    • rows between ?? and ???:用来对窗口中已经排序完成后的上下数据进行筛选

    2. 聚合窗口函数

            聚合窗口函数就是在通过over开窗的情况下,使用sum、max等聚合函数对开窗之后的结果进行聚合汇总;跟普通的聚合函数不同,普通聚合函数只会生成一条结果数据,聚合窗口函数会在每一行中都新增一列,用了存储这聚合结果,这个窗口有多少条数据,那就会有多少条一模一样的结果数据。

    原始数据如下表所示(只列举出了部分数据):

    user_idcat_idscore
    6184880.0320793066566
    732556480.1794971885383
    732556530.0195317941826
    95494720.2734451185567
    95493160.0102923143349
    101875600.09287464288
    160932990.8203353556701
    191094500.0129389094682
    2904355800.0780796261011
    290433600.0445708866509
    302073010.1283038644617
    374693180.0672950094793
    421213640.0285768131938
    450214760.0290241345941
    481613081.3997567045733
    481615300.0280696255429
    530085100.0292976912317
    535683020.6823961209357
    583973000.0356667545944

    SQL编写和查询结果展示:

    1. select
    2. user_id
    3. , third_cat_id as cat_id
    4. , score
    5. , max(score) over(partition by user_id) max_score_of_user
    6. , min(score) over(partition by user_id) min_score_of_user
    7. , count(score) over(partition by user_id) cnt_score_of_user
    8. , avg(score) over(partition by user_id) avg_score_of_user
    9. , sum(score) over(partition by user_id) sum_score_of_user
    10. from yishou_recommendation_system.user_third_category_score_15day
    user_idcat_idscoremax_score_of_usermin_score_of_usercnt_score_of_useravg_score_of_usersum_score_of_user
    16973250.249999999280.249999999280.2499999992810.249999999280.24999999928
    136384010.19244081147352.31447644475470.0153345765378290.305926986468572478.871882607588601
    136383180.84335643795662.31447644475470.0153345765378290.305926986468572478.871882607588601
    136382950.84730768579542.31447644475470.0153345765378290.305926986468572478.871882607588601
    136382990.34134523376742.31447644475470.0153345765378290.305926986468572478.871882607588601
    136385560.46361789494452.31447644475470.0153345765378290.305926986468572478.871882607588601
    136384720.07692307692312.31447644475470.0153345765378290.305926986468572478.871882607588601
    136383140.15312435648972.31447644475470.0153345765378290.305926986468572478.871882607588601
    136383630.01533457653782.31447644475470.0153345765378290.305926986468572478.871882607588601
    136384310.09287464301372.31447644475470.0153345765378290.305926986468572478.871882607588601
    136383030.15120205926572.31447644475470.0153345765378290.305926986468572478.871882607588601
    136383160.12525901674572.31447644475470.0153345765378290.305926986468572478.871882607588601
    136384900.01533457653782.31447644475470.0153345765378290.305926986468572478.871882607588601
    136383010.45500034447242.31447644475470.0153345765378290.305926986468572478.871882607588601
    136384880.49611836048252.31447644475470.0153345765378290.305926986468572478.871882607588601
    136383000.26986804799782.31447644475470.0153345765378290.305926986468572478.871882607588601
    136383610.07692307692312.31447644475470.0153345765378290.305926986468572478.871882607588601
    136384930.01533457653782.31447644475470.0153345765378290.305926986468572478.871882607588601
    136385210.13321107238672.31447644475470.0153345765378290.305926986468572478.871882607588601
    136385610.07692307692312.31447644475470.0153345765378290.305926986468572478.871882607588601

    3. 排序窗口函数

    3.1. 不同排序窗口函数的区别

    • rank():如果有并列名次的行,会占用下一名次的位置,排序结果是 1,1,3
    • dense_rank():如果有并列名次的行,不占用下一名次的位置,排序结果是 1,1,2
    • row_number():不考虑并列名次的情况,排序结果是 1,2,3

    3.2. SQL编写和查询结果展示

    1. select
    2. user_id
    3. , third_cat_id as cat_id
    4. , score
    5. , rank() over(partition by user_id order by score) score_rank_of_user
    6. , dense_rank() over(partition by user_id order by score) score_dense_rank_of_user
    7. , row_number() over(partition by user_id order by score) score_row_number_of_user
    8. from yishou_recommendation_system.user_third_category_score_15day
    user_idcat_idscorescore_rank_of_userscore_dense_rank_of_userscore_row_number_of_user
    16973250.24999999928111
    136384930.0153345765378111
    136383630.0153345765378112
    1363856570.0153345765378113
    136384900.0153345765378114
    136383250.0153345765378115
    136383400.0379776680745626
    136384720.0769230769231737
    136383610.0769230769231738
    136385610.0769230769231739
    136385620.07692307692317310
    136385100.092874643013711411
    136384310.092874643013711412
    136383160.125259016745713513
    136385210.133211072386714614
    136383940.140160736612315715
    136383030.151202059265716816
    136383140.153124356489717917
    136384010.1924408114735181018
    136383000.2698680479978191119
    136382990.3413452337674201220
    136383040.4085346499663211321
    136383080.4296759077854221422
    136383010.4550003444724231523
    136385560.4636178949445241624
    136384880.4961183604825251725
    136382940.4990914022087261826
    136383180.8433564379566271927
    136382950.8473076857954282028
    136383022.3144764447547292129

    4. 排序截取数据函数

    4.1. 排序截取数据函数的区别

    • lag(col,n,default_val):获取往前第n行数据,col是列名,n是往上的行数,当第n行为null的时候取default_val
    • lead(col,n, default_val):往后第n行数据,col是列名,n是往下的行数,当第n行为null的时候取default_val
    • ntile(n):把有序分区中的行分发到指定数据的组中,各个组有编号,编号从1开始,对于每一行,NTILE返回此行所属的组的编号。
    • cume_dist(),计算某个窗口或分区中某个值的累积分布。假定升序排序,则使用以下公式确定累积分布:小于等于当前值x的行数 / 窗口或partition分区内的总行数。其中,x 等于 order by 子句中指定的列的当前行中的值。

    4.2. 原始数据

    user_idrecord_date
    113433252022-09-04 06:17:24
    113433252022-09-04 06:17:24
    113433262022-09-04 06:17:25
    113433262022-09-04 06:17:25
    113433272022-09-04 06:17:26
    113433272022-09-04 06:17:26
    113433262022-09-04 06:17:27
    113433282022-09-04 06:17:27
    113433282022-09-04 06:17:27
    113433292022-09-04 06:17:28
    113433292022-09-04 06:17:28
    46549232022-09-04 06:17:30
    113433302022-09-04 06:17:32
    113433302022-09-04 06:17:32
    113433312022-09-04 06:17:35
    113433312022-09-04 06:17:35
    113433322022-09-04 06:17:37
    113433322022-09-04 06:17:37
    113433332022-09-04 06:17:37
    113433332022-09-04 06:17:37
    113433342022-09-04 06:17:40
    113433342022-09-04 06:17:40

    4.3. lag和lead函数的使用和结果

    1. select
    2. user_id
    3. , record_date
    4. , lag(record_date, 1, '1970-01-01 00:00:00') over(partition by user_id order by record_date) as start_date
    5. , lead(record_date, 1, '9999-99-99 00:00:00') over(partition by user_id order by record_date) as end_date
    6. from user_table

            使用user_id分组,并使用record_date进行升序排序;lag函数会获取之前一行的record_date数据作为start_date字段的值,并回使用 '1970-01-01 00:00:00' 作为默认值;lead函数会获取之后一行的record_date数据作为end_date字段的值,并回使用 '9999-99-99 00:00:00' 作为默认值;结果如下所示:

    user_idrecord_datestart_dateend_date
    10000452022-09-03 11:41:501970-01-01 00:00:009999-99-99 00:00:00
    10000612022-09-03 15:22:211970-01-01 00:00:009999-99-99 00:00:00
    1000132022-09-03 08:14:361970-01-01 00:00:002022-09-03 14:47:43
    1000132022-09-03 14:47:432022-09-03 08:14:362022-09-03 14:48:25
    1000132022-09-03 14:48:252022-09-03 14:47:432022-09-03 14:49:11
    1000132022-09-03 14:49:112022-09-03 14:48:252022-09-03 15:08:41
    1000132022-09-03 15:08:412022-09-03 14:49:112022-09-03 15:09:27
    1000132022-09-03 15:09:272022-09-03 15:08:419999-99-99 00:00:00

    具体赋值如下图所示:

    5. 排序限制分区函数

    5.1. 各排序限制分区函数说明

    • rows必须跟在Order by 子句之后,对排序的结果进行限制,使用固定的行数来限制分区中的数据行数量;
    • OVER():指定分析函数工作的数据窗口大小,这个数据窗口大小可能会随着行的变而变化;
    • current row:当前行;
    • n preceding:往前n行数据;
    • n following:往后n行数据;
    • unbounded preceding:该窗口数据的起点;
    • unbounded following:该窗口数据的终点;

    5.2. SQL编写和结果展示

    1. select
    2. user_id
    3. , third_cat_id as cat_id
    4. , score
    5. -- 以user_id分组,按照score进行升序排序
    6. , row_number() over(partition by user_id order by score) as row_number_id
    7. -- 以user_id分组,按照score进行升序排序,求该窗口中第一行到当前行的和
    8. , sum(score) over(partition by user_id order by score rows between unbounded preceding and current row) sum_1
    9. -- 以user_id分组,按照score进行升序排序,求该窗口中之前一行到当前行的和
    10. , sum(score) over(partition by user_id order by score rows between 1 preceding and current row) sum_2
    11. -- 以user_id分组,按照score进行升序排序,求该窗口中当前行到之后一行的和
    12. , sum(score) over(partition by user_id order by score rows between current row and 1 following) sum_3
    13. -- 以user_id分组,按照score进行升序排序,求该窗口中当前行到最后一行的和
    14. , sum(score) over(partition by user_id order by score rows between current row and unbounded following) sum_3
    15. from yishou_recommendation_system.user_third_category_score_15day
    user_idcat_idscorerow_number_idsum_1sum_2sum_3sum_3
    16973250.2499999992810.249999999280.249999999280.249999999280.24999999928
    136384900.015334576537810.01533457653780.01533457653780.03066915307568.8718826075886
    136384930.015334576537820.03066915307560.03066915307560.03066915307568.8565480310508
    136383630.015334576537830.04600372961340.03066915307560.03066915307568.841213454513
    136383250.015334576537840.06133830615120.03066915307560.03066915307568.8258788779752
    1363856570.015334576537850.0766728826890.03066915307560.0533122446123000048.8105443014374
    136383400.037977668074560.11465055076350.0533122446123000040.11490074499768.7952097248996
    136385610.076923076923170.191573627686600020.11490074499760.15384615384628.7572320568251
    136384720.076923076923180.268496704609700040.15384615384620.15384615384628.680308979902
    136385620.076923076923190.345419781532800060.15384615384620.15384615384628.6033859029789
    136383610.0769230769231100.42234285845590010.15384615384620.16979771993688.5264628260558
    136384310.0928746430137110.51521750146960010.16979771993680.18574928602748.449539749132699
    136385100.0928746430137120.60809214448330010.18574928602740.218133659759400028.356665106119
    136383160.1252590167457130.7333511612290.218133659759400020.25847008913248.2637904631053
    136385210.1332110723867140.86656223361570.25847008913240.2733718089998.1385314463596
    136383940.1401607366123151.0067229702280.2733718089990.291362795877999968.0053203739729
    136383030.1512020592657161.15792502949370.291362795877999960.30432641575547.8651596373606
    136383140.1531243564897171.31104938598340.30432641575540.34556516796327.713957578094901
    136384010.1924408114735181.50349019745690.34556516796320.46230885947137.5608332216052005
    136383000.2698680479978191.77335824545469990.46230885947130.61121328176520017.368392410131699
    136382990.3413452337674202.11470347922210.61121328176520010.74987988373377.098524362133899
    136383040.4085346499663212.52323812918839960.74987988373370.83821055775170016.7571791283665
    136383080.4296759077854222.95291403697380.83821055775170010.88467625225786.3486444784002
    136383010.4550003444724233.40791438144620.88467625225780.91861823941695.918968570614799
    136385560.4636178949445243.87153227639070.91861823941690.9597362554275.4639682261424
    136384880.4961183604825254.36765063687320.9597362554270.99520976269125.0003503311979
    136382940.4990914022087264.86674203908190.99520976269121.34244784016534.5042319707154
    136383180.8433564379566275.71009847703851.34244784016531.6906641237524.0051405685067
    136382950.8473076857954286.55740616283389951.6906641237523.16178413055013.1617841305501
    136383022.3144764447547298.87188260758863.16178413055012.31447644475472.3144764447547
    165303140.083333333093310.08333333309330.08333333309330.16666666618661.6825354114854
    165303180.083333333093320.16666666618660.16666666618660.235710039492700031.5992020783921002
    165304930.152376706399430.3190433725860.235710039492700030.30475341279881.5158687452988
    165303030.152376706399440.471420078985400060.30475341279880.33812599215941.3634920388994
    165303250.1857492857650.65716936474540.33812599215940.371498571521.2111153325
    165304890.1857492857660.84291865050540.371498571521.025366046741.02536604674
    165303240.8396167609871.68253541148541.025366046740.839616760980.83961676098

    注:其他Hive相关系列文章链接由此进 -> Hive文章汇总


  • 相关阅读:
    华为防火墙基础自学系列 | IKE介绍
    Spring Security是什么? - 简单例子(三)
    源码硬讲HashMap结构及数据结构转换过程(图+文)
    matlab之cell数组的详细用法
    react memo判断刷新机制 自定义的比较函数 避免重复渲染
    c# 实验六 菜单栏、工具栏和状态栏的设计
    Cannot install powershell modules
    怎么将客户引到私域?
    SpringBoot+Vue实现前后端的毕业设计管理系统
    C++ Qt开发:运用QJSON模块解析数据
  • 原文地址:https://blog.csdn.net/yang_shibiao/article/details/126721573