业务需要,需要计算同一个用户,通过同一个访问渠道,每次访问的时间间隔。
主要要用到的lag和lead分析函数。这边借这个例子,详细展开来讲一下。
lag(col, n, default) over()
lead(col, n, default) over()
- select
- *,
- lag(col_1, 1, 'none') over(partition by col_2 order by col_3) as lag_col
- from table;
构造原始数据如下,表名为user_visit_table:
使用lag函数,按user_id和visit_channel分组、然后按visit_time排序,获取到相邻的visit_time列,生成一个新的列,命名为time_lag,代码如下:
- SELECT
- user_id
- ,visit_channel
- ,visit_time
- ,LAG(visit_time) OVER(PARTITION BY user_id, visit_channel ORDER BY visit_time) AS time_lag
- FROM user_visit_table;
执行结果如下:
然后再计算visit_time和time_lag的差值,就比较简单了,代码如下:
- SELECT
- *
- ,IF(t.time_lag IS NOT null, unix_timestamp(t.visit_time)-unix_timestamp(t.time_lag), 0) AS time_diff
- FROM
- (
- SELECT
- user_id
- ,visit_channel
- ,visit_time
- ,LAG(visit_time) OVER(PARTITION BY user_id, visit_channel ORDER BY visit_time) AS time_lag
- FROM user_visit_table
- ) t;
执行结果如下:
hive分析函数lead()和lag()的应用_不想起的昵称的博客-CSDN博客_hive lead