• 快照表转换成拉链表的方式(hive)初始化拉链&增量更新拉链


    从快照表到拉链表的构建方式(hive)

    注意:本文主要讲解在没有新增标识(最后修改时间)的前提下,通过每个字段的比对变更将存量+后续增量的快照表转换成对应的拉链表

    初始数据和存量数据如下:

    快照表->拉链表初始化!!!!!!!!!!!!!
    
    NOTE:这次实践主要是针对hive中的拉链表,通过快照表将hive中得到快照维度表一次性初始化转换成拉链表
    AUTHOR:SHUFANGGEGE
    
    start:原始的快照表数据
    id  name inc_day
    A   a1   20220101
    A   a1   20220102
    A   a1   20220104
    A   a2   20220105
    A   a3   20220106
    A   a3   20220107
    A   a2   20220108
    A   a1   20220109
    B   b1   20220102
    A   b2   20220106
    
    
    end:目标结果的快照数据
    id  name  dw_start_date  dw_end_date
    A   a1     20220101       20220105 
    A   a2     20220105       20220106 
    A   a3     20220106       20220108 
    A   a2     20220108       20220109 
    A   a1     20220109       99991231 
    B   b1     20220102       20220106
    B   b2     20220106       99991231
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29

    过程分2步骤:

    • 从存量快照初始化拉链 (1)
    • 增量合并拉链 (2)

    1 从存量快照初始化拉链

    1.1 具体思路与数据变化过程

    在这里插入图片描述

    1.2 初始化代码

    --创建模拟的快照表
    DROP TABLE test_zip;
    CREATE TABLE test_zip stored as parquet as 
    select 'A' as id ,'a1' as name, '20220101' as inc_day 
    union all select 'A' as id , 'a1' as name , '20220102' as inc_day 
    union all select 'B' as id , 'b1' as name , '20220102' as inc_day 
    union all select 'A' as id , 'a1' as name , '20220104' as inc_day 
    union all select 'A' as id , 'a2' as name , '20220105' as inc_day 
    union all select 'B' as id , 'b2' as name , '20220106' as inc_day 
    union all select 'A' as id , 'a3' as name , '20220106' as inc_day 
    union all select 'A' as id , 'a3' as name , '20220107' as inc_day 
    union all select 'A' as id , 'a2' as name , '20220108' as inc_day 
    union all select 'A' as id , 'a1' as name , '20220109' as inc_day 
    
    
    --通过快照表初始化我们需要的拉链
    SELECT 
    	id,
    	max(name) as name,
    	max(dw_start_date) as dw_start_date,
    	max(dw_end_date) as dw_end_date
    FROM(
    	--END_DATE
    	SELECT 
    	id,
    	name,
    	NULL AS dw_start_date,
    	dw_end_date,
    	ROW_NUMBER() OVER(partition by id order by inc_day) AS rn
    	FROM(
    	SELECT 
    	id,
    	name,
    	nvl(lead(name,1) over(partition by id order by inc_day),'NULL') as name_o,
    	inc_day,
    	nvl(lead(inc_day,1) over(partition by id order by inc_day),'99991231') as dw_end_date
    	FROM test_zip
    	) t
    	WHERE name_o <> name
    
    	UNION ALL 
        
    	--START_DATE
    	SELECT 
    	id,
    	name,
    	inc_day AS dw_start_date,
    	NULL AS dw_end_date,
    	ROW_NUMBER() OVER(partition by id order by inc_day) AS rn
    	FROM(
    	SELECT 
    	id,
    	name,
    	nvl(LAG(name,1) over(partition by id order by inc_day),'NULL') as name_o,
    	inc_day
    	--nvl(lag(inc_day,1) over(partition by id order by inc_day),inc_day) as dw_end_date
    	FROM test_zip 
    	) t1
    	WHERE name_o <> name
    ) t2 
    GROUP BY id,rn order by id,rn
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62

    2 增量合并拉链

    在存量快照更新到拉链表中后,需要在调度上每日将新增及变化的数据与历史拉链进行合并,具体的要求如下:

    #历史拉链,start到20220109
    id  name  dw_start_date  dw_end_date
    A   a1     20220101       20220105 
    A   a2     20220105       20220106 
    A   a3     20220106       20220108 
    A   a2     20220108       20220109 
    A   a1     20220109       99991231(最新) 
    B   b1     20220102       20220106
    B   b2     20220106       99991231(最新)
    
    
    #T-1每日新增及变化
    A   a4     20220110(变化)
    B   b3     20220110(变化)
    C   c1     20220110(新增)
    #T-2存量快照(需从拉链表获取,以避免T-2有的数据在T-1的时候已经被删掉了)
    A   a1     20220109       99991231(最新) 
    B   b2     20220106       99991231(最新)
    NOTE:如果从快照T-2中取,加入T-1是20220110,那么T-2只能取到:A   a1     20220109,而B  b3  20220110会被视为新增,最终导致拉链表会重复。
    B   b2     20220106       99991231(最新)
    
    #需求:合并后的结果
    id  name  dw_start_date  dw_end_date
    A   a1     20220101       20220105 
    A   a2     20220105       20220106 
    A   a3     20220106       20220108 
    A   a2     20220108       20220109 
    A   a1     20220109       20220110(修改) 
    B   b1     20220102       20220106
    B   b2     20220106       20220110(修改)
    ----------------------------------
    A   a4     20220110		  99991231(新增)	
    B   b3     20220110       99991231(新增)
    C   c1     20220110       99991231(新增)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34

    2.1 从存量快照获取到T-2的有效信息(主要是为了支持重跑)

    -- 获取T-2的数据的时候,需要将时效时间为T-1的数据还原成99991231,如:20220110换成99991231
    CREATE TABLE tmp2 AS 
    select
        id,
        name,
        dw_start_date,
        if(dw_end_date = '20220110','99991231',dw_end_date) as dw_end_date --还原当日上次因任务调度被修改的dw_end_date
    from test_zip_chain where dw_start_date <= '20220109'
    
    /***
    id  name  dw_start_date  dw_end_date			id  name dw_start_date dw_end_date			
    A   a1     20220101       20220105 				A   a1  20220101   20220105 		
    A   a2     20220105       20220106          	A   a2  20220105   20220106 
    A   a3     20220106       20220108          	A   a3  20220106   20220108 
    A   a2     20220108       20220109          =>	A   a2  20220108   20220109 
    A   a1     20220109       20220110(被修改)	     A   a1  20220109   99991231(还原)
    B   b1     20220102       20220106          	B   b1  20220102   20220106
    B   b2     20220106       20220110(被修改)	     B   b2  20220106   99991231(还原)
    */
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    2.2 从T-1与T-2的拉链有效数据中获取新增及变化

    CREATE TABLE tmp1 AS 
    SELECT
    	t3.id,
    	t3.name,
    	t2.flag
    FROM(
    	SELECT
    		t1.id,
    		if(MAX(old_name) IS NULL,'新增','修改') as flag --打上新增、修改标签
    	FROM(
    		--T-1增量数据
    		SELECT 
    			name as new_name,
    			id,
    			null as old_name
    		FROM test_zip 
    		WHERE a.inc_day = '20220110' 
    		UNION ALL 
    		--T-2存量最新拉链的99991231的有效数据,因为拉链表来源于快照表,当快照的T-1数据准备好后,再与T-2的拉链进行比对筛选出新增及变化
    		SELECT 
    			null as new_name,
    			id,
    			name as old_name
    		FROM tmp2 where dw_end_date = '99991231'  --此时
    	) t1
    	GROUP BY id
    	HAVING NVL(MAX(new_name),'null') <> NVL(MAX(old_name),'null') --找出差异项
    ) t2 
    JOIN test_zip t3 --关联取最新的id的全量信息
    ON t2.id = t3.id
    
    /****
    整体数据变化情况如下:
    a4   A   null      		
    b3   B   null 	 			a4	A  a1  (修改)     			A  a4	(修改)
    c1   C   null 		=> 		b3 	B  b2  (修改) 		=>  	B  b3   (修改)	
    null A   a1		 			c1	C  null(新增)     			C  c1	(新增)
    null B   b2
    */
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39

    2.3 将增量与T-2的存量快照进行合并

    INSERT OVERWRITE TABLE test_zip_chain
    --查询新增
    SELECT 
        id,
        name,
        '20220110' as dw_start_date,
        '99991231' as dw_end_date
    FROM tmp1
    
    UNION ALL 
    --修改存量的99991231为T-1日期:20220110
    SELECT
        id,
        name,
        dw_start_date,
        if(tmp2.dw_end_date = '99991231' and tmp1.id is not null ,'20220110',tmp2.dw_end_date) as dw_end_date
    FROM tmp2
    left join tmp1 
    on tmp2.id = tmp1.id 
    
    /***
    A  a4	(修改)
    B  b3   (修改)	
    C  c1	(新增)
    
    -------
    
    id  name  dw_start_date  dw_end_date		
    A   a1     20220101       20220105 		
    A   a2     20220105       20220106 
    A   a3     20220106       20220108 
    A   a2     20220108       20220109 
    A   a1     20220109       99991231(还原)
    B   b1     20220102       20220106
    B   b2     20220106       99991231(还原)
    -------
    
    合并后结果
    id  name  dw_start_date  dw_end_date		
    A   a1     20220101       20220105 		
    A   a2     20220105       20220106 
    A   a3     20220106       20220108 
    A   a2     20220108       20220109 
    A   a1     20220109       20220110 (修改存量的时效时间)
    B   b1     20220102       20220106
    B   b2     20220106       20220110 (修改存量的时效时间)
    A   a4	   20220110       99991231 (将新增及修改的增量改成从T-1开始生效,时效为99991231)
    B   b3     20220110       99991231 (将新增及修改的增量改成从T-1开始生效,时效为99991231)
    C   c1	   20220110       99991231 (将新增及修改的增量改成从T-1开始生效,时效为99991231)
    */
    
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
  • 相关阅读:
    【C ++基础】迭代器(iterator)在string里面的简单使用
    黑马SpringBoot(P3-P7) 快速创建SpringBoot项目的方式
    香港云服务器搭建个人网站为什么更灵活
    专利:一种基于深度强化学习的机器人工件抓取方法
    如何理解 AnnData ?
    十一、数据结构——多路查找树
    C Primer Plus(6) 中文版 第11章 字符串和字符串函数 11.3 字符串输出
    Mqtt学习笔记--交叉编译移植(1)
    [Games 101] Lecture 11-12 Geometry 2 (Curves and Surfaces)
    开户的期货公司各有擅长的领域
  • 原文地址:https://blog.csdn.net/shufangreal/article/details/126434670