如何在所有开始日期列排序的三个缓慢变化维度之间执行连接？

Posted 2023-03-31

技术标签:

【中文标题】如何在所有开始日期列排序的三个缓慢变化维度之间执行连接？【英文标题】：How to execute join between three slow change dimensions sort by all start date columns? 【发布时间】：2021-08-23 15:23:26 【问题描述】：

我正在尝试连接三个缓慢变化维度类型 2 之间的数据。当我查询结果时，维度之间的按日期排序不符合预期。

我在下面有缓慢变化的尺寸：

表附属

id	name	subsidiary	department	start_date_dep	end_date_dep	last_record_flg
1	John Doe	AL	Engineering	2005-10-01	2013-01-01	0
1	John Doe	AL	Sales	2013-01-01	2014-05-01	0
1	John Doe	NY	Sales	2014-05-01		1
38	Ivy Johnson	NY	Sales	2020-06-01		1

表函数

id	function	start_date_fun	end_date_fun	last_record_flg
1	operator	2005-10-01	2009-08-01	0
1	leader	2009-08-01	2011-10-01	0
1	manager	2011-10-01	2017-07-01	0
1	director	2017-07-01		1
38	operator	2020-06-01		1

表毕业

id	university_graduation	conclusion_date	last_record_flg
1	bachelor	15/12/2005	0
1	master	15/12/2008	1
38	bachelor	15/12/2014	1

想要的结果是：

id	name	subsidiary	department	start_date_dep	end_date_dep	last_record_flg	function	start_date_fun	end_date_fun	last_record_flg	university_graduation	conclusion_date	last_record_flg	max_date	seq	start	end	last_record_flg
1	John Doe	AL	Engineering	2005-10-01	2013-01-01	0	operator	2005-10-01	2009-08-01	0	bachelor	2005-12-15	0	2005-12-15	1	2005-10-01	2008-12-15	0
1	John Doe	AL	Engineering	2005-10-01	2013-01-01	0	operator	2005-10-01	2009-08-01	0	master	2008-12-15	1	2008-12-15	1	2008-12-15	2009-08-01	0
1	John Doe	AL	Engineering	2005-10-01	2013-01-01	0	leader	2009-08-01	2011-10-01	0	master	2008-12-15	1	2009-08-01	1	2009-08-01	2011-10-01	0
1	John Doe	AL	Engineering	2005-10-01	2013-01-01	0	manager	2011-10-01	2017-07-01	0	master	2008-12-15	1	2011-10-01	1	2011-10-01	2013-01-01	0
1	John Doe	AL	Sales	2013-01-01	2014-05-01	0	manager	2011-10-01	2017-07-01	0	master	2008-12-15	1	2013-01-01	1	2013-01-01	2014-05-01	0
1	John Doe	NY	Sales	2014-05-01	NULL	1	manager	2011-10-01	2017-07-01	0	master	2008-12-15	1	2014-05-01	1	2014-05-01	2017-07-01	0
1	John Doe	NY	Sales	2014-05-01	NULL	1	director	2017-07-01	NULL	1	master	2008-12-15	1	2017-07-01	1	2017-07-01	NULL	1
38	Ivy Johnson	NY	Sales	2020-06-01	NULL	1	operator	2020-06-01	NULL	1	bachelor	2014-12-15	1	2020-06-01	1	2020-06-01	NULL	1

我尝试使用 CROSS APPLY，但每个 ID 只返回一行。我正在尝试使用 CASE WHEN，但查询输出并不完全等于所需的结果。在我的返回中，列 'FUNCTION' 和 'START_DATE_FUN' 不遵循所需结果中呈现的顺序（排序），列 'UNIVERSITY_GRADUATION' 和 'CONCLUSION_DATE'。

查询：

select 
    *
from(
    select 
        tb.*
        ,row_number() over(partition by tb.id,tb.max_date order by tb.max_date) as seq
        ,tb.max_date as [start]
        ,lead( tb.max_date ) over( partition by tb.id order by tb.max_date ) as [end] 
        ,case when lead( tb.max_date ) over( partition by tb.id order by tb.max_date ) is null then 1 else 0 end as last_record_flg
    from(
        select 
            sb.id
            ,sb.[name]
            ,sb.subsidiary
            ,sb.department
            ,sb.start_date_dep
            ,sb.end_date_dep
            ,sb.last_record_flg as lr_sb
            ,fc.[function]
            ,fc.start_date_fun
            ,fc.end_date_fun
            ,fc.last_record_flg as lr_fc
            ,gd.university_graduation
            ,gd.end_date_grad
            ,gd.last_record_flg as lr_gd
            ,case
                when sb.start_date_dep >= fc.start_date_fun and sb.start_date_dep >= gd.end_date_grad then sb.start_date_dep
                when fc.start_date_fun >= sb.start_date_dep and fc.start_date_fun >= gd.end_date_grad then fc.start_date_fun
                else gd.end_date_grad
            end as max_date
        from 
            #Subsidiaries as sb
            left outer join #Functions as fc
                on sb.id = fc.id
            left outer join #Graduations as gd
                on sb.id = gd.id
    ) as tb
) as tb2
where
    tb2.seq = 1

DDL 下方：

create table #Subsidiaries (
    id int
    ,[name] varchar(15)
    ,subsidiary varchar(2)
    ,department varchar(15)
    ,start_date_dep date
    ,end_date_dep date
    ,last_record_flg bit
)
go

insert into #Subsidiaries values
(1,'John Doe','AL','Engineering','2005-10-01','2013-01-01',0),
(1,'John Doe','AL','Sales','2013-01-01','2014-05-01',0),
(1,'John Doe','NY','Sales','2014-05-01',null,1),
(38,'Ivy Johnson','NY','Sales','2020-06-01',null,1)
go

create table #Functions (
    id int
    ,[function] varchar(15)
    ,start_date_fun date
    ,end_date_fun date
    ,last_record_flg bit
)
go

insert into #Functions values
(1,'operator','2005-10-01','2009-08-01',0),
(1,'leader','2009-08-01','2011-10-01',0),
(1,'manager','2011-10-01','2017-07-01',0),
(1,'director','2017-07-01',null,1),
(38,'operator','2020-06-01',null,1)
go

create table #Graduations (
    id int
    ,university_graduation varchar(15)
    ,end_date_grad date
    ,last_record_flg bit
)
go

insert into #Graduations values
(1,'bachelor','2005-12-15',0),
(1,'master','2008-12-15',1),
(38,'bachelor','2014-12-15',1)
go

【问题讨论】：

我不遵循您希望结果的顺序。您的预期输出看起来与您的查询当前返回的相同。也许向我们展示您想要返回的内容。因为它混淆了你放下的东西。例如。你有 3 张桌子，你希望结果是什么样的我听说过其他人，专门处理 HR 数据，其中员工和组织关系的属性分布在数十个 SCD 表中，通过为有效性生成每天的记录来处理它记录的时间段并使用该日期作为键的一部分执行连接。这是非常沉重的，但它解决了日期周期交叉的大问题。不幸的是，很少有 RDBMS 支持“周期”数据类型，其中周期交集是内置功能（如 Postgres 或 Teradata）。我更新了问题内容，以便更清楚地了解预期结果以及我的查询返回的位置与预期不同。 【参考方案1】：

如果有人发现相同的难以连接两个或多个 SCD 类型 2，我可以在此链接中找到参考 https://sqlsunday.com/2014/11/30/joining-two-scd2-tables/（SQL 星期日），它可以帮助我构建查询并在连接条件中使用范围间隔根据需要返回结果。

【讨论】：

以上是关于如何在所有开始日期列排序的三个缓慢变化维度之间执行连接？的主要内容，如果未能解决你的问题，请参考以下文章