如何在所有开始日期列排序的三个缓慢变化维度之间执行连接?
Posted
技术标签:
【中文标题】如何在所有开始日期列排序的三个缓慢变化维度之间执行连接?【英文标题】:How to execute join between three slow change dimensions sort by all start date columns? 【发布时间】:2021-08-23 15:23:26 【问题描述】:我正在尝试连接三个缓慢变化维度类型 2 之间的数据。当我查询结果时,维度之间的按日期排序不符合预期。
我在下面有缓慢变化的尺寸:
表附属
id | name | subsidiary | department | start_date_dep | end_date_dep | last_record_flg |
---|---|---|---|---|---|---|
1 | John Doe | AL | Engineering | 2005-10-01 | 2013-01-01 | 0 |
1 | John Doe | AL | Sales | 2013-01-01 | 2014-05-01 | 0 |
1 | John Doe | NY | Sales | 2014-05-01 | 1 | |
38 | Ivy Johnson | NY | Sales | 2020-06-01 | 1 |
表函数
id | function | start_date_fun | end_date_fun | last_record_flg |
---|---|---|---|---|
1 | operator | 2005-10-01 | 2009-08-01 | 0 |
1 | leader | 2009-08-01 | 2011-10-01 | 0 |
1 | manager | 2011-10-01 | 2017-07-01 | 0 |
1 | director | 2017-07-01 | 1 | |
38 | operator | 2020-06-01 | 1 |
表毕业
id | university_graduation | conclusion_date | last_record_flg |
---|---|---|---|
1 | bachelor | 15/12/2005 | 0 |
1 | master | 15/12/2008 | 1 |
38 | bachelor | 15/12/2014 | 1 |
想要的结果是:
id | name | subsidiary | department | start_date_dep | end_date_dep | last_record_flg | function | start_date_fun | end_date_fun | last_record_flg | university_graduation | conclusion_date | last_record_flg | max_date | seq | start | end | last_record_flg |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | John Doe | AL | Engineering | 2005-10-01 | 2013-01-01 | 0 | operator | 2005-10-01 | 2009-08-01 | 0 | bachelor | 2005-12-15 | 0 | 2005-12-15 | 1 | 2005-10-01 | 2008-12-15 | 0 |
1 | John Doe | AL | Engineering | 2005-10-01 | 2013-01-01 | 0 | operator | 2005-10-01 | 2009-08-01 | 0 | master | 2008-12-15 | 1 | 2008-12-15 | 1 | 2008-12-15 | 2009-08-01 | 0 |
1 | John Doe | AL | Engineering | 2005-10-01 | 2013-01-01 | 0 | leader | 2009-08-01 | 2011-10-01 | 0 | master | 2008-12-15 | 1 | 2009-08-01 | 1 | 2009-08-01 | 2011-10-01 | 0 |
1 | John Doe | AL | Engineering | 2005-10-01 | 2013-01-01 | 0 | manager | 2011-10-01 | 2017-07-01 | 0 | master | 2008-12-15 | 1 | 2011-10-01 | 1 | 2011-10-01 | 2013-01-01 | 0 |
1 | John Doe | AL | Sales | 2013-01-01 | 2014-05-01 | 0 | manager | 2011-10-01 | 2017-07-01 | 0 | master | 2008-12-15 | 1 | 2013-01-01 | 1 | 2013-01-01 | 2014-05-01 | 0 |
1 | John Doe | NY | Sales | 2014-05-01 | NULL | 1 | manager | 2011-10-01 | 2017-07-01 | 0 | master | 2008-12-15 | 1 | 2014-05-01 | 1 | 2014-05-01 | 2017-07-01 | 0 |
1 | John Doe | NY | Sales | 2014-05-01 | NULL | 1 | director | 2017-07-01 | NULL | 1 | master | 2008-12-15 | 1 | 2017-07-01 | 1 | 2017-07-01 | NULL | 1 |
38 | Ivy Johnson | NY | Sales | 2020-06-01 | NULL | 1 | operator | 2020-06-01 | NULL | 1 | bachelor | 2014-12-15 | 1 | 2020-06-01 | 1 | 2020-06-01 | NULL | 1 |
我尝试使用 CROSS APPLY,但每个 ID 只返回一行。我正在尝试使用 CASE WHEN,但查询输出并不完全等于所需的结果。在我的返回中,列 'FUNCTION' 和 'START_DATE_FUN' 不遵循所需结果中呈现的顺序(排序),列 'UNIVERSITY_GRADUATION' 和 'CONCLUSION_DATE'。
查询:
select
*
from(
select
tb.*
,row_number() over(partition by tb.id,tb.max_date order by tb.max_date) as seq
,tb.max_date as [start]
,lead( tb.max_date ) over( partition by tb.id order by tb.max_date ) as [end]
,case when lead( tb.max_date ) over( partition by tb.id order by tb.max_date ) is null then 1 else 0 end as last_record_flg
from(
select
sb.id
,sb.[name]
,sb.subsidiary
,sb.department
,sb.start_date_dep
,sb.end_date_dep
,sb.last_record_flg as lr_sb
,fc.[function]
,fc.start_date_fun
,fc.end_date_fun
,fc.last_record_flg as lr_fc
,gd.university_graduation
,gd.end_date_grad
,gd.last_record_flg as lr_gd
,case
when sb.start_date_dep >= fc.start_date_fun and sb.start_date_dep >= gd.end_date_grad then sb.start_date_dep
when fc.start_date_fun >= sb.start_date_dep and fc.start_date_fun >= gd.end_date_grad then fc.start_date_fun
else gd.end_date_grad
end as max_date
from
#Subsidiaries as sb
left outer join #Functions as fc
on sb.id = fc.id
left outer join #Graduations as gd
on sb.id = gd.id
) as tb
) as tb2
where
tb2.seq = 1
DDL 下方:
create table #Subsidiaries (
id int
,[name] varchar(15)
,subsidiary varchar(2)
,department varchar(15)
,start_date_dep date
,end_date_dep date
,last_record_flg bit
)
go
insert into #Subsidiaries values
(1,'John Doe','AL','Engineering','2005-10-01','2013-01-01',0),
(1,'John Doe','AL','Sales','2013-01-01','2014-05-01',0),
(1,'John Doe','NY','Sales','2014-05-01',null,1),
(38,'Ivy Johnson','NY','Sales','2020-06-01',null,1)
go
create table #Functions (
id int
,[function] varchar(15)
,start_date_fun date
,end_date_fun date
,last_record_flg bit
)
go
insert into #Functions values
(1,'operator','2005-10-01','2009-08-01',0),
(1,'leader','2009-08-01','2011-10-01',0),
(1,'manager','2011-10-01','2017-07-01',0),
(1,'director','2017-07-01',null,1),
(38,'operator','2020-06-01',null,1)
go
create table #Graduations (
id int
,university_graduation varchar(15)
,end_date_grad date
,last_record_flg bit
)
go
insert into #Graduations values
(1,'bachelor','2005-12-15',0),
(1,'master','2008-12-15',1),
(38,'bachelor','2014-12-15',1)
go
【问题讨论】:
我不遵循您希望结果的顺序。您的预期输出看起来与您的查询当前返回的相同。 也许向我们展示您想要返回的内容。因为它混淆了你放下的东西。例如。你有 3 张桌子,你希望结果是什么样的 我听说过其他人,专门处理 HR 数据,其中员工和组织关系的属性分布在数十个 SCD 表中,通过为有效性生成每天的记录来处理它记录的时间段并使用该日期作为键的一部分执行连接。这是非常沉重的,但它解决了日期周期交叉的大问题。不幸的是,很少有 RDBMS 支持“周期”数据类型,其中周期交集是内置功能(如 Postgres 或 Teradata)。 我更新了问题内容,以便更清楚地了解预期结果以及我的查询返回的位置与预期不同。 【参考方案1】:如果有人发现相同的难以连接两个或多个 SCD 类型 2,我可以在此链接中找到参考 https://sqlsunday.com/2014/11/30/joining-two-scd2-tables/(SQL 星期日),它可以帮助我构建查询并在连接条件中使用范围间隔根据需要返回结果。
【讨论】:
以上是关于如何在所有开始日期列排序的三个缓慢变化维度之间执行连接?的主要内容,如果未能解决你的问题,请参考以下文章