将主表与另一个表中的所有记录进行比较,以得出主表的列值

Posted

技术标签:

【中文标题】将主表与另一个表中的所有记录进行比较,以得出主表的列值【英文标题】:Compare main table with all records from another table to derive the column value of the main table 【发布时间】:2021-12-30 09:37:21 【问题描述】:

我有两个表 tb1main_tbl 示例数据集如下所示,我正在尝试为主表导出列 COL_VAL 的值。所以我创建了获取期望值的查询。但是我正在寻找简单的代码行数并达到相同结果的可能性

main_tbl Table:

col1        col2        col3         COL_VAL
123          Hi          568           ??

tbl Table:

col1        col2        col3        col4        col5
123          LN           Y           IP         2021-02-01
123          LN           N           NON-IP     2021-02-01
123          MOB          Y           AP         2021-02-01
123          MOB          N           NON-AP     2021-02-01

Main Query:

SELECT
d.COL1,
d.COL2,
d.COL3,
CAST(COALESCE(FRT_QRY.COL4,SND_QRY.COL4,FIF_QRY.COL4,TRD_QRY.COL4) AS STRING)  AS COL_VAL
FROM 
 (
 SELECT * FROM db.main_tbl)d
LEFT JOIN
(
SELECT * FROM 
    ( SELECT *, 
        ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY col5 desc) as Rnk
        FROM ( select  *  from    db.tb1  where   col2 IN ('LN') and col3 = 'Y') b
    ) a where a.Rnk =1 
) SND_QRY
on d.col1=SND_QRY.col1
LEFT JOIN
(
SELECT * FROM 
    ( SELECT *, 
        ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY col5 desc) as Rnk
        FROM ( select  *  from    db.tb1  where   col2 IN ('LN') and col3 = 'N') b
    ) a where a.Rnk =1 
) TRD_QRY
on d.col1=TRD_QRY.col1
LEFT JOIN
(
SELECT * FROM 
    ( SELECT *, 
        ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY col5 desc) as Rnk
        FROM ( select  *  from    db.tb1  where   col2 IN ('MOB') and col3 = 'Y') b
    ) a where a.Rnk =1 
) FRT_QRY
on d.col1=FRT_QRY.col1
LEFT JOIN
(
SELECT * FROM 
    ( SELECT *, 
        ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY col5 desc) as Rnk
        FROM ( select  *  from    db.tb1  where   col2 IN ('MOB') and col3 = 'N') b
    ) a where a.Rnk =1 
) FIF_QRY
on d.col1=FIF_QRY.col1

Expected Output - main_tbl Table:

col1        col2        col3         COL_VAL
123          Hi          568           AP

【问题讨论】:

【参考方案1】:

首先,我注意到您的所有子查询都包含应用于相同列的不同过滤器,并且这些列在 partition by 子句中。这意味着过滤器不会影响 row_number,您可以在没有过滤器的情况下计算一次 row_number,并将过滤器用作连接条件或在连接子查询中过滤:

WITH RANKED AS (
SELECT * FROM 
    ( SELECT b.*, 
        ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY col5 desc) as Rnk
        FROM db.tb1 b
    ) a where a.Rnk =1 
)

SELECT
d.COL1,
d.COL2,
d.COL3,
CAST(COALESCE(FRT_QRY.COL4,SND_QRY.COL4,FIF_QRY.COL4,TRD_QRY.COL4) AS STRING)  AS COL_VAL
FROM 
 (
 SELECT * FROM db.main_tbl)d
LEFT JOIN RANKED SND_QRY on d.col1=SND_QRY.col1 AND SND_QRY.col2 IN ('LN')  AND SND_QRY.col3 = 'Y'
LEFT JOIN RANKED TRD_QRY on d.col1=TRD_QRY.col1 AND TRD_QRY.col2 IN ('LN')  AND TRD_QRY.col3 = 'N'
LEFT JOIN RANKED FRT_QRY on d.col1=FRT_QRY.col1 AND FRT_QRY.col2 IN ('MOB') AND FRT_QRY.col3 = 'Y'
LEFT JOIN RANKED FIF_QRY on d.col1=FIF_QRY.col1 AND FIF_QRY.col2 IN ('MOB') AND FIF_QRY.col3 = 'N'

此外,如果您很幸运并且拥有具有 CTE 实现功能的 Hive 版本,请使用此设置:

set hive.optimize.cte.materialize.threshold=2;--HIVE-11752

RANKED CTE 将只计算一次,并且在所有连接中使用相同的结果。

您还可以尝试消除同一张表的许多连接。使用 CASE 表达式 + 聚合计算单个查询中的所有字段,并且只连接一次。聚合比连接更快:

WITH RANKED AS (
SELECT col1,
       --aggregate all in single row per col1 
       max(case when col2 IN ('LN') AND col3 = 'Y' then COL4 else null end) as SND_COL4,
       max(case when col2 IN ('LN') AND col3 = 'N' then COL4 else null end) as TRD_COL4,
       max(case when col2 IN ('MOB') AND col3 = 'Y' then COL4 else null end) as FRT_COL4,
       max(case when col2 IN ('MOB') AND col3 = 'N' then COL4 else null end) as FIF_COL4   
FROM 
    ( SELECT b.*, 
        ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY col5 desc) as Rnk
        FROM db.tb1 b
       WHERE (col2 IN ('LN') AND col3 = 'Y') 
          or (col2 IN ('LN') AND col3 = 'N') 
          or (col2 IN ('MOB') AND col3 = 'Y')
          or (col2 IN ('MOB') AND col3 = 'N')
    ) a where a.Rnk =1 
          
GROUP BY col1
)

SELECT
d.COL1,
d.COL2,
d.COL3,
CAST(COALESCE(R.FRT_COL4,R.SND_COL4,R.FIF_COL4, R.TRD_COL4) AS STRING)  AS COL_VAL
FROM 
 (
 SELECT * FROM db.main_tbl)d
LEFT JOIN RANKED R d.col1=R.col1

【讨论】:

以上是关于将主表与另一个表中的所有记录进行比较,以得出主表的列值的主要内容,如果未能解决你的问题,请参考以下文章

主键和外键约束(主表与从表)

一次主从表集成流程开发过程

左外连接不返回主表中的所有记录

主表和子表是一对多,查询主表数据以及子表的某一条数据

参照完整性-外键约束

区分关系行数据库的主表和从表