将主表与另一个表中的所有记录进行比较,以得出主表的列值
Posted
技术标签:
【中文标题】将主表与另一个表中的所有记录进行比较,以得出主表的列值【英文标题】:Compare main table with all records from another table to derive the column value of the main table 【发布时间】:2021-12-30 09:37:21 【问题描述】:我有两个表 tb1
和 main_tbl
示例数据集如下所示,我正在尝试为主表导出列 COL_VAL
的值。所以我创建了获取期望值的查询。但是我正在寻找简单的代码行数并达到相同结果的可能性
main_tbl Table:
col1 col2 col3 COL_VAL
123 Hi 568 ??
tbl Table:
col1 col2 col3 col4 col5
123 LN Y IP 2021-02-01
123 LN N NON-IP 2021-02-01
123 MOB Y AP 2021-02-01
123 MOB N NON-AP 2021-02-01
Main Query:
SELECT
d.COL1,
d.COL2,
d.COL3,
CAST(COALESCE(FRT_QRY.COL4,SND_QRY.COL4,FIF_QRY.COL4,TRD_QRY.COL4) AS STRING) AS COL_VAL
FROM
(
SELECT * FROM db.main_tbl)d
LEFT JOIN
(
SELECT * FROM
( SELECT *,
ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY col5 desc) as Rnk
FROM ( select * from db.tb1 where col2 IN ('LN') and col3 = 'Y') b
) a where a.Rnk =1
) SND_QRY
on d.col1=SND_QRY.col1
LEFT JOIN
(
SELECT * FROM
( SELECT *,
ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY col5 desc) as Rnk
FROM ( select * from db.tb1 where col2 IN ('LN') and col3 = 'N') b
) a where a.Rnk =1
) TRD_QRY
on d.col1=TRD_QRY.col1
LEFT JOIN
(
SELECT * FROM
( SELECT *,
ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY col5 desc) as Rnk
FROM ( select * from db.tb1 where col2 IN ('MOB') and col3 = 'Y') b
) a where a.Rnk =1
) FRT_QRY
on d.col1=FRT_QRY.col1
LEFT JOIN
(
SELECT * FROM
( SELECT *,
ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY col5 desc) as Rnk
FROM ( select * from db.tb1 where col2 IN ('MOB') and col3 = 'N') b
) a where a.Rnk =1
) FIF_QRY
on d.col1=FIF_QRY.col1
Expected Output - main_tbl Table:
col1 col2 col3 COL_VAL
123 Hi 568 AP
【问题讨论】:
【参考方案1】:首先,我注意到您的所有子查询都包含应用于相同列的不同过滤器,并且这些列在 partition by 子句中。这意味着过滤器不会影响 row_number,您可以在没有过滤器的情况下计算一次 row_number,并将过滤器用作连接条件或在连接子查询中过滤:
WITH RANKED AS (
SELECT * FROM
( SELECT b.*,
ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY col5 desc) as Rnk
FROM db.tb1 b
) a where a.Rnk =1
)
SELECT
d.COL1,
d.COL2,
d.COL3,
CAST(COALESCE(FRT_QRY.COL4,SND_QRY.COL4,FIF_QRY.COL4,TRD_QRY.COL4) AS STRING) AS COL_VAL
FROM
(
SELECT * FROM db.main_tbl)d
LEFT JOIN RANKED SND_QRY on d.col1=SND_QRY.col1 AND SND_QRY.col2 IN ('LN') AND SND_QRY.col3 = 'Y'
LEFT JOIN RANKED TRD_QRY on d.col1=TRD_QRY.col1 AND TRD_QRY.col2 IN ('LN') AND TRD_QRY.col3 = 'N'
LEFT JOIN RANKED FRT_QRY on d.col1=FRT_QRY.col1 AND FRT_QRY.col2 IN ('MOB') AND FRT_QRY.col3 = 'Y'
LEFT JOIN RANKED FIF_QRY on d.col1=FIF_QRY.col1 AND FIF_QRY.col2 IN ('MOB') AND FIF_QRY.col3 = 'N'
此外,如果您很幸运并且拥有具有 CTE 实现功能的 Hive 版本,请使用此设置:
set hive.optimize.cte.materialize.threshold=2;--HIVE-11752
RANKED CTE 将只计算一次,并且在所有连接中使用相同的结果。
您还可以尝试消除同一张表的许多连接。使用 CASE 表达式 + 聚合计算单个查询中的所有字段,并且只连接一次。聚合比连接更快:
WITH RANKED AS (
SELECT col1,
--aggregate all in single row per col1
max(case when col2 IN ('LN') AND col3 = 'Y' then COL4 else null end) as SND_COL4,
max(case when col2 IN ('LN') AND col3 = 'N' then COL4 else null end) as TRD_COL4,
max(case when col2 IN ('MOB') AND col3 = 'Y' then COL4 else null end) as FRT_COL4,
max(case when col2 IN ('MOB') AND col3 = 'N' then COL4 else null end) as FIF_COL4
FROM
( SELECT b.*,
ROW_NUMBER() OVER(PARTITION BY col1,col2,col3 ORDER BY col5 desc) as Rnk
FROM db.tb1 b
WHERE (col2 IN ('LN') AND col3 = 'Y')
or (col2 IN ('LN') AND col3 = 'N')
or (col2 IN ('MOB') AND col3 = 'Y')
or (col2 IN ('MOB') AND col3 = 'N')
) a where a.Rnk =1
GROUP BY col1
)
SELECT
d.COL1,
d.COL2,
d.COL3,
CAST(COALESCE(R.FRT_COL4,R.SND_COL4,R.FIF_COL4, R.TRD_COL4) AS STRING) AS COL_VAL
FROM
(
SELECT * FROM db.main_tbl)d
LEFT JOIN RANKED R d.col1=R.col1
【讨论】:
以上是关于将主表与另一个表中的所有记录进行比较,以得出主表的列值的主要内容,如果未能解决你的问题,请参考以下文章