用于从 Hive 中获取单个表的最大值、最小值和其他列的相应值以及总记录数的数据库查询
Posted
技术标签:
【中文标题】用于从 Hive 中获取单个表的最大值、最小值和其他列的相应值以及总记录数的数据库查询【英文标题】:Database Query for Getting Max, Min of a column and corresponding values from other columns and Total Record Count from a Single Table in Hive 【发布时间】:2020-05-24 05:00:25 【问题描述】:我在 Hive 表名中有以下数据集 - PUBLISH
注意PUBLISH中可以有重复记录
DATE |HOUR|SOURCE|COL_TIMESTAMP |ID
20200101|14 |A |2020-01-01 14:18:53.016 GMT|ID_111
20200101|14 |A |2020-01-01 14:18:53.012 GMT|ID_222
20200101|14 |A |2020-01-01 14:18:53.016 GMT|ID_111
20200101|14 |A |2020-01-01 14:18:53.019 GMT|ID_333
20200101|15 |C |2020-01-01 15:18:53.016 GMT|ID_444
20200102|00 |A |2020-01-01 15:18:53.016 GMT|ID_444
我想根据特定日期、时间和来源生成以下输出
例如。对于 (DATE=20200101
& HOUR=14
& SOURCE=A
),输出应为:
DATE |HOUR|SOURCE|MIN_TIMESTAMP |START_ID|MAX_TIMESTAMP |END_ID|RECORD_CNT
20200101|14 |A |2020-01-01 14:18:53.012 GMT|ID_222 |2020-01-01 14:18:53.019 GMT|ID_333|3
注意时间戳末尾有“GMT”。 此外,我正在尝试使用 spark java 代码运行查询。 当数据量很大时,请建议一个性能良好的 Hive 查询。
【问题讨论】:
这些答案之一是否解决了您的问题?如果没有,您能否提供更多信息来帮助回答?否则,请考虑将最能解决您的问题的答案标记为已接受(上/下投票箭头下的复选标记)。见What should I do when someone answers my question? 和How does accepting an answer work? 听起来像是一个 groupwise-max 问题。 mysql.rjweb.org/doc.php/groupwise_max 【参考方案1】:您应该能够使用子查询来确定给定小时的 MIN 和 MAX 时间戳以及不同行的计数,然后将其连接回主表以获得这些时间的 id
值:
SELECT DISTINCT P.DATE, P.HOUR, P.SOURCE,
P.MIN_TIMESTAMP, p1.ID AS START_ID,
P.MAX_TIMESTAMP, p2.ID AS END_ID
P.COUNT
FROM (
SELECT DATE, HOUR, SOURCE,
MIN(COL_TIMESTAMP) AS MIN_TIMESTAMP,
MAX(COL_TIMESTAMP) AS MAX_TIMESTAMP,
COUNT(DISTINCT DATE, HOUR, SOURCE, COL_TIMESTAMP, ID) AS COUNT
FROM PUBLISH
WHERE DATE = '20200101'
AND HOUR = 14
AND SOURCE = 'A'
GROUP BY DATE, HOUR, SOURCE
) P
JOIN PUBLISH P1 ON P1.DATE = P.DATE AND P1.HOUR = P.HOUR AND P1.SOURCE = P.SOURCE AND P1.COL_TIMESTAMP = P.MIN_TIMESTAMP
JOIN PUBLISH P2 ON P2.DATE = P.DATE AND P2.HOUR = P.HOUR AND P2.SOURCE = P.SOURCE AND P2.COL_TIMESTAMP = P.MAX_TIMESTAMP
只要您在(DATE, HOUR, SOURCE)
上有一个索引,这应该会很好。
【讨论】:
感谢您的回答。如前所述,我的数据有重复记录,因此查询导致重复。我在最外面的查询上添加了 distinct 并且它起作用了。 @user1326784 你读过What should I do when someone answers my question? 和How does accepting an answer work?吗?【参考方案2】:使用解析函数得到START_ID和LAST_ID,然后聚合:
with PUBLISH as ( --Use your_table instead of this CTE
select stack(6,
'20200101','14','A','2020-01-01 14:18:53.016 GMT','ID_111',
'20200101','14','A','2020-01-01 14:18:53.012 GMT','ID_222',
'20200101','14','A','2020-01-01 14:18:53.016 GMT','ID_111',
'20200101','14','A','2020-01-01 14:18:53.019 GMT','ID_333',
'20200101','15','C','2020-01-01 15:18:53.016 GMT','ID_444',
'20200102','00','A','2020-01-01 15:18:53.016 GMT','ID_444'
) as (DT, HOUR, SOURCE, COL_TIMESTAMP, ID)
)
select DT, HOUR, SOURCE,
min(COL_TIMESTAMP) as MIN_TIMESTAMP,
START_ID,
max(COL_TIMESTAMP) as MAX_TIMESTAMP,
END_ID,
sum(case when rn=1 then 1 else 0 end) as RECORD_CNT --unique records have rn=1
from
(
select DT, HOUR, SOURCE, COL_TIMESTAMP, ID,
first_value(ID) over(partition by DT, HOUR, SOURCE order by COL_TIMESTAMP) as START_ID,
first_value(ID) over(partition by DT, HOUR, SOURCE order by COL_TIMESTAMP desc) as END_ID,
row_number() over(partition by DT, HOUR, SOURCE, COL_TIMESTAMP, ID) as rn
from PUBLISH p
) s
group by DT, HOUR, SOURCE, START_ID, END_ID;
结果:
dt hour source min_timestamp start_id max_timestamp end_id record_cnt
20200101 14 A 2020-01-01 14:18:53.012 GMT ID_222 2020-01-01 14:18:53.019 GMT ID_333 3
20200101 15 C 2020-01-01 15:18:53.016 GMT ID_444 2020-01-01 15:18:53.016 GMT ID_444 1
20200102 00 A 2020-01-01 15:18:53.016 GMT ID_444 2020-01-01 15:18:53.016 GMT ID_444 1
【讨论】:
以上是关于用于从 Hive 中获取单个表的最大值、最小值和其他列的相应值以及总记录数的数据库查询的主要内容,如果未能解决你的问题,请参考以下文章
PostgreSQL - 如何在单个查询中获取列的最小值和最大值以及与它们关联的行?