Hive 多次与自身连接表
Posted
技术标签:
【中文标题】Hive 多次与自身连接表【英文标题】:Hive join table with itself many times 【发布时间】:2019-05-03 22:14:30 【问题描述】:我需要在 Hive 中加入一个表 13 次,这真的很慢。我在 Tez 上使用 Hive。
当我查看 Hive 泳道时,似乎只有 Map 任务正在执行并按顺序执行,这似乎是它需要这么长时间的原因之一。
我怀疑也可能是问题的一件事是我使用 3 列加入,但我不确定这会如何影响时间。
有没有办法加快这个查询的执行速度?
WITH merged AS (
SELECT
mp_0.bp AS bp,
mp_0.name as name,
mp_0.country as country,
mp_0.pos AS pos_0,
mp_0.min_p AS min_p_0,
mp_0.max_p AS max_p_0,
mp_1.pos AS pos_1,
mp_1.min_p AS min_p_1,
mp_1.max_p AS max_p_1,
mp_2.pos AS pos_2,
mp_2.min_p AS min_p_2,
mp_2.max_p AS max_p_2,
mp_3.pos AS pos_3,
mp_3.min_p AS min_p_3,
mp_3.max_p AS max_p_3,
mp_4.pos AS pos_4,
mp_4.min_p AS min_p_4,
mp_4.max_p AS max_p_4,
mp_5.pos AS pos_5,
mp_5.min_p AS min_p_5,
mp_5.max_p AS max_p_5,
mp_6.pos AS pos_6,
mp_6.min_p AS min_p_6,
mp_6.max_p AS max_p_6,
mp_7.pos AS pos_7,
mp_7.min_p AS min_p_7,
mp_7.max_p AS max_p_7,
mp_8.pos AS pos_8,
mp_8.min_p AS min_p_8,
mp_8.max_p AS max_p_8,
mp_9.pos AS pos_9,
mp_9.min_p AS min_p_9,
mp_9.max_p AS max_p_9,
mp_10.pos AS pos_10,
mp_10.min_p AS min_p_10,
mp_10.max_p AS max_p_10,
mp_11.pos AS pos_11,
mp_11.min_p AS min_p_11,
mp_11.max_p AS max_p_11,
mp_12.pos AS pos_12,
mp_12.min_p AS min_p_12,
mp_12.max_p AS max_p_12,
mp_13.pos AS pos_13,
mp_13.min_p AS min_p_13,
mp_13.max_p AS max_p_13
FROM
data.customers mp_0
INNER JOIN data.customers mp_1
ON mp_0.name = mp_1.name
AND mp_0.day = mp_1.day
AND mp_0.identify = mp_1.identify
AND mp_0.bp = mp_1.bp
AND mp_1.position = 1
AND mp_1.day <= 123456
AND mp_1.day > 123456 - 8
INNER JOIN data.customers mp_2
ON mp_0.name = mp_2.name
AND mp_0.day = mp_2.day
AND mp_0.identify = mp_2.identify
AND mp_0.bp = mp_2.bp
AND mp_2.position = 2
AND mp_2.day <= 123456
AND mp_2.day > 123456 - 8
INNER JOIN data.customers mp_3
ON mp_0.name = mp_3.name
AND mp_0.day = mp_3.day
AND mp_0.identify = mp_3.identify
AND mp_0.bp = mp_3.bp
AND mp_3.position = 3
AND mp_3.day <= 123456
AND mp_3.day > 123456 - 8
INNER JOIN data.customers mp_4
ON mp_0.name = mp_4.name
AND mp_0.day = mp_4.day
AND mp_0.identify = mp_4.identify
AND mp_0.bp = mp_4.bp
AND mp_4.position = 4
AND mp_4.day <= 123456
AND mp_4.day > 123456 - 8
INNER JOIN data.customers mp_5
ON mp_0.name = mp_5.name
AND mp_0.day = mp_5.day
AND mp_0.identify = mp_5.identify
AND mp_0.bp = mp_5.bp
AND mp_5.position = 5
AND mp_5.day <= 123456
AND mp_5.day > 123456 - 8
INNER JOIN data.customers mp_6
ON mp_0.name = mp_6.name
AND mp_0.day = mp_6.day
AND mp_0.identify = mp_6.identify
AND mp_0.bp = mp_6.bp
AND mp_6.position = 6
AND mp_6.day <= 123456
AND mp_6.day > 123456 - 8
INNER JOIN data.customers mp_7
ON mp_0.name = mp_7.name
AND mp_0.day = mp_7.day
AND mp_0.identify = mp_7.identify
AND mp_0.bp = mp_7.bp
AND mp_7.position = 7
AND mp_7.day <= 123456
AND mp_7.day > 123456 - 8
INNER JOIN data.customers mp_8
ON mp_0.name = mp_8.name
AND mp_0.day = mp_8.day
AND mp_0.identify = mp_8.identify
AND mp_0.bp = mp_8.bp
AND mp_8.position = 8
AND mp_8.day <= 123456
AND mp_8.day > 123456 - 8
INNER JOIN data.customers mp_9
ON mp_0.name = mp_9.name
AND mp_0.day = mp_9.day
AND mp_0.identify = mp_9.identify
AND mp_0.bp = mp_9.bp
AND mp_9.position = 9
AND mp_9.day <= 123456
AND mp_9.day > 123456 - 8
INNER JOIN data.customers mp_10
ON mp_0.name = mp_10.name
AND mp_0.day = mp_10.day
AND mp_0.identify = mp_10.identify
AND mp_0.bp = mp_10.bp
AND mp_10.position = 10
AND mp_10.day <= 123456
AND mp_10.day > 123456 - 8
INNER JOIN data.customers mp_11
ON mp_0.name = mp_11.name
AND mp_0.day = mp_11.day
AND mp_0.identify = mp_11.identify
AND mp_0.bp = mp_11.bp
AND mp_11.position = 11
AND mp_11.day <= 123456
AND mp_11.day > 123456 - 8
INNER JOIN data.customers mp_12
ON mp_0.name = mp_12.name
AND mp_0.day = mp_12.day
AND mp_0.identify = mp_12.identify
AND mp_0.bp = mp_12.bp
AND mp_12.position = 12
AND mp_12.day <= 123456
AND mp_12.day > 123456 - 8
INNER JOIN data.customers mp_13
ON mp_0.name = mp_13.name
AND mp_0.day = mp_13.day
AND mp_0.identify = mp_13.identify
AND mp_0.bp = mp_13.bp
AND mp_13.position = 13
AND mp_13.day <= 123456
AND mp_13.day > 123456 - 8
WHERE
mp_0.position = 0
AND mp_0.day <= 123456
AND mp_0.day > 123456 - 8
)
INSERT OVERWRITE TABLE data.processed PARTITION (day = 123456)
SELECT
*
FROM
(SELECT m.*, row_number() OVER (PARTITION BY bp ORDER BY RAND()) as rn FROM merged m) t
WHERE
t.rn <= 1000
我按 bp 对数据进行采样,因此我为每个 bp 随机抽取 1000 行。此外,该表是按天分区的,因此该查询需要 8 天的数据。
【问题讨论】:
查询中没有返回day列,你在单日分区中插入了1000条随机的8天记录,对吧? 【参考方案1】:在大多数情况下,自联接可以用聚合或分析函数代替。考虑用 case 语句和聚合替换连接,这将执行得更好。像这样:
WITH merged AS (
SELECT
bp,
name,
country,
day,
0 pos_0, --pos_0 is always=0, pos_1=1... Does it makes sense to have these constants?
max(case when pos=0 then min_p end) AS min_p_0,
max(case when pos=0 then max_p end) AS max_p_0,
1 pos_1,
max(case when pos=1 then min_p end) AS min_p_1,
max(case when pos=1 then max_p end) AS max_p_1,
2 pos_2,
max(case when pos=2 then min_p end) AS min_p_2,
max(case when pos=2 then max_p end) AS max_p_2,
...
and so on ...
FROM
data.customers c
WHERE
c.day <= 123456
AND c.day > 123456 - 8
GROUP BY
bp,
name,
country,
day
)
INSERT OVERWRITE TABLE data.processed PARTITION (day = 123456)
SELECT
* --probably you need to list columns without `day` here, because in the original query you have no `day` column
FROM
(SELECT m.*, row_number() OVER (PARTITION BY bp order by RAND()) as rn
FROM merged m
WHERE rand() <= 0.001 --filter some records before row_number, this may help to improve performance, check it please and adjust
) t
WHERE
t.rn <= 1000
你真的需要order by RAND()
中的row_number
吗? row_number without order by 将在 bp 分区内随机分配数字。如果没有order by rand()
,它将执行得更快。如果您确实需要这 1000 条记录比没有 order by rand()
时“更随机”,请使用 order by rand()
。
【讨论】:
感谢您的回答。您的方法看起来很有趣,无需加入即可加入。问题是,我将其用作 ML 算法的输入,其中所有这些都代表特征,因此我需要将它们全部放在“一行”中,因为它们代表一个用于训练的示例。关于 RAND() 部分,我确实需要该样本是随机的,正如我所读到的,如果您经常执行 LIMIT,您将在 Hive 中获得相同的行,很可能基于文件中的位置。如果您不按 rand 进行排序,我想这与 row_number 相同。每次运行我都需要不同的样本。 @Marko 大小写+聚合方法有问题吗?它应该和你的连接一样工作以上是关于Hive 多次与自身连接表的主要内容,如果未能解决你的问题,请参考以下文章
Hive 连续多次 lateral view explode 踩坑