Hive Group by 后自我加入

Posted 2023-04-13

技术标签:

【中文标题】Hive Group by 后自我加入【英文标题】：Hive Group by after self join 【发布时间】：2014-05-12 15:45:43 【问题描述】：

伙计们，

我们有一个要求，我们希望在将 HIVE 表与 self 连接后应用 group by 子句。

例如数据

CUSTOMER_NAME,PRODUCT_NAME,PURCHASE_PRICE

customer1,product1,20
customer1,product2,30
customer1,product1,25

现在我们要通过考虑所有产品的总和以及随后按 CUSTOMER_NAME、PRODUCT_NAME 分组的结果集来获取客户（仅进行价格总和后的前 5 个客户，产品名称不存在于子查询中）

select customer_name,product_name,sum(purchase_price)
from customer_prd cprd
Join (select customer_name,sum(purchase_prices) order by sum group by customer_name limit 5) cprdd
where cprd.customer_name = cprdd.customer_name group by customer_name,product_name

收到错误消息说不能在 HIVE 中像这样分组？

【问题讨论】：

【参考方案1】：

加入后，您的列名变得不明确。 Hive 不知道您是否关心连接左侧或右侧的那个。在这种情况下，这无关紧要，因为您正在对它们进行内部连接，但 hive 不够聪明，无法弄清楚这一点。试试这个：

select cprd.customer_name, cprd.product_name, sum(purchase_price)
from customer_prd cprd
Join (select customer_name, sum(purchase_price) as sum from customer_prd group by customer_name order by sum desc limit 5) cprdd
where cprd.customer_name = cprdd.customer_name group by cprd.customer_name, cprd.product_name;

【讨论】：

【参考方案2】：

我认为 Joe K 是正确的，但我会重新考虑您在做什么，并完全避免加入，并使用 Brickhouse 库 (http://github.com/klout/brickhouse) 中提供的 'collect' 或 'collect_max' UDF。先按产品求和，然后同时收集和求和。

SELECT customer_name, sum(purchases) as total_purchases, collect( product_name, purchases) as product_map
FROM
  ( SELECT customer_name, product_name, sum(purchase_prices) AS purchases
    FROM customer_prd
    GROUP BY customer_name, product_name
  ) sp
GROUP BY customer_name
ORDER BY sum(purchases)
LIMIT 5;

这仍会导致排序以获取前 5 名客户。如果您有一个大长尾的小客户，但有几个大客户鲸鱼，您可以添加一个“HAVING sum(purchases) >”来减少要排序的记录的大小。

【讨论】：

以上是关于Hive Group by 后自我加入的主要内容，如果未能解决你的问题，请参考以下文章

hive distribute by 和group by 的区别

Hive中提示Expression Not In Group By Key的解决办法

自加入时 hive 的缓慢处理

AWS SG自我参考解析不同环境

自我加入在这里没有帮助。我还可以使用啥其他方法？

Hive：UDF 和 GROUP BY