如何编写查询以避免在 select distinct 和 size collect_set hive 查询中使用单个 reducer？

Posted 2023-04-14

技术标签:

【中文标题】如何编写查询以避免在 select distinct 和 size collect_set hive 查询中使用单个 reducer？【英文标题】：How to write query to avoid single reducer in select distinct and size collect_set hive queries? 【发布时间】：2015-07-04 05:20:45 【问题描述】：

如何重写这些查询以避免在 reduce 阶段出现单个 reducer？这需要很长时间，我失去了使用它的并行性的好处。

select id
, count(distinct locations) AS unique_locations
  from
  mytable
;

和

select id
, size(collect_set(locations)) AS unique_locations
  from
  mytable
;

【问题讨论】：

collect_set 是收集东西并删除重复项的集合。我可以看到位置来自表格，要删除重复项，您需要扫描整个表格。我想它更像是一种聚合。那么我们可以在没有reducer的情况下进行聚合吗？肯定需要一个 reduce 工作，只是想避免它需要一个 reducer。 【参考方案1】：

对 count(distinct var) 使用两个查询：

SELECT
 count(1)
FROM (
 SELECT DISTINCT locations as unique_locations 
 from my_table
 ) t;

我认为大小 collect_set 也是如此：

SELECT
  size(unique_locations)
FROM (
 SELECT collect_set(locations) as unique_locations 
 from my_table
 ) t;

【讨论】：

以上是关于如何编写查询以避免在 select distinct 和 size collect_set hive 查询中使用单个 reducer？的主要内容，如果未能解决你的问题，请参考以下文章