使用 collect_set 进行 Hive 查询

Posted 2023-04-17

技术标签:

【中文标题】使用 collect_set 进行 Hive 查询【英文标题】：Hive Query with collect_set 【发布时间】：2017-03-20 06:09:03 【问题描述】：

我有 2 个表，sample_table1 有如下两列

和sample_table2 有两列

我想得到像

这样的输出

F1    F2
001    1    <as 001 -> [a, b, e] -> [0, 1, 0] -> 1 (if one of the items in the collection ([a, b, e] in this case) is 1, then Column F2 should be 1 )>
002    1    <as 002 -> [c, b] -> [0, 1] -> 1>
003    0    <as 003 -> [a, c] -> [0, 0] -> 0>

我对 Hive 的内置聚合函数 collect_set 进行了很多尝试，但无法解决。我想知道是否可以在不编写任何自定义 UDF 的情况下做到这一点？

【问题讨论】：

【参考方案1】：

不需要collect_set

select      t1.c1       as f1
           ,max(t2.c4)  as f2

from                sample_table1 t1
            join    sample_table2 t2
            on      t1.c2 = t2.c3

group by    t1.c1      
;

+-----+----+
| f1  | f2 |
+-----+----+
| 001 |  1 |
| 002 |  1 |
| 003 |  0 |
+-----+----+

【讨论】：

以上是关于使用 collect_set 进行 Hive 查询的主要内容，如果未能解决你的问题，请参考以下文章