在 Hadoop Pig 中加入和分组

Posted 2023-04-18

技术标签:

【中文标题】在 Hadoop Pig 中加入和分组【英文标题】：join and group-by in Hadoop Pig 【发布时间】：2016-03-13 22:45:32 【问题描述】：

经常看到人们使用 group by 和 join 来解决同样的问题，假设我有一个学生表和分数表，想找到具有相关课程分数的学生姓名。看来我们可以通过使用 join 或使用 group by 来解决这个问题？想知道这两种解决方案的优缺点。在下面发布数据结构和代码。谢谢。

table students:

student ID, student name, student email address

score table:

student ID, course ID, score

student_scores = group students by (studentId) inner, scores by (studentId);

student_scores = join students by student Id, scores by studentId;

【问题讨论】：

Join vs COGROUP in PIG的可能重复 @rahulbmv，很好的参考，并投票。 :) 但我问的是小组 vs.加入，你是指合作组？谢谢。 @rahulbmv，我也对 cmets 中的“外键”的含义感到困惑——“两者都需要将所有记录向前发送，键是外键。”，如果你可以举个例子，会很棒。 【参考方案1】：

在关于 Join 的 Pig Latin Manuall 中说：

Note the following about the GROUP/COGROUP and JOIN operators:

The GROUP and JOIN operators perform similar functions. GROUP creates a nested set of output tuples while JOIN creates a flat set of output tuples.
The GROUP/COGROUP and JOIN operators handle null values differently (see Nulls and JOIN Operator).

不知道是不是优缺点，但它们是不同的

【讨论】：

谢谢 Mzf，我的问题是它们在我的样本中有何不同。想了解不同之处。 :)

以上是关于在 Hadoop Pig 中加入和分组的主要内容，如果未能解决你的问题，请参考以下文章

如何在需要按一个键列分组的 3 个表中加入和求和值