如何在 Apache PIG 中正确执行此内部连接？

Posted 2023-04-18

技术标签:

【中文标题】如何在 Apache PIG 中正确执行此内部连接？【英文标题】：How can I do this inner join properly in Apache PIG? 【发布时间】：2011-10-17 06:23:54 【问题描述】：

我有两个文件，一个叫做 a-records

123^record1
222^record2
333^record3

还有另一个名为 b-records 的文件

123^jim
123^jim
222^mike
333^joe

您可以在文件 A 中看到我有一次令牌 123。在文件 B 中有两次。有没有办法使用 Apache PIG 我可以加入数据，这样我只能从 A 文件中获得一个加入记录？

这是我当前的脚本，它在下面输出以下内容

arecords = LOAD '$a'  USING PigStorage('^')  as (token:chararray, type:chararray);

brecords =  LOAD '$b'  USING PigStorage('^')  as (token:chararray, name:chararray);


x = JOIN arecords BY token, brecords BY token;

dump x;

产生：

(123,record1,123,jim)
(123,record1,123,jim)
(222,record2,222,mike)
(333,record3,333,joe)

当我真正想要的是（注意令牌 123 在加入后只在其中一次）

(123,record1,123,jim)
(222,record2,222,mike)
(333,record3,333,joe)

有什么想法吗？非常感谢

【问题讨论】：

【参考方案1】：

我会做这样的事情：

arecords = LOAD '$a'  USING PigStorage('^')  as (token:chararray, type:chararray);

brecords =  LOAD '$b'  USING PigStorage('^')  as (token:chararray, name:chararray);

bdistinct = DISTINCT brecords;

x = JOIN arecords BY token, bdistinct BY token;

dump x;

【讨论】：

完全正确，我应该在我的右侧做一个独特的。这样做了，现在一切都很好，谢谢！

以上是关于如何在 Apache PIG 中正确执行此内部连接？的主要内容，如果未能解决你的问题，请参考以下文章