Apache PIG - 加入后投影结果为 NULL
Posted
技术标签:
【中文标题】Apache PIG - 加入后投影结果为 NULL【英文标题】:Apache PIG - Join followed by projection results in NULLs 【发布时间】:2015-05-05 23:00:16 【问题描述】:以下代码按预期工作:
a = load 'data_a' using PigStorage('\t') as (a1, a2, a3);
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3;
a_b = join a by a1, b by b1; --inner join
当我检查这些字段时,它们被正确填充。
但是,一旦我将投影添加到混音中,它就不起作用了。
a = load 'data_a' using PigStorage('\t') as (a1, a2, a3);
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3;
a_b = join a by a1, b by b1; --inner join
ab = foreach a_b generate a1 as a1, a2 as a2, b2 as b2;
在 ab 中,来自 b 的字段中的所有单元格都是 NULL。
如果我这样做,也会发生同样的事情:
a = load 'data_a' using PigStorage('\t') as (a1, a2, a3);
a2 = foreach a generate a1, a2;
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3;
b2 = foreach b generate b1, b2;
ab = join a2 by a1, b2 by b1;
我使用以下解决方法,但讨厌被存储/加载所困扰:
a = load 'data_a' using PigStorage('\t') as (a1, a2, a3);
b = load 'data_b' using PigStorage('\t') as (b1, b2, b3;
a_b = join a by a1, b by b1; --inner join
store a_b into 'hdfs:///a_b_temp' using PigStorage('\t','-schema');
a_b2 = load 'hdfs:///a_b_temp' using PigStorage('\t');
ab = foreach a_b2 generate a1 as a1, a2 as a2, b2 as b2;
并且 ab 中的字段不会变为 NULL。但是,如果我随后分组并执行聚合,我通常会收到错误:
ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR: org.apache.pig.data.DataByteArray cannot be cast to java.lang.Long
但是,如果我跳过最后一个投影,这个错误就会消失。
我是 Pig 新手 - 是否有任何已知的错误/问题可能导致此问题?我已经观察到它在不同的数据集上发生了几次。
我在 Amazon AWS EMR 上使用 pig 0.12。
感谢您的帮助!
【问题讨论】:
【参考方案1】:我尝试了您的第二种方法,这是代码。
a = load '/user/root/pig/file1.txt' using PigStorage('\t') as (a1:int, a2:chararray, a3:chararray);
b = load '/user/root/pig/file2.txt' using PigStorage('\t') as (b1:int, b2:chararray, b3:chararray);
--inner join
a_b = join a by a1, b by b1;
--if your goal is to get selected field from relation b based on join condition.
--a::a1 says "there is a record from "a" and that has a column called a1"
ab = foreach a_b generate a::a1, a2, b2;
--If your goal is to get all matching data on id from both relations.
--ab = foreach a_b generate $0..;
DUMP ab;
希望对你有所帮助。
【讨论】:
感谢您的回复。我的理解是 :: 仅当两个关系之间存在重复的字段名称时才需要。这不是真的吗? 没有必要。仍然支持在 JOIN 之后识别字段名称。您可以查看更多详细信息:pig.apache.org/docs/r0.9.1/basic.html#disambiguate以上是关于Apache PIG - 加入后投影结果为 NULL的主要内容,如果未能解决你的问题,请参考以下文章
APACHE PIG - 模式中不存在错误投影字段 [Units_Sold]:group:chararray,D2:bag:tuple(Item_Type:chararray,Units_Sold:i