如何使用 apache pig 将标题行加入多个文件中的详细行

Posted

技术标签:

【中文标题】如何使用 apache pig 将标题行加入多个文件中的详细行【英文标题】:how to join header row to detail rows in multiple files with apache pig 【发布时间】:2015-10-22 17:22:00 【问题描述】:

我在 HDFS 文件夹中有几个 CSV 文件,我将它们加载到以下关系中:

source = LOAD '$data' USING PigStorage(','); -- $data 作为参数传递给 pig 命令。

当我转储它时,源关系的结构如下:(注意数据是文本限定的,但我将使用 REPLACE 函数处理)

("HEADER","20110118","20101218","20110118","T00002")
("0000000000000000035412","20110107","2699","D","20110107","2315.","","","","","","C")
("0000000000000000035412","20110107","2699","D","20110107","246..","162","74","","","","B")

<.... more records ....>

("HEADER","20110224","20110109","20110224","T00002")
("0000000000000000035412","20110121","2028","D","20110121","a6c3.","","","","","R","P")
("0000000000000000035412","20110217","2619","D","20110217","a6c3.","","","","","R","P")

<.... more records ....>

所以每个文件都有一个标题,它提供了一些关于它后面的数据集的信息,例如数据的提供者和它所涵盖的日期范围。

那么现在,如何转换上述结构并创建如下所示的新关系?:


(HEADER,20110118,20101218,20110118,T00002),(0000000000000000035412,20110107,2699,D,20110107,2315.,,,,,,C),(0000000000000000035412,20110107,2699,D,20110107,246..,162,74,,,,B),..more tuples..,
(HEADER,20110224,20110109,20110224,T00002),(0000000000000000035412,20110121,2028,D,20110121,a6c3.,,,,,R,P),(0000000000000000035412,20110217,2619,D,20110217,a6c3.,,,,,R,P),..more tuples..,..more tuples..

每个标头元组后面跟着一袋属于该标头的记录元组? 不幸的是,标题行和详细信息行之间没有共同的关键字段,所以我认为不能使用任何 JOIN 操作。 ?

我对 Pig 和 Hadoop 还很陌生,这是我参与的首批概念项目之一。

希望我的问题很清楚,并期待在这里得到一些指导。

【问题讨论】:

【参考方案1】:

这应该可以帮助您入门。 代码:

Source = LOAD '$data' USING PigStorage(',','-tagFile');
A = SPLIT Source INTO FileHeaders IF $1 == 'HEADER', FileData OTHERWISE;
B = GROUP FileData BY $0;
C = GROUP FileHeaders BY $0;
D = JOIN B BY Group, C BY Group;
...

【讨论】:

是的,这确实让我开始了,谢谢!!

以上是关于如何使用 apache pig 将标题行加入多个文件中的详细行的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 apache pig 将一个包转换为多个包?

Apache Pig Group / 展平 / 加入

使用 Apache Pig 从文本文件中获取备用行

如何从 Apache Pig 中的文件中读取多个文件?

Apache Pig 区分和计数

无法将 org.apache.pig.builtin.SUM 的匹配函数推断为多个匹配或都不匹配。请使用显式演员表