Apache PIG - 分组

Posted

技术标签:

【中文标题】Apache PIG - 分组【英文标题】:Apache PIG - GROUP BY 【发布时间】:2015-10-01 23:27:28 【问题描述】:

我希望在 Pig 中实现以下功能。我有一组这样的示例记录。

请注意,EffectiveDate 列有时为空白,并且对于相同的 CustomerID 也不同。

现在,作为输出,我希望每个 CustomerID 有一条记录,其中 EffectiveDate 为 MAX。因此,对于上面的示例,我希望突出显示的记录如下所示。

我目前使用 PIG 的方式是这样的:

customerdata = LOAD 'customerdata' AS (CustomerID:chararray, CustomerName:chararray, Age:int, Gender:chararray, EffectiveDate:chararray);

--Group customer data by CustomerID
customerdata_grpd = GROUP customerdata BY CustomerID;

--From the grouped data, generate one record per CustomerID that has the maximum EffectiveDate.
customerdata_maxdate = FOREACH customerdata_grpd GENERATE group as CustID, MAX(customerdata.EffectiveDate) as MaxDate;

--Join the above with the original data so that we get the other details like CustomerName, Age etc.
joinwithoriginal = JOIN customerdata by (CustomerID, EffectiveDate), customerdata_maxdate by (CustID, MaxDate);

finaloutput = FOREACH joinwithoriginal GENERATE customerdata::CustomerID as CustomerID, CustomerName as CustomerName, Age as Age, Gender as gender, EffectiveDate as EffectiveDate;

我基本上是对原始数据进行分组以找到最大有效日期的记录。然后我再次将这些“分组”记录与原始数据集结合起来,以获得具有最大有效日期的相同记录,但这次我还将获得其他数据,如客户姓名、年龄和性别。这个数据集很大,所以这种方法需要很多时间。有更好的方法吗?

【问题讨论】:

【参考方案1】:

输入:

1,John,28,M,1-Jan-15
1,John,28,M,1-Feb-15
1,John,28,M,
1,John,28,M,1-Mar-14
2,Jane,25,F,5-Mar-14
2,Jane,25,F,5-Jun-15
2,Jane,25,F,3-Feb-14

猪脚本:

customer_data = LOAD 'customer_data.csv' USING PigStorage(',')  AS  (id:int,name:chararray,age:int,gender:chararray,effective_date:chararray);

customer_data_fmt = FOREACH customer_data GENERATE id..gender,ToDate(effective_date,'dd-MMM-yy') AS date, effective_date;

customer_data_grp_id = GROUP customer_data_fmt BY id;

req_data = FOREACH customer_data_grp_id 
        customer_data_ordered = ORDER customer_data_fmt BY date DESC;
        req_customer_data = LIMIT customer_data_ordered 1;
        GENERATE FLATTEN(req_customer_data.id) AS id, 
                 FLATTEN(req_customer_data.name) AS name,
                 FLATTEN(req_customer_data.gender) AS gender,
                 FLATTEN(req_customer_data.effective_date) AS effective_date;
;  

输出:

(1,John,M,1-Feb-15)
(2,Jane,F,5-Jun-15)

【讨论】:

以上是关于Apache PIG - 分组的主要内容,如果未能解决你的问题,请参考以下文章

Apache PIG - 使用百分比值对 foreach 中的分组数据进行采样

apache pig中的嵌套组

如果存在多个值,Apache Pig Group by 和过滤器?

在 Pig 中投影分组元组

如何使用 Pig 将分组记录存储到多个文件中?

PIG - 从一个大输入优化各种分组结构的最佳方法