我可以使用 Pig Latin 中的嵌套 FOREACH 语句生成嵌套包吗?
Posted
技术标签:
【中文标题】我可以使用 Pig Latin 中的嵌套 FOREACH 语句生成嵌套包吗?【英文标题】:Can I generate nested bags using nested FOREACH statements in Pig Latin? 【发布时间】:2011-02-08 11:53:49 【问题描述】:假设我有一个餐厅评论数据集:
User,City,Restaurant,Rating
Jim,New York,Mecurials,3
Jim,New York,Whapme,4.5
Jim,London,Pint Size,2
Lisa,London,Pint Size,4
Lisa,London,Rabbit Whole,3.5
我想按用户和城市的平均评论生成一个列表。 IE。输出:
User,City,AverageRating
Jim,New York,3.75
Jim,London,2
Lisa,London,3.75
我可以编写一个 Pig 脚本如下:
Data = LOAD 'data.txt' USING PigStorage(',') AS (
user:chararray, city:chararray, restaurant:charray, rating:float
);
PerUserCity = GROUP Data BY (user, city);
ResultSet = FOREACH PerUserCity
GENERATE group.user, group.city, AVG(Data.rating);
但是我很好奇我是否可以先将更高级别的组(用户)分组,然后再将下一个级别(城市)分组:即
PerUser = GROUP Data BY user;
Intermediate = FOREACH PerUser
B = GROUP Data BY city;
GENERATE group AS user, B;
我明白了:
Error during parsing.
Invalid alias: GROUP in
group: chararray,
Data:
user: chararray,
city: chararray,
restaurant: chararray,
rating: float
有没有人成功地尝试过这个?是否根本不可能在 FOREACH 中进行 GROUP?
我的目标是做类似的事情:
ResultSet = FOREACH PerUser
FOREACH City
GENERATE user, city, AVG(City.rating)
【问题讨论】:
【参考方案1】:目前允许的操作是DISTINCT
、FILTER
、LIMIT
和 ORDER BY
在 FOREACH 内。
现在直接按(用户,城市)分组是按照你说的做的好方法。
【讨论】:
【参考方案2】:Pig 0.10 版的发行说明表明嵌套的 FOREACH 操作是 now supported。
【讨论】:
谢谢。为什么内部块需要两个 GENERATE? 撤回我的建议。发行说明表明可以做到这一点,但我无法让它工作。【参考方案3】:试试这个:
Records = load 'data_rating.txt' using PigStorage(',') as (user:chararray, city:chararray, restaurant:chararray, rating:float);
grpRecs = group Records By (user,city);
avgRating_Byuser_perCity = foreach grpRecs generate AVG(Records.rating) as average;
Result = foreach avgRating_Byuser_perCity generate flatten(group), average;
【讨论】:
您应该添加说明此代码完成了什么以及它是如何做到的。 这是错误的...应该是 Records = load 'data_rating.txt' using PigStorage(',') as (user:chararray, city:chararray, restaurant:chararray, rating:float) ; grpRecs = 按(用户,城市)分组记录; avgRating_Byuser_perCity = foreach grpRecs 生成 flatten(group), AVG(Records.rating) 作为平均值;结果 = 转储 avgRating_Byuser_perCity ;【参考方案4】:awdata = load 'data' using PigStorage(',') as (user:chararray , city:chararray , restaurant:chararray , rating:float);
data = filter rawdata by user != 'User';
groupbyusercity = group data by (user,city);
--describe groupbyusercity;
--groupbyusercity: group: (user: chararray,city: chararray),data: (user: chararray,city: chararray,restaurant: chararray,rating: float)
average = foreach groupbyusercity
generate group.user,group.city,AVG(data.rating);
dump average;
【讨论】:
【参考方案5】:按两个键分组,然后展平结构会导致相同的结果:
像你一样加载数据
Data = LOAD 'data.txt' USING PigStorage(',') AS (
user:chararray, city:chararray, restaurant:charray, rating:float);
按用户和城市分组
ByUserByCity = GROUP Data BY (user, city);
添加组的平均评分(您可以添加更多,例如 COUNT(Data) as count_res) 然后将组结构展平为原始结构。
ByUserByCityAvg = FOREACH ByUserByCity GENERATE
FLATTEN(group) AS (user, city),
AVG(Data.rating) as user_city_avg;
结果:
Jim,London,2.0
Jim,New York,3.75
Lisa,London,3.75
User,City,
【讨论】:
我猜,这不能回答问题以上是关于我可以使用 Pig Latin 中的嵌套 FOREACH 语句生成嵌套包吗?的主要内容,如果未能解决你的问题,请参考以下文章