PIG 需要找到最大值
Posted
技术标签:
【中文标题】PIG 需要找到最大值【英文标题】:PIG need to find max 【发布时间】:2021-02-20 06:30:43 【问题描述】:我是 Pig 的新手,正在解决一个问题,我需要在这个数据集中找到具有最大权重的玩家。以下是数据示例:
id, weight,id,year, triples
(bayja01,210,bayja01,2005,6)
(crawfca02,225,crawfca02,2005,15)
(damonjo01,205,damonjo01,2005,6)
(dejesda01,190,dejesda01,2005,6)
(eckstda01,170,eckstda01,2005,7)
这是我的猪脚本:
batters = LOAD 'hdfs:/user/maria_dev/pigtest/Batting.csv' using PigStorage(',');
realbatters = FILTER batters BY $1==2005;
triphitters = FILTER realbatters BY $9>5;
tripids = FOREACH triphitters GENERATE $0 AS id,$1 AS YEAR, $9 AS Trips;
names = LOAD 'hdfs:/user/maria_dev/pigtest/Master.csv'
using PigStorage(',');
weights = FOREACH names GENERATE $0 AS id, $16 AS weight;
get_ids = JOIN weights BY (id), tripids BY(id);
wts = FOREACH get_ids GENERATE MAX(get_ids.weight)as wgt;
DUMP wts;
倒数第二行当然不起作用。它告诉我我必须使用明确的演员表。我已经弄清楚了过滤等 - jsut无法弄清楚如何获得最终答案。
【问题讨论】:
我试过了,但我仍然收到 1045 错误:id_wght = FOREACH 名称生成 $0 作为 id,(双)$16 作为重量; get_ids = JOIN id_wght BY (id), triids BY(id); final = FOREACH get_ids GENERATE MAX($1),$0 AS id; 【参考方案1】:Pig 中的 MAX
函数需要一个 Bag 值,并将返回 bag 中的最大值。为了创建一个包,你必须首先GROUP
你的数据:
get_ids = JOIN weights BY id, tripids BY id;
-- Drop columns we no longer need and rename for ease
just_ids_weights = FOREACH get_ids GENERATE
weights::id AS id,
weights:: weight AS weight;
-- Group the data by id value
gp_by_ids = GROUP just_ids_weights BY id;
-- Find maximum weight by id
wts = FOREACH gp_by_ids GENERATE
group AS id,
MAX(just_ids_weights.weight) AS wgt;
如果您想要所有数据的最大权重,您可以使用GROUP ALL
将所有数据放在一个包中:
gp_all = GROUP just_ids_weights ALL;
was = FOREACH gp_all GENERATE
MAX(just_ids_weights.weight) AS wgt;
【讨论】:
以上是关于PIG 需要找到最大值的主要内容,如果未能解决你的问题,请参考以下文章
如何在 Pig 和 Hive 中找到第 n 个最大和最小的数字?
使用 pig 查找 pig 表中每列中存在的所有数据的最大值