使用 Hive 或 Pig 在字段中查找重复次数最多的值
Posted
技术标签:
【中文标题】使用 Hive 或 Pig 在字段中查找重复次数最多的值【英文标题】:Find The Most Repeated Value In a Field using Hive or Pig 【发布时间】:2016-04-11 07:46:31 【问题描述】:如何使用 Hive 或 Pig 查找字段中重复次数最多的值? 数据库值采用以下格式
cricket,Football,Basketball,Volleyball
cricket,Football,Basketball
Running cricket,Football
Basketball,Volleyball Football,Basketball,Volleyball,Baseball,Cycling
Running Shooting,Football,Running
我想从列表中找出最常见的游戏。
【问题讨论】:
这种情况下应该输出什么?是针对每一行还是整个数据集? 【参考方案1】:对数据进行字数统计,然后得到最大字数。
lines = LOAD 'test4.txt' as (line:chararray);
sports = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as sport;
groupedsport = GROUP sports BY sport;
sportcount = FOREACH groupedsport GENERATE group as sport, COUNT(sports) as total;
groupedsportcount = GROUP sportcount ALL;
maxvalue = FOREACH groupedsportcount GENERATE MAX(sportcount.total);
maxsportcount = FILTER sportcount BY (total == maxvalue.$0);
DUMP maxsportcount;
上述可以通过按 desc 顺序对计数进行排序并将输出限制为 1 来实现。但是,如果有多个最大计数,则不会返回具有最大计数的所有单词。
lines = LOAD 'test4.txt' as (line:chararray);
sports = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as sport;
groupedsport = GROUP sports BY sport;
sportcount = FOREACH groupedsport GENERATE group as sport, COUNT(sports) as total;
orderedsportcount = ORDER sportcount BY total DESC;
maxsportcount= LIMIT orderedsportcount 1;
DUMP maxsportcount;
输出
【讨论】:
【参考方案2】:我已将您的文本复制到文件 m.txt 中并执行以下操作以获得所需的输出。
str = load '/home/abhijit/Downloads/m.txt' AS (str:chararray);
我们将使用TOKENIZE
函数将一串单词(单个元组中的所有单词)拆分成一个单词包(单个元组中的每个单词)。
tokens = foreach str generate TOKENIZE(str);
dump tokens;
输出是袋子的形式。
((cricket),(Football),(Basketball),(Volleyball))
((cricket),(Football),(Basketball))
((Running),(cricket),(Football))
((Basketball),(Volleyball),(Football),(Basketball),(Volleyball),(Baseball),(Cycling))
((Running),(Shooting),(Football),(Running))
FLATTEN
: 它不嵌套元组和包。对于元组,flatten 用元组的字段代替元组。
当我们取消嵌套包时,我们会创建新的元组。
tokens = foreach str generate FLATTEN(TOKENIZE(str));
dump tokens;
(cricket)
(Football)
(Basketball)
(Volleyball)
(cricket)
(Football)
(Basketball)
(Running)
(cricket)
(Football)
(Basketball)
(Volleyball)
(Football)
(Basketball)
(Volleyball)
(Baseball)
(Cycling)
(Running)
(Shooting)
(Football)
(Running)
为了获得更高的准确性,您可以尝试在一种情况下获取字符串/单词,这样您将获得良好的结果和正确的计数。所以使用LOWER
将它们转换为小写
您也可以使用UPPER
将其转换为大写
tokens = foreach str generate FLATTEN(TOKENIZE(LOWER(str)));
输出将是:
(cricket)
(football)
(basketball)
(volleyball)
(cricket)
(football)
(basketball)
(running)
(cricket)
(football)
(basketball)
(volleyball)
(football)
(basketball)
(volleyball)
(baseball)
(cycling)
(running)
(shooting)
(football)
(running)
Group
:将数据分组到一个或多个关系中。
grps = group tokens by $0;
dump grps;
此处组创建 2 个字段,一个位于 $0
,另一个位于 $1
。
$0
表示键,S1
是具有相同组键的元组组(即$0
:键字段)。
输出显示按键分组的字段:
(Cycling,(Cycling))
(Running,(Running),(Running),(Running))
(cricket,(cricket),(cricket),(cricket))
(Baseball,(Baseball))
(Football,(Football),(Football),(Football),(Football),(Football))
(Shooting,(Shooting))
(Basketball,(Basketball),(Basketball),(Basketball),(Basketball))
(Volleyball,(Volleyball),(Volleyball),(Volleyball))
COUNT
函数计算tuples($1)
的数量,用于key field($0)
。
cnt = foreach grps generate $0, COUNT($1);
dump cnt;
输出显示单词的计数:
(Cycling,1)
(Running,3)
(cricket,3)
(Baseball,1)
(Football,5)
(Shooting,1)
(Basketball,4)
(Volleyball,3)
ORDER
用于对元组进行降序排序。所以会在顶部获得最高的。
ord = order cnt by $1 desc;
dump ord;
排序后的输出:
(Football,5)
(Basketball,4)
(Running,3)
(cricket,3)
(Volleyball,3)
(Cycling,1)
(Baseball,1)
(Shooting,1)
Limit
:它将输出元组的数量限制为指定的计数。
maxWord = limit ord 1;
dump maxWord;
最终的输出是
(Football,5)
【讨论】:
如果有另一个具有相同最大计数的元组,这将不会给出正确答案。说 (cricket,5)以上是关于使用 Hive 或 Pig 在字段中查找重复次数最多的值的主要内容,如果未能解决你的问题,请参考以下文章
HIVE 或 PIG 作为 Amazon Redshift 的替代品?