使用 Hive 或 Pig 在字段中查找重复次数最多的值

Posted

技术标签:

【中文标题】使用 Hive 或 Pig 在字段中查找重复次数最多的值【英文标题】:Find The Most Repeated Value In a Field using Hive or Pig 【发布时间】:2016-04-11 07:46:31 【问题描述】:

如何使用 Hive 或 Pig 查找字段中重复次数最多的值? 数据库值采用以下格式

cricket,Football,Basketball,Volleyball 
cricket,Football,Basketball
Running cricket,Football
Basketball,Volleyball Football,Basketball,Volleyball,Baseball,Cycling
Running Shooting,Football,Running

我想从列表中找出最常见的游戏。

【问题讨论】:

这种情况下应该输出什么?是针对每一行还是整个数据集? 【参考方案1】:

对数据进行字数统计,然后得到最大字数。

lines = LOAD 'test4.txt' as (line:chararray);
sports = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as sport;
groupedsport = GROUP sports BY sport;
sportcount = FOREACH groupedsport GENERATE group as sport, COUNT(sports) as total;
groupedsportcount = GROUP sportcount ALL;
maxvalue = FOREACH groupedsportcount  GENERATE MAX(sportcount.total);
maxsportcount = FILTER sportcount BY (total == maxvalue.$0);
DUMP maxsportcount;

上述可以通过按 desc 顺序对计数进行排序并将输出限制为 1 来实现。但是,如果有多个最大计数,则不会返回具有最大计数的所有单词。

lines = LOAD 'test4.txt' as (line:chararray);
sports = FOREACH lines GENERATE FLATTEN(TOKENIZE(line)) as sport;
groupedsport = GROUP sports BY sport;
sportcount = FOREACH groupedsport GENERATE group as sport, COUNT(sports) as total;
orderedsportcount = ORDER sportcount BY total DESC;
maxsportcount= LIMIT orderedsportcount 1;
DUMP maxsportcount;

输出

【讨论】:

【参考方案2】:

我已将您的文本复制到文件 m.txt 中并执行以下操作以获得所需的输出。

str = load '/home/abhijit/Downloads/m.txt' AS (str:chararray);

我们将使用TOKENIZE 函数将一串单词(单个元组中的所有单词)拆分成一个单词包(单个元组中的每个单词)。

tokens = foreach str generate TOKENIZE(str);

dump tokens;

输出是袋子的形式。

((cricket),(Football),(Basketball),(Volleyball))
((cricket),(Football),(Basketball))
((Running),(cricket),(Football))
((Basketball),(Volleyball),(Football),(Basketball),(Volleyball),(Baseball),(Cycling))
((Running),(Shooting),(Football),(Running))

FLATTEN : 它不嵌套元组和包。对于元组,flatten 用元组的字段代替元组。 当我们取消嵌套包时,我们会创建新的元组。

tokens = foreach str generate FLATTEN(TOKENIZE(str));

dump tokens;

(cricket)
(Football)
(Basketball)
(Volleyball)
(cricket)
(Football)
(Basketball)
(Running)
(cricket)
(Football)
(Basketball)
(Volleyball)
(Football)
(Basketball)
(Volleyball)
(Baseball)
(Cycling)
(Running)
(Shooting)
(Football)
(Running)

为了获得更高的准确性,您可以尝试在一种情况下获取字符串/单词,这样您将获得良好的结果和正确的计数。所以使用LOWER将它们转换为小写

您也可以使用UPPER将其转换为大写

tokens = foreach str generate FLATTEN(TOKENIZE(LOWER(str)));

输出将是:

(cricket)
(football)
(basketball)
(volleyball)
(cricket)
(football)
(basketball)
(running)
(cricket)
(football)
(basketball)
(volleyball)
(football)
(basketball)
(volleyball)
(baseball)
(cycling)
(running)
(shooting)
(football)
(running)

Group :将数据分组到一个或多个关系中。

grps = group tokens by $0; 
dump grps;

此处组创建 2 个字段,一个位于 $0,另一个位于 $1$0 表示键,S1 是具有相同组键的元组组(即$0:键字段)。

输出显示按键分组的字段:

(Cycling,(Cycling))
(Running,(Running),(Running),(Running))
(cricket,(cricket),(cricket),(cricket))
(Baseball,(Baseball))
(Football,(Football),(Football),(Football),(Football),(Football))
(Shooting,(Shooting))
(Basketball,(Basketball),(Basketball),(Basketball),(Basketball))
(Volleyball,(Volleyball),(Volleyball),(Volleyball))

COUNT 函数计算tuples($1) 的数量,用于key field($0)

cnt = foreach grps generate $0, COUNT($1);
dump cnt;

输出显示单词的计数:

(Cycling,1)
(Running,3)
(cricket,3)
(Baseball,1)
(Football,5)
(Shooting,1)
(Basketball,4)
(Volleyball,3)

ORDER 用于对元组进行降序排序。所以会在顶部获得最高的。

ord = order cnt by $1 desc;
dump ord;

排序后的输出:

(Football,5)
(Basketball,4)
(Running,3)
(cricket,3)
(Volleyball,3)
(Cycling,1)
(Baseball,1)
(Shooting,1)

Limit :它将输出元组的数量限制为指定的计数。

maxWord = limit ord 1;
dump maxWord;

最终的输出是

(Football,5)

【讨论】:

如果有另一个具有相同最大计数的元组,这将不会给出正确答案。说 (cricket,5)

以上是关于使用 Hive 或 Pig 在字段中查找重复次数最多的值的主要内容,如果未能解决你的问题,请参考以下文章

字符串查找(重复次数)

使用 PIG 或 HIVE 从 CSV 中删除前两行

HIVE 或 PIG 作为 Amazon Redshift 的替代品?

使用 AWS Elastic MapReduce 获取时间序列数据的 Hive、HBase 和 Pig

使用预先排序的数据加速 Hive 或 Pig 聚合

何时使用 Hadoop、HBase、Hive 和 Pig?