Pig:文件中不同元组的计数频率

Posted

技术标签:

【中文标题】Pig:文件中不同元组的计数频率【英文标题】:Pig: Count Frequency of distinct Tuples in a file 【发布时间】:2016-10-08 01:02:09 【问题描述】:

我有一个包含 json 条目的文件,如下所示:

"child_pos": "NN", "parent_pos": "NN", "parent": "fighter", "child_dep": "nn", "parent_dep": "nsubj", "child": "virtua"
"child_pos": "NN", "parent_pos": "NN", "parent": "case", "child_dep": "nn", "parent_dep": "nsubj", "child": "martin"
"child_pos": "NN", "parent_pos": "NN", "parent": "fighter", "child_dep": "nn", "parent_dep": "nsubj", "child": "virtua"
"child_pos": "NN", "parent_pos": "NN", "parent": "fighter", "child_dep": "nn", "parent_dep": "nsubj", "child": "virtua"
"child_pos": "NN", "parent_pos": "NN", "parent": "case", "child_dep": "nn", "parent_dep": "nsubj", "child": "martin"

我想计算文件中不同 json 对象的频率。我看到了我们在 Pig 中使用 Group By 和 count() 函数的其他答案。我不确定我是否正确使用它们,但我没有得到所需的结果。我的输出应该是这样的:

"child_pos": "NN", "parent_pos": "NN", "parent": "fighter", "child_dep": "nn", "parent_dep": "nsubj", "child": "virtua", "count": "3"
"child_pos": "NN", "parent_pos": "NN", "parent": "case", "child_dep": "nn", "parent_dep": "nsubj", "child": "martin", "count": "2"

顺序并不重要。有人可以给我一些指点吗?

【问题讨论】:

请分享您尝试过的方法以及为什么您认为这不起作用? 【参考方案1】:

这里是可以使用的代码,所有字段的条件都被分组如果你想要其他格式,你可以从元组中读取字段并使用任何其他格式

A = LOAD '/user/root/test12.json' USING JsonLoader('child_pos:chararray,               parent_pos:chararray, parent:chararray, child_dep:chararray, parent_dep:chararray, child:chararray');
B = GROUp A by (child_pos, parent_pos, parent, child_dep, parent_dep, child) ;
C = FOREACH B GENERATE group, COUNT(A.child_pos) as COUNTX;
STORE C into 'user/data/json_out.json' USING JsonStorage();

out put is ... 
"group":    "child_pos":"NN","parent_pos":"NN","parent":"case","child_dep":"nn","parent_dep":"nsubj","child":"martin","COUNTX":2
"group":"child_pos":"NN","parent_pos":"NN","parent":"fighter","child_dep":"nn","parent_dep":"nsubj","child":"virtua","COUNTX":3

【讨论】:

以上是关于Pig:文件中不同元组的计数频率的主要内容,如果未能解决你的问题,请参考以下文章

使用元组元素从列表中生成频率直方图

linux opp怎么使用

无法使用 Pig 中的 Elephant Bird 访问带有包和元组的嵌套 JSON

python csv文件频率计数

为条件频率分布创建标记和文本元组

访问元组的字段