Apache Pig - 如何获取多个包之间的匹配元素数量?

Posted

技术标签:

【中文标题】Apache Pig - 如何获取多个包之间的匹配元素数量?【英文标题】:Apache Pig - How to get number of matching elements between multiple bags? 【发布时间】:2013-05-22 08:52:31 【问题描述】:

我是 Apache Pig 的新用户,我有一个问题要解决。

我正在尝试用 apache pig 制作一个小型搜索引擎。这个想法很简单:我有一个文件,它是多个文档的串联(每行一个文档)。这是一个包含三个文档的示例:

1,word1 word4 word2 word1
2,word2 word6 word1 word5 word3
3,word1 word3 word4 word5

然后,我使用以下代码行为每个文档创建一个词袋:

docs = LOAD '$documents' USING PigStorage(',') AS (id:int, line:chararray);
B = FOREACH docs GENERATE line;
C = FOREACH B GENERATE TOKENIZE(line) as gu;

然后,我删除袋子上的重复条目:

filtered = FOREACH C 
    uniq = DISTINCT gu;
    GENERATE uniq;

以下是这段代码的结果:

DUMP filtered;

((word1), (word4),  (word2))
((word2), (word6),  (word1), (word5), (word3))
((word1), (word3),  (word4), (word5))

所以我每个文档都有一袋字,就像我想要的那样。

现在,让我们将用户查询视为一个文件:

word2 word7 word5

我将查询转换为一个词袋:

query = LOAD '$query' AS (line_query:chararray);
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS quer;

DUMP bag_query;

结果如下:

((word2), (word7), (word5))

现在,这是我的问题:我想获取查询和每个文档之间的匹配数。在这个例子中,我想要这个输出:

1
2
1

我尝试在包之间进行 JOIN,但没有成功。

你能帮帮我吗?

谢谢。

【问题讨论】:

【参考方案1】:

如果您可以不使用任何 UDF,则可以通过旋转包并采用所有 SQL 样式来完成。

docs = LOAD '/input/search.dat' USING PigStorage(',') AS (id:int, line:chararray);
C = FOREACH docs GENERATE id, TOKENIZE(line) as gu;
pivoted = FOREACH C 
    uniq = DISTINCT gu;
        GENERATE id, FLATTEN(uniq) as word;
;
filtered = FILTER pivoted BY word MATCHES '(word2|word7|word5)';
--dump filtered;
count_id_matched = FOREACH (GROUP filtered BY id) GENERATE group as id, COUNT(filtered) as count;

dump count_id_matched;

count_word_matched_in_docs = FOREACH (GROUP filtered BY word) GENERATE group as word, COUNT(filtered) as count;

dump count_word_matched_in_docs;

【讨论】:

【参考方案2】:

尝试使用 SetIntersect (a Datafu UDF - https://github.com/linkedin/datafu) 和 SIZE 来获取结果包中的元素数。

【讨论】:

感谢您的回复,但它不起作用。事实上,我的包在不同的变量中,似乎 SetIntersect 要求包在相同的变量中。【参考方案3】:

正如 SNeumann 指出的,您可以使用 DataFu 的 SetIntersect 作为示例。

基于您的示例,鉴于这些文档:

1,word1 word4 word2 word1
2,word2 word6 word1 word5 word3 word7
3,word1 word3 word4 word5

给出这个查询:

word2 word7 word5

那么这段代码会给你你想要的:

define SetIntersect datafu.pig.sets.SetIntersect();

docs = LOAD 'docs' USING PigStorage(',') AS (id:int, line:chararray);
B = FOREACH docs GENERATE id, line;
C = FOREACH B GENERATE id, TOKENIZE(line) as gu;

filtered = FOREACH C 
  uniq = DISTINCT gu;
  GENERATE id, uniq;


query = LOAD 'query' AS (line_query:chararray);
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS query;
-- sort the bag of tokens, since SetIntersect requires it
bag_query = FOREACH bag_query 
  query_sorted = ORDER query BY token;
  GENERATE query_sorted;


result = FOREACH filtered 
  -- sort the tokens, since SetIntersect requires it
  tokens_sorted = ORDER uniq BY token;
  GENERATE id, 
           SIZE(SetIntersect(tokens_sorted,bag_query.query_sorted)) as cnt;


DUMP result;

结果值:

(1,1)
(2,3)
(3,1)

这是一个完整的工作示例,您可以将其粘贴到位于 here 的 SetIntersect 的 DataFu 单元测试中:

/**
register $JAR_PATH

define SetIntersect datafu.pig.sets.SetIntersect();

docs = LOAD 'docs' USING PigStorage(',') AS (id:int, line:chararray);
B = FOREACH docs GENERATE id, line;
C = FOREACH B GENERATE id, TOKENIZE(line) as gu;

filtered = FOREACH C 
  uniq = DISTINCT gu;
  GENERATE id, uniq;


query = LOAD 'query' AS (line_query:chararray);
bag_query = FOREACH query GENERATE TOKENIZE(line_query) AS query;
-- sort the bag of tokens, since SetIntersect requires it
bag_query = FOREACH bag_query 
  query_sorted = ORDER query BY token;
  GENERATE query_sorted;


result = FOREACH filtered 
  -- sort the tokens, since SetIntersect requires it
  tokens_sorted = ORDER uniq BY token;
  GENERATE id, 
           SIZE(SetIntersect(tokens_sorted,bag_query.query_sorted)) as cnt;


DUMP result;

 */
@Multiline
private String setIntersectTestExample;

@Test
public void setIntersectTestExample() throws Exception
    
  PigTest test = createPigTestFromString(setIntersectTestExample);    

  writeLinesToFile("docs", 
                   "1,word1 word4 word2 word1",
                   "2,word2 word6 word1 word5 word3 word7",
                   "3,word1 word3 word4 word5");

  writeLinesToFile("query", 
                   "word2 word7 word5");

  test.runScript();

  super.getLinesForAlias(test, "filtered");
  super.getLinesForAlias(test, "query");
  super.getLinesForAlias(test, "result");

如果您有任何其他类似的用例,我很想听听 :) 我们一直在寻求为 DataFu 贡献更多有用的 UDF。

【讨论】:

以上是关于Apache Pig - 如何获取多个包之间的匹配元素数量?的主要内容,如果未能解决你的问题,请参考以下文章

Apache Pig - 具有多个匹配条件的 MATCHES

无法将 org.apache.pig.builtin.SUM 的匹配函数推断为多个匹配或都不匹配。请使用显式演员表

如何使用 apache pig 将标题行加入多个文件中的详细行

使用 pig 生成最大数量

如何从 Apache Pig 中的文件中读取多个文件?

Apache PIG - 如何获取 Flop 10 数据记录?