bigdata hadoop java codefor wordcount 修改

Posted

技术标签:

【中文标题】bigdata hadoop java codefor wordcount 修改【英文标题】:bigdata hadoop java codefor wordcount modified 【发布时间】:2014-10-02 22:46:44 【问题描述】:

我必须修改 hadoop wordcount 示例,以计算以前缀“cons”开头的单词的数量,然后需要按频率的降序对结果进行排序。谁能告诉如何为此编写映射器和缩减器代码?

代码:

public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> 
 
    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException 
     
        //Replacing all digits and punctuation with an empty string 
        String line =  value.toString().replaceAll("\\pPunct|\\d", "").toLowerCase();
        //Extracting the words 
        StringTokenizer record = new StringTokenizer(line); 
        //Emitting each word as a key and one as itsvalue 
        while (record.hasMoreTokens()) 
            context.write(new Text(record.nextToken()), new IntWritable(1)); 
     

【问题讨论】:

public class WordCountMapper extends Mapper public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException //将所有数字和标点符号替换为空字符串 String line = value.toString().replaceAll("\\pPunct|\\d", "").toLowerCase(); //提取单词 StringTokenizer record = new StringTokenizer(line); //将每个单词作为键和一个作为其值发出 while (record.hasMoreTokens()) context.write(new Text(record.nextToken()), new IntWritable(1)); 在这段代码中需要修改计算以“cons”开头的单词数的代码 下面是我为 hadoop wordcount 代码提供的链接。 wiki.apache.org/hadoop/WordCount 我认为 mapper 的代码将与上面链接中的代码相同,但代码将仅针对 reducer 进行更改。谁能告诉如何编写reducer代码?需要对 reducer 代码进行一些修改 【参考方案1】:

要计算以“cons”开头的单词的数量,您可以在从 mapper 发射时丢弃所有其他单词。

public void map(Object key, Text value, Context context) throws IOException,
        InterruptedException 
    IntWritable one = new IntWritable(1);
    String[] words = value.toString().split(" ");
    for (String word : words) 
        if (word.startsWith("cons"))
              context.write(new Text("cons_count"), one);
    

reducer 现在将只收到一个 key = cons_count,您可以将这些值相加得到计数。

要根据频率对以“cons”开头的单词进行排序,以cons开头的单词应该去同一个reducer,reducer应该对其进行总结和排序。为此,

public class MyMapper extends Mapper<Object, Text, Text, Text> 


@Override
public void map(Object key, Text value, Context output) throws IOException,
        InterruptedException 
      String[] words = value.toString().split(" ");
      for (String word : words) 
        if (word.startsWith("cons"))
              context.write(new Text("cons"), new Text(word));
    
 

减速机:

public class MyReducer extends Reducer<Text, Text, Text, IntWritable> 

@Override
public void reduce(Text key, Iterable<Text> values, Context output)
        throws IOException, InterruptedException 
    Map<String,Integer> wordCountMap = new HashMap<String,Integer>();
    for(Text value: values)
        word = value.get();
        if (wordCountMap.contains(word) 
           Integer count = wordCountMap.get(key);
           count++;
           wordCountMap.put(word,count)
        else 
         wordCountMap.put(word,new Integer(1));
        
    

    //use some sorting mechanism to sort the map based on values.
    // ...

    for (Map.Entry<String, Integer> entry : wordCountMap.entrySet()) 
        context.write(new Word(entry.getKey(),new IntWritable(entry.getValue());
     

【讨论】:

第二个映射器代码正是我们需要的。删除除以“cons”开头的所有其他词。 hadoop 按它们的键对中间键值对进行排序,输出按升序排序。这里我们必须编写自定义排序比较器,用于以 cons 开头的单词的降序。 @blackbookstar 的整个代码是指排序吗?检查此链接以了解如何执行此操作:***.com/questions/109383/…

以上是关于bigdata hadoop java codefor wordcount 修改的主要内容,如果未能解决你的问题,请参考以下文章

Linux-Bigdata

BigData/Hadoop 项目的典型流程?

如何开始探索 BigData、Hadoop 及其生态系统组件?

大数据Windows7Hadoop2.7.6

Hadoop单机伪分布式分布式集群搭建

BigData--hadoop集群搭建之zookeer安装