在cloudera中获取hadoop字数示例中的数字

Posted 2023-02-19

技术标签:

【中文标题】在cloudera中获取hadoop字数示例中的数字【英文标题】：Getting numbers in hadoop word count example in cloudera 【发布时间】：2020-03-04 22:55:27 【问题描述】：

下面我们使用了代码：地图类是 WCMapper。 reduce 类是 WCReducer。

不太清楚为什么输出生成的是数字而不是字数。

public class WCMapper extends Mapper  
    public void map(LongWritable key,Text value,Context context) throws 
    IOException,InterruptedException 
        String line = key.toString(); 
        StringTokenizer tokenizer = new StringTokenizer(line); 
          while(tokenizer.hasMoreTokens()) 
           value.set(tokenizer.nextToken()); 
           context.write(value, new IntWritable(1)); 
            
            

       

 public class WCReducer extends Reducer<Text,IntWritable,Text,IntWritable>
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException,InterruptedException

    int sum=0;
    for(IntWritable x: values)
    
        sum+=x.get();

    

    result.set(sum);
    System.out.println("Key: "+key+"Value: "+sum);
    context.write(key, result);


       



public static void main(String[] args) throws Exception
    Configuration conf = new Configuration();

    Job job = Job.getInstance(conf, "WordCount");

    job.setJarByClass(WorCount.class);
    job.setMapperClass(WCMapper.class);
    job.setReducerClass(WCReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

     Path outputPath = new Path(args[1]);

     FileInputFormat.addInputPath(job, new Path(args[0]));
     FileOutputFormat.setOutputPath(job, new Path(args[1]));

     outputPath.getFileSystem(conf).delete(outputPath, true);

     System.exit(job.waitForCompletion(true)? 0: 1);

输入文件：这是云时代这很聪明

预期输出：这 2 是 2 云时代 1 聪明的 1

获得的输出： 0 1 17 1

【问题讨论】：

也许这个问题可以在某种程度上帮助你，***.com/questions/26208454/… 【参考方案1】：

问题出在您的映射器中：

String line = key.toString();

本例中的key 是LongWritable，表示文件中行的字节偏移量。如果您将该行更改为value，然后不要在下面使用value，您将得到正确答案。

新的映射器：

public void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException  
    String line = value.toString(); 
    StringTokenizer tokenizer = new StringTokenizer(line); 
    Text word = new Text();

    while(tokenizer.hasMoreTokens()) 
        word.set(tokenizer.nextToken()); 
        context.write(word, new IntWritable(1));

【讨论】：

以上是关于在cloudera中获取hadoop字数示例中的数字的主要内容，如果未能解决你的问题，请参考以下文章