MapReduce编程求均值

Posted 2020-09-07 esc_ai

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了MapReduce编程求均值相关的知识，希望对你有一定的参考价值。

一、问题描述

三个文件中分别存储了学生的语文、数学和英语成绩，输出每个学生的平均分。

数据格式如下：
Chinese.txt

张三    78
李四    89
王五    96
赵六    67

Math.txt

张三    88
李四    99
王五    66
赵六    77

English.txt

张三    80
李四    82
王五    84
赵六    86

二、MapReduce编程

package com.javacore.hadoop;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;


/**
 * Created by bee on 3/29/17.
 */
public class StudentAvgDouble {

    public static class MyMapper extends Mapper<Object, Text, Text, DoubleWritable> {

        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
           String eachline = value.toString();
           StringTokenizer tokenizer = new StringTokenizer(eachline, "\n");
            while (tokenizer.hasMoreElements()) {
                StringTokenizer tokenizerLine = new StringTokenizer(tokenizer
                        .nextToken());
                String strName = tokenizerLine.nextToken();
                String strScore = tokenizerLine.nextToken();
                Text name = new Text(strName);
                IntWritable score = new IntWritable(Integer.parseInt(strScore));
                context.write(name, score);
            }
        }
    }

    public static class MyReducer extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {
        public void reduce(Text key, Iterable<DoubleWritable> values, Context
                context) throws IOException, InterruptedException {
            double sum = 0.0;
            int count = 0;
            for (DoubleWritable val : values) {
                sum += val.get();
                count++;
            }
            DoubleWritable avgScore = new DoubleWritable(sum / count);
            context.write(key, avgScore);
        }
    }

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //删除output文件夹
        FileUtil.deleteDir("output");
        Configuration conf = new Configuration();
        String[] otherArgs = new String[]{"input/studentAvg", "output"};
        if (otherArgs.length != 2) {
            System.out.println("参数错误");
            System.exit(2);
        }

        Job job = Job.getInstance();
        job.setJarByClass(StudentAvgDouble.class);
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(DoubleWritable.class);
        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);

    }
}

三、StringTokenizer和Split的用法对比

map函数里按行读入，每行按空格切开，之前我采用的split()函数切分，代码如下。

 String eachline = value.toString();
 for (String eachline : lines) {
                System.out.println("eachline:\t"+eachline);
                String[] words = eachline.split("\\s+");
                Text name = new Text(words[0]);
                IntWritable score = new IntWritable(Integer.parseInt(words[1]));
                context.write(name, score);
            }

这种方式简单明了，但是也存在缺陷，对于非正常编码的空格有时候会出现切割失败的情况。
StringTokenizer是java.util包中分割解析类，StringTokenizer类的构造函数有三个:

StringTokenizer（String str）：java默认的分隔符是“空格”、“制表符（‘\t’）”、“换行符(‘\n’）”、“回车符（‘\r’）。
StringTokenizer（String str,String delim）:可以构造一个用来解析str的StringTokenizer对象，并提供一个指定的分隔符。
StringTokenizer（String str,String delim,boolean returnDelims）：构造一个用来解析str的StringTokenizer对象，并提供一个指定的分隔符，同时，指定是否返回分隔符。

StringTokenizer和Split都可以对字符串进行切分，StringTokenizer的性能更高一些，分隔符如果用到一些特殊字符，StringTokenizer的处理结果更好。

四、运行结果

张三  82.0
李四  90.0
王五  82.0
赵六  76.66666666666667

以上是关于MapReduce编程求均值的主要内容，如果未能解决你的问题，请参考以下文章

MapReduce编程 求均值

一、问题描述

二、MapReduce编程

三、StringTokenizer和Split的用法对比

四、运行结果

MapReduce编程求均值