hadoop(二MapReduce)

Posted 2022-07-03 leccoo

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了hadoop(二MapReduce)相关的知识，希望对你有一定的参考价值。

hadoop(二MapReduce)

介绍

MapReduce:其实就是把数据分开处理后再将数据合在一起.

Map负责“分”，即把复杂的任务分解为若干个“简单的任务”来并行处理。可以进行拆分的前提是这些小任务可以并行计算，彼此间几乎没有依赖关系。
Reduce负责“合”，即对map阶段的结果进行全局汇总。
MapReduce运行在yarn集群

MapReduce中定义了如下的Map和Reduce两个抽象的编程接口，由用户去编程实现.Map和Reduce,

MapReduce处理的数据类型是键值对

代码处理

MapReduce 的开发一共有八个步骤, 其中 Map 阶段分为 2 个步骤，Shuwle 阶段 4 个步
骤，Reduce 阶段分为 2 个步骤

Map 阶段 2 个步骤

设置 InputFormat 类, 将数据切分为 Key-Value(K1和V1) 对, 输入到第二步
自定义 Map 逻辑, 将第一步的结果转换成另外的 Key-Value（K2和V2）对, 输出结果
Shuwle 阶段 4 个步骤
对输出的 Key-Value 对进行分区
对不同分区的数据按照相同的 Key 排序
(可选) 对分组过的数据初步规约, 降低数据的网络拷贝
对数据进行分组, 相同 Key 的 Value 放入一个集合中
Reduce 阶段 2 个步骤
对多个 Map 任务的结果进行排序以及合并, 编写 Reduce 函数实现自己的逻辑, 对输入的
Key-Value 进行处理, 转为新的 Key-Value（K3和V3）输出
设置 OutputFormat 处理并保存 Reduce 输出的 Key-Value 数据

常用Maven依赖

<packaging>jar</packaging>
 <dependencies>
 <dependency>
 <groupId>org.apache.hadoop</groupId>
 <artifactId>hadoop-common</artifactId>
 <version>2.7.5</version>
 </dependency>
 <dependency>
 <groupId>org.apache.hadoop</groupId>
 <artifactId>hadoop-client</artifactId>
 <version>2.7.5</version>
 </dependency>
 <dependency>
 <groupId>org.apache.hadoop</groupId>
 <artifactId>hadoop-hdfs</artifactId>
 <version>2.7.5</version>
 </dependency>
 <dependency>
 <groupId>org.apache.hadoop</groupId>
 <artifactId>hadoop-mapreduce-client-core</artifactId>
 <version>2.7.5</version>
 </dependency>
 <dependency>
 <groupId>junit</groupId>
 <artifactId>junit</artifactId>
 <version>RELEASE</version>
 </dependency>
 </dependencies>
 <build>
 <plugins>
 <plugin>
 <groupId>org.apache.maven.plugins</groupId>
 <artifactId>maven-compiler-plugin</artifactId>
 <version>3.1</version>
 <configuration>
 <source>1.8</source>
 <target>1.8</target>
 <encoding>UTF-8</encoding>
 <!--    <verbal>true</verbal>-->
 </configuration>
 </plugin>
 <plugin>
 <groupId>org.apache.maven.plugins</groupId>
 <artifactId>maven-shade-plugin</artifactId>
 <version>2.4.3</version>
 <executions>
 <execution>
 <phase>package</phase>
 <goals>
 <goal>shade</goal>
 </goals>
 <configuration>
 <minimizeJar>true</minimizeJar>
 </configuration>
 </execution>
 </executions>
 </plugin>
 </plugins>
 </build>

入门---统计

结构

/*
  四个泛型解释:
    KEYIN :K1的类型
    VALUEIN: V1的类型
    KEYOUT: K2的类型
    VALUEOUT: V2的类型
*/
public class WordCountMapper extends Mapper<LongWritable,Text, Text , LongWritable> 
 //map方法就是将K1和V1 转为 K2和V2
 /*
      参数:
         key    : K1   行偏移量(默认几乎一直固定为LongWritable)
         value  : V1   每一行的文本数据
         context ：表示上下文对象
     */
 /*
      如何将K1和V1 转为 K2和V2
        K1         V1
        0   hello,world,hadoop
        15  hdfs,hive,hello
       ---------------------------
        K2            V2
        hello         1
        world         1
        hdfs          1
        hadoop        1
        hello         1
     */
 @Override
 protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException 
 Text text = new Text();
 LongWritable longWritable = new LongWritable();
 //1:将一行的文本数据进行拆分
 String[] split = value.toString().split(",");
 //2:遍历数组，组装 K2 和 V2
 for (String word : split) 
 //3:将K2和V2写入上下文
            text.set(word);
            longWritable.set(1);
            context.write(text, longWritable);

/*
  四个泛型解释:
    KEYIN:  K2类型
    VALULEIN: V2类型
    KEYOUT: K3类型
    VALUEOUT:V3类型
*/
public class WordCountReducer extends Reducer<Text,LongWritable,Text,LongWritable> 
 //reduce方法作用: 将新的K2和V2转为 K3和V3 ，将K3和V3写入上下文中
 /*
      参数:
        key ： 新K2
        values： 集合 新 V2
        context ：表示上下文对象
        ----------------------
        如何将新的K2和V2转为 K3和V3
        新  K2         V2
            hello      <1,1,1>
            world      <1,1>
            hadoop     <1>
        ------------------------
           K3        V3
           hello     3
           world     2
           hadoop    1
     */
 @Override
 protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException 
 long count = 0;
 //1:遍历集合，将集合中的数字相加，得到 V3
 for (LongWritable value : values) 
             count += value.get();
 
 //2:将K3和V3写入上下文中
        context.write(key, new LongWritable(count));

public class JobMain extends Configured implements Tool 
 //该方法用于指定一个job任务
 @Override
 public int run(String[] args) throws Exception 
 //1:创建一个job任务对象
 Job job = Job.getInstance(super.getConf(), "wordcount");
 //如果打包运行出错，则需要加该配置
        job.setJarByClass(JobMain.class);
 //2:配置job任务对象(八个步骤)
 //第一步:指定文件的读取方式和读取路径
        job.setInputFormatClass(TextInputFormat.class);
 TextInputFormat.addInputPath(job, new Path("hdfs://node01:8020/wordcount"));
 //TextInputFormat.addInputPath(job, new Path("file:///D:\\mapreduce\\input"));
 //第二步:指定Map阶段的处理方式和数据类型
         job.setMapperClass(WordCountMapper.class);
 //设置Map阶段K2的类型
          job.setMapOutputKeyClass(Text.class);
 //设置Map阶段V2的类型
          job.setMapOutputValueClass(LongWritable.class);
 //第三，四，五，六 采用默认的方式
 //第七步：指定Reduce阶段的处理方式和数据类型
          job.setReducerClass(WordCountReducer.class);
 //设置K3的类型
           job.setOutputKeyClass(Text.class);
 //设置V3的类型
           job.setOutputValueClass(LongWritable.class);
 //第八步: 设置输出类型
           job.setOutputFormatClass(TextOutputFormat.class);
 //设置输出的路径
 Path path = new Path("hdfs://node01:8020/wordcount_out");
 TextOutputFormat.setOutputPath(job, path);
 //TextOutputFormat.setOutputPath(job, new Path("file:///D:\\mapreduce\\output"));
 //获取FileSystem
 FileSystem fileSystem = FileSystem.get(new URI("hdfs://node01:8020"), new Configuration());
 //判断目录是否存在
 boolean bl2 = fileSystem.exists(path);
 if(bl2)
 //删除目标目录
                 fileSystem.delete(path, true);
 
 //等待任务结束
 boolean bl = job.waitForCompletion(true);
 return bl ? 0:1;
 
 public static void main(String[] args) throws Exception 
 Configuration configuration = new Configuration();
 //启动job任务
 int run = ToolRunner.run(configuration, new JobMain(), args);
 System.exit(run);

shuwle阶段

分区

分区实则目的是按照我们的需求,将不同类型的数据分开处理,最终分开获取

代码实现

结构

public class MyPartitioner extends Partitioner<Text,NullWritable> 
 /*
      1：定义分区规则
      2:返回对应的分区编号
     */
 @Override
 public int getPartition(Text text, NullWritable nullWritable, int i) 
 //1:拆分行文本数据(K2),获取中奖字段的值
 String[] split = text.toString().split("\t");
 String numStr = split[5];
 //2:判断中奖字段的值和15的关系，然后返回对应的分区编号
 if(Integer.parseInt(numStr) > 15)
 return 1;
 else
 return 0;

 //第三步，指定分区类
            job.setPartitionerClass(MyPartitioner.class);
 //第四, 五，六步
 //设置ReduceTask的个数
            job.setNumReduceTasks(2);

MapReduce 中的计数器

计数器是收集作业统计信息的有效手段之一，用于质量控制或应用级统计

可辅助诊断系统故障

看能否用一个计数器值来记录某一特定事件的发生 ,比分析一堆日志文件容易

通过enum枚举类型来定义计数器统计reduce端数据的输入的key有多少个


public class PartitionerReducer extends Reducer<Text,NullWritable,Text,NullWritable> 
 public static enum Counter
            MY_INPUT_RECOREDS,MY_INPUT_BYTES
 
 @Override
 protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException 
 //方式2：使用枚枚举来定义计数器
        context.getCounter(Counter.MY_INPUT_RECOREDS).increment(1L);
       context.write(key, NullWritable.get());

排序(包含序列化)

序列化 (Serialization) 是指把结构化对象转化为字节流
反序列化 (Deserialization) 是序列化的逆过程. 把字节流转为结构化对象. 当要在进程间传
递对象或持久化对象的时候, 就需要序列化对象成字节流, 反之当要将接收到或从磁盘读取
的字节流转换为对象, 就要进行反序列化
Java 的序列化 (Serializable) 是一个重量级序列化框架, 一个对象被序列化后, 会附带很多额
外的信息 (各种校验信息, header, 继承体系等）, 不便于在网络中高效传输. 所以, Hadoop
自己开发了一套序列化机制(Writable), 精简高效. 不用像 Java 对象类一样传输多层的父子
关系, 需要哪个属性就传输哪个属性值, 大大的减少网络传输的开销
Writable 是 Hadoop 的序列化格式, Hadoop 定义了这样一个 Writable 接口. 一个类要支持可
序列化只需实现这个接口即可
另外 Writable 有一个子接口是 WritableComparable, WritableComparable 是既可实现序列
化, 也可以对key进行比较, 我们这里可以通过自定义 Key 实现 WritableComparable 来实现
我们的排序功能

public class SortBean implements WritableComparable<SortBean>
 private String word;
 private int  num;
 public String getWord() 
 return word;
 
 public void setWord(String word) 
 this.word = word;
 
 public int getNum() 
 return num;
 
 public void setNum(int num) 
 this.num = num;
 
 @Override
 public String toString() 
 return   word + "\t"+ num ;
 
 //实现比较器，指定排序的规则
 /*
      规则:
        第一列(word)按照字典顺序进行排列    //  aac   aad
        第一列相同的时候, 第二列(num)按照升序进行排列
     */
 @Override
 public int compareTo(SortBean sortBean) 
 //先对第一列排序: Word排序
 int result = this.word.compareTo(sortBean.word);
 //如果第一列相同，则按照第二列进行排序
 if(result == 0)
 return this.num - sortBean.num;
 
 return result;
 
 //实现序列化
 @Override
 public void write(DataOutput out) throws IOException 
        out.writeUTF(word);
        out.writeInt(num);
 
 //实现反序列
 @Override
 public void readFields(DataInput in) throws IOException 
 this.word = in.readUTF();
 this.num = in.readInt();

public class SortMapper extends Mapper<LongWritable,Text,SortBean,NullWritable> 
 /*
      map方法将K1和V1转为K2和V2:
      K1            V1
      0            a  3
      5            b  7
      ----------------------
      K2                         V2
      SortBean(a  3)         NullWritable
      SortBean(b  7)         NullWritable
     */
 @Override
 protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException 
 //1:将行文本数据(V1)拆分，并将数据封装到SortBean对象,就可以得到K2
 String[] split = value.toString().split("\t");
 SortBean sortBean = new SortBean();
        sortBean.setWord(split[0]);
        sortBean.setNum(Integer.parseInt(split[1]));
 //2:将K2和V2写入上下文中
        context.write(sortBean, NullWritable.get());

public class SortReducer extends Reducer<SortBean,NullWritable,SortBean,NullWritable> 
 //reduce方法将新的K2和V2转为K3和V3
 @Override
 protected void reduce(SortBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException 
       context.write(key, NullWritable.get());

job略

规约Combiner

在三大阶段的第一阶段map处理完后,可能数据过多,利用分布式思想,抢在reduce前先做一次合并,后再由reduce合并,目的是:提高网络IO 性能

实现步骤

 //第三（分区），四 （排序）
 //第五步: 规约(Combiner)
      job.setCombinerClass(MyCombiner.class);
 //第六步 分布

案例:流量统计(key相同则++++++++)

public class FlowBean implements Writable 
 private Integer upFlow; //上行数据包数
 private Integer downFlow; //下行数据包数
 private Integer upCountFlow; //上行流量总和
 private Integer downCountFlow;//下行流量总和
 //下略get   set   序列化  反序列化

public class FlowCountMapper extends Mapper<LongWritable,Text,Text,FlowBean> 
 /*
      将K1和V1转为K2和V2:
      K1              V1
      0               1363157985059     13600217502    00-1F-64-E2-E8-B1:CMCC    120.196.100.55    www.baidu.com    综合门户    19    128    1177    16852    200
     ------------------------------
      K2              V2
      13600217502     FlowBean(19    128    1177    16852)
     */
 @Override
 protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException 
 //1:拆分行文本数据,得到手机号--->K2
 String[] split = value.toString().split("\t");
 String phoneNum = split[1];
 //2:创建FlowBean对象,并从行文本数据拆分出流量的四个四段,并将四个流量字段的值赋给FlowBean对象
 FlowBean flowBean = new FlowBean();
        flowBean.setUpFlow(Integer.parseInt(split[6]));
        flowBean.setDownFlow(Integer.parseInt(split[7]));
        flowBean.setUpCountFlow(Integer.parseInt(split[8]));
        flowBean.setDownCountFlow(Integer.parseInt(split[9]));
 //3:将K2和V2写入上下文中
        context.write(new Text(phoneNum), flowBean);

public class FlowCountReducer extends Reducer<Text,FlowBean,Text,FlowBean> 
 @Override
 protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException 
 //1:遍历集合,并将集合中的对应的四个字段累计
 Integer upFlow = 0; //上行数据包数
 Integer downFlow = 0; //下行数据包数
 Integer upCountFlow = 0; //上行流量总和
 Integer downCountFlow = 0;//下行流量总和
 for (FlowBean value : values) 
            upFlow += value.getUpFlow();
            downFlow += value.getDownFlow();
            upCountFlow += value.getUpCountFlow();
            downCountFlow += value.getDownCountFlow();
 
 //2:创建FlowBean对象,并给对象赋值  V3
 FlowBean flowBean = new FlowBean();
        flowBean.setUpFlow(upFlow);
        flowBean.setDownFlow(downFlow);
        flowBean.setUpCountFlow(upCountFlow);
        flowBean.setDownCountFlow(downCountFlow);
 //3:将K3和V3下入上下文中
        context.write(key, flowBean);

public class JobMain extends Configured implements Tool 
 //该方法用于指定一个job任务
 @Override
 public int run(String[] args) throws Exception 
 //1:创建一个job任务对象
 Job job = Job.getInstance(super.getConf(), "mapreduce_flowcount");
 //如果打包运行出错，则需要加该配置
        job.setJarByClass(JobMain.class);
 //2:配置job任务对象(八个步骤)
 //第一步:指定文件的读取方式和读取路径
        job.setInputFormatClass(TextInputFormat.class);
 //TextInputFormat.addInputPath(job, new Path("hdfs://node01:8020/wordcount"));
 TextInputFormat.addInputPath(job, new Path("file:///D:\\input\\flowcount_input"));
 //第二步:指定Map阶段的处理方式和数据类型
         job.setMapperClass(FlowCountMapper.class);
 //设置Map阶段K2的类型
          job.setMapOutputKeyClass(Text.class);
 //设置Map阶段V2的类型
          job.setMapOutputValueClass(FlowBean.class);
 //第三（分区），四 （排序）
 //第五步: 规约(Combiner)
 //第六步 分组
 //第七步：指定Reduce阶段的处理方式和数据类型
          job.setReducerClass(FlowCountReducer.class);
 //设置K3的类型
           job.setOutputKeyClass(Text.class);
 //设置V3的类型
           job.setOutputValueClass(FlowBean.class);
 //第八步: 设置输出类型
           job.setOutputFormatClass(TextOutputFormat.class);
 //设置输出的路径
 TextOutputFormat.setOutputPath(job, new Path("file:///D:\\out\\flowcount_out"));
 //等待任务结束
 boolean bl = job.waitForCompletion(true);
 return bl ? 0:1;
 
 public static void main(String[] args) throws Exception 
 Configuration configuration = new Configuration();
 //启动job任务
 int run = ToolRunner.run(configuration, new JobMain(), args);
 System.exit(run);

如增加需求:

上行流量倒序排序

public class FlowBean implements WritableComparable<FlowBean> 
 //指定排序的规则
 @Override
 public int compareTo(FlowBean flowBean) 
 // return this.upFlow.compareTo(flowBean.getUpFlow()) * -1;
 return flowBean.upFlow - this.upFlow ;

需求:手机号码分区

public class FlowCountPartition extends Partitioner<Text,FlowBean> 
 /*
      该方法用来指定分区的规则:
        135 开头数据到一个分区文件
        136 开头数据到一个分区文件
        137 开头数据到一个分区文件
        其他分区
       参数:
         text : K2   手机号
         flowBean: V2
         i   : ReduceTask的个数
     */
 @Override
 public int getPartition(Text text, FlowBean flowBean, int i) 
 //1:获取手机号
 String phoneNum = text.toString();
 //2:判断手机号以什么开头,返回对应的分区编号(0-3)
 if(phoneNum.startsWith("135"))
 return 0;
 else if(phoneNum.startsWith("136"))
 return 1;
 else if(phoneNum.startsWith("137"))
 return 2;
 else
 return 3;

 //第三（分区），四 （排序）
            job.setPartitionerClass(FlowCountPartition.class);
 //第五步: 规约(Combiner)
 //第六步 分组
 //设置reduce个数
            job.setNumReduceTasks(4);

以上是关于hadoop(二MapReduce)的主要内容，如果未能解决你的问题，请参考以下文章