自定义combiner实现文件倒排索引
Posted guoziyi
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了自定义combiner实现文件倒排索引相关的知识,希望对你有一定的参考价值。
package com.zuoyan.hadoop; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.InputSplit; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; /** * * @author root * 1:输入文件中并没有地址的输入,那么我们需要在mapper端读取数据的时候,插入其地址。 * 按“”空格分割字符串,mapper的输出 <key,value>=<值 地址,1>或者<值 地址,(1,1)> * 2:利用mapper和reducer之间一个极其重要的组件combiner进行首次的处理, * 并且分离key中的值与地址,此时的输出结果<key,value>=<值,地址 1>或者<值,地址 2> * 注意:此组件是属于mapper端阶段的。 * 3:reducer开始进行最后的处理。 */ public class CombinerTest { // main public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration(); Job job = Job.getInstance(conf); job.setJarByClass(CombinerTest.class); //1 job.setMapperClass(LastSearchMapper.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); //2 job.setCombinerClass(LastSearchComb.class); //3 job.setReducerClass(LastSearchReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); boolean x = job.waitForCompletion(true); System.out.println(x); } // mapper public class LastSearchMapper extends Mapper<LongWritable, Text, Text, Text> { @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String words[] = line.split(" "); InputSplit input = context.getInputSplit(); String pathname = ((FileSplit) input).getPath().getName();// 得到此时数据的地址 for (String word : words) { String word1 = word + " " + pathname; context.write(new Text(word1), new Text("1")); } } } // combiner public class LastSearchComb extends Reducer<Text, Text, Text, Text> { @Override protected void reduce(Text arg0, Iterable<Text> arg1, Context arg2) throws IOException, InterruptedException { int sum = 0; for (Text arg : arg1) { String word = arg.toString(); int wordINT = Integer.parseInt(word); sum = wordINT + sum; } String line = arg0.toString(); String word[] = line.split(" "); arg2.write(new Text(word[0]), new Text(word[1] + ":" + sum)); } } // reducer public class LastSearchReducer extends Reducer<Text, Text, Text, Text> { @Override protected void reduce(Text arg0, Iterable<Text> arg1, Context arg2) throws IOException, InterruptedException { String newword = new String(); for (Text word : arg1) { String wordString = word.toString(); newword = newword + wordString + " "; } arg2.write(arg0, new Text(newword)); } } }
pom导入hadoop-Client即可
以上是关于自定义combiner实现文件倒排索引的主要内容,如果未能解决你的问题,请参考以下文章
2018-08-03 期 MapReduce倒排索引编程案例1(Combiner方式)