mapreduce 数据去重问题

Posted 2023-03-24

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了mapreduce 数据去重问题相关的知识，希望对你有一定的参考价值。

我在伪分布模式下用eclipse运行的，源代码如下：
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;

public class Dedup

//map将输入中的value复制到输出数据的key上，并直接输出

public static class Map extends Mapper<Object,Text,Text,Text>

private static Text line=new Text();//每行数据

//实现map函数

public void map(Object key,Text value,Context context)

throws IOException,InterruptedException

line=value;

context.write(line, new Text(""));

//reduce将输入中的key复制到输出数据的key上，并直接输出

public static class Reduce extends Reducer<Text,Text,Text,Text>

//实现reduce函数

public void reduce(Text key,Iterable<Text> values,Context context)

throws IOException,InterruptedException

context.write(key, new Text(""));

public static void main(String[] args) throws Exception

Configuration conf = new Configuration();

//这句话很关键

conf.set("mapred.job.tracker", "192.168.1.2:9001");

String[] ioArgs=new String[]"dedup_in","dedup_out";

String[] otherArgs = new GenericOptionsParser(conf, ioArgs).getRemainingArgs();

if (otherArgs.length != 2)

System.err.println("Usage: Data Deduplication <in> <out>");

System.exit(2);

Job job = new Job(conf, "Data Deduplication");

job.setJarByClass(Dedup.class);

//设置Map、Combine和Reduce处理类

job.setMapperClass(Map.class);

job.setCombinerClass(Reduce.class);

job.setReducerClass(Reduce.class);

//设置输出类型

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

//设置输入和输出目录

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));

FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

结果出现如下问题：
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1303)
at dedup.main(dedup.java:112)

运行结果没有应该有的那个dedup_out。。。。。
求大神解释

Mapper的输入key不是Object，而是LongWritable类型的，因为你的InputFormat默认是TextInputFormat 参考技术A 把具体的报错信息发出来吧，我这里运行你的代码是没有问题的。

以上是关于mapreduce 数据去重问题的主要内容，如果未能解决你的问题，请参考以下文章

mapreduce 数据去重 问题

mapreduce 数据去重问题