Hadoop——关键和价值应该是啥
Posted
技术标签:
【中文标题】Hadoop——关键和价值应该是啥【英文标题】:Hadoop - what should be the key and valueHadoop——关键和价值应该是什么 【发布时间】:2012-09-11 14:57:14 【问题描述】:我是 Hadoop 新手。
我的目标是上传一个大号。将具有不同扩展名的文件放到 Hadoop 集群上,并获得如下输出:
文件的扩展名
.jpeg 1000 .java 600 .txt 3000
等等。
我假设文件名必须是映射器方法的键,以便我可以读取扩展名(并且在将来执行其他文件操作)
public void map(Text fileName,
null/*will this do - value isn't required in this case*/,
OutputCollector<Text,IntWritable> output,
Reporter reporter)
throws IOException
Text extension = new Text(FilenameUtils.getExtension(filename));
output.collect(extension, 1);
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
int sum = 0;
while (values.hasNext())
sum += values.next().get();
output.collect(key, new IntWritable(sum));
查询:
-
如何将文件名作为密钥发送给 Mapper?我正在考虑实现 RecordReader 接口,但不确定它是否需要,但也无法确定要使用哪个实现类!
根据 API 和我的理解,InputFormat 实现负责为处理提供拆分 - 我必须在这里做些什么来完成我的工作吗?
请指导我,以防我对 Hadoop MapReduce 的概念做出任何根本不正确的假设。
-------第一次编辑-------
附加代码、输出和查询:
/**
*
*/
package com.hadoop.mapred.scratchpad;
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
public class Main
/**
* @param args
* @throws IOException
*/
public static void main(String[] args) throws IOException
// TODO Auto-generated method stub
Main main = new Main();
if (args == null || args.length == 0)
throw new RuntimeException("Enter path to read files");
main.groupFilesByExtn(args);
private void groupFilesByExtn(String[] args) throws IOException
// TODO Auto-generated method stub
JobConf conf = new JobConf(Main.class);
conf.setJobName("Grp_Files_By_Extn");
/* InputFormat and OutputFormat from 'mapred' package ! */
conf.setInputFormat(CustomFileInputFormat.class);
conf.setOutputFormat(org.apache.hadoop.mapred.TextOutputFormat.class);
/* No restrictions here ! */
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
/* Mapper and Reducer classes from 'mapred' package ! */
conf.setMapperClass(CustomMapperClass.class);
conf.setReducerClass(CustomReducer.class);
conf.setCombinerClass(CustomReducer.class);
CustomFileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
自定义文件输入格式
/**
*
*/
package com.hadoop.mapred.scratchpad;
import java.io.IOException;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
public class CustomFileInputFormat extends
FileInputFormat<String, NullWritable>
@Override
public RecordReader<String, NullWritable> getRecordReader(InputSplit aFile,
JobConf arg1, Reporter arg2) throws IOException
// TODO Auto-generated method stub
System.out.println("In CustomFileInputFormat.getRecordReader(...)");
/* the cast - ouch ! */
CustomRecordReader custRecRdr = new CustomRecordReader(
(FileSplit) aFile);
return custRecRdr;
定制的 RecordReader
/**
*
*/
package com.hadoop.mapred.scratchpad;
import java.io.IOException;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.RecordReader;
public class CustomRecordReader implements RecordReader<String, NullWritable>
private FileSplit aFile;
private String fileName;
public CustomRecordReader(FileSplit aFile)
this.aFile = aFile;
System.out.println("In CustomRecordReader constructor aFile is "
+ aFile.getClass().getName());
@Override
public void close() throws IOException
// TODO Auto-generated method stub
@Override
public String createKey()
// TODO Auto-generated method stub
fileName = aFile.getPath().getName();
System.out.println("In CustomRecordReader.createKey() "+fileName);
return fileName;
@Override
public NullWritable createValue()
// TODO Auto-generated method stub
return null;
@Override
public long getPos() throws IOException
// TODO Auto-generated method stub
return 0;
@Override
public float getProgress() throws IOException
// TODO Auto-generated method stub
return 0;
@Override
public boolean next(String arg0, NullWritable arg1) throws IOException
// TODO Auto-generated method stub
return false;
映射器
package com.hadoop.mapred.scratchpad;
import java.io.IOException;
import org.apache.commons.io.FilenameUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class CustomMapperClass extends MapReduceBase implements
Mapper<String, NullWritable, Text, IntWritable>
private static final int COUNT = 1;
@Override
public void map(String fileName, NullWritable value,
OutputCollector<Text, IntWritable> outputCollector,
Reporter reporter) throws IOException
// TODO Auto-generated method stub
System.out.println("In CustomMapperClass.map(...) : key " + fileName
+ " value = " + value);
outputCollector.collect(new Text(FilenameUtils.getExtension(fileName)),
new IntWritable(COUNT));
System.out.println("Returning from CustomMapperClass.map(...)");
减速机:
/**
*
*/
package com.hadoop.mapred.scratchpad;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class CustomReducer extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable>
@Override
public void reduce(Text fileExtn, Iterator<IntWritable> countCollection,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException
// TODO Auto-generated method stub
System.out.println("In CustomReducer.reduce(...)");
int count = 0;
while (countCollection.hasNext())
count += countCollection.next().get();
output.collect(fileExtn, new IntWritable(count));
System.out.println("Returning CustomReducer.reduce(...)");
输出(hdfs)目录:
hd@cloudx-538-520:~/hadoop/logs/userlogs$ hadoop fs -ls /scratchpad/output
Warning: $HADOOP_HOME is deprecated.
Found 3 items
-rw-r--r-- 4 hd supergroup 0 2012-10-11 20:52 /scratchpad/output/_SUCCESS
drwxr-xr-x - hd supergroup 0 2012-10-11 20:51 /scratchpad/output/_logs
-rw-r--r-- 4 hd supergroup 0 2012-10-11 20:52 /scratchpad/output/part-00000
hd@cloudx-538-520:~/hadoop/logs/userlogs$
hd@cloudx-538-520:~/hadoop/logs/userlogs$ hadoop fs -ls /scratchpad/output/_logs
Warning: $HADOOP_HOME is deprecated.
Found 1 items
drwxr-xr-x - hd supergroup 0 2012-10-11 20:51 /scratchpad/output/_logs/history
hd@cloudx-538-520:~/hadoop/logs/userlogs$
hd@cloudx-538-520:~/hadoop/logs/userlogs$
日志(只打开了一个):
hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$ ls -lrt
total 16
-rw-r----- 1 hd hd 393 2012-10-11 20:52 job-acls.xml
lrwxrwxrwx 1 hd hd 95 2012-10-11 20:52 attempt_201210091538_0019_m_000000_0 -> /tmp/hadoop-hd/mapred/local/userlogs/job_201210091538_0019/attempt_201210091538_0019_m_000000_0
lrwxrwxrwx 1 hd hd 95 2012-10-11 20:52 attempt_201210091538_0019_m_000002_0 -> /tmp/hadoop-hd/mapred/local/userlogs/job_201210091538_0019/attempt_201210091538_0019_m_000002_0
lrwxrwxrwx 1 hd hd 95 2012-10-11 20:52 attempt_201210091538_0019_m_000001_0 -> /tmp/hadoop-hd/mapred/local/userlogs/job_201210091538_0019/attempt_201210091538_0019_m_000001_0
hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$
hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$ cat attempt_201210091538_0019_m_000000_0/stdout
In CustomFileInputFormat.getRecordReader(...)
In CustomRecordReader constructor aFile is org.apache.hadoop.mapred.FileSplit
In CustomRecordReader.createKey() ExtJS_Notes.docx
hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$
hd@cloudx-538-520:~/hadoop/logs/userlogs/job_201210091538_0019$
如所见:
-
HDFS 上的输出是一个 0kb 的文件
日志仅显示 sysout,直到线程位于 CustomRecordReader 中
我错过了什么?
【问题讨论】:
【参考方案1】:卡利尤格,
根据您的需要,无需将文件名传递给映射器。它已经在映射器中可用。只需按以下方式访问它。其余的很简单,只需模仿简单的字数统计程序即可。
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
String fileName = fileSplit.getPath().getName();
如果是新的 API,reporter 需要更改为 context
为了优化性能,您可以只创建一个记录读取器,它将文件名作为键简单地提供给映射器(与上述方法相同)。使recordreader不读取任何文件内容。使值部分为 NullWritable。
Mapper 将文件名作为键。只需将
Reducer 需要做与 wordcount 相同的逻辑。
【讨论】:
嗨阿伦,非常感谢您的指点!我已经编辑了我原来的问题——我写的代码与你评论中提到的“优化方法”有关。我不清楚您建议的 Reporter 的用法。请评估我编写的代码。 基本上是 Reporter 类自带了旧的 hadoop API,而不是记者,只是使用 Context 类对象,这就是我提到的。 好的。你能指导我我在代码中犯了什么错误吗?Mapper、Reducer 中的系统输出没有出现,并且工作中没有错误/异常! 检查 Web UI 上的 sysout。他们不会出现在系统控制台上。只需点击地图任务或redue 测试尝试ID,您就可以看到系统输出或日志。以上是关于Hadoop——关键和价值应该是啥的主要内容,如果未能解决你的问题,请参考以下文章
用于 spark/hadoop 的 Postgres 适配器增加了啥价值?