手写简版倒排索引(Inverted Index)
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了手写简版倒排索引(Inverted Index)相关的知识,希望对你有一定的参考价值。
参考技术A Python手写Lucene倒排索引小功能,这里为啥使用字典树来存储term呢?其实主要是为了节省空间,比如"app"与"apple"如果用哈希表来存储,则会分别存储"app"与"apple",而如果使用字典树则只会存储"a,p,p,l,e"这5个字母,存储空间节省了一些,试想一下,如果terms很多的情况下,字典树的这种方式会节省很多的存储空间;当然在字典树中去查找一个term,通常会比在哈希表中查找term耗时,字典树的查找时间复杂度为O(len(term))。Hadoop Demo 倒排索引
package com.asin.hdp.inverted;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class InvertedIndexCombine {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(InvertedIndexCombine.class);
job.setMapperClass(invertedMapper.class);
job.setCombinerClass(invertedCombine.class);
job.setReducerClass(invertedReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path("e:/a.txt"));
FileInputFormat.addInputPath(job, new Path("e:/b.txt"));
FileInputFormat.addInputPath(job, new Path("e:/c.txt"));
FileOutputFormat.setOutputPath(job, new Path("e:/outputCombine"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
public static class invertedMapper extends Mapper<LongWritable, Text, Text, Text> {
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
FileSplit split = (FileSplit) context.getInputSplit();
Path path = split.getPath();
String name = path.getName().replace("e:/", "");
StringTokenizer token = new StringTokenizer(value.toString(), " ");
while (token.hasMoreTokens()) {
context.write(new Text(name + "\t" + token.nextToken()), new Text("1"));
}
}
}
public static class invertedCombine extends Reducer<Text, Text, Text, Text> {
@Override
protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
String line = key.toString();
String[] split = line.split("\t");
int sum = 0;
for (Text text : values) {
sum += Integer.parseInt(text.toString());
}
context.write(new Text(split[1]), new Text(split[0] + ":" + sum));
}
}
public static class invertedReduce extends Reducer<Text, Text, Text, Text> {
@Override
protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
throws IOException, InterruptedException {
String val = "";
for (Text text : values) {
val += text + "\t";
}
context.write(new Text(key), new Text(val));
}
}
}
以上是关于手写简版倒排索引(Inverted Index)的主要内容,如果未能解决你的问题,请参考以下文章
2018-08-03 期 MapReduce倒排索引编程案例1(Combiner方式)