如何在 Lucene 3.5.0 中提取文档术语向量
Posted
技术标签:
【中文标题】如何在 Lucene 3.5.0 中提取文档术语向量【英文标题】:How to extract Document Term Vector in Lucene 3.5.0 【发布时间】:2012-02-05 07:14:17 【问题描述】:我正在使用 Lucene 3.5.0,我想输出每个文档的术语向量。例如,我想知道一个词在所有文档和每个特定文档中的频率。 我的索引代码是:
import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;
import java.io.File;
import java.io.FileReader;
import java.io.BufferedReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Document;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
public class Indexer
public static void main(String[] args) throws Exception
if (args.length != 2)
throw new IllegalArgumentException("Usage: java " + Indexer.class.getName() + " <index dir> <data dir>");
String indexDir = args[0];
String dataDir = args[1];
long start = System.currentTimeMillis();
Indexer indexer = new Indexer(indexDir);
int numIndexed;
try
numIndexed = indexer.index(dataDir, new TextFilesFilter());
finally
indexer.close();
long end = System.currentTimeMillis();
System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");
private IndexWriter writer;
public Indexer(String indexDir) throws IOException
Directory dir = FSDirectory.open(new File(indexDir));
writer = new IndexWriter(dir,
new StandardAnalyzer(Version.LUCENE_35),
true,
IndexWriter.MaxFieldLength.UNLIMITED);
public void close() throws IOException
writer.close();
public int index(String dataDir, FileFilter filter) throws Exception
File[] files = new File(dataDir).listFiles();
for (File f: files)
if (!f.isDirectory() &&
!f.isHidden() &&
f.exists() &&
f.canRead() &&
(filter == null || filter.accept(f)))
BufferedReader inputStream = new BufferedReader(new FileReader(f.getName()));
String url = inputStream.readLine();
inputStream.close();
indexFile(f, url);
return writer.numDocs();
private static class TextFilesFilter implements FileFilter
public boolean accept(File path)
return path.getName().toLowerCase().endsWith(".txt");
protected Document getDocument(File f, String url) throws Exception
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("urls", url, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));
return doc;
private void indexFile(File f, String url) throws Exception
System.out.println("Indexing " + f.getCanonicalPath());
Document doc = getDocument(f, url);
writer.addDocument(doc);
谁能帮我写一个程序来做到这一点?谢谢。
【问题讨论】:
【参考方案1】:首先,您无需存储词向量即可仅了解文档中词的频率。尽管如此,Lucene 还是存储了这些数字以用于 TF-IDF 计算。您可以通过调用 IndexReader.termDocs(term)
并遍历结果来访问此信息。
如果您有其他目的并且您确实需要访问术语向量,那么您需要告诉 Lucene 存储它们,方法是将 Field.TermVector.YES
作为 Field
构造函数的最后一个参数传递。然后,您可以检索向量,例如与IndexReader.getTermFreqVector()
。
【讨论】:
对查找 tf-idf 有帮助吗? 我的意思是这里计算的一个特征:***.com/questions/9189179/… TF-IDF 是“词频”-“逆文档频率”的首字母缩写,是 Lucene 中默认相似度函数使用的基本度量。它总是由 Lucene 出于内部目的计算。【参考方案2】:我使用的是 Lucene 核心 3.0.3,但我希望 API 会非常相似。此方法将汇总给定文档编号集和感兴趣字段列表的术语频率图,忽略停用词。
/**
* Sums the term frequency vector of each document into a single term frequency map
* @param indexReader the index reader, the document numbers are specific to this reader
* @param docNumbers document numbers to retrieve frequency vectors from
* @param fieldNames field names to retrieve frequency vectors from
* @param stopWords terms to ignore
* @return a map of each term to its frequency
* @throws IOException
*/
private Map<String,Integer> getTermFrequencyMap(IndexReader indexReader, List<Integer> docNumbers, String[] fieldNames, Set<String> stopWords)
throws IOException
Map<String,Integer> totalTfv = new HashMap<String,Integer>(1024);
for (Integer docNum : docNumbers)
for (String fieldName : fieldNames)
TermFreqVector tfv = indexReader.getTermFreqVector(docNum, fieldName);
if (tfv == null)
// ignore empty fields
continue;
String terms[] = tfv.getTerms();
int termCount = terms.length;
int freqs[] = tfv.getTermFrequencies();
for (int t=0; t < termCount; t++)
String term = terms[t];
int freq = freqs[t];
// filter out single-letter words and stop words
if (StringUtils.length(term) < 2 ||
stopWords.contains(term))
continue; // stop
Integer totalFreq = totalTfv.get(term);
totalFreq = (totalFreq == null) ? freq : freq + totalFreq;
totalTfv.put(term, totalFreq);
return totalTfv;
【讨论】:
PS 你必须提前配置每个字段来存储一个词频向量! @Field(index = Index.TOKENIZED, termVector = TermVector.YES) public String getAbstract() return this.abstract_; 非常感谢,有没有办法计算这些数字中的 tf-idf 值? ***.com/questions/9189179/…以上是关于如何在 Lucene 3.5.0 中提取文档术语向量的主要内容,如果未能解决你的问题,请参考以下文章