如何在 Lucene 3.5.0 中提取文档术语向量

Posted

技术标签:

【中文标题】如何在 Lucene 3.5.0 中提取文档术语向量【英文标题】:How to extract Document Term Vector in Lucene 3.5.0 【发布时间】:2012-02-05 07:14:17 【问题描述】:

我正在使用 Lucene 3.5.0,我想输出每个文档的术语向量。例如,我想知道一个词在所有文档和每个特定文档中的频率。 我的索引代码是:

import java.io.FileFilter;
import java.io.FileReader;
import java.io.IOException;

import java.io.File;
import java.io.FileReader;
import java.io.BufferedReader;

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Document;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

public class Indexer 
public static void main(String[] args) throws Exception 
        if (args.length != 2) 
        throw new IllegalArgumentException("Usage: java " + Indexer.class.getName() + " <index dir> <data dir>");
    

    String indexDir = args[0];
    String dataDir = args[1];
    long start = System.currentTimeMillis();
    Indexer indexer = new Indexer(indexDir);
    int numIndexed;
    try 
        numIndexed = indexer.index(dataDir, new TextFilesFilter());
     finally 
        indexer.close();
    
    long end = System.currentTimeMillis();
    System.out.println("Indexing " + numIndexed + " files took " + (end - start) + " milliseconds");


private IndexWriter writer;

public Indexer(String indexDir) throws IOException 
    Directory dir = FSDirectory.open(new File(indexDir));
    writer = new IndexWriter(dir,
        new StandardAnalyzer(Version.LUCENE_35),
        true,
        IndexWriter.MaxFieldLength.UNLIMITED);


public void close() throws IOException 
    writer.close();


public int index(String dataDir, FileFilter filter) throws Exception 
    File[] files = new File(dataDir).listFiles();
    for (File f: files) 
        if (!f.isDirectory() &&
        !f.isHidden() &&
        f.exists() &&
        f.canRead() &&
        (filter == null || filter.accept(f))) 
            BufferedReader inputStream = new BufferedReader(new FileReader(f.getName()));
            String url = inputStream.readLine();
            inputStream.close();
            indexFile(f, url);
        
    
    return writer.numDocs();


private static class TextFilesFilter implements FileFilter 
    public boolean accept(File path) 
        return path.getName().toLowerCase().endsWith(".txt");
    


protected Document getDocument(File f, String url) throws Exception 
    Document doc = new Document();
    doc.add(new Field("contents", new FileReader(f)));
    doc.add(new Field("urls", url, Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.add(new Field("filename", f.getName(), Field.Store.YES, Field.Index.NOT_ANALYZED));
    doc.add(new Field("fullpath", f.getCanonicalPath(), Field.Store.YES, Field.Index.NOT_ANALYZED));
    return doc;


private void indexFile(File f, String url) throws Exception 
    System.out.println("Indexing " + f.getCanonicalPath());
    Document doc = getDocument(f, url);
    writer.addDocument(doc);


谁能帮我写一个程序来做到这一点?谢谢。

【问题讨论】:

【参考方案1】:

首先,您无需存储词向量即可仅了解文档中词的频率。尽管如此,Lucene 还是存储了这些数字以用于 TF-IDF 计算。您可以通过调用 IndexReader.termDocs(term) 并遍历结果来访问此信息。

如果您有其他目的并且您确实需要访问术语向量,那么您需要告诉 Lucene 存储它们,方法是将 Field.TermVector.YES 作为 Field 构造函数的最后一个参数传递。然后,您可以检索向量,例如与IndexReader.getTermFreqVector()

【讨论】:

对查找 tf-idf 有帮助吗? 我的意思是这里计算的一个特征:***.com/questions/9189179/… TF-IDF 是“词频”-“逆文档频率”的首字母缩写,是 Lucene 中默认相似度函数使用的基本度量。它总是由 Lucene 出于内部目的计算。【参考方案2】:

我使用的是 Lucene 核心 3.0.3,但我希望 API 会非常相似。此方法将汇总给定文档编号集和感兴趣字段列表的术语频率图,忽略停用词。

    /**
 * Sums the term frequency vector of each document into a single term frequency map
 * @param indexReader the index reader, the document numbers are specific to this reader
 * @param docNumbers document numbers to retrieve frequency vectors from
 * @param fieldNames field names to retrieve frequency vectors from
 * @param stopWords terms to ignore
 * @return a map of each term to its frequency
 * @throws IOException
 */
private Map<String,Integer> getTermFrequencyMap(IndexReader indexReader, List<Integer> docNumbers, String[] fieldNames, Set<String> stopWords)
throws IOException 
    Map<String,Integer> totalTfv = new HashMap<String,Integer>(1024);

    for (Integer docNum : docNumbers) 
        for (String fieldName : fieldNames) 
            TermFreqVector tfv = indexReader.getTermFreqVector(docNum, fieldName);
            if (tfv == null) 
                // ignore empty fields
                continue;
            

            String terms[] = tfv.getTerms();
            int termCount = terms.length;
            int freqs[] = tfv.getTermFrequencies();

            for (int t=0; t < termCount; t++) 
                String term = terms[t];
                int freq = freqs[t];

                // filter out single-letter words and stop words
                if (StringUtils.length(term) < 2 ||
                    stopWords.contains(term)) 
                    continue; // stop
                

                Integer totalFreq = totalTfv.get(term);
                totalFreq = (totalFreq == null) ? freq : freq + totalFreq;
                totalTfv.put(term, totalFreq);
            
        
    

    return totalTfv;

【讨论】:

PS 你必须提前配置每个字段来存储一个词频向量! @Field(index = Index.TOKENIZED, termVector = TermVector.YES) public String getAbstract() return this.abstract_; 非常感谢,有没有办法计算这些数字中的 tf-idf 值? ***.com/questions/9189179/…

以上是关于如何在 Lucene 3.5.0 中提取文档术语向量的主要内容,如果未能解决你的问题,请参考以下文章

如何使用Term或QueryParser从Lucene索引中删除文档

lucene 怎么这么快计算文档的交集?

Lucene 6.0 提取新闻热词Top-N

集群文档 Lucene

如何从PDF文档中提取文本? [关闭]

Lucene——索引过程分析Index