如何在 Lucene 中实现 tf-idf 和余弦相似度?
Posted
技术标签:
【中文标题】如何在 Lucene 中实现 tf-idf 和余弦相似度?【英文标题】:how can I implement the tf-idf and cosine similarity in Lucene? 【发布时间】:2013-04-18 15:51:50 【问题描述】:我正在使用 Lucene 4.2。我创建的程序没有使用 tf-idf 和余弦相似度,它只使用了 TopScoreDocCollector。
import com.mysql.jdbc.Statement;
import java.io.BufferedReader;
import java.io.File;
import java.io.InputStreamReader;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.util.Version;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriter;
import java.sql.DriverManager;
import java.sql.Connection;
import java.sql.ResultSet;
import org.apache.lucene.analysis.id.IndonesianAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.*;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.RAMDirectory;
public class IndexMysqlDBStemming
public static void main(String[] args) throws Exception
// 1. Create Index From Database
Class.forName("com.mysql.jdbc.Driver").newInstance();
Connection connection = DriverManager.getConnection("jdbc:mysql://localhost/db_haiquran", "root", "");
IndonesianAnalyzer analyzer = new IndonesianAnalyzer(Version.LUCENE_42);
//StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_42);
QueryParser parser = new QueryParser(Version.LUCENE_42, "result", analyzer);
Directory INDEX_DIR = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_42, analyzer);
IndexWriter writer = new IndexWriter(INDEX_DIR, config);
String query = "SELECT * FROM ayat";
java.sql.Statement statement = connection.createStatement();
ResultSet result = statement.executeQuery(query);
while (result.next())
Document document = new Document();
document.add(new Field("NO_INDEX_AYAT", result.getString("NO_INDEX_AYAT"), Field.Store.YES, Field.Index.NOT_ANALYZED));
document.add(new Field("NO_SURAT", result.getString("NO_SURAT"), Field.Store.YES, Field.Index.NOT_ANALYZED));
document.add(new Field("NO_AYAT", result.getString("NO_AYAT"), Field.Store.YES, Field.Index.NOT_ANALYZED));
document.add(new Field("TEXT_INDO", result.getString("TEXT_INDO"), Field.Store.YES, Field.Index.ANALYZED));
document.add(new Field("TEXT_ARAB", result.getString("TEXT_ARAB"), Field.Store.YES, Field.Index.NOT_ANALYZED));
writer.updateDocument(new Term("NO_INDEX_AYAT", result.getString("NO_INDEX_AYAT")), document);
writer.close();
// 2. Query
System.out.println("Enter your search keyword in here : ");
BufferedReader bufferRead = new BufferedReader(new InputStreamReader(System.in));
String s = bufferRead.readLine();
String querystr = args.length > 0 ? args[0] :s;
try
System.out.println(parser.parse(querystr)+"\n"); //amenit
System.out.println();
catch (ParseException ex)
// Exception
Query q = new QueryParser(Version.LUCENE_42, "TEXT_INDO", analyzer).parse(querystr);
// 3. Search
int hitsPerPage = 10;
IndexReader reader = DirectoryReader.open(INDEX_DIR);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
// 4. Display results
System.out.println("Found : " + hits.length + " hits.");
System.out.println("No" + " ID " + "\t" + " Surat " + "\t" + " No Ayat " + "\t" + " Terjemahan Ayat " + "\t" + " Teks Arab ");
for (int i=0; i<hits.length; i++)
int docID = hits[i].doc;
Document d = searcher.doc(docID);
System.out.println((i+1) + ". " + d.get("NO_INDEX_AYAT") + "\t" + d.get("NO_SURAT") + "\t" + d.get("NO_AYAT")+
"\t" + d.get("TEXT_INDO") + "\t" + d.get("TEXT_ARAB"));
reader.close();
如何显示使用 tf-idf 和余弦相似度计算的结果?
【问题讨论】:
***.com/a/39186002/8430173 【参考方案1】:除非我遗漏了什么,否则你已经完成了。干得好!
默认使用的相似性算法是DefaultSimilarity,但您可以在它的基类TFIDFSimilarity 中找到大部分文档(和逻辑)。
而 TFIDFSimilarity 确实是 TF-IDF 和余弦相似度评分模型的实现。
【讨论】:
谢谢 femtoRgon。你能举出使用 TFIDFSimilarity 和 DefaultSimilarity 的程序代码示例吗?我尝试计算 TF-idf 但不要使用 Lucene 中的模块,这是我的代码:但效果较差,因为它的值被插入到变量中,如何使用代码示例和 DefaultSimilarity TFIDFSimilarity? 谢谢 femtoRgon。你能举出使用 TFIDFSimilarity 和 DefaultSimilarity 的程序代码示例吗?我试图计算 TF-idf 但不要使用 Lucene 中的模块: TermFreqVector tfv = ir.getTermFreqVector(docNum, "TEXT_INDO");字符串术语[] = tfv.getTerms(); int termCount = terms.length; int freqs[] = tfv.getTermFrequencies(); for(int t=0; t 恐怕我不明白你想做什么。默认情况下,Lucene 应用了一种评分算法,与您的规范非常吻合。查询时,你会得到一个ScoreDoc
s 的数组,你可以从中通过ScoreDoc.score
获得分数,或者在你的情况下,当你循环通过hits
时,你可以得到hits[i].score
。以上是关于如何在 Lucene 中实现 tf-idf 和余弦相似度?的主要内容,如果未能解决你的问题,请参考以下文章
Python:在 Pandas 中计算两列之间的 tf-idf 余弦相似度时出现 MemoryError
使用 sklearn 如何计算文档和查询之间的 tf-idf 余弦相似度?