SOLR源码分析—edismax检索打分机制
Posted JAVA技术大揭底
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了SOLR源码分析—edismax检索打分机制相关的知识,希望对你有一定的参考价值。
Solr是基于 Lucene的开发的企业级全文检索引擎,工作中使用Solr做全文检索,之前检索都是使用Solr默认的打分机制,可以满足常规的相关度检索,最近客户针对检索提出新的检索需求,于是对Solr的打分机制做深入研究。下面将详细介绍:
Lucene底层的评分算法是TF/IDF算法,基本意思就是词频算法。根据基础的分词词库,所有的文档在建立索引的时候进行分词。进行搜索的时候,也对搜索的短语进行分词,然后根据这些词,基于倒排索引的原理,检索出与这些词相关的文档。
注意事项:TF-IDF算法是以 term为基础的,term就是最小的分词单元,这说明分词算法对基于统计的ranking无比重要,如果你对中文用单字切分,那么就会损失所有的语义相关性,这个时候搜索只是当做一种高效的全文匹配方法。edismax中的一些参数自行上网查查吧,看看就明白了,这里就不赘述了。源码的版本是Solr6.5.0
Luence中的评分因子:
1) coord(q,d)
代表查询项分词后,在一个文档中出现的词的个数,一个文档中命中的词数越高,说明文档的匹配程度越高。比如一个检索词"中国人民解放军",分词后可能产生“中国,中国人民,解放军”等几个词,如果这几个词都命中说明该文档相关度较其他文档要高。计算公式为:
coord(q,d)=hitTerms/totalQueryTerms(hitTerms:文档中命中检索词个数, totalQueryTerms:检索条件的个数)
例:QUERY("SanFrancisco"OR"NewYork"OR"Paris") ,文档A包含了上面的3个term,那么coord就是3/4,如果包含了1个,则coord就是4/4
2) queryNorm(q)
计算每个查询条目的方差和,此值并不影响排序,而仅仅使得不同的query之间的分数可以比较。也就说,对于同一词查询,他对所有的document的影响是一样的,所以不影响查询的结果,它主要是为了区分不同query了。
queryNorm(q) = 1 / (sumOfSquaredWeights )
sumOfSquaredWeights = ∑ ( idf(t) • t.getBoost() )2
3) tf(t in d)
全称(term frequency),即为词频的意思,该项的意思term单个词在单个文档document中出现的次数frequency。tf在生成索引的时候,就会计算出来并保存,具体数字值为次数的开根号。比如:
有个文档叫做"thisis book about chinese book",我的搜索项为"book",那么这个搜索项对应文档的frequency就为2,那么tf值就为根号2,取2的平方根即1.4142135
4) idf(t)
反转文档频率,故名思议,他和tf是相反的,idf是在query的时候获取, 它代表一个词term命中的文档document个数,一个词在很多文档中都出现了,在这些文档中,这个词对文档相关度的得分并没啥大的影响,所以该值越大,说明该词在相关度评分中越不重要。计算公式为:
idf(t) = 1 + log (numDocs / (docFreq +1)) 其中numDocs为文档总数,docFreq为该词命中的文档个数,docFreq越大,idf越小,反之idf越大。
比如我现在有三个文档,分别为:
1)this book is about english
2)this book is about chinese
3)this book is about japan
我要搜索的词语是"chinese",那么对第二篇文档来说,docFreq值就是1,因为只有一个文档符合这个搜索,而numDocs就是3。最后算出idf的值是: (Math.log(numDocs/(double)(docFreq+1))+ 1.0) = ln(3/(1+1)) + 1 = ln(1.5) + 1=1.40546510810816
5.) norm(t,d)
这个项是长度的加权因子,目的是为了将同样匹配的文档,比较短的放比较前面。比如两个文档: chinese、chinese book,我搜索chinese的时候,第一个文档会放比较前面。因为它更符合"完全匹配"。长度相关的加权因子,它包含Document boost,Field boost,lengthNorm。相比于t.getBoost()可以在查询的时候进行动态的设置,norm里面的f.getBoost()和d.getBoost()只能建索引过程中设置,如果需要对这两个boost进行修改,那么只能重建索引。他们的值是存储在.nrm文件中。计算公式为:
norm(t,d) = d.getBoost() • lengthNorm(f) • f.getBoost(),这里的doc.getBoost表示文档的权重,f.getBoost表示字段的权重,如果这两个都设置为1,那么nor(t,d)就和lengthNorm一样的值。
比如我现在有一个文档:
chinese book
搜索的词语为chinese,那么numTerms为2,lengthNorm的值为 1/sqrt(2) = 0.71428571428571。
6) lengthNorm
指的是一个文档中包含的词term个数,一个文档中包含的个数越多,在检索时该文档相对词少的文档相关度要低一些,例如:在10个词的title中出现北京一次和在有200个词的正文中出现北京2次,哪个field更加匹配,当然是title。计算公式为:文档中term的个数的平方根的倒数:
lengthNorm = 1/Math.sqrt(numTerms) (numTerms指的是term个数)numTerms越大,lengthNorm越小,说明term越不重要,反之越重要,这很好理解。
下面进入源码分析:
首先需要使用ant+idea搭建solr远程调试环境,就是将solr的源码下载下来然后通过ant打包成idea项目,导入到idea中,采用idea的远程调试功能进行debug调试。由于源码整体比较复杂,故这里只分析关键代码,更详细的可自行参阅。
检索代码的入口是SearchHandler类中的handleRequestBody方法,
//这里components 是个List,评分代码是在QueryComponent类中,
//这里调用QueryComponent的prepare方法进行前期的请求参数准备工作
//主要是封装Query
for( SearchComponent c : components ) {
c.prepare(rb);
}
//后面又调用QueryComponent的process处理检索请求
for( SearchComponent c : components ) {
c.process(rb);
}
2.调用QueryComponent的process方法
//调用 SolrIndexSearcher的search方法
searcher.search(result, cmd);
rb.setResult(result);
3.调用SolrIndexSearcher的search方法,然后调用search中的 getDocListC方法
private void getDocListC(QueryResult qr, QueryCommand cmd) throws IOException {
//省略部分代码...
if (useFilterCache) {
if (out.docSet == null) {
out.docSet = getDocSet(cmd.getQuery(), cmd.getFilter());
DocSet bigFilt = getDocSet(cmd.getFilterList());
if (bigFilt != null) out.docSet = out.docSet.intersection(bigFilt);
}
sortDocSet(qr, cmd);
} else {
if ((flags & GET_DOCSET) != 0) {
DocSet qDocSet = getDocListAndSetNC(qr, cmd);
if (qDocSet != null && filterCache != null && !qr.isPartialResults())
filterCache.put(cmd.getQuery(), qDocSet);
} else {
// 执行getDocListNC 获取检索到的文档
getDocListNC(qr, cmd);
}
assert null != out.docList : "docList is null";
}
}
//调用getDocListNC中的 buildAndRunCollectorChain
privatevoid getDocListNC(QueryResult qr, QueryCommand cmd) throws IOException {
......
final TopDocsCollector topCollector = buildTopDocsCollector(len, cmd);
Collector collector = topCollector;
//构建必要的收集器链并对其执行查询
buildAndRunCollectorChain(qr, query, collector, cmd, pf.postFilter);
totalHits = topCollector.getTotalHits();
TopDocs topDocs = topCollector.topDocs(0, len);
populateNextCursorMarkFromTopDocs(qr, cmd, topDocs);
......
}
//调用buildAndRunCollectorChain中 super.search(query, collector);
private void buildAndRunCollectorChain(QueryResult qr, Query query, Collector collector, QueryCommand cmd,
DelegatingCollector postFilter) throws IOException {
省略.........
try {
//执行检索
super.search(query, collector);
} catch (TimeLimitingCollector.TimeExceededException | ExitableDirectoryReader.ExitingReaderException x) {
log.warn("Query: [{}]; {}", query, x.getMessage());
qr.setPartialResults(true);
}
省略.........
}
4. 然后调用IndexSearcher的search方法,关键点在search方法中的 createNormalizedWeight的方法,构建Weight树
//调用createNormalizedWeight创建权重
public void search(Query query, Collector results) t4hrows IOException {
search(leafContexts, createNormalizedWeight(query, results.needsScores()), results);
}
//创建权重
public Weight createNormalizedWeight(Query query, boolean needsScores) throws IOException {
query = rewrite(query);
//这里采用递归的方式 创建Weight对象
Weight weight = createWeight(query, needsScores);
float v = weight.getValueForNormalization();
float norm = getSimilarity(needsScores).queryNorm(v);
if (Float.isInfinite(norm) || Float.isNaN(norm)) {
norm = 1.0f;
}
weight.normalize(norm, 1.0f);
return weight;
}
//这里query实例为BooleanQuery,BooleanQuery是一个链表结构
//它将一个查询拆分成多个子查询 对每个子查询创建Weight对象,一个子查询对应一个Weight对象
public Weight createWeight(Query query, boolean needsScores) throws IOException {
final QueryCache queryCache = this.queryCache;
Weight weight = query.createWeight(this, needsScores);
if (needsScores == false && queryCache != null) {
weight = queryCache.doCache(weight, queryCachingPolicy);
}
return weight;
}
调用BooleanQuery中的createWeight方法为该查询创建对应的 Weight权重,获取idf的值并创建TermWeight ,计算单个term的queryWeight,这里是一个递归的过程,BooleanQuery 是一个树形结构,这里遍历query树,并对树中的每个叶子节点创建对应的Weigth权重对象,生成Weight树
//创建Weight 调用BooleanWeight构造函数
public Weight createWeight(IndexSearcher searcher, boolean needsScores) throws IOException {
BooleanQuery query = this;
if (needsScores == false) {
query = rewriteNoScoring();
}
//初始化BooleanWeight
return new BooleanWeight(query, searcher, needsScores, disableCoord);
}
//初始化BooleanWeight 递归创建Weight
BooleanWeight(BooleanQuery query, IndexSearcher searcher, boolean needsScores, boolean disableCoord) throws IOException {
super(query);
this.query = query;
this.needsScores = needsScores;
//默认ClassSimilarty
this.similarity = searcher.getSimilarity(needsScores);
weights = new ArrayList<>();
int i = 0;
int maxCoord = 0;
for (BooleanClause c : query) {
//递归创建query的Weight
Weight w = searcher.createWeight(c.getQuery(), needsScores && c.isScoring());
weights.add(w);
if (c.isScoring()) {
maxCoord++;
}
i += 1;
}
......
}
// 接下来初始化单个词的权重 这里注意 TermWeight为TermQuery的内部类
public TermWeight(IndexSearcher searcher, boolean needsScores, TermContext termStates)
throws IOException {
super(TermQuery.this);
if (needsScores && termStates == null) {
throw new IllegalStateException("termStates are required when scores are needed");
}
this.needsScores = needsScores;
this.termStates = termStates;
this.similarity = searcher.getSimilarity(needsScores);
final CollectionStatistics collectionStats;
final TermStatistics termStats;
if (needsScores) {
collectionStats = searcher.collectionStatistics(term.field());
//获取docFreq 总文档数 并计算idf的值
termStats = searcher.termStatistics(term, termStates);
} else {
final int maxDoc = searcher.getIndexReader().maxDoc();
collectionStats = new CollectionStatistics(term.field(), maxDoc, -1, -1, -1);
termStats = new TermStatistics(term.bytes(), maxDoc, -1);
}
this.stats = similarity.computeWeight(collectionStats, termStats);
}
//计算idf并返回IDFStats
public final SimWeight computeWeight(CollectionStatistics collectionStats, TermStatistics... termStats) {
final Explanation idf = termStats.length == 1
? idfExplain(collectionStats, termStats[0])
: idfExplain(collectionStats, termStats);
return new IDFStats(collectionStats.field(), idf);
}
public IDFStats(String field, Explanation idf) {
this.field = field;
this.idf = idf;
normalize(1f, 1f);
}
//计算queryWeight
public void normalize(float queryNorm, float boost) {
this.boost = boost;
this.queryNorm = queryNorm;
queryWeight = queryNorm * boost * idf.getValue();
value = queryWeight * idf.getValue();// idf for document
}
回到第3步中的createNormalizedWeight方法,其中的 getValueForNormalization方法,遍历Weight树,计算Weight树 中节点的queryWight^2,并求和,即 sumOfSquaredWeights=
∑queryWight^2
public Weight createNormalizedWeight(Query query, boolean needsScores) throws IOException {
query = rewrite(query);
//这里采用递归的方式 创建Weight树
Weight weight = createWeight(query, needsScores);
//v即sumOfSquaredWeights=∑queryWight^2
float v = weight.getValueForNormalization();
//norm=1.0/Math.sqrt(sumOfSquaredWeights) //平方根的倒数
float norm = getSimilarity(needsScores).queryNorm(v);
if (Float.isInfinite(norm) || Float.isNaN(norm)) {
norm = 1.0f;
}
//再次递归计算queryWeight
//queryWeight=norm*boost*idf.getValue() boost默认为1
//计算代码在TermWeight>>IDFStats>>normalize
weight.normalize(norm, 1.0f);
return weight;
}
//BoooleanWeight中的getValueForNormalization
public float getValueForNormalization() throws IOException {
float sum = 0.0f;
int i = 0;
for (BooleanClause clause : query) {
// call sumOfSquaredWeights for all clauses in case of side effects
float s = weights.get(i).getValueForNormalization(); // sum sub weights
if (clause.isScoring()) {
// only add to sum for scoring clauses
sum += s;
}
i += 1;
}
return sum ;
}
//TermQuery中的TermWeight中的getValueForNormalization
public float getValueForNormalization() {
return stats.getValueForNormalization();
}
//IDFStats 中的getValueForNormalization 即queryWeight^2
public float getValueForNormalization() {
return queryWeight * queryWeight; // sum of squared weights
}
6.计算命中文档的分值,并按照计算的分值对命中文档进行排序,
IndexSearch>>search>>weight.bulkScorer(ctx)>>scorer.score 计算 每个文档的分值(这里是核心)
protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)
throws IOException {
for (LeafReaderContext ctx : leaves) {
final LeafCollector leafCollector;
try {
leafCollector = collector.getLeafCollector(ctx);
} catch (CollectionTerminatedException e) {
continue;
}
//构建DefaultBulkScorer
BulkScorer scorer = weight.bulkScorer(ctx);
if (scorer != null) {
try {
scorer.score(leafCollector, ctx.reader().getLiveDocs());
} catch (CollectionTerminatedException e) {
}
}
}
}
//DefaultBulkScorer>>score>>scoreAll
public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException {
collector.setScorer(scorer);
if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) {
scoreAll(collector, iterator, twoPhase, acceptDocs);
return DocIdSetIterator.NO_MORE_DOCS;
} else {
int doc = scorer.docID();
if (doc < min) {
if (twoPhase == null) {
doc = iterator.advance(min);
} else {
doc = twoPhase.approximation().advance(min);
}
}
return scoreRange(collector, iterator, twoPhase, acceptDocs, doc, max);
}
}
//scoreAll>>collector.collect(doc)
static void scoreAll(LeafCollector collector, DocIdSetIterator iterator, TwoPhaseIterator twoPhase, Bits acceptDocs) throws IOException {
if (twoPhase == null) {
//iterator.nextDoc() 这里是关键 遍历iterator获取term对应的倒排列表
for (int doc = iterator.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = iterator.nextDoc()) {
if (acceptDocs == null || acceptDocs.get(doc)) {
collector.collect(doc);
}
}
} else {
final DocIdSetIterator approximation = twoPhase.approximation();
for (int doc = approximation.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = approximation.nextDoc()) {
if ((acceptDocs == null || acceptDocs.get(doc)) && twoPhase.matches()) {
collector.collect(doc);
}
}
}
}
}
//SimpleTopScoreDocCollector>>collect
//这里SimpleTopScoreDocCollector中实例化了一个ScorerLeafCollector的匿名内部类
public LeafCollector getLeafCollector(LeafReaderContext context)
throws IOException {
final int docBase = context.docBase;
return new ScorerLeafCollector() {
public void collect(int doc) throws IOException {
float score = scorer.score();
assert score != Float.NEGATIVE_INFINITY;
assert !Float.isNaN(score);
totalHits++;
if (score <= pqTop.score) {
return;
}
pqTop.doc = doc + docBase;
pqTop.score = score;
pqTop = pq.updateTop();
}
};
//ReqOptSumScorer>>score 包含term相关度查询评分计算和boost查询评分计算,最后求和
public float score() throws IOException {
int curDoc = reqScorer.docID();
//term相关度查询评分
float score = reqScorer.score();
int optScorerDoc = optIterator.docID();
if (optScorerDoc < curDoc) {
optScorerDoc = optIterator.advance(curDoc);
}
if (optScorerDoc == curDoc) {
//boost函数评分计算 并加上前面的term分数
score += optScorer.score();
}
return score;
}
//reqScorer.score()>>TermScorer.score>>docScorer.score
public float score() throws IOException {
assert docID() != DocIdSetIterator.NO_MORE_DOCS;
return docScorer.score(postingsEnum.docID(), postingsEnum.freq());
}
//term相关度分值计算TFIDFSimScorer.score
public float score(int doc, float freq) {
final float raw = tf(freq) * weightValue; // compute tf(f)*weight
//norms.get(doc)对应前面的lengthNorm 文档中的term越少 该值越大
return norms == null ? raw : raw * decodeNormValue(norms.get(doc)); // normalize for field
}
//boost函数分值计算 FunctionQuery>>AllScorer.score
public float score() throws IOException {
float score = qWeight * vals.floatVal(docID());
return score>Float.NEGATIVE_INFINITY ? score : -Float.MAX_VALUE;
}
总结上述代码中可推导出评分公式为:
score(q,d)=tf * idf(t) * boost * queryNorm * 1/Math.sqrt(∑(idf(t) * boost * queryNorm)2 ) * lengthNorm
时间仓促,有些细节方面的可能还没说的太明白,后续再完善...............
以上是关于SOLR源码分析—edismax检索打分机制的主要内容,如果未能解决你的问题,请参考以下文章