SOLR源码分析—edismax检索打分机制

Posted JAVA技术大揭底

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了SOLR源码分析—edismax检索打分机制相关的知识,希望对你有一定的参考价值。

    Solr是基于 Lucene的开发的企业级全文检索引擎,工作中使用Solr做全文检索,之前检索都是使用Solr默认的打分机制,可以满足常规的相关度检索,最近客户针对检索提出新的检索需求,于是对Solr的打分机制做深入研究。下面将详细介绍:

    Lucene底层的评分算法是TF/IDF算法,基本意思就是词频算法。根据基础的分词词库,所有的文档在建立索引的时候进行分词。进行搜索的时候,也对搜索的短语进行分词,然后根据这些词,基于倒排索引的原理,检索出与这些词相关的文档。

    注意事项:TF-IDF算法是以 term为基础的,term就是最小的分词单元,这说明分词算法对基于统计的ranking无比重要,如果你对中文用单字切分,那么就会损失所有的语义相关性,这个时候搜索只是当做一种高效的全文匹配方法。edismax中的一些参数自行上网查查吧,看看就明白了,这里就不赘述了。源码的版本是Solr6.5.0

    Luence中的评分因子:

    1) coord(q,d)

    代表查询项分词后,在一个文档中出现的词的个数,一个文档中命中的词数越高,说明文档的匹配程度越高。比如一个检索词"中国人民解放军",分词后可能产生“中国,中国人民,解放军”等几个词,如果这几个词都命中说明该文档相关度较其他文档要高。计算公式为:

coord(q,d)=hitTerms/totalQueryTerms(hitTerms:文档中命中检索词个数, totalQueryTerms:检索条件的个数)

例:QUERY("SanFrancisco"OR"NewYork"OR"Paris") ,文档A包含了上面的3个term,那么coord就是3/4,如果包含了1个,则coord就是4/4

    2) queryNorm(q)

    计算每个查询条目的方差和,此值并不影响排序,而仅仅使得不同的query之间的分数可以比较。也就说,对于同一词查询,他对所有的document的影响是一样的,所以不影响查询的结果,它主要是为了区分不同query了。

queryNorm(q) = 1 / (sumOfSquaredWeights )sumOfSquaredWeights = ∑ ( idf(t) • t.getBoost() )2 

    3) tf(t in d)

    全称(term frequency),即为词频的意思,该项的意思term单个词在单个文档document中出现的次数frequency。tf在生成索引的时候,就会计算出来并保存,具体数字值为次数的开根号。比如:

    有个文档叫做"thisis book about chinese book",我的搜索项为"book",那么这个搜索项对应文档的frequency就为2,那么tf值就为根号2,取2的平方根即1.4142135

    4) idf(t)

    反转文档频率,故名思议,他和tf是相反的,idf是在query的时候获取, 它代表一个词term命中的文档document个数,一个词在很多文档中都出现了,在这些文档中,这个词对文档相关度的得分并没啥大的影响,所以该值越大,说明该词在相关度评分中越不重要。计算公式为:

idf(t) = 1 + log (numDocs / (docFreq +1)) 其中numDocs为文档总数,docFreq为该词命中的文档个数,docFreq越大,idf越小,反之idf越大。

比如我现在有三个文档,分别为:

 1)this book is about english

 2)this book is about chinese

 3)this book is about japan

 我要搜索的词语是"chinese",那么对第二篇文档来说,docFreq值就是1,因为只有一个文档符合这个搜索,而numDocs就是3。最后算出idf的值是: (Math.log(numDocs/(double)(docFreq+1))+ 1.0) = ln(3/(1+1)) + 1 = ln(1.5) + 1=1.40546510810816

    5.) norm(t,d)

    这个项是长度的加权因子,目的是为了将同样匹配的文档,比较短的放比较前面。比如两个文档: chinese、chinese book,我搜索chinese的时候,第一个文档会放比较前面。因为它更符合"完全匹配"。长度相关的加权因子,它包含Document boost,Field boost,lengthNorm。相比于t.getBoost()可以在查询的时候进行动态的设置,norm里面的f.getBoost()和d.getBoost()只能建索引过程中设置,如果需要对这两个boost进行修改,那么只能重建索引。他们的值是存储在.nrm文件中。计算公式为:

norm(t,d) = d.getBoost() • lengthNorm(f) • f.getBoost(),这里的doc.getBoost表示文档的权重,f.getBoost表示字段的权重,如果这两个都设置为1,那么nor(t,d)就和lengthNorm一样的值。

比如我现在有一个文档:

 chinese book

搜索的词语为chinese,那么numTerms为2,lengthNorm的值为 1/sqrt(2) = 0.71428571428571。

6) lengthNorm

指的是一个文档中包含的词term个数,一个文档中包含的个数越多,在检索时该文档相对词少的文档相关度要低一些,例如:在10个词的title中出现北京一次和在有200个词的正文中出现北京2次,哪个field更加匹配,当然是title。计算公式为:文档中term的个数的平方根的倒数:

lengthNorm = 1/Math.sqrt(numTerms) (numTerms指的是term个数)numTerms越大,lengthNorm越小,说明term越不重要,反之越重要,这很好理解。

    下面进入源码分析:

    首先需要使用ant+idea搭建solr远程调试环境,就是将solr的源码下载下来然后通过ant打包成idea项目,导入到idea中,采用idea的远程调试功能进行debug调试。由于源码整体比较复杂,故这里只分析关键代码,更详细的可自行参阅。

  1. 检索代码的入口是SearchHandler类中的handleRequestBody方法,

//这里components 是个List,评分代码是在QueryComponent类中,//这里调用QueryComponent的prepare方法进行前期的请求参数准备工作 //主要是封装Queryfor( SearchComponent c : components ) { c.prepare(rb);}//后面又调用QueryComponent的process处理检索请求for( SearchComponent c : components ) { c.process(rb);}

    2.调用QueryComponent的process方法

    //调用 SolrIndexSearcher的search方法    searcher.search(result, cmd); rb.setResult(result);

  3.调用SolrIndexSearcher的search方法,然后调用search中的        getDocListC方法

private void getDocListC(QueryResult qr, QueryCommand cmd) throws IOException {  //省略部分代码...  if (useFilterCache) { if (out.docSet == null) { out.docSet = getDocSet(cmd.getQuery(), cmd.getFilter()); DocSet bigFilt = getDocSet(cmd.getFilterList()); if (bigFilt != null) out.docSet = out.docSet.intersection(bigFilt);      } sortDocSet(qr, cmd); } else {      if ((flags & GET_DOCSET) != 0) { DocSet qDocSet = getDocListAndSetNC(qr, cmd); if (qDocSet != null && filterCache != null && !qr.isPartialResults())  filterCache.put(cmd.getQuery(), qDocSet); } else {        // 执行getDocListNC 获取检索到的文档 getDocListNC(qr, cmd); } assert null != out.docList : "docList is null"; }}//调用getDocListNC中的 buildAndRunCollectorChainprivatevoid getDocListNC(QueryResult qr, QueryCommand cmd) throws IOException {  ......  final TopDocsCollector topCollector = buildTopDocsCollector(len, cmd); Collector collector = topCollector; //构建必要的收集器链并对其执行查询  buildAndRunCollectorChain(qr, query, collector, cmd, pf.postFilter); totalHits = topCollector.getTotalHits(); TopDocs topDocs = topCollector.topDocs(0, len); populateNextCursorMarkFromTopDocs(qr, cmd, topDocs);  ......}//调用buildAndRunCollectorChain中 super.search(query, collector);private void buildAndRunCollectorChain(QueryResult qr, Query query, Collector collector, QueryCommand cmd,      DelegatingCollector postFilter) throws IOException {    省略......... try { //执行检索 super.search(query, collector); } catch (TimeLimitingCollector.TimeExceededException | ExitableDirectoryReader.ExitingReaderException x) { log.warn("Query: [{}]; {}", query, x.getMessage()); qr.setPartialResults(true); }  省略......... }

4. 然后调用IndexSearcher的search方法,关键点在search方法中的       createNormalizedWeight的方法,构建Weight树

//调用createNormalizedWeight创建权重public void search(Query query, Collector results) t4hrows IOException {    search(leafContexts, createNormalizedWeight(query, results.needsScores()), results);}//创建权重public Weight createNormalizedWeight(Query query, boolean needsScores) throws IOException { query = rewrite(query);    //这里采用递归的方式 创建Weight对象 Weight weight = createWeight(query, needsScores); float v = weight.getValueForNormalization(); float norm = getSimilarity(needsScores).queryNorm(v); if (Float.isInfinite(norm) || Float.isNaN(norm)) { norm = 1.0f; } weight.normalize(norm, 1.0f); return weight;}//这里query实例为BooleanQuery,BooleanQuery是一个链表结构//它将一个查询拆分成多个子查询 对每个子查询创建Weight对象,一个子查询对应一个Weight对象public Weight createWeight(Query query, boolean needsScores) throws IOException { final QueryCache queryCache = this.queryCache; Weight weight = query.createWeight(this, needsScores); if (needsScores == false && queryCache != null) { weight = queryCache.doCache(weight, queryCachingPolicy); } return weight;}
  • 调用BooleanQuery中的createWeight方法为该查询创建对应的  Weight权重,获取idf的值并创建TermWeight ,计算单个term的queryWeight,这里是一个递归的过程,BooleanQuery 是一个树形结构,这里遍历query树,并对树中的每个叶子节点创建对应的Weigth权重对象,生成Weight树

//创建Weight 调用BooleanWeight构造函数public Weight createWeight(IndexSearcher searcher, boolean needsScores) throws IOException { BooleanQuery query = this; if (needsScores == false) { query = rewriteNoScoring(); } //初始化BooleanWeight return new BooleanWeight(query, searcher, needsScores, disableCoord);}//初始化BooleanWeight 递归创建WeightBooleanWeight(BooleanQuery query, IndexSearcher searcher, boolean needsScores, boolean disableCoord) throws IOException { super(query); this.query = query; this.needsScores = needsScores; //默认ClassSimilarty this.similarity = searcher.getSimilarity(needsScores); weights = new ArrayList<>(); int i = 0;  int maxCoord = 0; for (BooleanClause c : query) { //递归创建query的Weight Weight w = searcher.createWeight(c.getQuery(), needsScores && c.isScoring()); weights.add(w); if (c.isScoring()) { maxCoord++; } i += 1; }  ......}// 接下来初始化单个词的权重  这里注意 TermWeight为TermQuery的内部类public TermWeight(IndexSearcher searcher, boolean needsScores, TermContext termStates) throws IOException { super(TermQuery.this); if (needsScores && termStates == null) { throw new IllegalStateException("termStates are required when scores are needed"); } this.needsScores = needsScores; this.termStates = termStates;    this.similarity = searcher.getSimilarity(needsScores); final CollectionStatistics collectionStats; final TermStatistics termStats; if (needsScores) { collectionStats = searcher.collectionStatistics(term.field());      //获取docFreq 总文档数 并计算idf的值  termStats = searcher.termStatistics(term, termStates); } else { final int maxDoc = searcher.getIndexReader().maxDoc(); collectionStats = new CollectionStatistics(term.field(), maxDoc, -1, -1, -1); termStats = new TermStatistics(term.bytes(), maxDoc, -1);    } this.stats = similarity.computeWeight(collectionStats, termStats);}//计算idf并返回IDFStatspublic final SimWeight computeWeight(CollectionStatistics collectionStats, TermStatistics... termStats) { final Explanation idf = termStats.length == 1 ? idfExplain(collectionStats, termStats[0]) : idfExplain(collectionStats, termStats); return new IDFStats(collectionStats.field(), idf);}public IDFStats(String field, Explanation idf) { this.field = field; this.idf = idf; normalize(1f, 1f);}//计算queryWeightpublic void normalize(float queryNorm, float boost) { this.boost = boost; this.queryNorm = queryNorm; queryWeight = queryNorm * boost * idf.getValue(); value = queryWeight * idf.getValue();// idf for document}
  • 回到第3步中的createNormalizedWeight方法,其中的  getValueForNormalization方法,遍历Weight树,计算Weight树    中节点的queryWight^2,并求和,即 sumOfSquaredWeights=

    ∑queryWight^2

public Weight createNormalizedWeight(Query query, boolean needsScores) throws IOException { query = rewrite(query); //这里采用递归的方式 创建Weight树 Weight weight = createWeight(query, needsScores);    //v即sumOfSquaredWeights=∑queryWight^2 float v = weight.getValueForNormalization();    //norm=1.0/Math.sqrt(sumOfSquaredWeights) //平方根的倒数 float norm = getSimilarity(needsScores).queryNorm(v); if (Float.isInfinite(norm) || Float.isNaN(norm)) { norm = 1.0f; }    //再次递归计算queryWeight      //queryWeight=norm*boost*idf.getValue() boost默认为1    //计算代码在TermWeight>>IDFStats>>normalize weight.normalize(norm, 1.0f); return weight;}//BoooleanWeight中的getValueForNormalizationpublic float getValueForNormalization() throws IOException { float sum = 0.0f; int i = 0; for (BooleanClause clause : query) { // call sumOfSquaredWeights for all clauses in case of side effects float s = weights.get(i).getValueForNormalization(); // sum sub weights if (clause.isScoring()) { // only add to sum for scoring clauses sum += s; } i += 1;    } return sum ; }  //TermQuery中的TermWeight中的getValueForNormalization public float getValueForNormalization() { return stats.getValueForNormalization();  }  //IDFStats 中的getValueForNormalization 即queryWeight^2  public float getValueForNormalization() { return queryWeight * queryWeight; // sum of squared weights  }

6.计算命中文档的分值,并按照计算的分值对命中文档进行排序,

IndexSearch>>search>>weight.bulkScorer(ctx)>>scorer.score 计算   每个文档的分值(这里是核心)

protected void search(List<LeafReaderContext> leaves, Weight weight, Collector collector)      throws IOException { for (LeafReaderContext ctx : leaves) {  final LeafCollector leafCollector; try { leafCollector = collector.getLeafCollector(ctx);      } catch (CollectionTerminatedException e) { continue; } //构建DefaultBulkScorer BulkScorer scorer = weight.bulkScorer(ctx); if (scorer != null) { try { scorer.score(leafCollector, ctx.reader().getLiveDocs());        } catch (CollectionTerminatedException e) { } } } }  //DefaultBulkScorer>>score>>scoreAll public int score(LeafCollector collector, Bits acceptDocs, int min, int max) throws IOException { collector.setScorer(scorer); if (scorer.docID() == -1 && min == 0 && max == DocIdSetIterator.NO_MORE_DOCS) { scoreAll(collector, iterator, twoPhase, acceptDocs); return DocIdSetIterator.NO_MORE_DOCS; } else { int doc = scorer.docID(); if (doc < min) { if (twoPhase == null) { doc = iterator.advance(min); } else { doc = twoPhase.approximation().advance(min); } } return scoreRange(collector, iterator, twoPhase, acceptDocs, doc, max); }}//scoreAll>>collector.collect(doc)static void scoreAll(LeafCollector collector, DocIdSetIterator iterator, TwoPhaseIterator twoPhase, Bits acceptDocs) throws IOException { if (twoPhase == null) {      //iterator.nextDoc()  这里是关键 遍历iterator获取term对应的倒排列表 for (int doc = iterator.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = iterator.nextDoc()) { if (acceptDocs == null || acceptDocs.get(doc)) { collector.collect(doc); } } } else { final DocIdSetIterator approximation = twoPhase.approximation(); for (int doc = approximation.nextDoc(); doc != DocIdSetIterator.NO_MORE_DOCS; doc = approximation.nextDoc()) { if ((acceptDocs == null || acceptDocs.get(doc)) && twoPhase.matches()) { collector.collect(doc); } } } }}//SimpleTopScoreDocCollector>>collect //这里SimpleTopScoreDocCollector中实例化了一个ScorerLeafCollector的匿名内部类public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException { final int docBase = context.docBase;   return new ScorerLeafCollector() { @Override public void collect(int doc) throws IOException {        float score = scorer.score(); assert score != Float.NEGATIVE_INFINITY;     assert !Float.isNaN(score); totalHits++; if (score <= pqTop.score) { return; } pqTop.doc = doc + docBase; pqTop.score = score; pqTop = pq.updateTop();   }};//ReqOptSumScorer>>score 包含term相关度查询评分计算和boost查询评分计算,最后求和public float score() throws IOException { int curDoc = reqScorer.docID();    //term相关度查询评分    float score = reqScorer.score(); int optScorerDoc = optIterator.docID(); if (optScorerDoc < curDoc) { optScorerDoc = optIterator.advance(curDoc);    } if (optScorerDoc == curDoc) {      //boost函数评分计算 并加上前面的term分数 score += optScorer.score();    } return score; }  //reqScorer.score()>>TermScorer.score>>docScorer.score  public float score() throws IOException { assert docID() != DocIdSetIterator.NO_MORE_DOCS; return docScorer.score(postingsEnum.docID(), postingsEnum.freq()); }//term相关度分值计算TFIDFSimScorer.scorepublic float score(int doc, float freq) {  final float raw = tf(freq) * weightValue; // compute tf(f)*weight  //norms.get(doc)对应前面的lengthNorm 文档中的term越少 该值越大 return norms == null ? raw : raw * decodeNormValue(norms.get(doc)); // normalize for field}//boost函数分值计算 FunctionQuery>>AllScorer.scorepublic float score() throws IOException {  float score = qWeight * vals.floatVal(docID());  return score>Float.NEGATIVE_INFINITY ? score : -Float.MAX_VALUE;}

总结上述代码中可推导出评分公式为: 

score(q,d)=tf * idf(t) * boost * queryNorm * 1/Math.sqrt(∑(idf(t) *  boost  * queryNorm)2 ) * lengthNorm

时间仓促,有些细节方面的可能还没说的太明白,后续再完善...............

以上是关于SOLR源码分析—edismax检索打分机制的主要内容,如果未能解决你的问题,请参考以下文章

solr排名的做法

了解Solr函数查询性能

Solr查询中的双引号

Solr相似性算法

Solr全文检索基本原理及评分机制

solr精确查询,查询关键字分词后,指定满足匹配所有