数字字段的 Lucene 自定义评分

Posted

技术标签:

【中文标题】数字字段的 Lucene 自定义评分【英文标题】:Lucene custom scoring for numeric fields 【发布时间】:2011-08-20 23:06:27 【问题描述】:

除了在文本内容字段上使用 tf-idf 相似性进行标准术语搜索之外,我还希望根据数字字段的“相似性”进行评分。这种相似性将取决于查询中的值与文档中的值之间的距离(例如,m= [用户输入],s= 0.5 的高斯)

即假设文档代表人,并且个人文档有两个字段:

说明(全文) 年龄(数字)。

我想查找类似的文档

描述:(x y z) 年龄:30

但年龄不是过滤器,而是分数的一部分(30 岁的乘数为 1.0,25 岁​​的乘数为 0.8 等)

这可以通过合理的方式实现吗?

编辑: 最后我发现这可以通过使用 CustomScoreQuery 包装 ValueSourceQuery 和 TermQuery 来完成。请参阅下面的解决方案。

编辑 2:对于快速变化的 Lucene 版本,我只想补充一点,它是在 Lucene 3.0 (Java) 上测试的。

【问题讨论】:

【参考方案1】:

好的,这里(有点冗长)概念验证作为完整的 JUnit 测试。尚未测试其对大型索引的效率,但从我读到的内容来看,它可能在热身后表现良好,前提是有足够的 RAM 可用于缓存数字字段。

  package tests;

  import org.apache.lucene.analysis.Analyzer;
  import org.apache.lucene.analysis.WhitespaceAnalyzer;
  import org.apache.lucene.document.Document;
  import org.apache.lucene.document.Field;
  import org.apache.lucene.document.NumericField;
  import org.apache.lucene.index.IndexWriter;
  import org.apache.lucene.queryParser.QueryParser;
  import org.apache.lucene.search.IndexSearcher;
  import org.apache.lucene.search.Query;
  import org.apache.lucene.search.ScoreDoc;
  import org.apache.lucene.search.TopDocs;
  import org.apache.lucene.search.function.CustomScoreQuery;
  import org.apache.lucene.search.function.IntFieldSource;
  import org.apache.lucene.search.function.ValueSourceQuery;
  import org.apache.lucene.store.Directory;
  import org.apache.lucene.store.RAMDirectory;
  import org.apache.lucene.util.Version;

  import junit.framework.TestCase;

  public class AgeAndContentScoreQueryTest extends TestCase
  
     public class AgeAndContentScoreQuery extends CustomScoreQuery
     
        protected float peakX;
        protected float sigma;

        public AgeAndContentScoreQuery(Query subQuery, ValueSourceQuery valSrcQuery, float peakX, float sigma) 
           super(subQuery, valSrcQuery);
           this.setStrict(true); // do not normalize score values from ValueSourceQuery!
           this.peakX = peakX;   // age for which the age-relevance is best
           this.sigma = sigma;
        

        @Override
        public float customScore(int doc, float subQueryScore, float valSrcScore)
           // subQueryScore is td-idf score from content query
           float contentScore = subQueryScore;

           // valSrcScore is a value of date-of-birth field, represented as a float
           // let's convert age value to gaussian-like age relevance score
           float x = (2011 - valSrcScore); // age
           float ageScore = (float) Math.exp(-Math.pow(x - peakX, 2) / 2*sigma*sigma);

           float finalScore = ageScore * contentScore;

           System.out.println("#contentScore: " + contentScore);
           System.out.println("#ageValue:     " + (int)valSrcScore);
           System.out.println("#ageScore:     " + ageScore);
           System.out.println("#finalScore:   " + finalScore);
           System.out.println("+++++++++++++++++");

           return finalScore;
        
     

     protected Directory directory;
     protected Analyzer analyzer = new WhitespaceAnalyzer();
     protected String fieldNameContent = "content";
     protected String fieldNameDOB = "dob";

     protected void setUp() throws Exception
     
        directory = new RAMDirectory();
        analyzer = new WhitespaceAnalyzer();

        // indexed documents
        String[] contents = "foo baz1", "foo baz2 baz3", "baz4";
        int[] dobs = 1991, 1981, 1987; // date of birth

        IndexWriter writer = new IndexWriter(directory, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);
        for (int i = 0; i < contents.length; i++) 
        
           Document doc = new Document();
           doc.add(new Field(fieldNameContent, contents[i], Field.Store.YES, Field.Index.ANALYZED)); // store & index
           doc.add(new NumericField(fieldNameDOB, Field.Store.YES, true).setIntValue(dobs[i]));      // store & index
           writer.addDocument(doc);
        
        writer.close();
     

     public void testSearch() throws Exception
     
        String inputTextQuery = "foo bar";
        float peak = 27.0f;
        float sigma = 0.1f;

        QueryParser parser = new QueryParser(Version.LUCENE_30, fieldNameContent, analyzer);
        Query contentQuery = parser.parse(inputTextQuery);

        ValueSourceQuery dobQuery = new ValueSourceQuery( new IntFieldSource(fieldNameDOB) );
         // or: FieldScoreQuery dobQuery = new FieldScoreQuery(fieldNameDOB,Type.INT);

        CustomScoreQuery finalQuery = new AgeAndContentScoreQuery(contentQuery, dobQuery, peak, sigma);

        IndexSearcher searcher = new IndexSearcher(directory);
        TopDocs docs = searcher.search(finalQuery, 10);

        System.out.println("\nDocuments found:\n");
        for(ScoreDoc match : docs.scoreDocs)
        
           Document d = searcher.doc(match.doc);
           System.out.println("CONTENT: " + d.get(fieldNameContent) );
           System.out.println("D.O.B.:  " + d.get(fieldNameDOB) );
           System.out.println("SCORE:   " + match.score );
           System.out.println("-----------------");
        
     
  

【讨论】:

这可以推广到任意数量的ValueSourceQuery-s,因为 CustomScoreQuery 具有可变参数构造函数。然后要覆盖的得分方法是public float customScore(int doc, float subQueryScore, float[] valSrcScore)【参考方案2】:

这可以使用 Solr 的 FunctionQuery 来实现

【讨论】:

以上是关于数字字段的 Lucene 自定义评分的主要内容,如果未能解决你的问题,请参考以下文章

如何实现Solr自定义评分查询

lucene 的评分机制

Elasticseach的评分机制

Elasticsearch的Groovy Script自定义评分检索

不替代自然评分的 ElasticSearch 自定义脚本评分

在android中自定义评分栏