Elasticsearch：Rank feature query

Posted 2022-02-04 Elastic 中国社区官方博客

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Elasticsearch：Rank feature query相关的知识，希望对你有一定的参考价值。

在我之前的文章：

我运用了 distance_feature 来针对位置和时间来对搜素的文档进行加权。它们在实际的使用中非常有用。比如我们搜索新闻，我们首先肯定对最近发生的事非常感兴趣，我们希望搜索的结果把最近的文档排在前面的位置。当我们搜索一样东西时，我们希望把靠近我们位置的文档排在搜索结果的前面，比如附近的餐馆，虽然索引里可能有两个同样名称的餐馆。

那么在实际的使用中，假如有一个索引的字段是数值，那么我们有什么方法通过同样的方法来进行加权呢？比如，我们想把没有参加过拍卖的文档排在前面，这样更容易让那些没有参加过拍卖的文档更加容易曝光。

实现这种需求的方法就是使用 Rank feature query。

Rank feature query 介绍

根据 rank_feature 或 rank_features 字段的数值提高文档的相关性分数。

rank_feature 查询通常用在 bool 查询的 should 子句中，因此它的相关性分数被添加到 bool 查询的其他分数中。

将 rank_feature 或 rank_features 字段的 positive_score_impact 设置为 false，我们建议参与查询的每个文档都具有该字段的值。否则，如果在 should 子句中使用了 rank_feature 查询，它不会向具有缺失值的文档的分数添加任何内容，但会为包含特征的文档添加一些提升。这与我们想要的相反 — 因为我们认为这些特征是负面的，我们希望包含它们的文档的排名低于缺少它们的文档。在下面的例子中，我们将进一步进行展示。

与 function_score 查询或其他更改相关性分数的方法不同，rank_feature 查询在 track_total_hits 参数不为 true 时有效地跳过非竞争性 hits。这可以显着提高查询速度。

Rank feature functions

为了根据排名特征字段计算相关性分数，rank_feature 查询支持以下数学函数：

如果你不知道从哪里开始，我们建议你使用 saturation 函数。如果没有提供函数，rank_feature 查询默认使用 saturation 函数。

示例

要使用 rank_feature 查询，你的索引必须包含 rank_feature 或 rank_features 字段映射。要了解如何为 rank_feature 查询设置索引，请尝试以下示例。

使用以下字段映射创建 test 索引：

pagerank，一个 rank_feature 字段，用于衡量网站的重要性
url_length，一个 rank_feature 字段，其中包含网站 URL 的长度。对于此示例，长 URL 与相关性呈负相关，也就是说，长度越长，相关性越差，最终的分数越小。由 positive_score_impact 值为 false 表示。
topics，一个 rank_features 字段，其中包含 topic 列表和衡量每个文档与该 topic 的联系程度的度量

我们首先来创建一个如下的一个 test 索引：

PUT /test

  "mappings": 
    "properties": 
      "pagerank": 
        "type": "rank_feature"
      ,
      "url_length": 
        "type": "rank_feature",
        "positive_score_impact": false
      ,
      "topics": 
        "type": "rank_features"

在上面，我们定义了 pagerank，url_length 及 topics 字段。它们分别是 rank_feauture, rank_feauture 及 rank_features 类型的字段。

我们接着使用如下的方法来写入几个文档：

PUT /test/_doc/1?refresh

  "url": "https://en.wikipedia.org/wiki/2016_Summer_Olympics",
  "content": "Rio 2016",
  "pagerank": 50.3,
  "url_length": 42,
  "topics": 
    "sports": 50,
    "brazil": 30
  


PUT /test/_doc/2?refresh

  "url": "https://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
  "content": "Formula One motor race held on 13 November 2016",
  "pagerank": 60.3,
  "url_length": 47,
  "topics": 
    "sports": 35,
    "formula one": 65,
    "brazil": 20
  


PUT /test/_doc/3?refresh

  "url": "https://en.wikipedia.org/wiki/Deadpool_(film)",
  "content": "Deadpool is a 2016 American superhero film",
  "pagerank": 70.3,
  "url_length": 37,
  "topics": 
    "movies": 60,
    "super hero": 65

从上面的文档中，我们可以看出来，虽然 pagerank 被定义为 rank_feauture 类型，但是它的实际值为浮点数类型的。

我们首先来做如下的搜索：

GET /test/_search?filter_path=**hits

  "query": 
    "bool": 
      "must": [
        
          "match": 
            "content": "2016"
          
        
      ],
      "should": [
        
          "rank_feature": 
            "field": "pagerank"
          
        
      ]

我们的搜索结果如下：


  "hits" : 
    "total" : 
      "value" : 3,
      "relation" : "eq"
    ,
    "max_score" : 0.6440702,
    "hits" : [
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.6440702,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/Deadpool_(film)",
          "content" : "Deadpool is a 2016 American superhero film",
          "pagerank" : 70.3,
          "url_length" : 37,
          "topics" : 
            "movies" : 60,
            "super hero" : 65
          
        
      ,
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.5969556,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/2016_Summer_Olympics",
          "content" : "Rio 2016",
          "pagerank" : 50.3,
          "url_length" : 42,
          "topics" : 
            "sports" : 50,
            "brazil" : 30
          
        
      ,
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.59679806,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
          "content" : "Formula One motor race held on 13 November 2016",
          "pagerank" : 60.3,
          "url_length" : 47,
          "topics" : 
            "sports" : 35,
            "formula one" : 65,
            "brazil" : 20
          
        
      
    ]

上面的搜索结果对我们来说，看起来不是很明显。我们再做一个不含 should 的搜索：

GET /test/_search?filter_path=**hits

  "query": 
    "bool": 
      "must": [
        
          "match": 
            "content": "2016"
          
        
      ]

上面搜索的结果为：


  "hits" : 
    "total" : 
      "value" : 3,
      "relation" : "eq"
    ,
    "max_score" : 0.18360566,
    "hits" : [
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.18360566,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/2016_Summer_Olympics",
          "content" : "Rio 2016",
          "pagerank" : 50.3,
          "url_length" : 42,
          "topics" : 
            "sports" : 50,
            "brazil" : 30
          
        
      ,
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.12500812,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/Deadpool_(film)",
          "content" : "Deadpool is a 2016 American superhero film",
          "pagerank" : 70.3,
          "url_length" : 37,
          "topics" : 
            "movies" : 60,
            "super hero" : 65
          
        
      ,
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.110856235,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
          "content" : "Formula One motor race held on 13 November 2016",
          "pagerank" : 60.3,
          "url_length" : 47,
          "topics" : 
            "sports" : 35,
            "formula one" : 65,
            "brazil" : 20
          
        
      
    ]

对于上面的搜索分数，大家可能不会太陌生。因为所有的三个文档都含有 2016，所以所有的三个文档都会被正确地搜索出来，但是由于 id 为 "1" 的文档更短，所以具有更高的分数。同样 id 为 "2" 的文档长度也较短，所以排名第二。关于如何计算分数，请参考我的文章 “Elasticsearch：分布式计分”。

这个结果和我们之前添加 should 后的搜索的进行比较，由于引进了 should，我们搜索结果的排序发生了变化。从 1，3，2 变为 3，1，2 的顺序。这个结果是可以理解的。这是因为引入了 should，而在它里面使用了 rank_feature 查询。在默认的情况下，positive_score_impact 为 ture，也就意味着 pagerank 值越大，那么代表越相关。在我们的三个文档中，id 为 "3" 的 pagerank 值最大，所以它帮助整个搜索进行分数的提升。

事实上，我们甚至可以添加 boost 参数来提升这个 rank_feature 的重要性：

GET /test/_search?filter_path=**hits

  "query": 
    "bool": 
      "must": [
        
          "match": 
            "content": "2016"
          
        
      ],
      "should": [
        
          "rank_feature": 
            "field": "pagerank",
            "boost": 2.0
          
        
      ]

在上面，我们添加了 boost 参数，并且把这个参数值提高为 2.0，那么上面的搜索结果为：


  "hits" : 
    "total" : 
      "value" : 3,
      "relation" : "eq"
    ,
    "max_score" : 1.2109984,
    "hits" : [
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.2109984,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/Deadpool_(film)",
          "content" : "Deadpool is a 2016 American superhero film",
          "pagerank" : 70.3,
          "url_length" : 37,
          "topics" : 
            "movies" : 60,
            "super hero" : 65
          
        
      ,
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.1202803,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
          "content" : "Formula One motor race held on 13 November 2016",
          "pagerank" : 60.3,
          "url_length" : 47,
          "topics" : 
            "sports" : 35,
            "formula one" : 65,
            "brazil" : 20
          
        
      ,
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.1024628,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/2016_Summer_Olympics",
          "content" : "Rio 2016",
          "pagerank" : 50.3,
          "url_length" : 42,
          "topics" : 
            "sports" : 50,
            "brazil" : 30
          
        
      
    ]

我们可看到最终得到的分数是按照 pagerank 进行降序排列的。这充分说明了 rank_feature 在搜索中确实起到了作业。

在实际的使用中，我们有时并不希望数值越大则代表相关性越强。比如，我们认为，数值越小，它代表相关性越强。就像我在文章开始介绍的那样。我们有时想对那些没有参加过拍卖的文档进行加分。在本例子中，我们认为 url_length 越小，则代表相关性越强。针对这种情况，我们必须设置 positive_score_impact 为 false。

我们进行如下的搜索：

GET /test/_search

  "query": 
    "bool": 
      "must": [
        
          "match": 
            "content": "2016"
          
        
      ],
      "should": [
        
          "rank_feature": 
            "field": "url_length",
            "boost": 3.0
          
        
      ]

上面搜索的结果为：


  "took" : 0,
  "timed_out" : false,
  "_shards" : 
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  ,
  "hits" : 
    "total" : 
      "value" : 3,
      "relation" : "eq"
    ,
    "max_score" : 1.7130321,
    "hits" : [
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.7130321,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/Deadpool_(film)",
          "content" : "Deadpool is a 2016 American superhero film",
          "pagerank" : 70.3,
          "url_length" : 37,
          "topics" : 
            "movies" : 60,
            "super hero" : 65
          
        
      ,
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.6778586,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/2016_Summer_Olympics",
          "content" : "Rio 2016",
          "pagerank" : 50.3,
          "url_length" : 42,
          "topics" : 
            "sports" : 50,
            "brazil" : 30
          
        
      ,
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.519763,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
          "content" : "Formula One motor race held on 13 November 2016",
          "pagerank" : 60.3,
          "url_length" : 47,
          "topics" : 
            "sports" : 35,
            "formula one" : 65,
            "brazil" : 20
          
        
      
    ]

从上面的搜索结果中，我们可以看出来文档的排序为：3，1，2，而对应于它们的 url_length 值分别为：37，42，47。排名的顺序是按照 url_length 的值进行升序进行排列的，也就是说 url_length 值越小，则代表相关性越大。

最后，我们可以结合 rank_features 进行查询：

GET /test/_search?filter_path=**hits

  "query": 
    "bool": 
      "must": [
        
          "match": 
            "content": "2016"
          
        
      ],
      "should": [
        
          "rank_feature": 
            "field": "pagerank"
          
        ,
        
          "rank_feature": 
            "field": "url_length",
            "boost": 0.1
          
        ,
        
          "rank_feature": 
            "field": "topics.sports",
            "boost": 0.4
          
        
      ]

在上面，我们使用 topics.sports 这样的形式来访问 rank_feautures 里的项。上面搜索的结果为：


  "hits" : 
    "total" : 
      "value" : 3,
      "relation" : "eq"
    ,
    "max_score" : 0.90905887,
    "hits" : [
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.90905887,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/2016_Summer_Olympics",
          "content" : "Rio 2016",
          "pagerank" : 50.3,
          "url_length" : 42,
          "topics" : 
            "sports" : 50,
            "brazil" : 30
          
        
      ,
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.843177,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/2016_Brazilian_Grand_Prix",
          "content" : "Formula One motor race held on 13 November 2016",
          "pagerank" : 60.3,
          "url_length" : 47,
          "topics" : 
            "sports" : 35,
            "formula one" : 65,
            "brazil" : 20
          
        
      ,
      
        "_index" : "test",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.7209374,
        "_source" : 
          "url" : "https://en.wikipedia.org/wiki/Deadpool_(film)",
          "content" : "Deadpool is a 2016 American superhero film",
          "pagerank" : 70.3,
          "url_length" : 37,
          "topics" : 
            "movies" : 60,
            "super hero" : 65
          
        
      
    ]

上面的搜索结果其实也蛮容易理解的。由于文档 "3" 里不含有 topics.sports，所以它的得分最低。在通用的情况下，由于文档 "1" 里的 content 文本长度最短，从而使得它的分数也最高。

以上是关于Elasticsearch：Rank feature query的主要内容，如果未能解决你的问题，请参考以下文章

Elasticsearch：Rank feature query - 排名功能查询

elasticsearch可以rank吗

Elasticsearch 5.x 批量删除某typy中的数据

Apache Solr vs Elasticsearch-feature

Elasticsearch：使用 distance feature 查询提高分数