Elasticsearch：文档中可选字段的意外相关性得分

Posted 2023-03-12

技术标签:

【中文标题】Elasticsearch：文档中可选字段的意外相关性得分【英文标题】：Elasticsearch: unexpected relevancy score for optional fields in documents 【发布时间】：2022-01-08 08:35:34 【问题描述】：

我可能在这里遗漏了一些琐碎的事情，但是当涉及到文档中的可选字段时，我遇到了搜索结果的相关性得分问题。考虑以下示例：

测试数据：

DELETE /my-index

PUT /my-index

POST /my-index/_bulk
"index":"_id":"1"
"required_field":"RareWord"
"index":"_id":"2"
"required_field":"RareWord"
"index":"_id":"3"
"required_field":"CommonWord"
"index":"_id":"4"
"required_field":"CommonWord"
"index":"_id":"5"
"required_field":"CommonWord"
"index":"_id":"6"
"required_field":"CommonWord"
"index":"_id":"7"
"required_field":"CommonWord"
"index":"_id":"8"
"required_field":"CommonWord"
"index":"_id":"9"
"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"
"index":"_id":"10"
"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"

搜索查询：

如果我运行类似于以下的搜索查询：

GET /my-index/_search
"query":"multi_match":"query":"RareWord AnotherRareWord","fields":["required_field","optional_field"]

期待

最终用户会期望文档 #9 和 #10 的得分高于其他文档，因为它们的 optional_field

中包含搜索查询的确切两个词

现实

文档#1 的得分高于#10，即使它只包含搜索查询的两个词之一；这与最终用户最可能期望的相反。

仔细看看_explain

这是对文档 #1 运行相同搜索查询的 _explain 结果：


  "_index" : "my-index",
  "_type" : "_doc",
  "_id" : "1",
  "matched" : true,
  "explanation" : 
    "value" : 1.4816045,
    "description" : "max of:",
    "details" : [
      
        "value" : 1.4816045,
        "description" : "sum of:",
        "details" : [
          
            "value" : 1.4816045,
            "description" : "weight(required_field:rareword in 0) [PerFieldSimilarity], result of:",
            "details" : [
              
                "value" : 1.4816045,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  ,
                  
                    "value" : 1.4816046,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      
                        "value" : 2,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      ,
                      
                        "value" : 10,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      
                    ]
                  ,
                  
                    "value" : 0.45454544,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      ,
                      
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      ,
                      
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      ,
                      
                        "value" : 1.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      ,
                      
                        "value" : 1.0,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      
                    ]
                  
                ]
              
            ]
          
        ]
      
    ]

这是对文档 #10 运行相同搜索查询的 _explain 结果：


  "_index" : "my-index",
  "_type" : "_doc",
  "_id" : "10",
  "matched" : true,
  "explanation" : 
    "value" : 0.36464313,
    "description" : "max of:",
    "details" : [
      
        "value" : 0.36464313,
        "description" : "sum of:",
        "details" : [
          
            "value" : 0.18232156,
            "description" : "weight(optional_field:rareword in 9) [PerFieldSimilarity], result of:",
            "details" : [
              
                "value" : 0.18232156,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  ,
                  
                    "value" : 0.18232156,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      
                        "value" : 2,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      ,
                      
                        "value" : 2,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      
                    ]
                  ,
                  
                    "value" : 0.45454544,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      ,
                      
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      ,
                      
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      ,
                      
                        "value" : 2.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      ,
                      
                        "value" : 2.0,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      
                    ]
                  
                ]
              
            ]
          ,
          
            "value" : 0.18232156,
            "description" : "weight(optional_field:anotherrareword in 9) [PerFieldSimilarity], result of:",
            "details" : [
              
                "value" : 0.18232156,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  ,
                  
                    "value" : 0.18232156,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      
                        "value" : 2,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      ,
                      
                        "value" : 2,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      
                    ]
                  ,
                  
                    "value" : 0.45454544,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      ,
                      
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      ,
                      
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      ,
                      
                        "value" : 2.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      ,
                      
                        "value" : 2.0,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      
                    ]
                  
                ]
              
            ]
          
        ]
      
    ]

如您所见，Document #10 得分更差，主要是由于 IDF 值较低 (0.18232156)。仔细看，这是因为IDF使用N，字段的文档总数：2，而不是简单地考虑索引中的文档总数：10。

问题

我的问题是，在计算可选字段的 IDF 值时，有什么方法可以强制 multi_match 查询考虑所有文档（而不仅仅是包含该字段的文档），因此产生更接近最终用户期望的相关性分数？或者，有没有更好的方法来编写搜索查询，所以我得到了预期的结果？

任何帮助将不胜感激。谢谢。

【问题讨论】：

【参考方案1】：

您的情况似乎与cross_fields query type 中描述的情况相似，所以您应该尝试一下：


  "multi_match": 
    "query": "RareWord AnotherRareWord",
    "fields": ["required_field","optional_field"], 
    "type": "cross_fields", 
    "operator": "and"

【讨论】：

感谢您的回复。 cross_fields 的行为方式相同；如果您运行上面的查询，文档#1 没有出现在搜索结果中的原因是“and”运算符。如果我们将运算符更改为“或”，我遇到的问题仍然存在。您的建议实际上提醒我尝试 combine_fields 查询，它没有相同的问题，但仍然存在在我的用例中不理想的其他限制（例如，所有字段都需要使用相同的分析器，而我不是确定如何使用它来匹配短语）是的，有道理。另一种方法是使用映射中的第三个字段在索引期间收集所有数据：

PUT /my-index "mappings":  "properties":  "required_field": "type": "text", "copy_to": "all_fields", "optional_field": "type": "text", "copy_to": "all_fields"

，然后将其用于搜索GET /my-index/_search "query": "match": "all_fields": "RareWord AnotherRareWord" 使用 copy_to 是一个好主意，并且应该在查询不经常更改的大多数情况下工作。不幸的是，在我的情况下，根据用户选择的搜索过滤器，需要包含在搜索查询中的字段列表会发生变化。所以我不能将它们全部复制到一个字段中，因为我事先不知道哪些字段需要包含在搜索中，如果这有意义的话。顺便说一句，非常感谢您回复并提出解决方案。

以上是关于Elasticsearch：文档中可选字段的意外相关性得分的主要内容，如果未能解决你的问题，请参考以下文章