Elasticsearch:文档中可选字段的意外相关性得分

Posted

技术标签:

【中文标题】Elasticsearch:文档中可选字段的意外相关性得分【英文标题】:Elasticsearch: unexpected relevancy score for optional fields in documents 【发布时间】:2022-01-08 08:35:34 【问题描述】:

我可能在这里遗漏了一些琐碎的事情,但是当涉及到文档中的可选字段时,我遇到了搜索结果的相关性得分问题。考虑以下示例:

测试数据:

DELETE /my-index

PUT /my-index

POST /my-index/_bulk
"index":"_id":"1"
"required_field":"RareWord"
"index":"_id":"2"
"required_field":"RareWord"
"index":"_id":"3"
"required_field":"CommonWord"
"index":"_id":"4"
"required_field":"CommonWord"
"index":"_id":"5"
"required_field":"CommonWord"
"index":"_id":"6"
"required_field":"CommonWord"
"index":"_id":"7"
"required_field":"CommonWord"
"index":"_id":"8"
"required_field":"CommonWord"
"index":"_id":"9"
"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"
"index":"_id":"10"
"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"

搜索查询:

如果我运行类似于以下的搜索查询:

GET /my-index/_search
"query":"multi_match":"query":"RareWord AnotherRareWord","fields":["required_field","optional_field"]

期待

最终用户会期望文档 #9 和 #10 的得分高于其他文档,因为它们的 optional_field

中包含搜索查询的确切两个词

现实

文档#1 的得分高于#10,即使它只包含搜索查询的两个词之一;这与最终用户最可能期望的相反。

仔细看看_explain

这是对文档 #1 运行相同搜索查询的 _explain 结果:


  "_index" : "my-index",
  "_type" : "_doc",
  "_id" : "1",
  "matched" : true,
  "explanation" : 
    "value" : 1.4816045,
    "description" : "max of:",
    "details" : [
      
        "value" : 1.4816045,
        "description" : "sum of:",
        "details" : [
          
            "value" : 1.4816045,
            "description" : "weight(required_field:rareword in 0) [PerFieldSimilarity], result of:",
            "details" : [
              
                "value" : 1.4816045,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  ,
                  
                    "value" : 1.4816046,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      
                        "value" : 2,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      ,
                      
                        "value" : 10,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      
                    ]
                  ,
                  
                    "value" : 0.45454544,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      ,
                      
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      ,
                      
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      ,
                      
                        "value" : 1.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      ,
                      
                        "value" : 1.0,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      
                    ]
                  
                ]
              
            ]
          
        ]
      
    ]
  

这是对文档 #10 运行相同搜索查询的 _explain 结果:


  "_index" : "my-index",
  "_type" : "_doc",
  "_id" : "10",
  "matched" : true,
  "explanation" : 
    "value" : 0.36464313,
    "description" : "max of:",
    "details" : [
      
        "value" : 0.36464313,
        "description" : "sum of:",
        "details" : [
          
            "value" : 0.18232156,
            "description" : "weight(optional_field:rareword in 9) [PerFieldSimilarity], result of:",
            "details" : [
              
                "value" : 0.18232156,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  ,
                  
                    "value" : 0.18232156,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      
                        "value" : 2,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      ,
                      
                        "value" : 2,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      
                    ]
                  ,
                  
                    "value" : 0.45454544,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      ,
                      
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      ,
                      
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      ,
                      
                        "value" : 2.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      ,
                      
                        "value" : 2.0,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      
                    ]
                  
                ]
              
            ]
          ,
          
            "value" : 0.18232156,
            "description" : "weight(optional_field:anotherrareword in 9) [PerFieldSimilarity], result of:",
            "details" : [
              
                "value" : 0.18232156,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  ,
                  
                    "value" : 0.18232156,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      
                        "value" : 2,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      ,
                      
                        "value" : 2,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      
                    ]
                  ,
                  
                    "value" : 0.45454544,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      ,
                      
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      ,
                      
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      ,
                      
                        "value" : 2.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      ,
                      
                        "value" : 2.0,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      
                    ]
                  
                ]
              
            ]
          
        ]
      
    ]
  

如您所见,Document #10 得分更差,主要是由于 IDF 值较低 (0.18232156)。仔细看,这是因为IDF使用N,字段的文档总数:2,而不是简单地考虑索引中的文档总数:10。

问题

我的问题是,在计算可选字段的 IDF 值时,有什么方法可以强制 multi_match 查询考虑所有文档(而不仅仅是包含该字段的文档),因此产生更接近最终用户期望的相关性分数? 或者,有没有更好的方法来编写搜索查询,所以我得到了预期的结果?

任何帮助将不胜感激。谢谢。

【问题讨论】:

【参考方案1】:

您的情况似乎与cross_fields query type 中描述的情况相似,所以您应该尝试一下:


  "multi_match": 
    "query": "RareWord AnotherRareWord",
    "fields": ["required_field","optional_field"], 
    "type": "cross_fields", 
    "operator": "and"
  

【讨论】:

感谢您的回复。 cross_fields 的行为方式相同;如果您运行上面的查询,文档#1 没有出现在搜索结果中的原因是“and”运算符。如果我们将运算符更改为“或”,我遇到的问题仍然存在。您的建议实际上提醒我尝试 combine_fields 查询,它没有相同的问题,但仍然存在在我的用例中不理想的其他限制(例如,所有字段都需要使用相同的分析器,而我不是确定如何使用它来匹配短语) 是的,有道理。另一种方法是使用映射中的第三个字段在索引期间收集所有数据:PUT /my-index "mappings": "properties": "required_field": "type": "text", "copy_to": "all_fields", "optional_field": "type": "text", "copy_to": "all_fields",然后将其用于搜索GET /my-index/_search "query": "match": "all_fields": "RareWord AnotherRareWord" 使用 copy_to 是一个好主意,并且应该在查询不经常更改的大多数情况下工作。不幸的是,在我的情况下,根据用户选择的搜索过滤器,需要包含在搜索查询中的字段列表会发生变化。所以我不能将它们全部复制到一个字段中,因为我事先不知道哪些字段需要包含在搜索中,如果这有意义的话。顺便说一句,非常感谢您回复并提出解决方案。

以上是关于Elasticsearch:文档中可选字段的意外相关性得分的主要内容,如果未能解决你的问题,请参考以下文章

elasticsearch 实现聚合后两个字段相除相加相减相乘,运算

Swift 在尝试向上移动布局时打开可选值时意外发现 nil

C#中可选参数和具名参数的使用

Spring数据Elasticsearch中可配置的索引名称

致命错误:在展开可选值时意外发现 nil - 当 UITextField 被点击时

Swift5.4 中可选的误报