Elasticsearch:文档中可选字段的意外相关性得分
Posted
技术标签:
【中文标题】Elasticsearch:文档中可选字段的意外相关性得分【英文标题】:Elasticsearch: unexpected relevancy score for optional fields in documents 【发布时间】:2022-01-08 08:35:34 【问题描述】:我可能在这里遗漏了一些琐碎的事情,但是当涉及到文档中的可选字段时,我遇到了搜索结果的相关性得分问题。考虑以下示例:
测试数据:
DELETE /my-index
PUT /my-index
POST /my-index/_bulk
"index":"_id":"1"
"required_field":"RareWord"
"index":"_id":"2"
"required_field":"RareWord"
"index":"_id":"3"
"required_field":"CommonWord"
"index":"_id":"4"
"required_field":"CommonWord"
"index":"_id":"5"
"required_field":"CommonWord"
"index":"_id":"6"
"required_field":"CommonWord"
"index":"_id":"7"
"required_field":"CommonWord"
"index":"_id":"8"
"required_field":"CommonWord"
"index":"_id":"9"
"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"
"index":"_id":"10"
"required_field":"CommonWord","optional_field":"RareWord AnotherRareWord"
搜索查询:
如果我运行类似于以下的搜索查询:
GET /my-index/_search
"query":"multi_match":"query":"RareWord AnotherRareWord","fields":["required_field","optional_field"]
期待
最终用户会期望文档 #9 和 #10 的得分高于其他文档,因为它们的 optional_field
中包含搜索查询的确切两个词现实
文档#1 的得分高于#10,即使它只包含搜索查询的两个词之一;这与最终用户最可能期望的相反。
仔细看看_explain
这是对文档 #1 运行相同搜索查询的 _explain 结果:
"_index" : "my-index",
"_type" : "_doc",
"_id" : "1",
"matched" : true,
"explanation" :
"value" : 1.4816045,
"description" : "max of:",
"details" : [
"value" : 1.4816045,
"description" : "sum of:",
"details" : [
"value" : 1.4816045,
"description" : "weight(required_field:rareword in 0) [PerFieldSimilarity], result of:",
"details" : [
"value" : 1.4816045,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
"value" : 2.2,
"description" : "boost",
"details" : [ ]
,
"value" : 1.4816046,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
,
"value" : 10,
"description" : "N, total number of documents with field",
"details" : [ ]
]
,
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
,
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
,
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
,
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
,
"value" : 1.0,
"description" : "avgdl, average length of field",
"details" : [ ]
]
]
]
]
]
这是对文档 #10 运行相同搜索查询的 _explain 结果:
"_index" : "my-index",
"_type" : "_doc",
"_id" : "10",
"matched" : true,
"explanation" :
"value" : 0.36464313,
"description" : "max of:",
"details" : [
"value" : 0.36464313,
"description" : "sum of:",
"details" : [
"value" : 0.18232156,
"description" : "weight(optional_field:rareword in 9) [PerFieldSimilarity], result of:",
"details" : [
"value" : 0.18232156,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
"value" : 2.2,
"description" : "boost",
"details" : [ ]
,
"value" : 0.18232156,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
,
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
]
,
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
,
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
,
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
,
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
,
"value" : 2.0,
"description" : "avgdl, average length of field",
"details" : [ ]
]
]
]
,
"value" : 0.18232156,
"description" : "weight(optional_field:anotherrareword in 9) [PerFieldSimilarity], result of:",
"details" : [
"value" : 0.18232156,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
"value" : 2.2,
"description" : "boost",
"details" : [ ]
,
"value" : 0.18232156,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
"value" : 2,
"description" : "n, number of documents containing term",
"details" : [ ]
,
"value" : 2,
"description" : "N, total number of documents with field",
"details" : [ ]
]
,
"value" : 0.45454544,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
,
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
,
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
,
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
,
"value" : 2.0,
"description" : "avgdl, average length of field",
"details" : [ ]
]
]
]
]
]
如您所见,Document #10 得分更差,主要是由于 IDF 值较低 (0.18232156)。仔细看,这是因为IDF使用N,字段的文档总数:2,而不是简单地考虑索引中的文档总数:10。
问题
我的问题是,在计算可选字段的 IDF 值时,有什么方法可以强制 multi_match 查询考虑所有文档(而不仅仅是包含该字段的文档),因此产生更接近最终用户期望的相关性分数? 或者,有没有更好的方法来编写搜索查询,所以我得到了预期的结果?
任何帮助将不胜感激。谢谢。
【问题讨论】:
【参考方案1】:您的情况似乎与cross_fields query type 中描述的情况相似,所以您应该尝试一下:
"multi_match":
"query": "RareWord AnotherRareWord",
"fields": ["required_field","optional_field"],
"type": "cross_fields",
"operator": "and"
【讨论】:
感谢您的回复。 cross_fields 的行为方式相同;如果您运行上面的查询,文档#1 没有出现在搜索结果中的原因是“and”运算符。如果我们将运算符更改为“或”,我遇到的问题仍然存在。您的建议实际上提醒我尝试 combine_fields 查询,它没有相同的问题,但仍然存在在我的用例中不理想的其他限制(例如,所有字段都需要使用相同的分析器,而我不是确定如何使用它来匹配短语) 是的,有道理。另一种方法是使用映射中的第三个字段在索引期间收集所有数据:PUT /my-index "mappings": "properties": "required_field": "type": "text", "copy_to": "all_fields", "optional_field": "type": "text", "copy_to": "all_fields"
,然后将其用于搜索GET /my-index/_search "query": "match": "all_fields": "RareWord AnotherRareWord"
使用 copy_to 是一个好主意,并且应该在查询不经常更改的大多数情况下工作。不幸的是,在我的情况下,根据用户选择的搜索过滤器,需要包含在搜索查询中的字段列表会发生变化。所以我不能将它们全部复制到一个字段中,因为我事先不知道哪些字段需要包含在搜索中,如果这有意义的话。顺便说一句,非常感谢您回复并提出解决方案。以上是关于Elasticsearch:文档中可选字段的意外相关性得分的主要内容,如果未能解决你的问题,请参考以下文章
elasticsearch 实现聚合后两个字段相除相加相减相乘,运算
Swift 在尝试向上移动布局时打开可选值时意外发现 nil
Spring数据Elasticsearch中可配置的索引名称