如何使更短（更接近）的令牌匹配更相关？ (edge_ngram)

Posted 2023-02-24

技术标签:

【中文标题】如何使更短（更接近）的令牌匹配更相关？ (edge_ngram)【英文标题】：How to make shorter (closer) token match more relevant? (edge_ngram) 【发布时间】：2021-02-08 07:35:03 【问题描述】：

我在使用 edge_ngram 标记器进行自动完成时得到了奇怪的结果。我试图弄清楚如何使我的结果更相关。我从 elasticsearch 文档中复制了example。

我有以下描述的文件：

“苹果，生的，去皮” “苹果，生的，金黄可口，带皮” “APPLEBEE'S，辣椒” “婴儿食品、水果、苹果酱、初中”

如果我搜索apple，“APPLEBEE'S, chili”的得分将高于“Apples, raw, without skin”

如果我搜索apples，“Babyfood,fruit,applesauce,junior”的得分将高于“Apples, raw, Golden sweet, with skin”

在这两种情况下，我都希望为更相关的更接近/更短的匹配获得更高的分数（即，当我搜索 apple 或 apples 时，包含单词 apples 的结果应该更高得分高于APPLEBEE'S 或applesauce。

我的设置是：


  "settings": 
    "analysis": 
      "analyzer": 
        "autocomplete": 
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        ,
        "autocomplete_search": 
          "tokenizer": "lowercase"
        
      ,
      "tokenizer": 
        "autocomplete": 
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 20,
          "token_chars": [
            "letter"
          ]
        
      
    
  ,
  "mappings": 
    "properties": 
      "description": 
        "type": "text",
        "analyzer": "autocomplete",
        "search_analyzer": "autocomplete_search"

查询：

"query": 
    "match": 
      "description": 
          "query": "apple", 
          "operator": "and"

如何让越相关的结果得分越高？

【问题讨论】：

您能否也分享您的映射，以便我们知道您的字段使用的分析器我更新了问题，但我使用了上面链接的文档中的自动完成示例代码用于 edgengram 标记器 【参考方案1】：

由于新的 BM25 算法（用于评分）中称为 (dl) 的匹配字段的长度导致此问题发生，您可以轻松地在查询中使用 explain param 来详细了解它

http://hostname:port//_search?explain=true

由于您的 APPLEBEE'S, chili 长度最短，因此得分更高，这是此文档的 tf 分数

 
                                    "value": 0.5344296,
                                    "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                                    "details": [
                                        
                                            "value": 1.0,
                                            "description": "freq, occurrences of term within document",
                                            "details": []
                                        ,
                                        
                                            "value": 1.2,
                                            "description": "k1, term saturation parameter",
                                            "details": []
                                        ,
                                        
                                            "value": 0.75,
                                            "description": "b, length normalization parameter",
                                            "details": []
                                        ,
                                        
                                            "value": 11.0,
                                            "description": "dl, length of field", ---> note this
                                            "details": []
                                        ,
                                        
                                            "value": 17.333334,
                                            "description": "avgdl, average length of field",
                                            "details": []
                                        
                                    ]

解决方案

您需要创建另一个使用english 分析器的字段，如multi-fields 示例所示，以下是完整示例

索引示例


    "settings": 
        "analysis": 
            "analyzer": 
                "autocomplete": 
                    "tokenizer": "autocomplete",
                    "filter": [
                        "lowercase",
                        "asciifolding"
                    ]
                ,
                "autocomplete_search": 
                    "tokenizer": "lowercase"
                
            ,
            "tokenizer": 
                "autocomplete": 
                    "type": "edge_ngram",
                    "min_gram": 2,
                    "max_gram": 20,
                    "token_chars": [
                        "letter"
                    ]
                
            
        
    ,
    "mappings": 
        "properties": 
            "name": 
                "type": "text",
                "analyzer": "autocomplete",
                "search_analyzer": "autocomplete_search",
                "fields": 
                    "english": 
                        "type": "text",
                        "analyzer": "english"

并索引您的示例文档


    "name" : "Apples, raw, without skin"


    "name" : "APPLEBEE'S, chili"


    "name" : "Babyfood, fruit, applesauce, junior"


    "name" : "Apples, raw, golden delicious, with skin"

搜索查询


    "query": 
        "bool": 
            "should": [
                
                    "multi_match": 
                        "query": "apple",
                        "fields": [
                            "name.english",
                            "name"
                        ]
                    
                
            ]

在搜索结果中，请注意包含apple的文档得分较高

 "hits": [
            
                "_index": "edgelow",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.6747451,
                "_source": 
                    "name": "Apples, raw, without skin"
                
            ,
            
                "_index": "edgelow",
                "_type": "_doc",
                "_id": "4",
                "_score": 0.60996956,
                "_source": 
                    "name": "Apples, raw, golden delicious, with skin"
                
            ,
            
                "_index": "edgelow",
                "_type": "_doc",
                "_id": "2",
                "_score": 0.12822598,
                "_source": 
                    "name": "APPLEBEE'S, chili"
                
            ,
            
                "_index": "edgelow",
                "_type": "_doc",
                "_id": "3",
                "_score": 0.09446116,
                "_source": 
                    "name": "Babyfood, fruit, applesauce, junior"
                
            
        ]

【讨论】：

这很好，谢谢你的详细解释。是否可以在不依赖语言分析器的情况下做到这一点？这个文本字段可以包含任何语言的单词，所以我需要找到一个不是特定语言的解决方案。 @orszaczky，是的，你绝对可以使用标准分析器，但因为它不能阻止单词（因此苹果不会匹配苹果）并解决这个问题，使用了english 但是这个仅用于您的示例数据，如果您不使用任何语言分析器并使用 text 字段，则默认情况下将使用 standard 分析器，这是默认设置，适用于大多数用例。 @orszaczky 已经有一段时间了，如果您可以投票并接受答案，如果有帮助，那就太好了，在此先感谢 :) @orszaczky 如果您需要更多信息，请告诉我，否则如果您可以投票并接受答案，那就太好了。我仍在寻找一种更通用的解决方案，它不是特定于语言的，当我搜索 @ 时，更接近（更短）的匹配 apples 比更长的匹配 applesauce 获得更高的相关性987654336@。顺便说一句，我已经对答案投了赞成票，只是希望得到（或想出）更好的解决方案

以上是关于如何使更短（更接近）的令牌匹配更相关？ (edge_ngram)的主要内容，如果未能解决你的问题，请参考以下文章