Elasticsearch:Keep words token 过滤器

Posted Elastic 中国社区官方博客

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Elasticsearch:Keep words token 过滤器相关的知识,希望对你有一定的参考价值。

Keep words token 过滤器是用来仅保留包含在指定单词列表中的 token,尽管你的文字中可能含有比这个列表更多的 token。在某些情况下,我们可以有一个包含多个单词的字段,但是将字段中的所有单词都设为标记可能并不有趣。这个过滤器使用 Lucene 的 KeepWordFilter。它的使用和我们经常使用到的 stop 过滤器正好相反。关于 stop filter 的使用,你可以查看我之前的文章 “Elasticsearch:分词器中的 token 过滤器使用示例”。

示例

以下 _analyze API 请求使用 keep 过滤器仅保留 "thief", "corporate", "technology" 及 "project" 标记:

GET _analyze

  "tokenizer": "standard",
  "filter": [
    
      "type": "keep",
      "keep_words": [ "thief", "corporate", "technology", "project", "elephant" ]
    
  ],
  "text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."

上述命令返回结果:


  "tokens": [
    
      "token": "thief",
      "start_offset": 2,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    ,
    
      "token": "corporate",
      "start_offset": 19,
      "end_offset": 28,
      "type": "<ALPHANUM>",
      "position": 4
    ,
    
      "token": "technology",
      "start_offset": 70,
      "end_offset": 80,
      "type": "<ALPHANUM>",
      "position": 12
    ,
    
      "token": "project",
      "start_offset": 187,
      "end_offset": 194,
      "type": "<ALPHANUM>",
      "position": 35
    
  ]

从上述的结果中,我们可以看到尽管 text 字段有一段很长的文字,但是返回的结果中只含有 keep 过滤器中的 keep_words 的一部分 token。如果按照正常的不使用 keep 过滤器,返回的结果是这样的:

GET _analyze

  "tokenizer": "standard",
  "text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."

上述命令返回的结果是:


  "tokens": [
    
      "token": "A",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    ,
    
      "token": "thief",
      "start_offset": 2,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    ,
    
      "token": "who",
      "start_offset": 8,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 2
    ,
    
      "token": "steals",
      "start_offset": 12,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 3
    ,
    
      "token": "corporate",
      "start_offset": 19,
      "end_offset": 28,
      "type": "<ALPHANUM>",
      "position": 4
    ,
    
      "token": "secrets",
      "start_offset": 29,
      "end_offset": 36,
      "type": "<ALPHANUM>",
      "position": 5
    ,
    
      "token": "through",
      "start_offset": 37,
      "end_offset": 44,
      "type": "<ALPHANUM>",
      "position": 6
    ,
    
      "token": "the",
      "start_offset": 45,
      "end_offset": 48,
      "type": "<ALPHANUM>",
      "position": 7
    ,
    
      "token": "use",
      "start_offset": 49,
      "end_offset": 52,
      "type": "<ALPHANUM>",
      "position": 8
    ,
    
      "token": "of",
      "start_offset": 53,
      "end_offset": 55,
      "type": "<ALPHANUM>",
      "position": 9
    ,
    
      "token": "dream",
      "start_offset": 56,
      "end_offset": 61,
      "type": "<ALPHANUM>",
      "position": 10
    ,
    
      "token": "sharing",
      "start_offset": 62,
      "end_offset": 69,
      "type": "<ALPHANUM>",
      "position": 11
    ,
    
      "token": "technology",
      "start_offset": 70,
      "end_offset": 80,
      "type": "<ALPHANUM>",
      "position": 12
    ,
    
      "token": "is",
      "start_offset": 81,
      "end_offset": 83,
      "type": "<ALPHANUM>",
      "position": 13
    ,
    
      "token": "given",
      "start_offset": 84,
      "end_offset": 89,
      "type": "<ALPHANUM>",
      "position": 14
    ,
    
      "token": "the",
      "start_offset": 90,
      "end_offset": 93,
      "type": "<ALPHANUM>",
      "position": 15
    ,
    
      "token": "inverse",
      "start_offset": 94,
      "end_offset": 101,
      "type": "<ALPHANUM>",
      "position": 16
    ,
    
      "token": "task",
      "start_offset": 102,
      "end_offset": 106,
      "type": "<ALPHANUM>",
      "position": 17
    ,
    
      "token": "of",
      "start_offset": 107,
      "end_offset": 109,
      "type": "<ALPHANUM>",
      "position": 18
    ,
    
      "token": "planting",
      "start_offset": 110,
      "end_offset": 118,
      "type": "<ALPHANUM>",
      "position": 19
    ,
    
      "token": "an",
      "start_offset": 119,
      "end_offset": 121,
      "type": "<ALPHANUM>",
      "position": 20
    ,
    
      "token": "idea",
      "start_offset": 122,
      "end_offset": 126,
      "type": "<ALPHANUM>",
      "position": 21
    ,
    
      "token": "into",
      "start_offset": 127,
      "end_offset": 131,
      "type": "<ALPHANUM>",
      "position": 22
    ,
    
      "token": "the",
      "start_offset": 132,
      "end_offset": 135,
      "type": "<ALPHANUM>",
      "position": 23
    ,
    
      "token": "mind",
      "start_offset": 136,
      "end_offset": 140,
      "type": "<ALPHANUM>",
      "position": 24
    ,
    
      "token": "of",
      "start_offset": 141,
      "end_offset": 143,
      "type": "<ALPHANUM>",
      "position": 25
    ,
    
      "token": "a",
      "start_offset": 144,
      "end_offset": 145,
      "type": "<ALPHANUM>",
      "position": 26
    ,
    
      "token": "C.E.O",
      "start_offset": 146,
      "end_offset": 151,
      "type": "<ALPHANUM>",
      "position": 27
    ,
    
      "token": "but",
      "start_offset": 154,
      "end_offset": 157,
      "type": "<ALPHANUM>",
      "position": 28
    ,
    
      "token": "his",
      "start_offset": 158,
      "end_offset": 161,
      "type": "<ALPHANUM>",
      "position": 29
    ,
    
      "token": "tragic",
      "start_offset": 162,
      "end_offset": 168,
      "type": "<ALPHANUM>",
      "position": 30
    ,
    
      "token": "past",
      "start_offset": 169,
      "end_offset": 173,
      "type": "<ALPHANUM>",
      "position": 31
    ,
    
      "token": "may",
      "start_offset": 174,
      "end_offset": 177,
      "type": "<ALPHANUM>",
      "position": 32
    ,
    
      "token": "doom",
      "start_offset": 178,
      "end_offset": 182,
      "type": "<ALPHANUM>",
      "position": 33
    ,
    
      "token": "the",
      "start_offset": 183,
      "end_offset": 186,
      "type": "<ALPHANUM>",
      "position": 34
    ,
    
      "token": "project",
      "start_offset": 187,
      "end_offset": 194,
      "type": "<ALPHANUM>",
      "position": 35
    ,
    
      "token": "and",
      "start_offset": 195,
      "end_offset": 198,
      "type": "<ALPHANUM>",
      "position": 36
    ,
    
      "token": "his",
      "start_offset": 199,
      "end_offset": 202,
      "type": "<ALPHANUM>",
      "position": 37
    ,
    
      "token": "team",
      "start_offset": 203,
      "end_offset": 207,
      "type": "<ALPHANUM>",
      "position": 38
    ,
    
      "token": "to",
      "start_offset": 208,
      "end_offset": 210,
      "type": "<ALPHANUM>",
      "position": 39
    ,
    
      "token": "disaster",
      "start_offset": 211,
      "end_offset": 219,
      "type": "<ALPHANUM>",
      "position": 40
    
  ]

很显然,这个列表比之前使用过滤器的情况下比较,列表要长很多。

添加到索引中

我们可以定义如下的一个索引:

PUT keep_example

  "settings": 
    "analysis": 
      "analyzer": 
        "my_analyzer": 
          "tokenizer": "standard",
          "filter": [
            "my_keep"
          ]
        
      ,
      "filter": 
        "my_keep": 
          "type": "keep",
          "stopwords": [
            "thief",
            "corporate",
            "technology",
            "project",
            "elephant"
          ]
        
      
    
  ,
  "mappings": 
    "properties": 
      "text": 
        "type": "text",
        "analyzer": "my_analyzer"
      
    
  

我们可以使用如下的命令来进行测试:

GET keep_example/_analyze

  "analyzer": "my_analyzer", 
  "text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."

配置参数

参数        描述
keep_words

(必需*,字符串数组)要保留的单词列表。 只有与此列表中的单词匹配的标记才会包含在输出中。

必须指定此参数或 keep_words_path。

keep_words_path

(必需*,字符串数组)包含要保留的单词列表的文件的路径。 只有与此列表中的单词匹配的标记才会包含在输出中。

此路径必须是绝对路径或相对于 Elasticsearch config 位置的路径,并且文件必须是 UTF-8 编码的。 文件中的每个单词必须用换行符分隔。

必须指定此参数或 keep_words。

keep_words_case(可选,布尔值)如果为真,则把 keep_words 中词都进行小写。 默认为假。

在实际的使用中,我们的 keep_words 可能会比较长。放入到命令中会不方便,且难以阅读。我们可以把这个列表放入到 keep_words_path 的文件中,比如:

PUT keep_words_example

  "settings": 
    "analysis": 
      "analyzer": 
        "standard_keep_word_array": 
          "tokenizer": "standard",
          "filter": [ "keep_word_array" ]
        ,
        "standard_keep_word_file": 
          "tokenizer": "standard",
          "filter": [ "keep_word_file" ]
        
      ,
      "filter": 
        "keep_word_array": 
          "type": "keep",
          "keep_words": [ "one", "two", "three" ]
        ,
        "keep_word_file": 
          "type": "keep",
          "keep_words_path": "analysis/example_word_list.txt"
        
      
    
  

如上所示,我们把 example_word_list.txt 放置于我们的 Elasticsearch 安装目录中的如下位置:

$ pwd
/Users/liuxg/elastic/elasticsearch-8.6.1/config/analysis
$ ls
example_word_list.txt
$ cat example_word_list.txt 
thief
corporate
technology
project
elephant

以上是关于Elasticsearch:Keep words token 过滤器的主要内容,如果未能解决你的问题,请参考以下文章

Elasticsearch的布尔搜索模糊查询 - 意外结果 - “Word1”和“Word2”~3

Elasticsearch 7.x - IK分词器插件(ik_smart,ik_max_word)

Elasticsearch:使用 Docker 来安装 FSCrawler 并摄入 Word 及 PDF 文件

How to search for a part of a word with ElasticSearch.md

有了解过Elasticsearch的性化搜索方案吗?

word break和word wrap