Elasticsearch:Keep words token 过滤器
Posted Elastic 中国社区官方博客
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Elasticsearch:Keep words token 过滤器相关的知识,希望对你有一定的参考价值。
Keep words token 过滤器是用来仅保留包含在指定单词列表中的 token,尽管你的文字中可能含有比这个列表更多的 token。在某些情况下,我们可以有一个包含多个单词的字段,但是将字段中的所有单词都设为标记可能并不有趣。这个过滤器使用 Lucene 的 KeepWordFilter。它的使用和我们经常使用到的 stop 过滤器正好相反。关于 stop filter 的使用,你可以查看我之前的文章 “Elasticsearch:分词器中的 token 过滤器使用示例”。
示例
以下 _analyze API 请求使用 keep 过滤器仅保留 "thief", "corporate", "technology" 及 "project" 标记:
GET _analyze
"tokenizer": "standard",
"filter": [
"type": "keep",
"keep_words": [ "thief", "corporate", "technology", "project", "elephant" ]
],
"text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."
上述命令返回结果:
"tokens": [
"token": "thief",
"start_offset": 2,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
,
"token": "corporate",
"start_offset": 19,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 4
,
"token": "technology",
"start_offset": 70,
"end_offset": 80,
"type": "<ALPHANUM>",
"position": 12
,
"token": "project",
"start_offset": 187,
"end_offset": 194,
"type": "<ALPHANUM>",
"position": 35
]
从上述的结果中,我们可以看到尽管 text 字段有一段很长的文字,但是返回的结果中只含有 keep 过滤器中的 keep_words 的一部分 token。如果按照正常的不使用 keep 过滤器,返回的结果是这样的:
GET _analyze
"tokenizer": "standard",
"text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."
上述命令返回的结果是:
"tokens": [
"token": "A",
"start_offset": 0,
"end_offset": 1,
"type": "<ALPHANUM>",
"position": 0
,
"token": "thief",
"start_offset": 2,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
,
"token": "who",
"start_offset": 8,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
,
"token": "steals",
"start_offset": 12,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
,
"token": "corporate",
"start_offset": 19,
"end_offset": 28,
"type": "<ALPHANUM>",
"position": 4
,
"token": "secrets",
"start_offset": 29,
"end_offset": 36,
"type": "<ALPHANUM>",
"position": 5
,
"token": "through",
"start_offset": 37,
"end_offset": 44,
"type": "<ALPHANUM>",
"position": 6
,
"token": "the",
"start_offset": 45,
"end_offset": 48,
"type": "<ALPHANUM>",
"position": 7
,
"token": "use",
"start_offset": 49,
"end_offset": 52,
"type": "<ALPHANUM>",
"position": 8
,
"token": "of",
"start_offset": 53,
"end_offset": 55,
"type": "<ALPHANUM>",
"position": 9
,
"token": "dream",
"start_offset": 56,
"end_offset": 61,
"type": "<ALPHANUM>",
"position": 10
,
"token": "sharing",
"start_offset": 62,
"end_offset": 69,
"type": "<ALPHANUM>",
"position": 11
,
"token": "technology",
"start_offset": 70,
"end_offset": 80,
"type": "<ALPHANUM>",
"position": 12
,
"token": "is",
"start_offset": 81,
"end_offset": 83,
"type": "<ALPHANUM>",
"position": 13
,
"token": "given",
"start_offset": 84,
"end_offset": 89,
"type": "<ALPHANUM>",
"position": 14
,
"token": "the",
"start_offset": 90,
"end_offset": 93,
"type": "<ALPHANUM>",
"position": 15
,
"token": "inverse",
"start_offset": 94,
"end_offset": 101,
"type": "<ALPHANUM>",
"position": 16
,
"token": "task",
"start_offset": 102,
"end_offset": 106,
"type": "<ALPHANUM>",
"position": 17
,
"token": "of",
"start_offset": 107,
"end_offset": 109,
"type": "<ALPHANUM>",
"position": 18
,
"token": "planting",
"start_offset": 110,
"end_offset": 118,
"type": "<ALPHANUM>",
"position": 19
,
"token": "an",
"start_offset": 119,
"end_offset": 121,
"type": "<ALPHANUM>",
"position": 20
,
"token": "idea",
"start_offset": 122,
"end_offset": 126,
"type": "<ALPHANUM>",
"position": 21
,
"token": "into",
"start_offset": 127,
"end_offset": 131,
"type": "<ALPHANUM>",
"position": 22
,
"token": "the",
"start_offset": 132,
"end_offset": 135,
"type": "<ALPHANUM>",
"position": 23
,
"token": "mind",
"start_offset": 136,
"end_offset": 140,
"type": "<ALPHANUM>",
"position": 24
,
"token": "of",
"start_offset": 141,
"end_offset": 143,
"type": "<ALPHANUM>",
"position": 25
,
"token": "a",
"start_offset": 144,
"end_offset": 145,
"type": "<ALPHANUM>",
"position": 26
,
"token": "C.E.O",
"start_offset": 146,
"end_offset": 151,
"type": "<ALPHANUM>",
"position": 27
,
"token": "but",
"start_offset": 154,
"end_offset": 157,
"type": "<ALPHANUM>",
"position": 28
,
"token": "his",
"start_offset": 158,
"end_offset": 161,
"type": "<ALPHANUM>",
"position": 29
,
"token": "tragic",
"start_offset": 162,
"end_offset": 168,
"type": "<ALPHANUM>",
"position": 30
,
"token": "past",
"start_offset": 169,
"end_offset": 173,
"type": "<ALPHANUM>",
"position": 31
,
"token": "may",
"start_offset": 174,
"end_offset": 177,
"type": "<ALPHANUM>",
"position": 32
,
"token": "doom",
"start_offset": 178,
"end_offset": 182,
"type": "<ALPHANUM>",
"position": 33
,
"token": "the",
"start_offset": 183,
"end_offset": 186,
"type": "<ALPHANUM>",
"position": 34
,
"token": "project",
"start_offset": 187,
"end_offset": 194,
"type": "<ALPHANUM>",
"position": 35
,
"token": "and",
"start_offset": 195,
"end_offset": 198,
"type": "<ALPHANUM>",
"position": 36
,
"token": "his",
"start_offset": 199,
"end_offset": 202,
"type": "<ALPHANUM>",
"position": 37
,
"token": "team",
"start_offset": 203,
"end_offset": 207,
"type": "<ALPHANUM>",
"position": 38
,
"token": "to",
"start_offset": 208,
"end_offset": 210,
"type": "<ALPHANUM>",
"position": 39
,
"token": "disaster",
"start_offset": 211,
"end_offset": 219,
"type": "<ALPHANUM>",
"position": 40
]
很显然,这个列表比之前使用过滤器的情况下比较,列表要长很多。
添加到索引中
我们可以定义如下的一个索引:
PUT keep_example
"settings":
"analysis":
"analyzer":
"my_analyzer":
"tokenizer": "standard",
"filter": [
"my_keep"
]
,
"filter":
"my_keep":
"type": "keep",
"stopwords": [
"thief",
"corporate",
"technology",
"project",
"elephant"
]
,
"mappings":
"properties":
"text":
"type": "text",
"analyzer": "my_analyzer"
我们可以使用如下的命令来进行测试:
GET keep_example/_analyze
"analyzer": "my_analyzer",
"text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."
配置参数
参数 | 描述 |
---|---|
keep_words | (必需*,字符串数组)要保留的单词列表。 只有与此列表中的单词匹配的标记才会包含在输出中。 必须指定此参数或 keep_words_path。 |
keep_words_path | (必需*,字符串数组)包含要保留的单词列表的文件的路径。 只有与此列表中的单词匹配的标记才会包含在输出中。 此路径必须是绝对路径或相对于 Elasticsearch config 位置的路径,并且文件必须是 UTF-8 编码的。 文件中的每个单词必须用换行符分隔。 必须指定此参数或 keep_words。 |
keep_words_case | (可选,布尔值)如果为真,则把 keep_words 中词都进行小写。 默认为假。 |
在实际的使用中,我们的 keep_words 可能会比较长。放入到命令中会不方便,且难以阅读。我们可以把这个列表放入到 keep_words_path 的文件中,比如:
PUT keep_words_example
"settings":
"analysis":
"analyzer":
"standard_keep_word_array":
"tokenizer": "standard",
"filter": [ "keep_word_array" ]
,
"standard_keep_word_file":
"tokenizer": "standard",
"filter": [ "keep_word_file" ]
,
"filter":
"keep_word_array":
"type": "keep",
"keep_words": [ "one", "two", "three" ]
,
"keep_word_file":
"type": "keep",
"keep_words_path": "analysis/example_word_list.txt"
如上所示,我们把 example_word_list.txt 放置于我们的 Elasticsearch 安装目录中的如下位置:
$ pwd
/Users/liuxg/elastic/elasticsearch-8.6.1/config/analysis
$ ls
example_word_list.txt
$ cat example_word_list.txt
thief
corporate
technology
project
elephant
以上是关于Elasticsearch:Keep words token 过滤器的主要内容,如果未能解决你的问题,请参考以下文章
Elasticsearch的布尔搜索模糊查询 - 意外结果 - “Word1”和“Word2”~3
Elasticsearch 7.x - IK分词器插件(ik_smart,ik_max_word)
Elasticsearch:使用 Docker 来安装 FSCrawler 并摄入 Word 及 PDF 文件