Elasticsearch: Prefix queries - 前缀查询
Posted Elastic 中国社区官方博客
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Elasticsearch: Prefix queries - 前缀查询相关的知识,希望对你有一定的参考价值。
Prefix queries 被用于在查询时返回在提供的字段中包含特定前缀的文档。有时我们可能想使用前缀查询单词,例如 Leonardo 的 Leo 或 Marlon Brando、Mark Hamill 或 Martin Balsam 的 Mar。 Elasticsearch 提供了一个前缀查询,用于获取匹配单词开头部分(前缀)的记录。
准备数据
示例
我们先准备数据。我们想创建如下的一个 movies 的索引:
PUT movies
"settings":
"analysis":
"analyzer":
"en_analyzer":
"tokenizer": "standard",
"filter": [
"lowercase",
"stop"
]
,
"shingle_analyzer":
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle_filter"
]
,
"filter":
"shingle_filter":
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
,
"mappings":
"properties":
"title":
"type": "text",
"analyzer": "en_analyzer",
"fields":
"suggest":
"type": "text",
"analyzer": "shingle_analyzer"
,
"actors":
"type": "text",
"analyzer": "en_analyzer",
"fields":
"keyword":
"type": "keyword",
"ignore_above": 256
,
"description":
"type": "text",
"analyzer": "en_analyzer",
"fields":
"keyword":
"type": "keyword",
"ignore_above": 256
,
"director":
"type": "text",
"fields":
"keyword":
"type": "keyword",
"ignore_above": 256
,
"genre":
"type": "text",
"fields":
"keyword":
"type": "keyword",
"ignore_above": 256
,
"metascore":
"type": "long"
,
"rating":
"type": "float"
,
"revenue":
"type": "float"
,
"runtime":
"type": "long"
,
"votes":
"type": "long"
,
"year":
"type": "long"
,
"title_suggest":
"type": "completion",
"analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
我们接下来使用 _bulk 命令来写入一些文档到这个索引中去。我们使用这个链接中的内容。我们使用如下的方法:
POST movies/_bulk
"index":
"title": "Guardians of the Galaxy", "genre": "Action,Adventure,Sci-Fi", "director": "James Gunn", "actors": "Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana", "description": "A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control of the universe.", "year": 2014, "runtime": 121, "rating": 8.1, "votes": 757074, "revenue": 333.13, "metascore": 76
"index":
"title": "Prometheus", "genre": "Adventure,Mystery,Sci-Fi", "director": "Ridley Scott", "actors": "Noomi Rapace, Logan Marshall-Green, Michael Fassbender, Charlize Theron", "description": "Following clues to the origin of mankind, a team finds a structure on a distant moon, but they soon realize they are not alone.", "year": 2012, "runtime": 124, "rating": 7, "votes": 485820, "revenue": 126.46, "metascore": 65
....
在上面,为了说明的方便,我省去了其它的文档。你需要把整个 movies.txt 的文件拷贝过来,并全部写入到 Elasticsearch 中。它共有1000 个文档。
Prefix 查询
我们使用如下的例子来进行查询:
GET movies/_search?filter_path=**.hits
"_source": false,
"fields": [
"actors"
],
"query":
"prefix":
"actors.keyword":
"value": "Mar"
当我们搜索前缀 Mar 时,上面的查询获取了演员以 Mar 开头的电影。请注意,我们正在 actors.keyword 字段上运行前缀查询。它是一个 keyword 字段。返回的结果为:
"hits":
"hits": [
"_index": "movies",
"_id": "RgJfWIYBfOmyc7Qq5geX",
"_score": 1,
"fields":
"actors": [
"Mark Wahlberg, Michelle Monaghan, J.K. Simmons, John Goodman"
]
,
"_index": "movies",
"_id": "SQJfWIYBfOmyc7Qq5geX",
"_score": 1,
"fields":
"actors": [
"Mark Wahlberg, Kurt Russell, Douglas M. Griffin, James DuMont"
]
,
"_index": "movies",
"_id": "awJfWIYBfOmyc7Qq5geX",
"_score": 1,
"fields":
"actors": [
"Mario Casas, Ana Wagener, José Coronado, Bárbara Lennie"
]
,
"_index": "movies",
"_id": "ggJfWIYBfOmyc7Qq5geX",
"_score": 1,
"fields":
"actors": [
"Mark Wahlberg, Nicola Peltz, Jack Reynor, Stanley Tucci"
]
,
"_index": "movies",
"_id": "mgJfWIYBfOmyc7Qq5geX",
"_score": 1,
"fields":
"actors": [
"Mark Rylance, Ruby Barnhill, Penelope Wilton,Jemaine Clement"
]
,
"_index": "movies",
"_id": "xAJfWIYBfOmyc7Qq5geX",
"_score": 1,
"fields":
"actors": [
"Mark Ruffalo, Michael Keaton, Rachel McAdams, Liev Schreiber"
]
,
"_index": "movies",
"_id": "3gJfWIYBfOmyc7Qq5geX",
"_score": 1,
"fields":
"actors": [
"Mark Huberman, Susan Loughnane, Steve Oram,Catherine Walker"
]
,
"_index": "movies",
"_id": "EwJfWIYBfOmyc7Qq5giX",
"_score": 1,
"fields":
"actors": [
"Martin Freeman, Ian McKellen, Richard Armitage,Andy Serkis"
]
,
"_index": "movies",
"_id": "MQJfWIYBfOmyc7Qq5giX",
"_score": 1,
"fields":
"actors": [
"Mark Wahlberg, Taylor Kitsch, Emile Hirsch, Ben Foster"
]
,
"_index": "movies",
"_id": "tgJfWIYBfOmyc7Qq5giY",
"_score": 1,
"fields":
"actors": [
"Marilyn Manson, Mark Boone Junior, Sam Quartin, Niko Nicotera"
]
]
很显然,actors 的列表中都是以 Mar 为开头的列表。
注意:前缀查询是一个昂贵的查询 - 有时会破坏集群的稳定性。
我们不需要在字段块级别添加由 value 组成的对象。 相反,你可以创建一个缩短的版本,如下所示,为简洁起见:
GET movies/_search?filter_path=**.hits
"_source": false,
"fields": [
"actors"
],
"query":
"prefix":
"actors.keyword": "Mar"
由于我们希望在结果中找出匹配的字段,因此我们将通过在查询中添加高亮来突出显示结果。 我们向前缀查询添加一个 highlight 显示块。 这会突出一个或多个匹配的字段,如下面的清单所示。
GET movies/_search?filter_path=**.hits
"_source": false,
"query":
"prefix":
"actors.keyword": "Mar"
,
"highlight":
"fields":
"actors.keyword":
上面的搜索结果显示:
"hits":
"hits": [
"_index": "movies",
"_id": "RgJfWIYBfOmyc7Qq5geX",
"_score": 1,
"highlight":
"actors.keyword": [
"<em>Mark Wahlberg, Michelle Monaghan, J.K. Simmons, John Goodman</em>"
]
,
"_index": "movies",
"_id": "SQJfWIYBfOmyc7Qq5geX",
"_score": 1,
"highlight":
"actors.keyword": [
"<em>Mark Wahlberg, Kurt Russell, Douglas M. Griffin, James DuMont</em>"
]
,
"_index": "movies",
"_id": "awJfWIYBfOmyc7Qq5geX",
"_score": 1,
"highlight":
"actors.keyword": [
"<em>Mario Casas, Ana Wagener, José Coronado, Bárbara Lennie</em>"
]
,
"_index": "movies",
"_id": "ggJfWIYBfOmyc7Qq5geX",
"_score": 1,
"highlight":
"actors.keyword": [
"<em>Mark Wahlberg, Nicola Peltz, Jack Reynor, Stanley Tucci</em>"
]
,
"_index": "movies",
"_id": "mgJfWIYBfOmyc7Qq5geX",
"_score": 1,
"highlight":
"actors.keyword": [
"<em>Mark Rylance, Ruby Barnhill, Penelope Wilton,Jemaine Clement</em>"
]
,
"_index": "movies",
"_id": "xAJfWIYBfOmyc7Qq5geX",
"_score": 1,
"highlight":
"actors.keyword": [
"<em>Mark Ruffalo, Michael Keaton, Rachel McAdams, Liev Schreiber</em>"
]
,
"_index": "movies",
"_id": "3gJfWIYBfOmyc7Qq5geX",
"_score": 1,
"highlight":
"actors.keyword": [
"<em>Mark Huberman, Susan Loughnane, Steve Oram,Catherine Walker</em>"
]
,
"_index": "movies",
"_id": "EwJfWIYBfOmyc7Qq5giX",
"_score": 1,
"highlight":
"actors.keyword": [
"<em>Martin Freeman, Ian McKellen, Richard Armitage,Andy Serkis</em>"
]
,
"_index": "movies",
"_id": "MQJfWIYBfOmyc7Qq5giX",
"_score": 1,
"highlight":
"actors.keyword": [
"<em>Mark Wahlberg, Taylor Kitsch, Emile Hirsch, Ben Foster</em>"
]
,
"_index": "movies",
"_id": "tgJfWIYBfOmyc7Qq5giY",
"_score": 1,
"highlight":
"actors.keyword": [
"<em>Marilyn Manson, Mark Boone Junior, Sam Quartin, Niko Nicotera</em>"
]
]
我们之前讨论过,前缀查询在运行查询时会施加额外的计算压力。 幸运的是,有一种方法可以加快这种煞费苦心的性能不佳的前缀查询 —— 将在下一节中讨论。
加速前缀查询
这是因为引擎必须根据前缀(任何带字母的单词)得出结果。 因此,前缀查询运行起来很慢,但有一种机制可以加快它们的速度:在字段上使用 index_prefixes 参数。
我们可以在开发映射模式时在字段上设置 index_prefixes 参数。 例如,下面清单中的映射定义在我们为本练习创建的新索引 new_movies 上使用附加参数 index_prefixes 设置 title 字段(请记住,title 字段是 text 数据类型)。我们按照如下的命令来创建这个新索引:
PUT new_movies
"settings":
"analysis":
"analyzer":
"en_analyzer":
"tokenizer": "standard",
"filter": [
"lowercase",
"stop"
]
,
"shingle_analyzer":
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle_filter"
]
,
"filter":
"shingle_filter":
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
,
"mappings":
"properties":
"title":
"type": "text",
"index_prefixes":
,
"actors":
"type": "text",
"analyzer": "en_analyzer",
"fields":
"keyword":
"type": "keyword",
"ignore_above": 256
,
"description":
"type": "text",
"analyzer": "en_analyzer",
"fields":
"keyword":
"type": "keyword",
"ignore_above": 256
,
"director":
"type": "text",
"fields":
"keyword":
"type": "keyword",
"ignore_above": 256
,
"genre":
"type": "text",
"fields":
"keyword":
"type": "keyword",
"ignore_above": 256
,
"metascore":
"type": "long"
,
"rating":
"type": "float"
,
"revenue":
"type": "float"
,
"runtime":
"type": "long"
,
"votes":
"type": "long"
,
"year":
"type": "long"
,
"title_suggest":
"type": "completion",
"analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
在上面,我们为 new_movies 添加了如下的 index_prefixes 相:
"title":
"type": "text",
"index_prefixes": ,
"analyzer": "en_analyzer",
"fields":
"suggest":
"type": "text",
"analyzer": "shingle_analyzer"
从清单中的代码可以看出,title 属性包含一个附加属性 index_prefixes。 这向引擎表明,在索引过程中,它应该创建带有预置前缀的字段并存储这些值。 我们使用如下的代码来写入数据到这个索引中:
POST _reindex
"source":
"index": "movies"
,
"dest":
"index": "new_movies"
我们使用 reindex 把之前的 movies 里的文档写入到 new_movies 索引中去。
因为我们在上面显示的列表中的 title 字段上设置了 index_prefixes,所以 Elasticsearch 默认为最小字符大小 2 和最大字符大小 5 索引前缀。 这样,当我们运行前缀查询时,就不需要计算前缀了。 相反,它从存储中获取它们。
当然,我们可以更改 Elasticsearch 在索引期间尝试为我们创建的前缀的默认最小和最大大小。 这是通过调整 index_prefixes 对象的大小来完成的,如下面的清单所示。
PUT my-index-000001
"mappings":
"properties":
"full_name":
"type": "text",
"index_prefixes":
"min_chars" : 1,
"max_chars" : 10
在清单中,我们要求引擎预先创建最小和最大字符长度分别为 4 个和 10 个字母的前缀。 注意,min_chars 必须大于 0,max_chars 应小于 20 个字符。 这样,我们就可以在索引过程中自定义 Elasticsearch 应该预先创建的前缀。
我们接着可以对 title 字段做类似下面的搜索:
GET new_movies/_search?filter_path=**.hits
"_source": false,
"fields": [
"title"
],
"query":
"prefix":
"title":
"value": "ga"
在上面的搜索中,我们查询 titile 字段里 含有 ga 为开头的文档。上述搜索返回如下的结果:
"hits":
"hits": [
"_index": "new_movies",
"_id": "BAJfWIYBfOmyc7Qq5geX",
"_score": 1,
"fields":
"title": [
"Guardians of the Galaxy"
]
,
"_index": "new_movies",
"_id": "jQJfWIYBfOmyc7Qq5geX",
"_score": 1,
"fields":
"title": [
"The Great Gatsby"
]
,
"_index": "new_movies",
"_id": "lQJfWIYBfOmyc7Qq5geX",
"_score": 1,
"fields":
"title": [
"Ah-ga-ssi"
]
,
"_index": "new_movies",
"_id": "mwJfWIYBfOmyc7Qq5geX",
"_score": 1,
"fields":
"title": [
"The Hunger Games"
]
,
"_index": "new_movies",
"_id": "sAJfWIYBfOmyc7Qq5geX",
"_score": 1,
"fields":
"title": [
"Beyond the Gates"
]
,
"_index": "new_movies",
"_id": "ygJfWIYBfOmyc7Qq5geX",
"_score": 1,
"fields":
"title": [
"The Imitation Game"
]
,
"_index": "new_movies",
"_id": "jQJfWIYBfOmyc7Qq5giY",
"_score": 1,
"fields":
"title": [
"Whisky Galore"
]
,
"_index": "new_movies",
"_id": "nAJfWIYBfOmyc7Qq5giY",
"_score": 1,
"fields":
"title": [
"The Hunger Games: Mockingjay - Part 2"
]
,
"_index": "new_movies",
"_id": "1QJfWIYBfOmyc7Qq5giY",
"_score": 1,
"fields":
"title": [
"Sherlock Holmes: A Game of Shadows"
]
,
"_index": "new_movies",
"_id": "2gJfWIYBfOmyc7Qq5giY",
"_score": 1,
"fields":
"title": [
"American Gangster"
]
]
很显然,返回的结果里都含有 "ga" 为开头的单词。
以上是关于Elasticsearch: Prefix queries - 前缀查询的主要内容,如果未能解决你的问题,请参考以下文章
Elasticsearch: Prefix queries - 前缀查询
具有 multi_match 和 bool_prefix 类型的 Elasticsearch 模糊性
Elasticsearch Query DSL 整理总结—— Match Phrase Query 和 Match Phrase Prefix Query