Elasticsearch:创建一个简单的 “你的意思是?” 推荐搜索
Posted Elastic 中国社区官方博客
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Elasticsearch:创建一个简单的 “你的意思是?” 推荐搜索相关的知识,希望对你有一定的参考价值。
“你的意思是” 是搜索引擎中一个非常重要的功能,因为它们通过显示建议的术语来帮助用户,以便他可以进行更准确的搜索。比如,在百度中,我们进行搜索时,它通常会显示一些更为常用推荐的搜索选项来供我们选择:
为了创建 “你的意思是”,我们将使用 phrase suggester,因为通过它我们将能够建议句子更正,而不仅仅是术语。在我之前的文章 “Elasticsearch:如何实现短语建议 - phrase suggester”,我有涉及到这个问题。
首先,我们将使用一个 shingle 过滤器,因为它将提供一个分词,短语建议器将使用该标记来进行匹配并返回更正。有关 shingle 过滤器的描述,请阅读之前的文章 “Elasticsearch: Ngrams, edge ngrams, and shingles”。
准备数据
我们首先来定义映射:
PUT movies
"settings":
"analysis":
"analyzer":
"en_analyzer":
"tokenizer": "standard",
"filter": [
"lowercase",
"stop"
]
,
"shingle_analyzer":
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"shingle_filter"
]
,
"filter":
"shingle_filter":
"type": "shingle",
"min_shingle_size": 2,
"max_shingle_size": 3
,
"mappings":
"properties":
"title":
"type": "text",
"analyzer": "en_analyzer",
"fields":
"suggest":
"type": "text",
"analyzer": "shingle_analyzer"
,
"actors":
"type": "text",
"analyzer": "en_analyzer",
"fields":
"keyword":
"type": "keyword",
"ignore_above": 256
,
"description":
"type": "text",
"analyzer": "en_analyzer",
"fields":
"keyword":
"type": "keyword",
"ignore_above": 256
,
"director":
"type": "text",
"fields":
"keyword":
"type": "keyword",
"ignore_above": 256
,
"genre":
"type": "text",
"fields":
"keyword":
"type": "keyword",
"ignore_above": 256
,
"metascore":
"type": "long"
,
"rating":
"type": "float"
,
"revenue":
"type": "float"
,
"runtime":
"type": "long"
,
"votes":
"type": "long"
,
"year":
"type": "long"
,
"title_suggest":
"type": "completion",
"analyzer": "simple",
"preserve_separators": true,
"preserve_position_increments": true,
"max_input_length": 50
我们接下来使用 _bulk 命令来写入一些文档到这个索引中去。我们使用这个链接中的内容。我们使用如下的方法:
POST movies/_bulk
"index":
"title": "Guardians of the Galaxy", "genre": "Action,Adventure,Sci-Fi", "director": "James Gunn", "actors": "Chris Pratt, Vin Diesel, Bradley Cooper, Zoe Saldana", "description": "A group of intergalactic criminals are forced to work together to stop a fanatical warrior from taking control of the universe.", "year": 2014, "runtime": 121, "rating": 8.1, "votes": 757074, "revenue": 333.13, "metascore": 76
"index":
"title": "Prometheus", "genre": "Adventure,Mystery,Sci-Fi", "director": "Ridley Scott", "actors": "Noomi Rapace, Logan Marshall-Green, Michael Fassbender, Charlize Theron", "description": "Following clues to the origin of mankind, a team finds a structure on a distant moon, but they soon realize they are not alone.", "year": 2012, "runtime": 124, "rating": 7, "votes": 485820, "revenue": 126.46, "metascore": 65
....
在上面,为了说明的方便,我省去了其它的文档。你需要把整个 movies.txt 的文件拷贝过来,并全部写入到 Elasticsearch 中。它共有1000 个文档。
搜索数据
现在让我们运行一个基本查询来查看 suggest 的结果:
GET movies/_search?filter_path=suggest
"suggest":
"text": "transformers revenge of the falen",
"did_you_mean":
"phrase":
"field": "title.suggest",
"size": 5
上面命令显示的结果为:
"suggest":
"did_you_mean": [
"text": "transformers revenge of the falen",
"offset": 0,
"length": 33,
"options": [
"text": "transformers revenge of the fallen",
"score": 0.004467494
,
"text": "transformers revenge of the fall",
"score": 0.00020402104
,
"text": "transformers revenge of the face",
"score": 0.00006419608
]
]
请注意,在几行中你已经获得了一些有希望的结果。
现在让我们通过使用更多短语建议功能来增加我们的查询。让我们使用 max_errors = 2,这样我们希望句子中最多有两个术语。 添加了 highlight 显示以突出显示建议的术语。
GET movies/_search?filter_path=suggest
"suggest":
"text": "transformer revenge of the falen",
"did_you_mean":
"phrase":
"field": "title.suggest",
"size": 5,
"confidence": 1,
"max_errors":2,
"highlight":
"pre_tag": "<strong>",
"post_tag": "</strong>"
上面命令返回的结果为:
"suggest":
"did_you_mean": [
"text": "transformer revenge of the falen",
"offset": 0,
"length": 32,
"options": [
"text": "transformers revenge of the fallen",
"highlighted": "<strong>transformers</strong> revenge of the <strong>fallen</strong>",
"score": 0.004382903
,
"text": "transformers revenge of the fall",
"highlighted": "<strong>transformers</strong> revenge of the <strong>fall</strong>",
"score": 0.00020015794
,
"text": "transformers revenge of the face",
"highlighted": "<strong>transformers</strong> revenge of the <strong>face</strong>",
"score": 0.00006298054
,
"text": "transformers revenge of the falen",
"highlighted": "<strong>transformers</strong> revenge of the falen",
"score": 0.00006159308
,
"text": "transformer revenge of the fallen",
"highlighted": "transformer revenge of the <strong>fallen</strong>",
"score": 0.000048000533
]
]
我们再改进一点好吗? 我们添加了 “collate”,我们可以对每个结果执行查询,改进建议的结果。 我使用了带有 “and” 运算符的匹配项,以便在同一个句子中匹配所有术语。 如果我仍然想要不符合查询条件的结果,我使用 prune = true。
GET movies/_search?filter_path=suggest
"suggest":
"text": "transformer revenge of the falen",
"did_you_mean":
"phrase":
"field": "title.suggest",
"size": 5,
"confidence": 1,
"max_errors":2,
"collate":
"query":
"source" :
"match":
"field_name":
"query": "suggestion",
"operator": "and"
,
"params": "field_name" : "title",
"prune" :true
,
"highlight":
"pre_tag": "<strong>",
"post_tag": "</strong>"
现在的结果是:
请注意,答案已更改,我有一个新字段 “collate_match”,它指示结果中是否匹配整理规则(这是因为 prune = true)。
让我们设置 prune 为 false:
GET movies/_search?filter_path=suggest
"suggest":
"text": "transformer revenge of the falen",
"did_you_mean":
"phrase":
"field": "title.suggest",
"size": 5,
"confidence": 1,
"max_errors":2,
"collate":
"query":
"source" :
"match":
"field_name":
"query": "suggestion",
"operator": "and"
,
"params": "field_name" : "title",
"prune" :false
,
"highlight":
"pre_tag": "<strong>",
"post_tag": "</strong>"
这次我们得到的结果是:
我们可以看到只有一个结果是最相关的建议。
以上是关于Elasticsearch:创建一个简单的 “你的意思是?” 推荐搜索的主要内容,如果未能解决你的问题,请参考以下文章