Elasticsearch汉字补全和拼写纠错
Posted 赵广陆
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Elasticsearch汉字补全和拼写纠错相关的知识,希望对你有一定的参考价值。
目录
1 使用ES实现的效果
汉字补全
拼写纠错
2 产品搜索与自动补全
Term suggester :词条建议器。对给输入的文本进进行分词,为每个分词提供词项建议
Phrase suggester :短语建议器,在term的基础上,会考量多个term之间的关系
Completion Suggester,它主要针对的应用场景就是"Auto Completion"
Context Suggester:上下文建议器
GET product_completion_index/_search
"from": 0,
"size": 100,
"suggest":
"czbk-suggest":
"prefix": "小米",
"completion":
"field": "searchkey",
"size": 20,
"skip_duplicates": true
2.1 汉字补全OpenAPI
2.1.1 定义自动补全接口
GET product_completion_index/_search
"from": 0,
"size": 100,
"suggest":
"czbk-suggest":
"prefix": "小米",
"completion":
"field": "searchkey",
"size": 20,
"skip_duplicates": true
package com.oldlu.service;
import com.oldlu.commons.pojo.CommonEntity;
import org.elasticsearch.action.DocWriteResponse;
import org.elasticsearch.action.search.SearchResponse;
import org.elasticsearch.rest.RestStatus;
import org.elasticsearch.search.SearchHit;
import org.elasticsearch.search.suggest.completion.CompletionSuggestion;
import java.util.List;
import java.util.Map;
/**
* @Class: ElasticsearchDocumentService
* @Package com.oldlu.service
* @Description: 文档操作接口
* @Company: http://www.oldlu.com/
*/
public interface ElasticsearchDocumentService
//自动补全(完成建议)
public List<String> cSuggest(CommonEntity commonEntity) throws Exception;
2.1.2 定义自动补全实现
/*
* @Description: 自动补全 根据用户的输入联想到可能的词或者短语
* @Method: suggester
* @Param: [commonEntity]
* @Update:
* @since: 1.0.0
* @Return: org.elasticsearch.action.search.SearchResponse
*
*/
public List<String> cSuggest(CommonEntity commonEntity) throws Exception
//定义返回
List<String> suggestList = new ArrayList<>();
//构建查询请求
SearchRequest searchRequest = new
SearchRequest(commonEntity.getIndexName());
//通过查询构建器定义评分排序
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.sort(new ScoreSortBuilder().order(SortOrder.DESC));
//构造搜索建议语句,搜索条件字段
CompletionSuggestionBuilder completionSuggestionBuilder =new
CompletionSuggestionBuilder(commonEntity.getSuggestFileld());
//搜索关键字
completionSuggestionBuilder.prefix(commonEntity.getSuggestValue());
//去除重复
completionSuggestionBuilder.skipDuplicates(true);
//匹配数量
completionSuggestionBuilder.size(commonEntity.getSuggestCount());
searchSourceBuilder.suggest(new SuggestBuilder().addSuggestion("czbk-
suggest", completionSuggestionBuilder));
//czbk-suggest为返回的字段,所有返回将在czbk-suggest里面,可写死,sort按照评分排
序
searchRequest.source(searchSourceBuilder);
//定义查找响应
SearchResponse suggestResponse = client.search(searchRequest,
RequestOptions.DEFAULT);
//定义完成建议对象
CompletionSuggestion completionSuggestion =
suggestResponse.getSuggest().getSuggestion("czbk-suggest");
List<CompletionSuggestion.Entry.Option> optionsList =
completionSuggestion.getEntries().get(0).getOptions();
//从optionsList取出结果
if (!CollectionUtils.isEmpty(optionsList))
optionsList.forEach(item ->
suggestList.add(item.getText().toString()));
return suggestList;
2.1.3 定义自动补全控制器
/*
* @Description 自动补全
* @Method: suggester
* @Param: [commonEntity]
* @Update:
* @since: 1.0.0
* @Return: com.oldlu.commons.result.ResponseData
*
*/
@GetMapping(value = "/csuggest")
public ResponseData cSuggest(@RequestBody CommonEntity commonEntity)
// 构造返回数据
ResponseData rData = new ResponseData();
if (StringUtils.isEmpty(commonEntity.getIndexName()) ||
StringUtils.isEmpty(commonEntity.getSuggestFileld()) ||
StringUtils.isEmpty(commonEntity.getSuggestValue()))
rData.setResultEnum(ResultEnum.PARAM_ISNULL);
return rData;
//批量查询返回结果
List<String> result = null;
try
//通过高阶API调用批量新增操作方法
result = elasticsearchDocumentService.cSuggest(commonEntity);
//通过类型推断自动装箱(多个参数取交集)
rData.setResultEnum(result, ResultEnum.SUCCESS, result.size());
//日志记录
logger.info(TipsEnum.CSUGGEST_GET_DOC_SUCCESS.getMessage());
catch (Exception e)
//日志记录
logger.error(TipsEnum.CSUGGEST_GET_DOC_FAIL.getMessage(), e);
//构建错误返回信息
rData.setResultEnum(ResultEnum.ERROR);
return rData;
2.1.4 自动补全调用验证
http://localhost:8888/v1/docs/csuggest
参数
"indexName": "product_completion_index",
"suggestFileld": "searchkey",
"suggestValue": "小米",
"suggestCount": 13
indexName索引名称
suggestFileld:自动补全查找列
suggestValue:自动补全输入的关键字
suggestCount:自动补全返回个数(京东是13个)
返回
"code": "200",
"desc": "操作成功!",
"data": [
"小米10",
"小米10Pro",
"小米8",
"小米9",
"小米充电宝",
"小米手机",
"小米摄像头",
"小米电视",
"小米电饭煲",
"小米笔记本",
"小米耳环",
"小米路由器"
],
"count": 12
tips: 自动补全自动去重
2.2 拼音补全OpenAPI
使用拼音访问【小米】
http://localhost:8888/v1/docs/csuggest
全拼访问
"indexName": "product_completion_index",
"suggestFileld": "searchkey",
"suggestValue": "xiaomi",
"suggestCount": 13
全拼访问(分隔)
"indexName": "product_completion_index",
"suggestFileld": "searchkey",
"suggestValue": "xiao mi",
"suggestCount": 13
首字母访问
"indexName": "product_completion_index",
"suggestFileld": "searchkey",
"suggestValue": "xm",
"suggestCount": 13
2.2.1 下载拼插件
wget https://github.com/medcl/elasticsearch-analysis-
pinyin/releases/download/v7.4.0/elasticsearch-analysis-pinyin-7.4.0.zip
或者
https://github.com/medcl/elasticsearch-analysis-pinyin/releases/tag/v7.4.0
当我们创建索引时可以自定义分词器,通过指定映射去匹配自定义分词器
"indexName": "product_completion_index",
"map":
"settings":
"number_of_shards": 1,
"number_of_replicas": 2,
"analysis":
"analyzer":
"ik_pinyin_analyzer":
"type": "custom",
"tokenizer": "ik_smart",
"filter": "pinyin_filter"
,
"filter":
"pinyin_filter":
"type": "pinyin",
"keep_first_letter": true,
"keep_separate_first_letter": false,
"keep_full_pinyin": true,
"keep_original": true,
"limit_first_letter_length": 16,
"lowercase": true,
"remove_duplicated_term": true
,
"mapping":
"properties":
"name":
"type": "text"
,
"searchkey":
"type": "completion",
"analyzer": "ik_pinyin_analyzer"
调用【新增文档开发API】接口进行新增数据
开始拼音补全
3 什么是语言处理(拼写纠错)
场景描述
例如:错误输入"【adidaas官方旗舰店】 ”能够纠错为【adidas官方旗舰店】
3.1 语言处理OpenAPI
GET product_completion_index/_search
"suggest":
"czbk-suggestion":
"text": "adidaas官方旗舰店",
"phrase":
"field": "name",
"size": 13
返回
3.1.1 定义拼写纠错接口
//拼写纠错
public String pSuggest(CommonEntity commonEntity) throws Exception;
3.1.2 定义拼写纠错实现
/*
* @Description: 拼写纠错
* @Method: psuggest
* @Param: [commonEntity]
* @Update:
* @since: 1.0.0
* @Return: java.util.List<java.lang.String>
*
*/
@Override
public String pSuggest(CommonEntity commonEntity) throws Exception
//定义返回
String pSuggestString = new String();
//定义查询请求
SearchRequest searchRequest = new
SearchRequest(commonEntity.getIndexName());
//定义查询条件构建器
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
//定义排序器
searchSourceBuilder.sort(new ScoreSortBuilder().order(SortOrder.DESC));
//构造短语建议器对象(参数为匹配列)
PhraseSuggestionBuilder pSuggestionBuilder = new
PhraseSuggestionBuilder(commonEntity.getSuggestFileld());
//搜索关键字(被纠错的值)
pSuggestionBuilder.text(commonEntity.getSuggestValue());
//匹配数量
pSuggestionBuilder.size(1);
searchSourceBuilder.suggest(new SuggestBuilder().addSuggestion("czbk-
suggest", pSuggestionBuilder));
searchRequest.source(searchSourceBuilder);
//定义查找响应
SearchResponse suggestResponse = client.search(searchRequest,
RequestOptions.DEFAULT);
//定义短语建议对象
PhraseSuggestion phraseSuggestion =
suggestResponse.getSuggest().getSuggestion("czbk-suggest");
//获取返回数据
List<PhraseSuggestion.Entry.Option> optionsList =
phraseSuggestion.getEntries().get(0).getOptions();
//从optionsList取出结果
if (!CollectionUtils.isEmpty(optionsList)
&&optionsList.get(0).getText()!=null)
pSuggestString = optionsList.get(0).getText().string().replaceAll("
","");
return pSuggestString;
3.1.3 定义拼写纠错控制器
/*
* @Description: 拼写纠错
* @Method: suggester2
* @Param: [commonEntity]
* @Update:
* @since: 1.0.0
* @Return: com.oldlu.commons.result.ResponseData
*
*/
@GetMapping(value = "/psuggest")
public ResponseData pSuggest(@RequestBody CommonEntity commonEntity)
// 构造返回数据
ResponseData rData = new ResponseData();
if (StringUtils.isEmpty(commonEntity.getIndexName()) ||
StringUtils.isEmpty(commonEntity.getSuggestFileld()) ||
StringUtils.isEmpty(commonEntity.getSuggestValue()))
rData.setResultEnum(ResultEnum.PARAM_ISNULL);
return rData;
//批量查询返回结果
String result = null;
try
//通过高阶API调用批量新增操作方法
result = elasticsearchDocumentService.pSuggest(commonEntity);
//通过类型推断自动装箱(多个参数取交集)
rData.setResultEnum(result, ResultEnum.SUCCESS, null);
//日志记录
logger.info(TipsEnum.PSUGGEST_GET_DOC_SUCCESS.getMessage());
catch (Exception e)
//日志记录
logger.error(TipsEnum.PSUGGEST_GET_DOC_FAIL.getMessage(), e);
//构建错误返回信息
rData.setResultEnum(ResultEnum.ERROR);
return rData;
3.1.4 语言处理调用验证
http://localhost:8888/v1/docs/psuggest
参数
"indexName": "product_completion_index",
"suggestFileld": "name",
"suggestValue": "adidaas官方旗舰店"
indexName索引名称
suggestFileld:自动补全查找列
suggestValue:自动补全输入的关键字
返回
"code": "200",
"desc": "操作成功!",
"data": "adidas官方旗舰店"
4 总结
- 需要一个搜索词库/语料库,不要和业务索引库在一起,方便维护和升级语料库
- 根据分词及其他搜索条件去语料库中查询若干条(京东13条、淘宝(天猫)10条、百度4条)记录
返回 - 为了提升准确率,通常都是前缀搜索
以上是关于Elasticsearch汉字补全和拼写纠错的主要内容,如果未能解决你的问题,请参考以下文章
Elasticsearch语言处理系列之打字或拼写错误 模糊匹配 字段纠错 Fuzzy multi_match
Elasticsearch Suggester API(自动补全)
Elasticsearch如何在Elasticsearch中查找相似的术语
ES系列Elasticsearch Suggester API(自动补全)