ElasticSearch05_模糊匹配背景fuzzy核心参数说明编写JAVA代码实现纠错

Posted 所得皆惊喜

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了ElasticSearch05_模糊匹配背景fuzzy核心参数说明编写JAVA代码实现纠错相关的知识,希望对你有一定的参考价值。

文章目录

①. 模糊匹配出现的背景

  • ①.fuzzy在es中可以理解为模糊查询,搜索本身很多时候是不精确的,很多时候我们需要在用户的查询词中有部分错误的情况下也能召回正确的结果,但是计算机无法理解自然语言,因此我们只能通过一些算法替代语言理解能力实现类似的事情,前缀查询的实现比较简单但效果很难令人满意,就模糊查询而言es的fuzzy实现了一种复杂度和效果比较折中的查询能力

  • ②. 字符的相似度-编辑距离:是对两个字符串差异长度的量化,及一个字符至少需要处理多少次才能变成另一个字符,比如lucene和lucece只差了一个字符他们的编辑距离是1

  • ③. 莱文斯坦距离(Levenshtein distance):编辑距离的一种,指两个字符串之间,由一个转成另一个所需的最少编辑操作次数,允许的编辑包括

  1. 将一个字符替换成另一个字符
  2. 插入一个字符
  3. 删除一个字符
  4. 将相邻位置的两个字符的互换当做两次编辑
  • ④. Damerau–Levenshtein distance:莱文斯坦距离的一个扩展版 ,将相邻位置的两个字符的互换当做一次编辑,而在经典的莱文斯坦距离计算中位置互换是2次编辑

  • ⑤. ElasticSearch支持经典的Levenshtein距离和Damerau-Levenshtein距离,在es中对模糊查询的支持有两种方式match query和fuzzy query
    注意:fuzzy query的工作原理与term query类似,对所要查询的内容不会进行分析

②. fuzzy - 参数说明

  • ①. fuzziness:本次查询允许的最大编辑距离,默认不开启模糊查询,相当于fuzziness=0,支持的格式:
  1. 可以是数字:0、1、2代表固定的最大编辑距离.最大距离只能设置为2
  2. 自动模式,AUTO:[low],[high]的格式(也可以只写AUTO代表默认的自动模式,相当于AUTO:3,6)
    [0-2] - 范围内编辑距离为0即强匹配)
    [3, 5] - 单词长度3到5个字母时,最大编辑距离为1
    [6] - 单词长度大于5个字母时,最大编辑距离为2
  • ②. prefix_length:控制两个字符串匹配的最小相同的前缀大小,也即是前n个字符不允许编辑,必须与查询词相同,默认是0,大于0时可以显著提升查询性能,需要注意的是这里的prefix_length作用在分词后的term级,也就是作用在每个分词的词根上而不是整个查询词上,对于下面的例子 elastic search来说就是需要elastic和search都会严格匹配前两个字符来召回,是不是很意外
GET index_name/_search

  "query": 
    "match": 
      "name": 
        "query": "elastic search",
        "fuzziness": 0,
        "prefix_length": 0,
        "max_expansions": 50,
        "transpositions": true
      
    
  

  • ③. max_expansions:定义fuzzy query会扩展的最大term的数量。默认为50

  • ④. transpositions:将相邻位置字符互换算作一次编辑距离:如ab -> ba,即使用Damerau–Levenshtein距离算法,默认开启,设置transpositions=false将使用经典莱文斯坦距离算法

  • ⑤. 注意:如果prefix_length设为0并且max_expansions设置为很大的一个数,这个查询的计算量将会是非常大。很有可能导致索引里的每个term都被检查一遍

③. 如何使用模糊查询

  • ①. fuzzy query的工作原理与term query类似,fuzzy query不会进行分词处理,term query会进行分词再进行模糊匹配
GET index_name/_search

  "query": 
    "match": 
      "name": 
        "query": "elastic search",
        "fuzziness": 0,
        "prefix_length": 0,
        "max_expansions": 50,
        "fuzzy_transpositions": true
      
    
  

GET /test-mapping/_search

 "query": 
   "fuzzy": 
     "name": 
       "value": "elastic",
       "fuzziness": 0,
       "prefix_length": 0,
       "max_expansions": 50,
       "transpositions": true
     
   
 

  • ②. 模糊查询流程

④. 编写纠错Java代码

  • ①. 需求:对spuName和indication进行纠错处理,要求第一个字不进行纠错,保持固定

  • ②. kibana DSL语句编写

GET /ssm-retail-goods-spu-new/_search

  "from": 0,
  "size": 10,
  "query": 
    "bool": 
      "must": [
        
          "term": 
            "shop_code": 
              "value": "YD-5e81e1c21e591400010c2ff9"
            
          
        ,
        
          "term": 
            "is_shelves": 
              "value": 1
            
          
        ,
        
          "bool": 
            "should": [
              
                "term": 
                  "spu_code": 
                    "value": "8C80B5BD009B439FAC1BC5D5A4E9C438"
                  
                
              ,
              
                "match_phrase": 
                  "spu_name": 
                    "query": "8C80B5BD009B439FAC1BC5D5A4E9C438"
                  
                
              ,
              
                "match_phrase": 
                  "indication": 
                    "query": "8C80B5BD009B439FAC1BC5D5A4E9C438"
                  
                
              ,
              
                "fuzzy": 
                  "spu_name": 
                    "value": "8C80B5BD009B439FAC1BC5D5A4E9C438",
                    "fuzziness": "1",
                    "prefix_length": 1,
                    "max_expansions": 50,
                    "transpositions": false
                  
                
              ,
              
                "match": 
                  "spu_name": 
                    "query": "8C80B5BD009B439FAC1BC5D5A4E9C438",
                    "operator": "OR",
                    "fuzziness": "1",
                    "prefix_length": 1,
                    "max_expansions": 50,
                    "fuzzy_transpositions": false
                  
                
              ,
              
                "fuzzy": 
                  "indication": 
                    "value": "8C80B5BD009B439FAC1BC5D5A4E9C438",
                    "fuzziness": "1",
                    "prefix_length": 1,
                    "max_expansions": 50,
                    "transpositions": false
                  
                
              ,
              
                "match": 
                  "indication": 
                    "query": "8C80B5BD009B439FAC1BC5D5A4E9C438",
                    "operator": "OR",
                    "fuzziness": "1",
                    "prefix_length": 1,
                    "max_expansions": 50,
                    "fuzzy_transpositions": false
                  
                
              
            ]
          
        ,
        
          "terms": 
            "spu_code": [
              "E7509FD59242425F88C1906E3F76610E",
              "D3CC7ECD055D46828914C93F6CD1E461",
              "D0D2ACF4B66C4029A34533B31ECC6016",
              "A3798FC0E2C640058BF566E1696B14E5",
              "F86F9F0C5B6A493E9DE47ED0A46656D4",
              "7F86E2606E304EE9A1C5CC8AF95389B7",
              "A23D9A62E1F44DB1BB3D8B3DAF68E6BB",
              "C7A396A9E9D249A1A373639F540506BD",
              "A49EF94F3E6D44EA9FA5F971DCED0258",
              "AA9536B291E544C0B6D0BB8C191F4D2D",
              "8C80B5BD009B439FAC1BC5D5A4E9C438"
            ]
          
        
      ]
    
  ,
  "sort": [
    
      "_score": 
        "order": "desc"
      
    
  ]

  • ③. Java代码实现
@Data
public class SearchParam 
	
    /**
     * 查询条件商品编码或商品名称
     */
    private String searchCriteria;
	
    /**
     * 店铺编码
     */
    private String shopCode;
	
    /**
     * 渠道编码
     */
    private String channelCode;

    /**
     * 是否过滤处方药
     */
    private Boolean rxPass = false;


    private int pageNo = 0;

    private int pageSize = 10;

    private String categoryId;

    private String categoryCode;

    private String advertId;

    @NotNull(message = "排序方式不能为空")
    private SearchSortEnums sortType;


public List<SearchGoodsResult> searchGoods(SearchParam searchParam) 
        logger.info("商品查询:", JSON.toJSONString(searchParam));
        MallInfo mallInfo = AuthorUtil.getMallInfo();
        String shopCode = mallInfo.getShopCode();

        searchParam.setShopCode(shopCode);
        searchParam.setPageNo(searchParam.getPageSize() * searchParam.getPageNo());
        QueryBuilder query = QueryBuilders.boolQuery()
                .should(QueryBuilders.termQuery("spu_code", searchParam.getSearchCriteria()))
                .should(QueryBuilders.matchPhraseQuery("spu_name", searchParam.getSearchCriteria()))
                .should(QueryBuilders.matchPhraseQuery("indication", searchParam.getSearchCriteria()))
                // 全短语纠错匹配
                .should(QueryBuilders.fuzzyQuery("spu_name", searchParam.getSearchCriteria()).fuzziness(Fuzziness.ONE).transpositions(false).prefixLength(1))
                // 分词短语纠错匹配
                .should(QueryBuilders.matchQuery("spu_name", searchParam.getSearchCriteria()).fuzziness(Fuzziness.ONE).fuzzyTranspositions(false).prefixLength(1))

                .should(QueryBuilders.fuzzyQuery("indication", searchParam.getSearchCriteria()).fuzziness(Fuzziness.ONE).transpositions(false).prefixLength(1))
                .should(QueryBuilders.matchQuery("indication", searchParam.getSearchCriteria()).fuzziness(Fuzziness.ONE).fuzzyTranspositions(false).prefixLength(1));

        BoolQueryBuilder boolQueryBuilder = QueryBuilders.boolQuery();

        boolQueryBuilder
                .must(QueryBuilders.termQuery("shop_code", searchParam.getShopCode()))
                .must(QueryBuilders.termQuery("is_shelves", 1))
                .must(query);
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.from(searchParam.getPageNo());
        searchSourceBuilder.size(searchParam.getPageSize());

        searchSourceBuilder.sort(SortBuilders.scoreSort().order(SortOrder.DESC));
        searchSourceBuilder.query(boolQueryBuilder);
        String advertId = searchParam.getAdvertId();
        SearchResponse searchResponse = null;
        if (StringUtils.isNotBlank(advertId)) 
            SearchSourceBuilder advSearchSourceBuilder = new SearchSourceBuilder();
            BoolQueryBuilder advBoolQueryBuilder = QueryBuilders.boolQuery();
            SearchRequest advSearchRequest = new SearchRequest("ssm-retail-adv-spu-new");
            advBoolQueryBuilder.must(QueryBuilders.termQuery("adv_id", advertId));
            advSearchSourceBuilder.query(advBoolQueryBuilder);
            advSearchSourceBuilder.size(1000);
            advSearchRequest.source(advSearchSourceBuilder);
            try 
                searchResponse = restHighLevelClient.search(advSearchRequest, RequestOptions.DEFAULT);
                if (null == searchResponse) 
                    return null;
                

                SearchHits hits = searchResponse.getHits();
                //TODO 不可能小于0
                if (hits.getTotalHits().equals(new TotalHits(0, TotalHits.Relation.EQUAL_TO))) 
                    return null;
                
//                if (hits.getTotalHits() <= 0) 
//                    return null;
//                
                List<String> spuCodeList = new ArrayList<>();
                hits.forEach(i -> 
                    JSONObject sourceAsMap = JSONObject.parseObject(i.getSourceAsString());
                    spuCodeList.add(sourceAsMap.getString(EsRetailGoodsColumn.SPU_CODE));
                );
                if (CollectionUtils.isNotEmpty(spuCodeList)) 
                    boolQueryBuilder.must(QueryBuilders.termsQuery("spu_code", spuCodeList));
                
             catch (Exception e) 
                logger.error("商品查询异常", e);
                DistributionException.throwException(ErrorCode.SEARCH_GOODS_ERROR);
            
        
        SearchRequest searchRequest = new SearchRequest("ssm-retail-goods-spu-new");
        String categoryCode = searchParam.getCategoryCode();
        if (StringUtils.isNotBlank(categoryCode)) 
            boolQueryBuilder.must(QueryBuilders.termQuery("category_code", categoryCode));
        
        searchRequest.source(searchSourceBuilder);
        try 
            logger.info("searchRequest : " + searchRequest);
            searchResponse = restHighLevelClient.search(searchRequest, RequestOptions.DEFAULT);
         catch (Exception e) 
            logger.error("商品查询异常", e);
            DistributionException.throwException(ErrorCode.SEARCH_GOODS_ERROR);
        
        if (null == searchResponse) 
            return null;
        

        SearchHits hits = searchResponse.getHits();

        if (hits.getTotalHits().equals(new TotalHits(0, TotalHits.Relation.EQUAL_TO))) 
            return null;
        
//        if (hits.getTotalHits() <= 0) 
//            return null;
//        

        List<SearchGoodsResult> searchGoodsResultList = new ArrayList<>();
        logger.info("hit:", JSON.toJSON(hits));
        hits.forEach(i -> 
            SearchGoodsResult searchGoodsResult = new SearchGoodsResult();
            JSONObject sourceAsMap = JSONObject.parseObject(i.getSourceAsString());
            logger.info("sourceAsMap:", i.getSourceAsString());
            searchGoodsResult.setShopCode(sourceAsMap.getString(EsRetailGoodsColumn.SHOP_CODE));
            searchGoodsResult.setSpuCode(sourceAsMap.getString(EsRetailGoodsColumn.SPU_CODE));
            searchGoodsResult.setSpuName(sourceAsMap.getString(EsRetailGoodsColumn.SPU_NAME));
            searchGoodsResult.setIsShelves(sourceAsMap.getBoolean(EsRetailGoodsColumn.IS_SHELVES));
            searchGoodsResult.setFirstRetailCharge(sourceAsMap.getBigDecimal(EsRetailGoodsColumn.FIRST_RETAIL_CHARGE));
       

以上是关于ElasticSearch05_模糊匹配背景fuzzy核心参数说明编写JAVA代码实现纠错的主要内容,如果未能解决你的问题,请参考以下文章

ElasticSearch如何使用 ElasticSearch 搜索单词的一部分 模糊搜索 正则匹配 前缀匹配

利用SQL模糊匹配来验证字段是否是日期格式

使用模糊 NEST 进行多匹配查询 - ElasticSearch

elasticsearch 模糊匹配 max_expansions & min_similarity

ElasticSearch入门3: 高级查询

Elasticsearch系列---前缀搜索和模糊搜索