Elasticsearch搜索之most_fields分析

Posted 虾米&老黄牛

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Elasticsearch搜索之most_fields分析相关的知识,希望对你有一定的参考价值。

     顾名思义,most_field就是匹配词干的字段数越多,分数越高,也可设置权重boost。

     下面是简易公式(详细评分算法请参考:http://m.blog.csdn.net/article/details?id=50623948):

     score=match_field1_score*boost+match_field2_score*boost+...match_fieldN_score*boost。

     在很多情况下,这种搜索很有效,但存在一个弱点,就是当文档中的字段冗余信息过多,将会影响那些文档比较精炼,而且意思较为全面的分值,

     不能使用operator和minimum_should_match来减少相关性低的doc的长尾问题,简单的来说就是按term匹配的个数取胜

    例下:

    搜索关键字“北京东路”,先下面的分词结果,我们知道它的词干为“北京”与“东路”:

curl   ‘localhost:9200/fullbiz_index/_analyze?analyzer=ik_smart&pretty=true‘ -d ‘{"text":"北京东路"}‘
{
   "tokens" : [
      {
         "token" : "text",
         "start_offset" : 2,
         "end_offset" : 6,
         "type" : "ENGLISH",
         "position" : 1
      },
      {
         "token" : "北京",
         "start_offset" : 9,
         "end_offset" : 11,
         "type" : "CN_WORD",
         "position" : 2
      },
      {
         "token" : "东路",
         "start_offset" : 11,
         "end_offset" : 13,
         "type" : "CN_WORD",
         "position" : 3
      }
   ]
}

 

curl  ‘localhost:9200/fullbiz1/fullbizinfo/_search?pretty‘ -d ‘
{
  "from" : 0,
  "size" : 20,
  "query" : {
    "multi_match" : {
      "query" : "北京东路",
      "fields" : [ "title", "highlight", "tags", "address", "businessDistrict", "cuisineStyle" ],
      "type" : "most_fields",
	  "minimum_should_match" : "70%",//这是指最少匹配词干占比,例如三个词干,只要配置了二个以上就算match,66.6%会啥入70%。二个词干或以下,只要匹配了一个就行。所以“北京东路”只要匹配了“北京”或“东路”都可得分
      "analyzer" : "ik_smart" //ik有二种模式,一种是ik_max_word(最细词干法),ik_smart(最粗词干法),这里我们配置第二种,以更接近于业务结果。        
    }
  },
  "post_filter" : {
    "bool" : {
      "must" : [ {
        "term" : {
          "status" : 0
        }
      }, {
        "term" : {
          "hostDisplay" : 1
        }
      }, {
        "term" : {
          "cityId" : 2
        }
      }, {
        "term" : {
          "productType" : 3
        }
      } ]
    }
  }
}‘
 
    "hits" : [ {
      "_index" : "fullbiz1",
      "_type" : "fullbizinfo",
      "_id" : "324239",
      "_score" : 0.33371,
      "_source":{"boost":1,"productId":24239,"productType":3,"subType":2,"title":"城市公牛(南京东路店)","viceTitle":"城市公牛(南京东路店)","personMax":"-1","personMin":"-1","picUrl":"meal/2016/08/11/1470892987880.jpg","recommand":-1,"needReserveTime":-1,"priceStr":"-1","price":"-1","originalPrice":"-1","leadingMinutes":-1,"tags":null,"status":0,"isFree":-1,"duration":"10:00:00-22:30:00","onlineTime":1470280723,"updateTime":1486951326,"applyExpiredTime":0,"beginTime":0,"endTime":0,"isCourse":-1,"isTour":-1,"supportParty":0,"interestedNum":0,"cityId":2,"cityName":"上海","categoryId":"0","categoryName":"","categoryIconUrl":"","businessDistrict":"南京东路","businessDistrictId":73,"hostId":24239,"contactNumber":"13764741956","hostName":"城市公牛(南京东路店)","address":"南京东路300号L221-222室(河南中路口)","hostDisplay":1,"hostPicUrl":"meal/2016/08/11/1470892987880.jpg","hostSharePicUrl":"meal/2016/08/11/1470892987880.jpg","hostLatitude":"31.243455970586","hostLongitude":"121.49099099941","location":{"lat":"31.243455970586","lon":"121.49099099941"},"hostLatitudeGD":"31.237701","hostLongitudeGD":"121.484409","locationGD":{"lat":"31.237701","lon":"121.484409"},"headPics":"","catalogIds":null,"cuisineStyleId":41,"cuisineStyle":"西餐","hideMask":0,"referenceAgeMin":0,"referenceAgeMax":0,"userLimit":-1,"todayReservable":1,"orderNums":3,"pvConversionRate":"-1","interestNums":0,"hotPoints":0,"hostAvgPrice":16000,"hostProductLabelIds":",1,2,4,5,7,8,9,12,13,14,15,","shopPay":0,"hostVipEquities":"0","isHostSale":0,"highlight":"[\"2010年世博会加拿大馆特约餐厅\",\"加拿大简约西部乡村风格小酒馆餐厅\",\"家庭式的用餐氛围 80%均是外国食客\"]","isSeatBook":1,"lastUTCTimestamp":"2017-02-13T10:02:06.000+08:00"}
    }, {
      "_index" : "fullbiz1",
      "_type" : "fullbizinfo",
      "_id" : "392659",
      "_score" : 0.31962717,
      "_source":{"boost":1,"productId":92659,"productType":3,"subType":4,"title":"THAIBEAUTY美容连锁机构(南京东路店)","viceTitle":"THAIBEAUTY美容连锁机构(南京东路店)","personMax":"-1","personMin":"-1","picUrl":"hostInfo/2017/01/11/1484121279773528.jpg","recommand":-1,"needReserveTime":-1,"priceStr":"-1","price":"-1","originalPrice":"-1","leadingMinutes":-1,"tags":"","status":0,"isFree":-1,"duration":null,"onlineTime":1484121281,"updateTime":1484202471,"applyExpiredTime":0,"beginTime":0,"endTime":0,"isCourse":-1,"isTour":-1,"supportParty":0,"interestedNum":0,"cityId":2,"cityName":"上海","categoryId":"0","categoryName":"","categoryIconUrl":"","businessDistrict":"南京东路","businessDistrictId":73,"hostId":92659,"contactNumber":"021-63511876","hostName":"THAIBEAUTY美容连锁机构(南京东路店)","address":"南京东路580号6楼","hostDisplay":1,"hostPicUrl":"hostInfo/2017/01/11/1484121279773528.jpg","hostSharePicUrl":"hostInfo/2017/01/11/1484121279773528.jpg","hostLatitude":"31.241721400027","hostLongitude":"121.48585125776","location":{"lat":"31.241721400027","lon":"121.48585125776"},"hostLatitudeGD":"31.235887","hostLongitudeGD":"121.479289","locationGD":{"lat":"31.235887","lon":"121.479289"},"headPics":"","catalogIds":null,"cuisineStyleId":0,"cuisineStyle":"美容/SPA","hideMask":-1,"referenceAgeMin":0,"referenceAgeMax":0,"userLimit":-1,"todayReservable":0,"orderNums":0,"pvConversionRate":"-1","interestNums":0,"hotPoints":0,"hostAvgPrice":284500,"hostProductLabelIds":",60,","shopPay":0,"hostVipEquities":"0","isHostSale":0,"highlight":"[\"高端局部瘦身\",\"环境舒适 按摩师手法专业\",\"使用高品质产品\"]","isSeatBook":1,"lastUTCTimestamp":"2017-01-12T14:27:51.000+08:00"}
    }, {
      "_index" : "fullbiz1",
      "_type" : "fullbizinfo",
      "_id" : "364804",
      "_score" : 0.31002828,
      "_source":{"boost":1,"productId":64804,"productType":3,"subType":2,"title":"斗牛士(南京东路店)","viceTitle":"斗牛士(南京东路店)","personMax":"-1","personMin":"-1","picUrl":"hostInfo/2016/12/26/1482718008927949.png","recommand":-1,"needReserveTime":-1,"priceStr":"-1","price":"-1","originalPrice":"-1","leadingMinutes":-1,"tags":"","status":0,"isFree":-1,"duration":null,"onlineTime":1482718014,"updateTime":1486569730,"applyExpiredTime":0,"beginTime":0,"endTime":0,"isCourse":-1,"isTour":-1,"supportParty":0,"interestedNum":0,"cityId":2,"cityName":"上海","categoryId":"0","categoryName":"","categoryIconUrl":"","businessDistrict":"南京东路","businessDistrictId":73,"hostId":64804,"contactNumber":"021-33317136","hostName":"斗牛士(南京东路店)","address":"南京东路353号悦荟广场(原353店)7F","hostDisplay":1,"hostPicUrl":"hostInfo/2016/12/26/1482718008927949.png","hostSharePicUrl":"hostInfo/2016/12/26/1482718008927949.png","hostLatitude":"31.24210523683","hostLongitude":"121.49020262932","location":{"lat":"31.24210523683","lon":"121.49020262932"},"hostLatitudeGD":"31.236339","hostLongitudeGD":"121.483623","locationGD":{"lat":"31.236339","lon":"121.483623"},"headPics":"","catalogIds":null,"cuisineStyleId":41,"cuisineStyle":"西餐","hideMask":-1,"referenceAgeMin":0,"referenceAgeMax":0,"userLimit":-1,"todayReservable":0,"orderNums":0,"pvConversionRate":"-1","interestNums":0,"hotPoints":0,"hostAvgPrice":12200,"hostProductLabelIds":",1,","shopPay":0,"hostVipEquities":"0","isHostSale":0,"highlight":"[\"精选进口澳洲安格斯牛排\",\"严控0度低温 保证牛肉鲜嫩\",\"进口原切牛排保证牛肉口感与外观\"]","isSeatBook":1,"lastUTCTimestamp":"2017-02-09T00:02:10.000+08:00"}
.....
      "_index" : "fullbiz1",
      "_type" : "fullbizinfo",
      "_id" : "353771",
      "_score" : 0.7784657,
      "_source":{"boost":1,"productId":53771,"productType":3,"subType":2,"title":"九储堂创意中国菜(外滩店)","viceTitle":"九储堂创意中国菜(外滩店)","personMax":"-1","personMin":"-1","picUrl":"hostInfo/2016/12/26/1482744127546461.jpg","recommand":-1,"needReserveTime":-1,"priceStr":"-1","price":"-1","originalPrice":"-1","leadingMinutes":-1,"tags":"","status":0,"isFree":-1,"duration":null,"onlineTime":1482744132,"updateTime":1486738928,"applyExpiredTime":0,"beginTime":0,"endTime":0,"isCourse":-1,"isTour":-1,"supportParty":0,"interestedNum":0,"cityId":2,"cityName":"上海","categoryId":"0","categoryName":"","categoryIconUrl":"","businessDistrict":"外滩","businessDistrictId":71,"hostId":53771,"contactNumber":"021-63308900","hostName":"九储堂创意中国菜(外滩店)","address":"北京东路398号新协通国际大酒店18楼","hostDisplay":1,"hostPicUrl":"hostInfo/2016/12/26/1482744127546461.jpg","hostSharePicUrl":"hostInfo/2016/12/26/1482744127546461.jpg","hostLatitude":"31.246247363994","hostLongitude":"121.48894308136","location":{"lat":"31.246247363994","lon":"121.48894308136"},"hostLatitudeGD":"31.240463","hostLongitudeGD":"121.48237","locationGD":{"lat":"31.240463","lon":"121.48237"},"headPics":"","catalogIds":null,"cuisineStyleId":25,"cuisineStyle":"创意菜","hideMask":-1,"referenceAgeMin":0,"referenceAgeMax":0,"userLimit":-1,"todayReservable":0,"orderNums":0,"pvConversionRate":"-1","interestNums":0,"hotPoints":0,"hostAvgPrice":19100,"hostProductLabelIds":",1,","shopPay":0,"hostVipEquities":"0","isHostSale":0,"highlight":"[\"新加坡同乐餐饮总厨胡于保先生主理\",\"大厅可容纳150人的宴会 包房5间\",\"靠窗座位亦可欣赏浦江两岸美景\"]","isSeatBook":1,"lastUTCTimestamp":"2017-02-10T23:02:08.000+08:00"}

而结果中有包含“北京东路”完整内容的文档却排在后面,这不科学,为什么会是这个结果,下面我们经过explain来看看评分计算:

 curl  ‘localhost:9200/fullbiz1/fullbizinfo/_search?pretty&explain‘  ....后面内容省略,和上面的请求是一样,只加了一个explain,以及size限制第一条,因为信息太多,只分析具体一个文档,下面我们直接看评分部分:

      "_explanation" : {
        "value" : 0.33371,
        "description" : "product of:",
        "details" : [ {
          "value" : 0.66742,
          "description" : "sum of:",
          "details" : [ {
            "value" : 0.28481156,
            "description" : "product of:",
            "details" : [ {
              "value" : 0.5696231,
              "description" : "sum of:",
              "details" : [ {
                "value" : 0.5696231,
                "description" : "weight(title:东路 in 7321) [PerFieldSimilarity], result of:",
                "details" : [ {
                  "value" : 0.5696231,
                  "description" : "score(doc=7321,freq=1.0), product of:",
                  "details" : [ {
                    "value" : 0.25448462,
                    "description" : "queryWeight, product of:",
                    "details" : [ {
                      "value" : 7.1626873,
                      "description" : "idf(docFreq=244, maxDocs=116302)"
                    }, {
                      "value" : 0.03552921,
                      "description" : "queryNorm"
                    } ]
                  }, {
                    "value" : 2.23834,
                    "description" : "fieldWeight in 7321, product of:",
                    "details" : [ {
                      "value" : 1.0,
                      "description" : "tf(freq=1.0), with freq of:",
                      "details" : [ {
                        "value" : 1.0,
                        "description" : "termFreq=1.0"
                      } ]
                    }, {
                      "value" : 7.1626873,
                      "description" : "idf(docFreq=244, maxDocs=116302)"
                    }, {
                      "value" : 0.3125,
                      "description" : "fieldNorm(doc=7321)"
                    } ]
                  } ]
                } ]
              } ]
            }, {
              "value" : 0.5,
              "description" : "coord(1/2)"
            } ]
          }, {
            "value" : 0.067192085,
            "description" : "product of:",
            "details" : [ {
              "value" : 0.13438417,
              "description" : "sum of:",
              "details" : [ {
                "value" : 0.13438417,
                "description" : "weight(address:东路 in 7321) [PerFieldSimilarity], result of:",
                "details" : [ {
                  "value" : 0.13438417,
                  "description" : "score(doc=7321,freq=1.0), product of:",
                  "details" : [ {
                    "value" : 0.1477382,
                    "description" : "queryWeight, product of:",
                    "details" : [ {
                      "value" : 4.158218,
                      "description" : "idf(docFreq=4942, maxDocs=116302)"
                    }, {
                      "value" : 0.03552921,
                      "description" : "queryNorm"
                    } ]
                  }, {
                    "value" : 0.90961015,
                    "description" : "fieldWeight in 7321, product of:",
                    "details" : [ {
                      "value" : 1.0,
                      "description" : "tf(freq=1.0), with freq of:",
                      "details" : [ {
                        "value" : 1.0,
                        "description" : "termFreq=1.0"
                      } ]
                    }, {
                      "value" : 4.158218,
                      "description" : "idf(docFreq=4942, maxDocs=116302)"
                    }, {
                      "value" : 0.21875,
                      "description" : "fieldNorm(doc=7321)"
                    } ]
                  } ]
                } ]
              } ]
            }, {
              "value" : 0.5,
              "description" : "coord(1/2)"
            } ]
          }, {
            "value" : 0.3154164,
            "description" : "product of:",
            "details" : [ {
              "value" : 0.6308328,
              "description" : "sum of:",
              "details" : [ {
                "value" : 0.6308328,
                "description" : "weight(businessDistrict:东路 in 7321) [PerFieldSimilarity], result of:",
                "details" : [ {
                  "value" : 0.6308328,
                  "description" : "score(doc=7321,freq=1.0), product of:",
                  "details" : [ {
                    "value" : 0.22633977,
                    "description" : "queryWeight, product of:",
                    "details" : [ {
                      "value" : 6.3705263,
                      "description" : "idf(docFreq=540, maxDocs=116302)"
                    }, {
                      "value" : 0.03552921,
                      "description" : "queryNorm"
                    } ]
                  }, {
                    "value" : 2.7871053,
                    "description" : "fieldWeight in 7321, product of:",
                    "details" : [ {
                      "value" : 1.0,
                      "description" : "tf(freq=1.0), with freq of:",
                      "details" : [ {
                        "value" : 1.0,
                        "description" : "termFreq=1.0"
                      } ]
                    }, {
                      "value" : 6.3705263,
                      "description" : "idf(docFreq=540, maxDocs=116302)"
                    }, {
                      "value" : 0.4375,
                      "description" : "fieldNorm(doc=7321)"
                    } ]
                  } ]
                } ]
              } ]
            }, {
              "value" : 0.5,
              "description" : "coord(1/2)"
            } ]
          } ]
        }, {
          "value" : 0.5,
          "description" : "coord(3/6)"
        } ]
      }
    } ]
  }
}

从上面分析结果来看,排在前面的这些包含“南京东路”的文档,不是因为匹配度高,而是因为匹配的字段多,所以得分大于下面那个只包含一个“北京东路”字段的文档。

总结:most_field适应于那种字段之间信息差异较大的搜索匹配,像上面那种title中有“东路”,商圈、地址中也有“东路“,冗余信息较多。

以上是关于Elasticsearch搜索之most_fields分析的主要内容,如果未能解决你的问题,请参考以下文章

BeetleX.WebFamily之ElasticSearch搜索集成

如何开发自己的搜索帝国之Elasticsearch

elasticsearch实战三部曲之三:搜索操作

全文搜索之MySQL与ElasticSearch搜索引擎

分布式搜索引擎ElasticSearch之高级运用

全文搜索之 Elasticsearch