关于Elasticsearch 使用 MatchPhrase搜索的一些坑

Posted 2020-10-19 EvilTuzki

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了关于Elasticsearch 使用 MatchPhrase搜索的一些坑相关的知识，希望对你有一定的参考价值。

对分词字段检索使用的通常是match查询，对于短语查询使用的是matchphrase查询，但是并不是matchphrase可以直接对分词字段进行不分词检索（也就是业务经常说的精确匹配），下面有个例子，使用Es的请注意。

某个Index下面存有如下内容

  {
      "id": "1",
      "fulltext": "亚马逊卓越有限公司诉讼某某公司"
  }

其中fulltext使用ik分词器进行分词存储，使用ik分词结果如下

  "tokens": [
      {
        "token": "亚马逊",
        "start_offset": 0,
        "end_offset": 3,
        "type": "CN_WORD",
        "position": 0
      },
      {
        "token": "亚",
        "start_offset": 0,
        "end_offset": 1,
        "type": "CN_WORD",
        "position": 1
      },
      {
        "token": "马",
        "start_offset": 1,
        "end_offset": 2,
        "type": "CN_CHAR",
        "position": 2
      },
      {
        "token": "逊",
        "start_offset": 2,
        "end_offset": 3,
        "type": "CN_WORD",
        "position": 3
      },
      {
        "token": "卓越",
        "start_offset": 3,
        "end_offset": 5,
        "type": "CN_WORD",
        "position": 4
      },
      {
        "token": "卓",
        "start_offset": 3,
        "end_offset": 4,
        "type": "CN_WORD",
        "position": 5
      },
      {
        "token": "越有",
        "start_offset": 4,
        "end_offset": 6,
        "type": "CN_WORD",
        "position": 6
      },
      {
        "token": "有限公司",
        "start_offset": 5,
        "end_offset": 9,
        "type": "CN_WORD",
        "position": 7
      },
      {
        "token": "有限",
        "start_offset": 5,
        "end_offset": 7,
        "type": "CN_WORD",
        "position": 8
      },
      {
        "token": "公司",
        "start_offset": 7,
        "end_offset": 9,
        "type": "CN_WORD",
        "position": 9
      },
      {
        "token": "诉讼",
        "start_offset": 9,
        "end_offset": 11,
        "type": "CN_WORD",
        "position": 10
      },
      {
        "token": "讼",
        "start_offset": 10,
        "end_offset": 11,
        "type": "CN_WORD",
        "position": 11
      },
      {
        "token": "某某",
        "start_offset": 11,
        "end_offset": 13,
        "type": "CN_WORD",
        "position": 12
      },
      {
        "token": "某公司",
        "start_offset": 12,
        "end_offset": 15,
        "type": "CN_WORD",
        "position": 13
      },
      {
        "token": "公司",
        "start_offset": 13,
        "end_offset": 15,
        "type": "CN_WORD",
        "position": 14
      }
    ]

对于如上结果，如果进行matchphrase查询 “亚马逊卓越”，无法匹配出任何结果
因为对 “亚马逊卓越” 进行分词后的结果为：

    {
      "tokens": [
        {
          "token": "亚马逊",
          "start_offset": 0,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "亚",
          "start_offset": 0,
          "end_offset": 1,
          "type": "CN_WORD",
          "position": 1
        },
        {
          "token": "马",
          "start_offset": 1,
          "end_offset": 2,
          "type": "CN_CHAR",
          "position": 2
        },
        {
          "token": "逊",
          "start_offset": 2,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 3
        },
        {
          "token": "卓越",
          "start_offset": 3,
          "end_offset": 5,
          "type": "CN_WORD",
          "position": 4
        },
        {
          "token": "卓",
          "start_offset": 3,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 5
        },
        {
          "token": "越",
          "start_offset": 4,
          "end_offset": 5,
          "type": "CN_CHAR",
          "position": 6
        }
      ]
    }

和存储的内容对比发现原文存储中包含词语 “越有”，而查询语句中并不包含“越有”，包含的是“越”，因此使用matchphrase短语匹配失败，也就导致了无法检索出内容。
还是这个例子，换个词语进行检索，使用“亚马逊卓越有”，会发现竟然检索出来了，对“亚马逊卓越有”进行分词得到如下结果：

     {
      "tokens": [
        {
          "token": "亚马逊",
          "start_offset": 0,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "亚",
          "start_offset": 0,
          "end_offset": 1,
          "type": "CN_WORD",
          "position": 1
        },
        {
          "token": "马",
          "start_offset": 1,
          "end_offset": 2,
          "type": "CN_CHAR",
          "position": 2
        },
        {
          "token": "逊",
          "start_offset": 2,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 3
        },
        {
          "token": "卓越",
          "start_offset": 3,
          "end_offset": 5,
          "type": "CN_WORD",
          "position": 4
        },
        {
          "token": "卓",
          "start_offset": 3,
          "end_offset": 4,
          "type": "CN_WORD",
          "position": 5
        },
        {
          "token": "越有",
          "start_offset": 4,
          "end_offset": 6,
          "type": "CN_WORD",
          "position": 6
        }
      ]
    }

注意到了吗？这里出现了越有这个词，这也就是说现在的分词结果和之前的全文分词结果完全一致了，所以matchphrash也就找到了结果。

再换一个极端点的例子，使用“越有限公司”去进行检索，你会惊讶的发现，竟然还能检索出来，对“越有限公司”进行分词，结果如下：

    {
      "tokens": [
        {
          "token": "越有",
          "start_offset": 0,
          "end_offset": 2,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "有限公司",
          "start_offset": 1,
          "end_offset": 5,
          "type": "CN_WORD",
          "position": 1
        },
        {
          "token": "有限",
          "start_offset": 1,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 2
        },
        {
          "token": "公司",
          "start_offset": 3,
          "end_offset": 5,
          "type": "CN_WORD",
          "position": 3
        }
      ]
    }

这个结果和原文中的结果又是完全一致（从越有之后的内容一致），所以匹配出来了结果，注意点这里有个词语“有限公司”，检索词语如果我换成了“越有限”，就会发现没有查询到内容，因为“越有限”分词结果为：

    {
      "tokens": [
        {
          "token": "越有",
          "start_offset": 0,
          "end_offset": 2,
          "type": "CN_WORD",
          "position": 0
        },
        {
          "token": "有限",
          "start_offset": 1,
          "end_offset": 3,
          "type": "CN_WORD",
          "position": 1
        }
      ]
    }

“越有”这个词是包含的，”有限”这个词语也是包含的，但是中间隔了一个“有限公司”，所以没有完全一致，也就匹配不到结果了。这时候如果我检索条件设置matchphrase的slop=1，使用“越有限”就能匹配到结果了，现在可以明白了，其实position的位置差就是slop的值，而matchphrase并不是所谓的词语拼接进行匹配，还是需要进行分词，以及position匹配的。

以上是关于关于Elasticsearch 使用 MatchPhrase搜索的一些坑的主要内容，如果未能解决你的问题，请参考以下文章