Elasticsearch：使用路径层次分词器 — Path Hierarchy Tokenizer

Posted 2023-03-08 Elastic 中国社区官方博客

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Elasticsearch：使用路径层次分词器 — Path Hierarchy Tokenizer相关的知识，希望对你有一定的参考价值。

想象一下你的文档具有层次结构特征并且你想搜索层次结构级别的情况，我相信 Path Hierarchy Tokenizer 是你需要了解的分词器。Path_hierarchy 分词器采用像文件系统路径这样的层次结构值，在路径分隔符上拆分，并为树中的每个组件发出一个术语。一个简单的例子：

POST _analyze

  "tokenizer": "path_hierarchy",
  "text": "/one/two/three"

上面的输出为：


  "tokens": [
    
      "token": "/one",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    ,
    
      "token": "/one/two",
      "start_offset": 0,
      "end_offset": 8,
      "type": "word",
      "position": 0
    ,
    
      "token": "/one/two/three",
      "start_offset": 0,
      "end_offset": 14,
      "type": "word",
      "position": 0
    
  ]

配置

条目	描述
delimiter	用作路径分隔符的字符。默认为 /
replacement	用于分隔符的可选替换字符。默认为 delimiter 所定义的值
buffer_size	单次读取到术语缓冲区的字符数。默认为 1024。术语缓冲区将按此大小增长，直到所有文本都被消耗掉。建议不要更改此设置。
reverse	如果设置为 true，它会以相反的顺序发出分词。默认为 false。
skip	要跳过的初始分词数。默认为 0。

配置例子

在此示例中，我们将 path_hierarchy 分词器配置为以 - 字符进行拆分，并用 / 替换它们。跳过前两个分词：

PUT my-index-000001

  "settings": 
    "analysis": 
      "analyzer": 
        "my_analyzer": 
          "tokenizer": "my_tokenizer"
        
      ,
      "tokenizer": 
        "my_tokenizer": 
          "type": "path_hierarchy",
          "delimiter": "-",
          "replacement": "/",
          "skip": 2

我们使用如下的例子来进行测试：

POST my-index-000001/_analyze

  "analyzer": "my_analyzer",
  "text": "one-two-three-four-five"

上面命令显示的结果为：


  "tokens": [
    
      "token": "/three",
      "start_offset": 7,
      "end_offset": 13,
      "type": "word",
      "position": 0
    ,
    
      "token": "/three/four",
      "start_offset": 7,
      "end_offset": 18,
      "type": "word",
      "position": 0
    ,
    
      "token": "/three/four/five",
      "start_offset": 7,
      "end_offset": 23,
      "type": "word",
      "position": 0
    
  ]

从上面的结果中，我们可以看出来：

它跳过了前面的两个分词 one 及 two
分词是以 - 字符为分界符进行拆分的
- 被 / 所代替

我们还可以进行如下的配置来试试：

PUT my-index-000001

  "settings": 
    "analysis": 
      "analyzer": 
        "my_analyzer": 
          "tokenizer": "my_tokenizer"
        
      ,
      "tokenizer": 
        "my_tokenizer": 
          "type": "path_hierarchy",
          "delimiter": "-",
          "replacement": "/",
          "skip": 2,
          "reverse": true

在上面，我们把 reverse 选项设置为 true。运行上面的命令，并以如下的实例来进行测试：

POST my-index-000001/_analyze

  "analyzer": "my_analyzer",
  "text": "one-two-three-four-five"

上面的命令显示的结果为：


  "tokens": [
    
      "token": "one/two/three/",
      "start_offset": 0,
      "end_offset": 14,
      "type": "word",
      "position": 0
    ,
    
      "token": "two/three/",
      "start_offset": 4,
      "end_offset": 14,
      "type": "word",
      "position": 0
    ,
    
      "token": "three/",
      "start_offset": 8,
      "end_offset": 14,
      "type": "word",
      "position": 0
    
  ]

很显然，它和之前的那个还是有所不同的。

详细示例

path_hierarchy 分词器的一个常见用例是按文件路径过滤结果。如果将文件路径与数据一起索引，使用 path_hierarchy 分词器分析路径允许按文件路径字符串的不同部分过滤结果。

此示例将索引配置为具有两个自定义分析器，并将这些分析器应用于将存储文件名的 file_path 文本字段的多字段。两个分析器之一使用反向分词。然后索引一些示例文档以表示两个不同用户的照片文件夹中照片的一些文件路径。

我们首先来定义索引：

PUT file-path-test

  "settings": 
    "analysis": 
      "analyzer": 
        "custom_path_tree": 
          "tokenizer": "custom_hierarchy"
        ,
        "custom_path_tree_reversed": 
          "tokenizer": "custom_hierarchy_reversed"
        
      ,
      "tokenizer": 
        "custom_hierarchy": 
          "type": "path_hierarchy",
          "delimiter": "/"
        ,
        "custom_hierarchy_reversed": 
          "type": "path_hierarchy",
          "delimiter": "/",
          "reverse": "true"
        
      
    
  ,
  "mappings": 
    "properties": 
      "file_path": 
        "type": "text",
        "fields": 
          "tree": 
            "type": "text",
            "analyzer": "custom_path_tree"
          ,
          "tree_reversed": 
            "type": "text",
            "analyzer": "custom_path_tree_reversed"

我们使用如下的一些文档来进行测试：

POST file-path-test/_bulk
 "index" : "_id" : "1"  
 "file_path" : "/User/alice/photos/2017/05/16/my_photo1.jpg" 
 "index" : "_id" : "2"  
 "file_path" : "/User/alice/photos/2017/05/16/my_photo2.jpg" 
 "index" : "_id" : "3"  
 "file_path" : "/User/alice/photos/2017/05/16/my_photo3.jpg" 
 "index" : "_id" : "4"  
 "file_path" : "/User/alice/photos/2017/05/15/my_photo1.jpg" 
 "index" : "_id" : "5"  
 "file_path" : "/User/bob/photos/2017/05/16/my_photo1.jpg"

针对文本字段搜索特定文件路径字符串与所有示例文档相匹配，Bob 的文档排名最高，因为 bob 也是标准分析器创建的提高 Bob 文档相关性的术语之一。

GET file-path-test/_search?filter_path=**.hits

  "query": 
    "match": 
      "file_path": "/User/bob/photos/2017/05"


  "hits": 
    "hits": [
      
        "_index": "file-path-test",
        "_id": "5",
        "_score": 1.7343397,
        "_source": 
          "file_path": "/User/bob/photos/2017/05/16/my_photo1.jpg"
        
      ,
      
        "_index": "file-path-test",
        "_id": "1",
        "_score": 0.34804547,
        "_source": 
          "file_path": "/User/alice/photos/2017/05/16/my_photo1.jpg"
        
      ,
      
        "_index": "file-path-test",
        "_id": "2",
        "_score": 0.34804547,
        "_source": 
          "file_path": "/User/alice/photos/2017/05/16/my_photo2.jpg"
        
      ,
      
        "_index": "file-path-test",
        "_id": "3",
        "_score": 0.34804547,
        "_source": 
          "file_path": "/User/alice/photos/2017/05/16/my_photo3.jpg"
        
      ,
      
        "_index": "file-path-test",
        "_id": "4",
        "_score": 0.34804547,
        "_source": 
          "file_path": "/User/alice/photos/2017/05/15/my_photo1.jpg"
        
      
    ]

在上述的搜索中，它并没有使用到 path_hierarchy 分词器。这个是使用标准的分析器二得到的结果。

使用 file_path.tree 字段可以很简单地匹配或过滤具有特定目录中存在的文件路径的文档。

GET file-path-test/_search?filter_path=**.hits

  "query": 
    "term": 
      "file_path.tree": "/User/alice/photos/2017/05/16"

上面命令显示的结果为：


  "hits": 
    "hits": [
      
        "_index": "file-path-test",
        "_id": "1",
        "_score": 0.83005464,
        "_source": 
          "file_path": "/User/alice/photos/2017/05/16/my_photo1.jpg"
        
      ,
      
        "_index": "file-path-test",
        "_id": "2",
        "_score": 0.83005464,
        "_source": 
          "file_path": "/User/alice/photos/2017/05/16/my_photo2.jpg"
        
      ,
      
        "_index": "file-path-test",
        "_id": "3",
        "_score": 0.83005464,
        "_source": 
          "file_path": "/User/alice/photos/2017/05/16/my_photo3.jpg"
        
      
    ]

使用此分词器的反向参数，还可以从文件路径的另一端进行匹配，例如单个文件名或深层子目录。以下示例显示了通过配置为在映射中使用反向参数的 file_path.tree_reversed 字段在任何目录中搜索所有名为 my_photo1.jpg 的文件。

GET file-path-test/_search?filter_path=**.hits

  "query": 
    "term": 
      "file_path.tree_reversed": 
        "value": "my_photo1.jpg"


  "hits": 
    "hits": [
      
        "_index": "file-path-test",
        "_id": "1",
        "_score": 0.839499,
        "_source": 
          "file_path": "/User/alice/photos/2017/05/16/my_photo1.jpg"
        
      ,
      
        "_index": "file-path-test",
        "_id": "4",
        "_score": 0.839499,
        "_source": 
          "file_path": "/User/alice/photos/2017/05/15/my_photo1.jpg"
        
      ,
      
        "_index": "file-path-test",
        "_id": "5",
        "_score": 0.839499,
        "_source": 
          "file_path": "/User/bob/photos/2017/05/16/my_photo1.jpg"
        
      
    ]

当与其他类型的搜索结合使用时，能够使用文件路径进行过滤也很有用，例如本例中查找具有 16 且也必须位于 Alice 的照片目录中的任何文件路径。

GET file-path-test/_search?filter_path=**.hits

  "query": 
    "bool" : 
      "must" : 
        "match" :  "file_path" : "16" 
      ,
      "filter": 
        "term" :  "file_path.tree" : "/User/alice"

上述命令返回的结果为：


  "hits": 
    "hits": [
      
        "_index": "file-path-test",
        "_id": "1",
        "_score": 0.2876821,
        "_source": 
          "file_path": "/User/alice/photos/2017/05/16/my_photo1.jpg"
        
      ,
      
        "_index": "file-path-test",
        "_id": "2",
        "_score": 0.2876821,
        "_source": 
          "file_path": "/User/alice/photos/2017/05/16/my_photo2.jpg"
        
      ,
      
        "_index": "file-path-test",
        "_id": "3",
        "_score": 0.2876821,
        "_source": 
          "file_path": "/User/alice/photos/2017/05/16/my_photo3.jpg"
        
      
    ]

以上是关于Elasticsearch：使用路径层次分词器 — Path Hierarchy Tokenizer的主要内容，如果未能解决你的问题，请参考以下文章