ES 检索 wordpdf 文档插件 ingest attachment 的管道配置和文档结构映射

Posted catoop

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了ES 检索 wordpdf 文档插件 ingest attachment 的管道配置和文档结构映射相关的知识,希望对你有一定的参考价值。

一、安装 ingest attachment 插件

安装方法:https://blog.csdn.net/catoop/article/details/124468788

二、定义文本抽取管道

1.单附件(示例)

PUT _ingest/pipeline/attachment

    "description": "Extract attachment information",
    "processors": [
        
            "attachment": 
                "field": "data",
                "ignore_missing": true
            
        ,
        
            "remove": 
                "field": "data",
            
        
    ]

其中remove段的配置表示附件经过管道处理后删除附件本身,只将附件的文字存入ES中,附件自身base64的数据抛弃掉。

2.多附件(示例)

PUT _ingest/pipeline/attachment

    "description": "Extract attachment information",
    "processors": [
        
            "foreach": 
                "field": "attachments",
                "processor": 
                    "attachment": 
                        "field": "_ingest._value.data",
                        "target_field": "_ingest._value.attachment"
                    
                
            
        ,
        
            "foreach": 
                "field": "attachments",
                "processor": 
                    "remove": 
                        "field": "_ingest._value.data",
                        "target_field": "_ingest._value.attachment"
                    
                
            
        
    ]

需要注意的是,多附件的情况下,field 和 target_field 必须要写成 _ingest._value.*,否则不能匹配正确的字段。
从 es 8.0 版本开始,需要删除二进制文件内容,只需要为 attachment 添加一个属性 remove_binary 为 true,就不需要像上面那样单独写一个 remove 处理器了。

三、建立文档结构映射

1.单附件(示例)

PUT newdoc_dispatch

  "mappings": 
    "properties": 
      "businessId":
        "type": "keyword"
      ,
      "title":
        "type": "text",
        "analyzer": "ik_smart"
      ,
      "fullDocNO":
        "type": "text",
        "analyzer": "ik_smart"
      ,
      "drafterUser":
        "type": "keyword"
      ,
      "dispatchNO":
        "type": "text",
        "analyzer": "ik_smart"
      ,
      "dispatchDept":
        "type": "keyword"
      ,
      "dispatchTime":
        "type": "date"
      ,
      "abolish":
        "type": "keyword"
      ,
      "tenantId":
        "type": "keyword"
      ,
      "attachment": 
        "properties": 
          "content":
            "type": "text",
            "analyzer": "ik_smart"
          
        
      
    
  

2.多附件(示例)

PUT newdoc_dispatch

  "mappings": 
    "properties": 
      "businessId":
        "type": "keyword"
      ,
      "title":
        "type": "text",
        "analyzer": "ik_smart"
      ,
      "fullDocNO":
        "type": "text",
        "analyzer": "ik_smart"
      ,
      "drafterUser":
        "type": "keyword"
      ,
      "dispatchNO":
        "type": "text",
        "analyzer": "ik_smart"
      ,
      "dispatchDept":
        "type": "keyword"
      ,
      "dispatchTime":
        "type": "date",
        "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
      ,
      "abolish":
        "type": "keyword"
      ,
      "tenantid":
        "type": "keyword"
      ,
      "attachments" : 
        "properties" : 
          "attachment" : 
            "properties" : 
              "content" : 
                "type" : "text",
                "analyzer": "ik_smart"
              
            
                  
        
      
    
  

工程中的代码是多附件的示例,mapping结构映射的对象详见ESDispatchDocumentVo

官网参考资料:https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html
其他参考资料:https://www.cnblogs.com/ncore/p/10475909.html
代码工程参考:https://gitee.com/catoop/es-attachment


(END)

以上是关于ES 检索 wordpdf 文档插件 ingest attachment 的管道配置和文档结构映射的主要内容,如果未能解决你的问题,请参考以下文章

如何用Elasticsearch实现WordPDF,TXT文件的全文内容检索?

Elasticsearch 实现对WordPDF等文件进行全文检索

Elasticsearch 实现对WordPDF等文件进行全文检索

Elasticsearch 安装Ingest User-Agent插件(ingest-user-agent)

插件 [ingest-geoip] 是为 Elasticsearch 6.2.4 版构建的,但 6.5.0 版正在运行

elasticsearch ingest node and docker-cluster---quey using sql]