ES 检索 wordpdf 文档插件 ingest attachment 的管道配置和文档结构映射
Posted catoop
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了ES 检索 wordpdf 文档插件 ingest attachment 的管道配置和文档结构映射相关的知识,希望对你有一定的参考价值。
一、安装 ingest attachment 插件
安装方法:https://blog.csdn.net/catoop/article/details/124468788
二、定义文本抽取管道
1.单附件(示例)
PUT _ingest/pipeline/attachment
"description": "Extract attachment information",
"processors": [
"attachment":
"field": "data",
"ignore_missing": true
,
"remove":
"field": "data",
]
其中remove段的配置表示附件经过管道处理后删除附件本身,只将附件的文字存入ES中,附件自身base64的数据抛弃掉。
2.多附件(示例)
PUT _ingest/pipeline/attachment
"description": "Extract attachment information",
"processors": [
"foreach":
"field": "attachments",
"processor":
"attachment":
"field": "_ingest._value.data",
"target_field": "_ingest._value.attachment"
,
"foreach":
"field": "attachments",
"processor":
"remove":
"field": "_ingest._value.data",
"target_field": "_ingest._value.attachment"
]
需要注意的是,多附件的情况下,field 和 target_field 必须要写成 _ingest._value.*,否则不能匹配正确的字段。
从 es 8.0 版本开始,需要删除二进制文件内容,只需要为 attachment 添加一个属性remove_binary 为 true
,就不需要像上面那样单独写一个 remove 处理器了。
三、建立文档结构映射
1.单附件(示例)
PUT newdoc_dispatch
"mappings":
"properties":
"businessId":
"type": "keyword"
,
"title":
"type": "text",
"analyzer": "ik_smart"
,
"fullDocNO":
"type": "text",
"analyzer": "ik_smart"
,
"drafterUser":
"type": "keyword"
,
"dispatchNO":
"type": "text",
"analyzer": "ik_smart"
,
"dispatchDept":
"type": "keyword"
,
"dispatchTime":
"type": "date"
,
"abolish":
"type": "keyword"
,
"tenantId":
"type": "keyword"
,
"attachment":
"properties":
"content":
"type": "text",
"analyzer": "ik_smart"
2.多附件(示例)
PUT newdoc_dispatch
"mappings":
"properties":
"businessId":
"type": "keyword"
,
"title":
"type": "text",
"analyzer": "ik_smart"
,
"fullDocNO":
"type": "text",
"analyzer": "ik_smart"
,
"drafterUser":
"type": "keyword"
,
"dispatchNO":
"type": "text",
"analyzer": "ik_smart"
,
"dispatchDept":
"type": "keyword"
,
"dispatchTime":
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
,
"abolish":
"type": "keyword"
,
"tenantid":
"type": "keyword"
,
"attachments" :
"properties" :
"attachment" :
"properties" :
"content" :
"type" : "text",
"analyzer": "ik_smart"
工程中的代码是多附件的示例,mapping结构映射的对象详见ESDispatchDocumentVo
官网参考资料:https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html
其他参考资料:https://www.cnblogs.com/ncore/p/10475909.html
代码工程参考:https://gitee.com/catoop/es-attachment
(END)
以上是关于ES 检索 wordpdf 文档插件 ingest attachment 的管道配置和文档结构映射的主要内容,如果未能解决你的问题,请参考以下文章
如何用Elasticsearch实现WordPDF,TXT文件的全文内容检索?
Elasticsearch 实现对WordPDF等文件进行全文检索
Elasticsearch 实现对WordPDF等文件进行全文检索
Elasticsearch 安装Ingest User-Agent插件(ingest-user-agent)
插件 [ingest-geoip] 是为 Elasticsearch 6.2.4 版构建的,但 6.5.0 版正在运行
elasticsearch ingest node and docker-cluster---quey using sql]