apache solr索引带有xml文档的pdf文件

Posted 2021-04-09

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了apache solr索引带有xml文档的pdf文件相关的知识，希望对你有一定的参考价值。

如何使用xml文档在Apache Solr（版本8）中索引pdf文件例如：

<add>
<doc>
<field name="id">filePath</field>
<field name="title">the title</field>
<field name="description">description of the pdf file</field>
<field name="Creator">jhone doe</field>
<field name="Language">English</field>
<field name="Publisher">Publisher_name</field>
<field name="tags">some_tag</field>
<field name="is_published">true</field>
<field name="year">2002</field>
<field name="file">path_to_the_file/file_name.pdf</field>
</doc>
</add>

更新

如何将literal.id设置为filePath

答案

确定，这是我所做的

我正在使用solr DHI在solrconfig.xml中

<requestHandler name="/dataimport_fromXML" class="org.apache.solr.handler.dataimport.DataImportHandler">

        <lst name="defaults">
            <str name="config">data.import.xml</str>
            <str name="update.chain">dedupe</str>
        </lst>
 </requestHandler>

和data.import.xml文件

<dataConfig>
    <dataSource type="BinFileDataSource" name="data"/>
    <dataSource type="FileDataSource" name="main"/>
    <document>
        <!-- url : the url for the xml file that holde the metadata -->
        <entity name="rec" processor="XPathEntityProcessor" url="${solr.install.dir:}solr/solr_core_name/filestore/docs_metaData/metaData.xml" forEach="/docs/doc" dataSource="main" transformer="RegexTransformer,DateFormatTransformer">
            <field column="resourcename" xpath="//resourcename" name="resourceName" />
            <field column="title" xpath="//title" name="title" />
            <field column="subject" xpath="//subject" name="subject"/>
            <field column="description" xpath="//description" name="description"/>
            <field column="comments" xpath="//comments" name="comments"/>
            <field column="author" xpath="//author" name="author"/>
            <field column="keywords" xpath="//keywords" name="keywords"/>
            <!-- baseDir: path to the folder that containt the files (pdf | doc | docx | ...) -->
            <entity name="files" dataSource="null" rootEntity="false" processor="FileListEntityProcessor" baseDir="${solr.install.dir:}solr/solr_core_name/filestore/docs_folder" fileName="${rec.resourcename}" onError="skip" recursive="false">
                <field column="fileAbsolutePath" name="filePath" />
                <field column="resourceName" name="resourceName" />
                <field column="fileSize" name="size" />
                <field column="fileLastModified" name="lastModified" />
                <!-- for etch file extracte metadata if not in the xml metadata file -->
                <entity name="file" processor="TikaEntityProcessor" dataSource="data" format="text" url="${files.fileAbsolutePath}" onError="skip" recursive="false">
                    <field column="title" name="title" meta="true"/>
                    <field column="subject" name="subject" meta="true"/>
                    <field column="description" name="description" meta="true"/>
                    <field column="comments" name="comments" meta="true"/>
                    <field column="Author" name="author" meta="true"/>
                    <field column="Keywords" name="keywords" meta="true"/>
                </entity>
            </entity>
        </entity>
    </document>
</dataConfig>

在此之后，您要做的就是创建xml文件（metaData.xml）

<docs>
    <doc>
        <resourcename>fileName.pdf</resourcename>
        <title></title>
        <subject></subject>
        <description></description>
        <comments></comments>
        <author></author>
        <keywords></keywords>
    </doc>
</docs>

并将所有文件放在一个文件夹中

"${solr.install.dir:}solr/solr_core_name/filestore/docs_folder"

$ {solr.install.dir：}是solr主文件夹

有关问题的更新

如何将literal.id设置为filePath

在data.import.xml中，将fileAbsolutePath映射到ID

<field column="fileAbsolutePath" name="id" />

最后一件事

在此示例中，我使用的ID是自动生成的

<updateRequestProcessorChain name="dedupe">

女巫基于内容的哈希创建唯一ID，以避免重复

以上是关于apache solr索引带有xml文档的pdf文件的主要内容，如果未能解决你的问题，请参考以下文章

让ExtractingRequestHandler在Solr中工作

Solr索引数据

索引时 Apache SOLR 3.5 挂起

Apache Solr初学者教程（入门之旅）

solr的schema.xml字段类型都有哪些

solr 5.3 提取pdf数据创建索引