Apache Tika 提取扫描的 PDF 文件

Posted 2023-04-17

技术标签:

【中文标题】Apache Tika 提取扫描的 PDF 文件【英文标题】：Apache Tika extract scanned PDF files 【发布时间】：2015-11-28 00:34:34 【问题描述】：

我在使用 Apache TIKA（1.10 版）时遇到了一些问题。我得到了一些 PDF 文件，它们只是扫描的纸片。这意味着每个页面只是一个图像。我的目标是提取 PDF 文件的文本。

我的 tesseract 设置正确，提取 JPG 和 PNG 文件就像一个魅力。我正在使用的代码看起来像这样（不要介意丢失的异常处理）：

public String extractText(InputStream stream) 
    AutoDetectParser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
    Metadata metadata = new Metadata();
    ParseContext context = new ParseContext();
    parser.parse(stream, handler, metadata, context);
    String text = handler.toString();
    return text;

我搜索了很多，但没有找到任何适合我的解决方案。我已经尝试过PDFParserConfig 类的setExtractInlineImages 方法，但这并没有改变任何事情。使用自定义 ParsingEmbeddedDocumentExtractor 提取嵌入文档确实提取了 doc 文件的嵌入资源，但不适用于我的 PDF 文件。

如果你们中的任何人都可以提供一些帮助，那就太棒了:)

【问题讨论】：

您是否将PDFParserConfig 附加到设置了该选项的上下文中？是的，我做到了。但这没有效果：/ 你能把你用来做那个的代码贴出来，以便我们检查它是否正确吗？ PDFParserConfig config = new PDFParserConfig(); config.setExtractInlineImages(true); ParseContext context = new ParseContext(); context.set(PDFParserConfig.class, config); PDFParser pdfParser = new PDFParser(); pdfParser.setPDFParserConfig(config); pdfParser.parse(stream, handler, metadata, context); 好了，谢谢你到目前为止的帮助:) 运行带有-z（提取）标志的 Tika 应用程序是否会从文件中获取扫描的图像？ 【参考方案1】：

Tim Allison带来了解决方案：

Parser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

TesseractOCRConfig config = new TesseractOCRConfig();
PDFParserConfig pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(true);

ParseContext parseContext = new ParseContext();
parseContext.set(TesseractOCRConfig.class, config);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(Parser.class, parser); //need to add this to make sure recursive parsing happens!

parser.parse(stream, handler, new Metadata(), parseContext);

这对我有用:)

编辑： 这是完整的解决方案：

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.ocr.TesseractOCRConfig;
import org.apache.tika.parser.pdf.PDFParserConfig;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;

import java.io.FileInputStream;
import java.io.IOException;

/**
 * @since 8/26/16
 */
public class Sample 
    public static void main(String[] args)
            throws IOException, TikaException, SAXException 
        Parser parser = new AutoDetectParser();
        BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);

        TesseractOCRConfig config = new TesseractOCRConfig();
        PDFParserConfig pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(true);

        ParseContext parseContext = new ParseContext();
        parseContext.set(TesseractOCRConfig.class, config);
        parseContext.set(PDFParserConfig.class, pdfConfig);
        //need to add this to make sure recursive parsing happens!
        parseContext.set(Parser.class, parser);

        FileInputStream stream = new FileInputStream("samplepdf.pdf");
        Metadata metadata = new Metadata();
        parser.parse(stream, handler, metadata, parseContext);
        System.out.println(metadata);
        String content = handler.toString();
        System.out.println("===============");
        System.out.println(content);
        System.out.println("Done");

Maven 依赖项：

<dependencies>
    <dependency>
      <groupId>org.apache.tika</groupId>
      <artifactId>tika-parsers</artifactId>
      <version>1.13</version>
    </dependency>
    <dependency>
      <groupId>com.levigo.jbig2</groupId>
      <artifactId>levigo-jbig2-imageio</artifactId>
      <version>1.6.5</version>
    </dependency>
  </dependencies>

【讨论】：

我已经尝试了解决方案并遵循了 Apache Tika-Jira 但它不起作用。我没有收到任何错误，但输出为空。我的问题得到了解决。关注：***.com/questions/39762841/… Thamme，谢谢你。请更新以包含以下依赖项（感谢上面 Rana 的链接）以及有关 levigo 和 jai 的许可影响的警告。 com.github.jai-imageiojai-imageio-core1.3.1 您好，我使用了上面的代码，发现无论我是否包含 tesseract，提取结果都没有区别。你能告诉我为什么要使用 tesseract 吗？提前致谢。

以上是关于Apache Tika 提取扫描的 PDF 文件的主要内容，如果未能解决你的问题，请参考以下文章