如何在 python 中阅读 pdf？ [复制]

Posted 2023-02-24

技术标签:

【中文标题】如何在 python 中阅读 pdf？ [复制]【英文标题】：How can I read pdf in python? [duplicate] 【发布时间】：2018-01-29 09:49:22 【问题描述】：

如何在 python 中阅读 pdf？ 我知道一种将其转换为文本的方法，但我想直接从 pdf 中读取内容。

谁能解释一下python中的哪个模块最适合pdf提取

【问题讨论】：

【参考方案1】：

你可以使用 PyPDF2 包

#install pyDF2
pip install PyPDF2

# importing all the required modules
import PyPDF2

# creating an object 
file = open('example.pdf', 'rb')

# creating a pdf reader object
fileReader = PyPDF2.PdfFileReader(file)

# print the number of pages in pdf file
print(fileReader.numPages)

关注此文档http://pythonhosted.org/PyPDF2/

【讨论】：

是否有解决方法来解决“PyPDF2.utils.PdfReadError: EOF marker not found”错误？您并没有在这里真正说明如何获取 pdf 的实际文本。您的代码仅在 0x10d31f278> 处创建 . PyPDF2、PyPDF3 和 PyPDF4 未维护。 I recommend to use pymupdf 尝试将此包裹与来自亚马逊的订单一起使用。它找到了 33 个页面，但所有页面的 extractText() API 都是空的是的，我已经测试了一些 pdf，extractText() API 跳过了一些文本。它没有打印 pdf 中的所有文本。【参考方案2】：

试试 PyPDF2。

这里有一个很好的教程：https://automatetheboringstuff.com/chapter13/

【讨论】：

【参考方案3】：

你可以在python中使用texttract模块

提取

用于安装

pip install textract

用于阅读 pdf

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

详情Textract

【讨论】：

据我所知，texttract 已损坏。 Textract 似乎也死了：github.com/deanmalmgren/textract/issues/350

以上是关于如何在 python 中阅读 pdf？ [复制]的主要内容，如果未能解决你的问题，请参考以下文章