走目录时的字数PDF文件

Posted 2023-03-07

技术标签:

【中文标题】走目录时的字数PDF文件【英文标题】：word count PDF files when walking directory 【发布时间】：2018-03-05 18:29:55 【问题描述】：

你好 *** 社区！

我正在尝试构建一个 Python 程序，它将遍历一个目录（和所有子目录）并对所有 .html、.txt 和 .pdf 文件进行累计字数统计。读取 .pdf 文件时，它需要一些额外的东西 (PdfFileReader) 来解析文件。解析 .pdf 文件时出现以下错误并且程序停止：

AttributeError: 'PdfFileReader' 对象没有属性 'startswith'

当不解析.pdf文件时问题完全成功。

代码

#!/usr/bin/python

import re
import os
import sys
import os.path
import fnmatch
import collections
from PyPDF2 import PdfFileReader


ignore = [<lots of words>]

def extract(file_path, counter):
    words = re.findall('\w+', open(file_path).read().lower())
    counter.update([x for x in words if x not in ignore and len(x) > 2])

def search(path):
    print path
    counter = collections.Counter()

    if os.path.isdir(path):
        for root, dirs, files in os.walk(path):
            for file in files:
                if file.lower().endswith(('.html', '.txt')):
                        print file
                        extract(os.path.join(root, file), counter)
                if file.lower().endswith(('.pdf')):
                    file_path = os.path.abspath(os.path.join(root, file))
                    print file_path

                    with open(file_path, 'rb') as f:
                        reader = PdfFileReader(f)
                        extract(os.path.join(root, reader), counter)
                        contents = reader.getPage(0).extractText().split('\n')
                        extract(os.path.join(root, contents), counter)
                        pass
    else:
        extract(path, counter)

    print(counter.most_common(50))

search(sys.argv[1])

完整的错误

Traceback (most recent call last):File line 50, in <module> search(sys.argv[1])

File line 36, in search extract(os.path.join(root, reader), counter)

File line 68, in join if b.startswith('/'):

AttributeError: 'PdfFileReader' object has no attribute 'startswith'

使用 .pdf 文件调用提取函数时出现故障。任何帮助/指导将不胜感激！

预期结果（不带 .pdf 文件的作品）

[('cyber', 5101), ('2016', 5095), ('date', 4912), ('threat', 4343)]

【问题讨论】：

如果您要放弃the exact same question，请删除它。 【参考方案1】：

问题在于这条线

reader = PdfFileReader(f)

返回一个 PdfFileReader 类型的对象。然后，您将此对象传递给 extract() 函数，该函数需要文件路径而不是 PdfFileReader 对象。

建议将您当前在 search() 函数中进行的 PDF 相关处理移至 extract 函数()。然后，在提取功能中，您将检查它是否为 PDF 文件，然后采取相应措施。所以，是这样的：

def extract(file_path, counter):
    if file_path.lower().endswith(('.pdf')):
        reader = PdfFileReader(file)
        contents = reader.getPage(0).extractText().split('\n')
        counter.update([x for x in contents if x not in ignore and len(x) > 2])
    elif file_path.lower().endswith(('.html', '.txt')):
        words = re.findall('\w+', open(file_path).read().lower())
        counter.update([x for x in words if x not in ignore and len(x) > 2])
    else:
        ## some other file type...

还没有测试过上面的代码 sn-p 但希望你能明白。

【讨论】：

以上是关于走目录时的字数PDF文件的主要内容，如果未能解决你的问题，请参考以下文章

使用 xdocReport 将 .odt 转换为 .pdf 时的验证错误

使用 dompdf 生成 pdf 时的背景位置数组

pdf解析问题：如何点击pdf文件里面的目录，跳转到所在的page？

latex pdf 统计字数

尝试在python中阅读pdf