Python学习笔记：20 Python读写Word文件和PDF文件

Posted 2021-09-05 better meˇ:)

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Python学习笔记：20 Python读写Word文件和PDF文件相关的知识，希望对你有一定的参考价值。

写Word文档

使用Python写Word文档需要安装docx三方库，如下示例写了一个简单的Word文档

from docx import Document
from docx.shared import Cm, Pt
from docx.document import Document as Doc
# 创建一个word对象
document = Document()  # type:Doc

# font = document.styles['Normal'].font
# font.size = Pt(22)
# 添加顶级标题
document.add_heading('快快乐乐学Python', 0)
# 添加段落
p = document.add_paragraph('Python是一门非常流行的编程语言，它')
run = p.add_run('简单')
run.bold = True
# 设置字体大小
run.font.size = Pt(18)
run = p.add_run('而且')
# 设置字体
run.font.name = 'HYj1gf'
p.add_run('优雅。').italic = True
# 添加一级标题
document.add_heading('Heading, level 1', level=1)
document.add_paragraph('Intense quote', style='Intense Quote')
# 带上小圆圈
document.add_paragraph(
    'first item in unordered list', style='List Bullet'
)
# 带上数字
document.add_paragraph(
    'first item in ordered list', style='List Number'
)
# 添加图片
document.add_picture('resources/beauty.png', width=Cm(3.2))
# 加分节符
document.add_section()

records = (
    ('小龙', '男', '1999-02-15'),
    ('小英', '女', '2000-10-20'),
    ('小白', '女', '1998-07-18')
)

table = document.add_table(rows=1, cols=3)
# 使用表格模板
table.style = 'Colorful List Accent 1'
hdr_cells = table.rows[0].cells
hdr_cells[0].text = '姓名'
hdr_cells[1].text = '性别'
hdr_cells[2].text = '生日'
for name, sex, birthday in records:
    row_cells = table.add_row().cells
    row_cells[0].text = name
    row_cells[1].text = sex
    row_cells[2].text = birthday
# 加分页符
document.add_page_break()

document.save('resources/demo.docx')

生活中，可能需要批量地写一些文档，这时候可以利用Python来帮助我们完成重复性的工作，例如要批量地写一些离职证明，我们可以通过读入离职证明模板，通过录入一些需要改动的信息，生成不同人的离职证明，代码如下所示：

from docx import Document
from docx.document import Document as Doc

employees = [
    {
        'name': '小龙',
        'id': '100200198011280001',
        'sdate': '2008年3月1日',
        'edate': '2012年2月29日',
        'department': '产品研发',
        'position': '架构师'
    },
    {
        'name': '小青',
        'id': '510210199012125566',
        'sdate': '2019年1月1日',
        'edate': '2021年4月30日',
        'department': '产品研发',
        'position': 'Python开发工程师'
    }
]


for emp_dict in employees:
    doc = Document('resources/离职证明模板.docx')  # type: Doc
    for p in doc.paragraphs:
        if '{' not in p.text:
            continue
        for run in p.runs:
            if '{' not in run.text:
                continue
            # 将占位符换成实际内容
            start, end = run.text.find('{'), run.text.find('}')
            key, place_holder = run.text[start + 1:end], run.text[start:end + 1]
            run.text = run.text.replace(place_holder, emp_dict[key])
    doc.save(f'resources/{emp_dict["name"]}离职证明.docx')

在模板中，需要录入信息的地方使用占位符{}括起来

操作PDF文件

读取PDF并提取文字

在Python中，可以使用名为PyPDF2的三方库来读取PDF文件。

import PyPDF2

from PyPDF2.pdf import PageObject

reader = PyPDF2.PdfFileReader('resources/XGBoost.pdf')
writer = PyPDF2.PdfFileWriter()
for page_num in range(reader.numPages):
    current_page = reader.getPage(page_num)  # type:PageObject
    print(current_page.extractText())
    current_page.rotateClockwise(90)   # 顺时针旋转90度
    writer.addPage(current_page)
    writer.addBlankPage()   # 添加空白页
with open('resources/XGBoost-modified.pdf', 'wb') as file:
    writer.write(file)

给PDF文件添加密码

使用encrypt函数可以实现PDF文件的加密，这样别人想打开你的文件时，只有输入正确的密码，才能打开。

import PyPDF2

reader = PyPDF2.PdfFileReader('resources/XGBoost.pdf')
writer = PyPDF2.PdfFileWriter()
for page_num in range(reader.numPages):
    writer.addPage(reader.getPage(page_num))
# 加密PDF文件
writer.encrypt('123456')
with open('resources/XGBoost-encrypted.pdf', 'wb') as file:
    writer.write(file)

给PDF文件添加水印

原理是将水印文件合并到需要添加的PDF文件的每一页上面。使用mergePage函数可以实现页面的合并（重叠）。

import PyPDF2

from PyPDF2.pdf import PageObject

reader1 = PyPDF2.PdfFileReader('resources/XGBoost.pdf')
reader2 = PyPDF2.PdfFileReader('resources/watermark.pdf')
writer = PyPDF2.PdfFileWriter()

watermark_page = reader2.getPage(0)
for page_num in range(reader1.numPages):
    current_page = reader1.getPage(page_num)  # type: PageObject
    current_page.mergePage(watermark_page)
    writer.addPage(current_page)

with open('resources/XGBoost-watermarked.pdf', 'wb') as file:
    writer.write(file)

以上是关于Python学习笔记：20 Python读写Word文件和PDF文件的主要内容，如果未能解决你的问题，请参考以下文章