基于pandas groupby拆分dataframe并生成多个PDF

Posted 2023-02-24

技术标签:

【中文标题】基于pandas groupby拆分dataframe并生成多个PDF【英文标题】：Split dataframe based on pandas groupby and generate multiple PDFs 【发布时间】：2022-01-16 02:22:33 【问题描述】：

有一个表格，其中包含 3 名员工和他们应该分别参加的 3-4 门课程的列表。我想为该表中的每个员工创建单独的 PDF。第一个 PDF 将列出 Emp1 将参加的 3 门课程，第二个 PDF 将列出 Emp2 将参加的 3 门课程，依此类推。

以下代码仅创建 1 个 PDF，并包含所有员工的所有课程列表。

我的想法是最初根据 EmpNo 拆分/分组数据，然后创建单独的 PDF，为此我需要创建一个 For 循环进行迭代。但是，我无法弄清楚这一点...

数据帧代码

pip install fpdf #To generate PDF

import pandas as pd
data = 'EmpNo': ['123','123','123','456','456', '456','456','789','789','789'],
  'First Name': ['John', 'John', 'John', 'Jane', 'Jane', 'Jane', 'Jane', 'Danny', 'Danny', 'Danny'],
  'Last Name': ['Doe', 'Doe' ,'Doe', 'Doe' ,'Doe', 'Doe', 'Doe', 'Roberts', 'Roberts', 'Roberts'],
  'Activity Code': ['HR-CONF-1', 'HR-Field-NH-ONB','COEATT-2021','HR-HBK-CA-1','HR-WD-EMP','HR-LIST-1','HS-Guide-3','HR-WD-EMP','HR-LIST-1','HS-Guide-3'],
  'RegistrationDate': ['11/22/2021', '11/22/2021', '11/22/2021', '11/22/2021', '11/22/2021', '11/22/2021','11/22/2021', '11/22/2021', '11/22/2021','11/22/2021']


df = pd.DataFrame(data = data, columns = ['EmpNo','First Name', 'Last Name', 'Activity Code', 'RegistrationDate'])
employees = data['EmpNo']
employees = data.drop_duplicates(subset=['EmpNo'])

print(df)

输入看起来像这样，

PDF 生成代码

from fpdf import FPDF

class PDF(FPDF):
def header(self):

    # Arial bold 15
    self.set_font('Helvetica', 'B', 15)
    # Move to the right
    self.cell(80)            
    # Title
    self.cell(42, 2, 'Plan', 0, 0, 'C')
    # Line break
    self.ln(20)
    
    # Page footer
def footer(self):
    # Position at 1.5 cm from bottom
    self.set_y(-15)
    # Arial italic 8
    self.set_font('Helvetica', 'I', 8)
    # Page number
    self.cell(0, 10, 'Page ' + str(self.page_no()) + '/nb', 0, 0, 'C')
    # Footer image First is horizontal, second is vertical, third is size
    
for EmpNo in employees['EmpNo']:
print (EmpNo)

    # Instantiation of inherited class
pdf = PDF()
pdf.alias_nb_pages()
pdf.add_page()
pdf.set_font('Helvetica', '', 11)
pdf.cell(80, 6, 'Employee ID: ' + str(data.loc[0]['EmpNo']), 0, 1, 'L')
pdf.ln(2.5)
pdf.multi_cell(160, 5, 'Dear ' + str(data.loc[0]['First Name']) + ' ' + str(data.loc[0]['Last Name']) + ', Please find below your Plan.', 0, 1, 'L')
pdf.cell(80, 6, '', 0, 1, 'C')
pdf.set_font('Helvetica', 'B', 13)
pdf.cell(80, 6, 'Name', 0, 0, 'L')
pdf.cell(40, 6, 'Date', 0, 0, 'L')
pdf.cell(40, 6, 'Link', 0, 1, 'L')
pdf.cell(80, 6, '', 0, 1, 'C')
pdf.set_font('Helvetica', '', 8)
for i in range (len(data)):
    pdf.set_font('Helvetica', '', 8)
    pdf.cell(80, 6, data.loc[0+i]['Activity Code'], 0, 0, 'L')
    #pdf.cell(40, 6, data.loc[0+i]['Activity Link'], 0, 1, 'L')
    pdf.cell(40, 6, data.loc[0+i]['RegistrationDate'], 0, 0, 'L')
    pdf.set_font('Helvetica', 'U', 8)
    pdf.cell(40, 6, 'Click Here', 0, 1, 'L', link = 'www.google.com')
pdf.set_font('Helvetica', 'B', 10)
pdf.cell(80, 6, '', 0, 1, 'C')
pdf.cell(80, 6, 'IF YOU REQUIRE ANY HELP, PLEASE CONTACT US', 0, 0, 'L')
pdf.output(str(data.loc[0]['First Name']) + ' ' + str(data.loc[0]['Last Name'])+ '.pdf', 'F')

这是生成的 PDF 快照。

我可以使用以下代码拆分数据，但我不知道如何调用单个拆分然后进一步创建多个 PDF

splits = list(data.groupby('EmpNo'))

任何帮助将不胜感激。谢谢。

【问题讨论】：

try "for i, g in data.groupby('EmpNo'): pdf.output(str(g.loc[0]['First Name']) + ' ' + str(g .loc[0]['姓氏'])+ '.pdf', 'F') 【参考方案1】：

我会这样写 groupby：

for EmpNo, data in df.groupby("EmpNo"):

对于每个组，groupby 将返回它分组的变量，以及与该变量匹配的数据框。

接下来，我将提取该数据帧的第一行。这是为了更容易获取名称和类似属性。

first_row = data.iloc[0]

(What's the difference between iloc and loc?)

由于我们已经有了员工 ID，我们可以跳过在数据框中查找它。对于其他属性，我们可以像first_row['First Name']一样查找。

pdf.cell(80, 6, 'Employee ID: ' + str(EmpNo), 0, 1, 'L')
# ...
pdf.multi_cell(160, 5, 'Dear ' + str(first_row['First Name']) + ' ' + str(first_row['Last Name']) + ', Please find below your Plan.', 0, 1, 'L')

接下来，在循环子集的循环中，我将使用.iterrows() 来执行循环，而不是使用range() 和.loc。如果您的数据框的索引不是从零开始，这会更容易并且不会中断。（分组后，第二组的索引不再从零开始。）

以下是修改后的最终源代码：

import pandas as pd
data = 'EmpNo': ['123','123','123','456','456', '456','456','789','789','789'],
  'First Name': ['John', 'John', 'John', 'Jane', 'Jane', 'Jane', 'Jane', 'Danny', 'Danny', 'Danny'],
  'Last Name': ['Doe', 'Doe' ,'Doe', 'Doe' ,'Doe', 'Doe', 'Doe', 'Roberts', 'Roberts', 'Roberts'],
  'Activity Code': ['HR-CONF-1', 'HR-Field-NH-ONB','COEATT-2021','HR-HBK-CA-1','HR-WD-EMP','HR-LIST-1','HS-Guide-3','HR-WD-EMP','HR-LIST-1','HS-Guide-3'],
  'RegistrationDate': ['11/22/2021', '11/22/2021', '11/22/2021', '11/22/2021', '11/22/2021', '11/22/2021','11/22/2021', '11/22/2021', '11/22/2021','11/22/2021']
df = pd.DataFrame(data = data, columns = ['EmpNo','First Name', 'Last Name', 'Activity Code', 'RegistrationDate'])


from fpdf import FPDF

class PDF(FPDF):
    def header(self):

        # Arial bold 15
        self.set_font('Helvetica', 'B', 15)
        # Move to the right
        self.cell(80)            
        # Title
        self.cell(42, 2, 'Plan', 0, 0, 'C')
        # Line break
        self.ln(20)

        # Page footer
    def footer(self):
        # Position at 1.5 cm from bottom
        self.set_y(-15)
        # Arial italic 8
        self.set_font('Helvetica', 'I', 8)
        # Page number
        self.cell(0, 10, 'Page ' + str(self.page_no()) + '/nb', 0, 0, 'C')
        # Footer image First is horizontal, second is vertical, third is size

for EmpNo, data in df.groupby("EmpNo"):
    # Get first row of grouped dataframe
    first_row = data.iloc[0]

    # Instantiation of inherited class
    pdf = PDF()
    pdf.alias_nb_pages()
    pdf.add_page()
    pdf.set_font('Helvetica', '', 11)
    pdf.cell(80, 6, 'Employee ID: ' + str(EmpNo), 0, 1, 'L')
    pdf.ln(2.5)
    pdf.multi_cell(160, 5, 'Dear ' + str(first_row['First Name']) + ' ' + str(first_row['Last Name']) + ', Please find below your Plan.', 0, 1, 'L')
    pdf.cell(80, 6, '', 0, 1, 'C')
    pdf.set_font('Helvetica', 'B', 13)
    pdf.cell(80, 6, 'Name', 0, 0, 'L')
    pdf.cell(40, 6, 'Date', 0, 0, 'L')
    pdf.cell(40, 6, 'Link', 0, 1, 'L')
    pdf.cell(80, 6, '', 0, 1, 'C')
    pdf.set_font('Helvetica', '', 8)
    for _, row in data.iterrows():
        pdf.set_font('Helvetica', '', 8)
        pdf.cell(80, 6, row['Activity Code'], 0, 0, 'L')
        #pdf.cell(40, 6, row['Activity Link'], 0, 1, 'L')
        pdf.cell(40, 6, row['RegistrationDate'], 0, 0, 'L')
        pdf.set_font('Helvetica', 'U', 8)
        pdf.cell(40, 6, 'Click Here', 0, 1, 'L', link = 'www.google.com')
    pdf.set_font('Helvetica', 'B', 10)
    pdf.cell(80, 6, '', 0, 1, 'C')
    pdf.cell(80, 6, 'IF YOU REQUIRE ANY HELP, PLEASE CONTACT US', 0, 0, 'L')
    pdf.output(str(first_row['First Name']) + ' ' + str(first_row['Last Name'])+ '.pdf', 'F')

经过测试，它可以工作。

【讨论】：

以上是关于基于pandas groupby拆分dataframe并生成多个PDF的主要内容，如果未能解决你的问题，请参考以下文章