查找正则表达式匹配 x 的数量，将数据帧的某些列重复 x 次 + Unicode 错误

Posted 2023-03-11

技术标签:

【中文标题】查找正则表达式匹配 x 的数量，将数据帧的某些列重复 x 次 + Unicode 错误【英文标题】：Find the number of RegEx matches x, duplicate some columns of the dataframe x times + Unicode Error 【发布时间】：2021-12-16 17:56:40 【问题描述】：

在使用 Pandas Dataframe 将值编译到表之前，我正在解析多页 pdf 以提取订单的字符串数据和每页的产品列表（也就是每个订单）。

The current output	My desired output

如您所见，它只捕获每个页面中的第一个产品（也就是每个订单），然后继续前进。

我的代码如下：

import re

import pdfplumber
import pandas as pd
from collections import namedtuple

path = "/Users/mymacbook/Downloads/onefolder/123.pdf"

Line = namedtuple('Line', 'PO DeliveryDate PODate ShipTo Barcode Description SMCode Quantity Price')

PO_re = re.compile(r'(\d+\.\d+).+(\d2\.\d2\.\d4)')
ord_re = re.compile(r'(\d+) (.*) (\d5,6) (\d+\,\d+) (\d+\,\d+)')
ShipTo_re = re.compile(r'(\w+) (Company Name MM) (.*)')
PODate_re = re.compile(r'\d1,2\.\d1,2\.\d4')
lines = []

dts = []
with pdfplumber.open(path) as pdf:
    pages = pdf.pages
    for page in pdf.pages:
        text = page.extract_text()
        for line in text.split('\n'):
            if PO_re.search(line):
            #line.startswith('Số'):
                a = line.split()
                b = str(a)
                c = PO_re.search(str(b))
                PO = c.group(1)
                DeliveryDate = c.group(2)
                print(DeliveryDate, PO)
                
            elif PO_re.search(line):
                r = line.split()
                date2 = re.findall(r"\d1,2\.\d1,2\.\d4", str(r))
                print(date2)
                PODate = date2[1]
                print(PODate)
                
            elif line.startswith('Name', 30):
                print(line)
                c = line.split(' Name ')
                print(c)
                ShipTo = c[-1]
                print(ShipTo)
                
            elif ord_re.search(line):
                print(line)
                l = ord_re.search(str(line))
                haha = l.group(1)
                Barcode = haha[-13:]
                print(Barcode)
                Description = l.group(2)
                print(Description)
                SMCode = l.group(3)
                print(SMCode)
                Quantity = l.group(4)
                print(Quantity)
                Price = l.group(5)
                print(Price)
            
        dts = Line(PO, DeliveryDate, PODate, ShipTo, Barcode, Description, SMCode, Quantity, Price)
        for item in dts:
            if ord_re.match(line):
                
        lines.append(dts)
          

df = pd.DataFrame(lines)
df.apply(lambda x: pd.api.types.infer_dtype(x.values))
df

将当前输出转换为 csv 文件后，所有非英文字符都得到了一些奇怪的字符而不是正常的空格，所有的越南字符都变成了完全不同的东西，尽管输出没有显示错误。

关于如何解决这些问题的任何解决方案？

【问题讨论】：

请提供一个可重现的最小示例。 pdf文本样本！ 【参考方案1】：

所有的越南字符都变成了完全不同的东西

如果df 中的符号正常，那么问题在于保存 csv 的代码，此处未提供。尝试将 encoding="utf-8" 参数添加到您的 df.to_csv() 调用中。如果它不起作用，请查找支持您的符号的编码

【讨论】：

以上是关于查找正则表达式匹配 x 的数量，将数据帧的某些列重复 x 次 + Unicode 错误的主要内容，如果未能解决你的问题，请参考以下文章