使用正则表达式解析多个文本字段并编译成 Pandas DataFrame

Posted

技术标签:

【中文标题】使用正则表达式解析多个文本字段并编译成 Pandas DataFrame【英文标题】:Parsing Multiple Text Fields Using Regex and Compiling into Pandas DataFrame 【发布时间】:2020-04-21 11:19:25 【问题描述】:

我正在尝试使用 python 和正则表达式解析文本文件以构建特定的 pandas 数据框。下面是我正在解析的文本文件和我正在寻找的理想 pandas DataFrame 的示例。

Sample Text

Washington, DC  November 27, 2019
USDA Truck Rate Report

WA_FV190 

FIRST PRICE RANGE FOR WEEK OF NOVEMBER 20-26 2019                                                                                   
SECOND PRICE MOSTLY FOR TUESDAY NOVEMBER 26 2019                                                                                    

PERCENTAGE OF CHANGE FROM TUESDAY NOVEMBER 19 2019 SHOWN IN ().                                                                     

In areas where rates are based on package rates, per-load rates were                                                                
derived by multiplying the package rate by the number of packages in                                                                
the most usual load in a 48-53 foot trailer.

CENTRAL AND WESTERN ARIZONA                                                                                                         
-- LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LEAF LETTUCE   SLIGHT SHORTAGE 
--                                                    

ATLANTA           5100   5500                                                                                                       
BALTIMORE         6300   6600                                                                                                       
BOSTON            7000   7300                                                                                                       
CHICAGO           4500   4900                                                                                                       
DALLAS            3400   3800                                                                                                       
MIAMI             6400   6700                                                                                                       
NEW YORK          6600   6900                                                                                                       
PHILADELPHIA      6400   6700 

                   2019           2018                                                                                              

              NOV 17-23      NOV 18-24                                                                                              

U.S.             25,701         22,956                                                                                              
IMPORTS          13,653         15,699                                                                                              
           ------------ --------------                                                                                              
sum              39,354         38,655 

理想的输出应该是这样的:

Region                        CommodityGroup          InboundCity  Low   High   
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   ATLANTA      5100  5500
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   BALTIMORE    6300  6600
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   BOSTON       7000  7300
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   CHICAGO      4500  4900
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   DALLAS       3400  3800
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   MIAMI        6400  6700
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   NEW YORK     6600  6900
CENTRAL AND WESTERN ARIZONA   LETTUCE, BROCCOLI,ETC   PHILADELPHIA 6400  6700

由于我对创建正则表达式的理解有限,这是我最接近成功隔离所需文本的方法:regex tester for USDA data

我一直在尝试从How to parse complex text files using Python?1 复制适用的解决方案,但我的正则表达式经验严重缺乏。您能提供的任何帮助将不胜感激!

【问题讨论】:

你应该在这里寻找解析器解决方案(我是链接问题的回答者)。 【参考方案1】:

我想出了this regex(txt 是您的问题文本):

import re
import numpy as np
import pandas as pd

data = 'Region':[], 'CommodityGroup':[], 'InboundCity':[], 'Low':[], 'High':[]
for region, commodity_group, values in re.findall(r'([A-Z ]+)\n--(.*?)--\n(.*?)\n\n', txt, flags=re.S|re.M):

    for val in values.strip().splitlines():
        val = re.sub(r'(\d)\s8,.*', r'\1', val)
        inbound_city, low, high = re.findall(r'([A-Z ]+)\s*(\d*)\s+(\d+)', val)[0]
        data['Region'].append(region)
        data['CommodityGroup'].append(commodity_group)
        data['InboundCity'].append(inbound_city)
        data['Low'].append(np.nan if low == '' else int(low))
        data['High'].append(int(high))

df = pd.DataFrame(data)
print(df)

打印:

                        Region                                     CommodityGroup   InboundCity   Low  High
0  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...       ATLANTA  5100  5500
1  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...     BALTIMORE  6300  6600
2  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...        BOSTON  7000  7300
3  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...       CHICAGO  4500  4900
4  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...        DALLAS  3400  3800
5  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...         MIAMI  6400  6700
6  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...      NEW YORK  6600  6900
7  CENTRAL AND WESTERN ARIZONA  LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE...  PHILADELPHIA  6400  6700

编辑:现在应该适用于来自 regex101 的大文档

【讨论】:

感谢您的快速解决方案。这肯定涵盖了示例文本。我似乎无法在较大的文件中返回任何匹配项。我将在较大文件的上下文中重新发布问题,并将此问题中的两个文本引用更改为示例文本。再次感谢!

以上是关于使用正则表达式解析多个文本字段并编译成 Pandas DataFrame的主要内容,如果未能解决你的问题,请参考以下文章

数据科学入门必读:如何使用正则表达式?

详解正则表达式(re) 一

使用正则表达式c#替换文档中的文本字段

如何最好地使用正则表达式将层次文本文件转换为 XML?

正则表达式

多个正则表达式字符串模式(不同的字段)