使用正则表达式解析多个文本字段并编译成 Pandas DataFrame
Posted
技术标签:
【中文标题】使用正则表达式解析多个文本字段并编译成 Pandas DataFrame【英文标题】:Parsing Multiple Text Fields Using Regex and Compiling into Pandas DataFrame 【发布时间】:2020-04-21 11:19:25 【问题描述】:我正在尝试使用 python 和正则表达式解析文本文件以构建特定的 pandas 数据框。下面是我正在解析的文本文件和我正在寻找的理想 pandas DataFrame 的示例。
Sample Text
Washington, DC November 27, 2019
USDA Truck Rate Report
WA_FV190
FIRST PRICE RANGE FOR WEEK OF NOVEMBER 20-26 2019
SECOND PRICE MOSTLY FOR TUESDAY NOVEMBER 26 2019
PERCENTAGE OF CHANGE FROM TUESDAY NOVEMBER 19 2019 SHOWN IN ().
In areas where rates are based on package rates, per-load rates were
derived by multiplying the package rate by the number of packages in
the most usual load in a 48-53 foot trailer.
CENTRAL AND WESTERN ARIZONA
-- LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LEAF LETTUCE SLIGHT SHORTAGE
--
ATLANTA 5100 5500
BALTIMORE 6300 6600
BOSTON 7000 7300
CHICAGO 4500 4900
DALLAS 3400 3800
MIAMI 6400 6700
NEW YORK 6600 6900
PHILADELPHIA 6400 6700
2019 2018
NOV 17-23 NOV 18-24
U.S. 25,701 22,956
IMPORTS 13,653 15,699
------------ --------------
sum 39,354 38,655
理想的输出应该是这样的:
Region CommodityGroup InboundCity Low High
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC ATLANTA 5100 5500
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC BALTIMORE 6300 6600
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC BOSTON 7000 7300
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC CHICAGO 4500 4900
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC DALLAS 3400 3800
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC MIAMI 6400 6700
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC NEW YORK 6600 6900
CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI,ETC PHILADELPHIA 6400 6700
由于我对创建正则表达式的理解有限,这是我最接近成功隔离所需文本的方法:regex tester for USDA data
我一直在尝试从How to parse complex text files using Python?1 复制适用的解决方案,但我的正则表达式经验严重缺乏。您能提供的任何帮助将不胜感激!
【问题讨论】:
你应该在这里寻找解析器解决方案(我是链接问题的回答者)。 【参考方案1】:我想出了this regex(txt
是您的问题文本):
import re
import numpy as np
import pandas as pd
data = 'Region':[], 'CommodityGroup':[], 'InboundCity':[], 'Low':[], 'High':[]
for region, commodity_group, values in re.findall(r'([A-Z ]+)\n--(.*?)--\n(.*?)\n\n', txt, flags=re.S|re.M):
for val in values.strip().splitlines():
val = re.sub(r'(\d)\s8,.*', r'\1', val)
inbound_city, low, high = re.findall(r'([A-Z ]+)\s*(\d*)\s+(\d+)', val)[0]
data['Region'].append(region)
data['CommodityGroup'].append(commodity_group)
data['InboundCity'].append(inbound_city)
data['Low'].append(np.nan if low == '' else int(low))
data['High'].append(int(high))
df = pd.DataFrame(data)
print(df)
打印:
Region CommodityGroup InboundCity Low High
0 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... ATLANTA 5100 5500
1 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... BALTIMORE 6300 6600
2 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... BOSTON 7000 7300
3 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... CHICAGO 4500 4900
4 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... DALLAS 3400 3800
5 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... MIAMI 6400 6700
6 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... NEW YORK 6600 6900
7 CENTRAL AND WESTERN ARIZONA LETTUCE, BROCCOLI, CAULIFLOWER, ROMAINE AND LE... PHILADELPHIA 6400 6700
编辑:现在应该适用于来自 regex101 的大文档
【讨论】:
感谢您的快速解决方案。这肯定涵盖了示例文本。我似乎无法在较大的文件中返回任何匹配项。我将在较大文件的上下文中重新发布问题,并将此问题中的两个文本引用更改为示例文本。再次感谢!以上是关于使用正则表达式解析多个文本字段并编译成 Pandas DataFrame的主要内容,如果未能解决你的问题,请参考以下文章