在熊猫中将文本读入表格格式
Posted
技术标签:
【中文标题】在熊猫中将文本读入表格格式【英文标题】:Reading text into table format in pandas 【发布时间】:2021-07-27 22:49:55 【问题描述】:我有一个文本形式的表格,我想读入 pandas
我可以使用\n
分隔行,但我如何分隔它们在格式中的列(2 x 文本字段,然后 6 x 数字)。
有没有使用正则表达式或类似方法的方法?
table_text = '''Name AIC sector Price (last close) Price (bid) Price (offer) NAV Total assets (£m) Market cap (£m)
3i Infrastructure Plc Infrastructure GBX 296.00 2.96 2.96 254.50 2,268.700 2,638.645
Aberdeen Asian Income Fund Limited Asia Pacific Income GBX 227.50 2.26 2.29 252.51 479.110 399.796
Aberdeen Diversified Income & Growth Ord Flexible Investment GBX 95.20 0.95 0.96 115.34 379.030 294.985
Aberdeen Emerging Markets Investment Company Limited Global Emerging Markets GBX 704.00 6.98 7.10 829.47 391.268 323.595
Aberdeen Japan Investment Trust Plc Japan GBX 712.50 7.00 7.25 784.79 114.957 94.198
Aberdeen Latin American Income Latin America GBX 57.00 0.54 0.57 62.13 40.985 32.555
Aberdeen New Dawn Asia Pacific GBX 322.00 3.22 3.26 365.56 431.544 350.752
Aberdeen New India Investment Trust Plc India GBX 516.00 5.16 5.18 601.47 375.170 301.268
Aberdeen New Thai Investment Trust Plc Country Specialist GBX 445.00 4.40 4.50 516.30 92.585 71.180
Aberdeen Smaller Companies Income Trust UK Smaller Companies GBX 358.00 3.56 3.60 397.45 95.028 79.153
Aberdeen Standard Asia Focus 2025 CULS Asia Pacific Smaller Companies GBX 100.95 1.01 1.01 97.25 391.484 37.026
Aberdeen Standard Asia Focus PLC Asia Pacific Smaller Companies GBX 1,280.00 12.75 13.00 1,440.65 483.841 402.730
Aberdeen Standard Equity Inc Trust plc UK Equity Income GBX 353.00 3.50 3.56 379.60 203.368 170.598
Aberdeen Standard European Logistics Income PLC Property - Europe GBX 116.00 1.15 1.16 117.82 309.808 305.022
Aberforth Smaller Companies Trust Plc UK Smaller Companies GBX 1,496.00 14.94 15.00 1,613.41 1,513.467 1,327.297
Aberforth Split Level Income Trust Plc UK Smaller Companies GBX 80.10 0.80 0.81 91.46 228.143 152.390
Aberforth Split Level Income ZDP 2024 UK Smaller Companies GBX 111.50 1.10 1.13 113.83 227.713 53.032
Acorn Income Fund Ltd UK Equity & Bond Income GBX 351.00 3.46 3.56 415.97 100.206 55.517
Acorn Income Fund ZDP 2022 UK Equity & Bond Income GBX 161.00 1.61 1.61 162.09 34.413 34.182
AEW UK REIT Ord Property - UK Commercial GBX 92.40 0.92 0.92 97.85 194.107 146.384'''
df = pd.DataFrame([x.split(';') for x in table_text.split('\n')])
print(df)
输出:
0
0 Name AIC sector Price (last close) Price (bid)...
1 3i Infrastructure Plc Infrastructure GBX 296...
2 Aberdeen Asian Income Fund Limited Asia Paci...
3 Aberdeen Diversified Income & Growth Ord Fle...
4 Aberdeen Emerging Markets Investment Company...
5 Aberdeen Japan Investment Trust Plc Japan GB...
6 Aberdeen Latin American Income Latin America...
7 Aberdeen New Dawn Asia Pacific GBX 322.00 3....
8 Aberdeen New India Investment Trust Plc Indi...
9 Aberdeen New Thai Investment Trust Plc Count...
10 Aberdeen Smaller Companies Income Trust UK S...
11 Aberdeen Standard Asia Focus 2025 CULS Asia ...
12 Aberdeen Standard Asia Focus PLC Asia Pacifi...
13 Aberdeen Standard Equity Inc Trust plc UK Eq...
14 Aberdeen Standard European Logistics Income ...
15 Aberforth Smaller Companies Trust Plc UK Sma...
16 Aberforth Split Level Income Trust Plc UK Sm...
17 Aberforth Split Level Income ZDP 2024 UK Sma...
18 Acorn Income Fund Ltd UK Equity & Bond Incom...
19 Acorn Income Fund ZDP 2022 UK Equity & Bond ...
20 AEW UK REIT Ord Property - UK Commercial GBX...
编辑:
这是我的 hacky 方式。依赖于有一个填充了“GBX”的货币列。
欢迎任何关于更好的方法的想法吗?
是否有一种正则表达式方法可以找到三个大写字母前面有一个空格,然后是一个空格,然后是数字?这将找到没有硬编码“GBX”的货币。
def convert_rows(df):
sector_name = "GBX"
for index, row in df.iterrows():
if sector_name in row[0]:
name = row[0].split(sector_name)[0]
numbers = row[0].split(sector_name)[1]
df.at[index, ['Name']] = name
df.at[index, ['AIC sector']] = sector_name
df.at[index,['Price (last close)', 'Price (bid)', 'Price (offer)', 'NAV', 'Total assets (£m)', 'Market cap (£m)']] = numbers.split()
return df
df = convert_rows(df)
【问题讨论】:
【参考方案1】:你可以试试这个:
import re
def convert_rows(df):
for index, row in df.iterrows():
# Search for the pattern
sector_name = re.match(r".+\s([A-Z]3)\s\d+.+", row[0])
if sector_name:
sector_name = sector_name.group(1) # GBX for instance
name = row[0].split(sector_name)[0]
numbers = row[0].split(sector_name)[1]
df.at[index, ['Name']] = name
df.at[index, ['AIC sector']] = sector_name
df.at[index,['Price (last close)', 'Price (bid)', 'Price (offer)', 'NAV', 'Total assets (£m)', 'Market cap (£m)']] = numbers.split()
return df
【讨论】:
以上是关于在熊猫中将文本读入表格格式的主要内容,如果未能解决你的问题,请参考以下文章