Python |正则表达式分裂行;不是专栏
Posted
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python |正则表达式分裂行;不是专栏相关的知识,希望对你有一定的参考价值。
我有一个包含5个嵌套行的数据帧(全部包含以下数据)
1ItWB (NL)$327,481,7484,148$123,403,4194,1039/8/172The
ExorcistWB$232,906,145-n/a-12/26/733Get
OutUni.$176,040,6653,143$33,377,0602,7812/24/174The Blair Witch
ProjectArt.$140,539,0992,538$1,512,054277/16/995The ConjuringWB
(NL)$137,400,1413,115$41,855,3262,9037/19/136Paranormal
ActivityPar.$107,918,8102,712$77,873129/25/097Interview with the
VampireWB$105,264,6082,604$36,389,7052,60411/11/94
我想要做的是分成新的行;不是专栏。
我尝试过这样的事情:
df["Box_Office"].str.split(r'([d][A-Z][a-z]*)', expand=True)
df["Box_Office"].str.split(r'([d][A-Z][a-z]*)', expand=True).melt()
df["Box_Office"].str.split(r'([d][A-Z][a-z]*)', expand=True).stack().to_frame()
正则表达式在每个新等级分裂(EG:2The,3Get,4The)。我只是希望拆分创建新行,而不是列。正则表达式需要一些工作,但我很乐意自己解决这个问题。
我可以融合数据框来创建行,但随后清理变得非常耗时(如果没有其他方法,很高兴沿着这条路走下去)。
堆叠更接近,但它分成不同的行(这自然与我的正则表达式有关)。这感觉最接近,但我找不到正则表达式来捕捉这个[还]。
理想的结果如下,但我真正需要的是Title和Gross
Rank Title Studio Gross Theatres Date
1 IT WB $327,481,748 4,138 9/8/17
2 The Exorcist WB $232,906,145 NA 12/26/73
以下内容更加接近
df["Box_Office"].str.split(r'($[0-9,/]*)', expand=True).stack().to_frame()
提取或拆分是否可以跨行扩展,而不是跨列?
答案
这是我要做的:
(?P<title>[A-Z](?:(?!WB|Par|Art|Uni)[-sA-Za-z])+)
(?P<studio>WB|Par|Art|Uni)
[^$]*
(?P<gross>$d+(?:,d{3})*)
(?P<theatres>(?:d+(?:,d{3})*)|-n/a-)
[$,d]*?
(?P<date>(?:1[0-2]|[1-9])/d{1,2}/d{2})
Which in
Python
would be:
import pandas as pd, re
junk = """
1ItWB (NL)$327,481,7484,148$123,403,4194,1039/8/172The
ExorcistWB$232,906,145-n/a-12/26/733Get
OutUni.$176,040,6653,143$33,377,0602,7812/24/174The Blair Witch
ProjectArt.$140,539,0992,538$1,512,054277/16/995The ConjuringWB
(NL)$137,400,1413,115$41,855,3262,9037/19/136Paranormal
ActivityPar.$107,918,8102,712$77,873129/25/097Interview with the
VampireWB$105,264,6082,604$36,389,7052,60411/11/94"""
rx = re.compile(r'''
(?P<Title>[A-Z](?:(?!WB|Par|Art|Uni)[-sA-Za-z])+)
(?P<Studio>WB|Par|Art|Uni)
[^$]*
(?P<Gross>$d+(?:,d{3})*)
(?P<Theatres>(?:d+(?:,d{3})*)|-n/a-)
[$,d]*?
(?P<Date>(?:1[0-2]|[1-9])/d{1,2}/d{2})''', re.VERBOSE)
def replacer(d):
d['Title'] = d['Title'].replace('
', ' ')
return d
records = (replacer(m.groupdict()) for m in rx.finditer(junk))
df = pd.DataFrame(records)
# reorder the columns if necessary
df = df[['Title', 'Studio', 'Gross', 'Theatres', 'Date']]
print(df)
This yields
Title Studio Gross Theatres Date
0 It WB $327,481,748 4,148 9/8/17
1 The Exorcist WB $232,906,145 -n/a- 12/26/73
2 Get Out Uni $176,040,665 3,143 12/24/17
3 The Blair Witch Project Art $140,539,099 2,538 7/16/99
4 The Conjuring WB $137,400,141 3,115 7/19/13
5 Paranormal Activity Par $107,918,810 2,712 9/25/09
6 Interview with the Vampire WB $105,264,608 2,604 11/11/94
见a demo for the expression on regex101.com。
As for your original question: you could extract columns and then transpose the dataframe (like turn it around). However, wherefrom do you get this data in the first place? Scraped from somehwere? You might want to rethink this step!
以上是关于Python |正则表达式分裂行;不是专栏的主要内容,如果未能解决你的问题,请参考以下文章