Python |正则表达式分裂行;不是专栏

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Python |正则表达式分裂行;不是专栏相关的知识,希望对你有一定的参考价值。

我有一个包含5个嵌套行的数据帧(全部包含以下数据)

1ItWB (NL)$327,481,7484,148$123,403,4194,1039/8/172The
ExorcistWB$232,906,145-n/a-12/26/733Get
OutUni.$176,040,6653,143$33,377,0602,7812/24/174The Blair Witch
ProjectArt.$140,539,0992,538$1,512,054277/16/995The ConjuringWB
(NL)$137,400,1413,115$41,855,3262,9037/19/136Paranormal
ActivityPar.$107,918,8102,712$77,873129/25/097Interview with the
VampireWB$105,264,6082,604$36,389,7052,60411/11/94

我想要做的是分成新的行;不是专栏。

我尝试过这样的事情:

df["Box_Office"].str.split(r'([d][A-Z][a-z]*)', expand=True)
df["Box_Office"].str.split(r'([d][A-Z][a-z]*)', expand=True).melt()
df["Box_Office"].str.split(r'([d][A-Z][a-z]*)', expand=True).stack().to_frame()

正则表达式在每个新等级分裂(EG:2The,3Get,4The)。我只是希望拆分创建新行,而不是列。正则表达式需要一些工作,但我很乐意自己解决这个问题。

我可以融合数据框来创建行,但随后清理变得非常耗时(如果没有其他方法,很高兴沿着这条路走下去)。

堆叠更接近,但它分成不同的行(这自然与我的正则表达式有关)。这感觉最接近,但我找不到正则表达式来捕捉这个[还]。

理想的结果如下,但我真正需要的是Title和Gross

Rank      Title         Studio      Gross         Theatres       Date
1         IT            WB          $327,481,748  4,138          9/8/17
2         The Exorcist  WB          $232,906,145  NA             12/26/73

以下内容更加接近

df["Box_Office"].str.split(r'($[0-9,/]*)', expand=True).stack().to_frame()

enter image description here

提取或拆分是否可以跨行扩展,而不是跨列?

答案

这是我要做的:

(?P<title>[A-Z](?:(?!WB|Par|Art|Uni)[-sA-Za-z])+)
(?P<studio>WB|Par|Art|Uni)
[^$]*
(?P<gross>$d+(?:,d{3})*)
(?P<theatres>(?:d+(?:,d{3})*)|-n/a-)
[$,d]*?
(?P<date>(?:1[0-2]|[1-9])/d{1,2}/d{2})


Which in Python would be:
import pandas as pd, re

junk = """
1ItWB (NL)$327,481,7484,148$123,403,4194,1039/8/172The
ExorcistWB$232,906,145-n/a-12/26/733Get
OutUni.$176,040,6653,143$33,377,0602,7812/24/174The Blair Witch
ProjectArt.$140,539,0992,538$1,512,054277/16/995The ConjuringWB
(NL)$137,400,1413,115$41,855,3262,9037/19/136Paranormal
ActivityPar.$107,918,8102,712$77,873129/25/097Interview with the
VampireWB$105,264,6082,604$36,389,7052,60411/11/94"""

rx = re.compile(r'''
(?P<Title>[A-Z](?:(?!WB|Par|Art|Uni)[-sA-Za-z])+)
(?P<Studio>WB|Par|Art|Uni)
[^$]*
(?P<Gross>$d+(?:,d{3})*)
(?P<Theatres>(?:d+(?:,d{3})*)|-n/a-)
[$,d]*?
(?P<Date>(?:1[0-2]|[1-9])/d{1,2}/d{2})''', re.VERBOSE)

def replacer(d):
    d['Title'] = d['Title'].replace('
', ' ')
    return d

records = (replacer(m.groupdict()) for m in rx.finditer(junk))
df = pd.DataFrame(records)

# reorder the columns if necessary
df = df[['Title', 'Studio', 'Gross', 'Theatres', 'Date']]
print(df)


This yields
                        Title Studio         Gross Theatres      Date
0                          It     WB  $327,481,748    4,148    9/8/17
1                The Exorcist     WB  $232,906,145    -n/a-  12/26/73
2                     Get Out    Uni  $176,040,665    3,143  12/24/17
3     The Blair Witch Project    Art  $140,539,099    2,538   7/16/99
4               The Conjuring     WB  $137,400,141    3,115   7/19/13
5         Paranormal Activity    Par  $107,918,810    2,712   9/25/09
6  Interview with the Vampire     WB  $105,264,608    2,604  11/11/94

a demo for the expression on regex101.com


As for your original question: you could extract columns and then transpose the dataframe (like turn it around). However, wherefrom do you get this data in the first place? Scraped from somehwere? You might want to rethink this step!

以上是关于Python |正则表达式分裂行;不是专栏的主要内容,如果未能解决你的问题,请参考以下文章

正则表达式匹配特定的 URL 片段而不是所有其他 URL 可能性

正则表达式 处理srt

〖Python网络爬虫实战⑨〗- 正则表达式基本原理

循环通过 python 正则表达式匹配

蟒蛇 |正则表达式 |拆分行;不是列

python 正则表达式 re模块基础