如何拆分熊猫字符串以提取中间名?
Posted
技术标签:
【中文标题】如何拆分熊猫字符串以提取中间名?【英文标题】:How to split a pandas string to extract middle names? 【发布时间】:2019-05-27 12:45:33 【问题描述】:我想将个人姓名拆分为多个字符串。我可以很容易地提取名字和姓氏,但是在提取中间名或名字时遇到问题,因为它们在每种情况下都完全不同。
数据如下所示:
ID| Complete_Name | Type
1 | JERRY, Ben | "I"
2 | VON HELSINKI, Olga | "I"
3 | JENSEN, James Goodboy Dean | "I"
4 | THE COMPANY | "C"
5 | CRUZ, Juan S. de la | "I"
因此,有些名称只有名字和姓氏,而名称则介于两者之间或两个中间名之间。如何从 Pandas 数据框中提取中间名?我已经可以提取名字和姓氏了。
df = pd.read_csv("list.pip", sep="|")
df["First Name"] =
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])
想要的结果应该是这样的:
ID| Complete_Name | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben | "I" | Ben | | JERRY
2 | VON HELSINKI, Olga | "I" | Olga | |
3 | JENSEN, James Goodboy Dean | "I" | James | Goodboy Dean| VON HELSINKI
4 | THE COMPANY | "C" | | |
5 | CRUZ, Juan S. de la | "I" | Juan | S. de la | CRUZ
【问题讨论】:
splitting a column by delimiter pandas python的可能重复 ^ 不,不是那个。 【参考方案1】:一个单一的str.extract
呼叫将在这里工作:
p = r'^(?P<Last_Name>.*), (?P<First_Name>\S+)\b\s*(?P<Middle_Name>.*)'
u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
pd.concat([df, u], axis=1).fillna('')
ID Complete_Name Type Last_Name First_Name Middle_Name
0 1 JERRY, Ben I JERRY Ben
1 2 VON HELSINKI, Olga I VON HELSINKI Olga
2 3 JENSEN, James Goodboy Dean I JENSEN James Goodboy Dean
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I CRUZ Juan S. de la
正则表达式分解
^ # Start-of-line
(?P<Last_Name> # First named capture group - Last Name
.* # Match anything until...
)
, # ...we see a comma
\s # whitespace
(?P<First_Name> # Second capture group - First Name
\S+ # Match all non-whitespace characters
)
\b # Word boundary
\s* # Optional whitespace chars (mostly housekeeping)
(?P<Middle_Name> # Third capture group - Zero of more middle names
.* # Match everything till the end of string
)
【讨论】:
非常简洁,但是您能否为将来访问的人更详细地解释正则表达式,以便他们可以根据自己的需要调整正则表达式? @mrPy 已编辑,希望对您有所帮助。【参考方案2】:我认为你可以做到:
# take the complete_name column and split it multiple times
df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
.split(',', expand=True)
.fillna(''))
# remove extra spaces
for x in df2.columns:
df2[x] = [x.strip() for x in df2[x]]
# split the name on first space and join it
df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
df2.columns = ['last','first','middle']
# join the data frames
df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)
# rearrange columns - not necessary though
df = df[['ID','Complete_Name','first','middle','last']]
# remove none values
df = df.replace([None], '')
ID Complete_Name Type first middle last
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
【讨论】:
似乎可以使用 YOLO,除了您没有考虑到 np.where(df.Type=="I")。所以它也拆分了“C”类型。 @mrPy 对不起,我错过了那部分,现在修复它。【参考方案3】:这是另一个使用一些简单 lambda 功能的答案。
import numpy as np
import pandas as pd
""" Create data and data frame """
info_dict =
'ID': [1,2,3,4,5,],
'Complete_Name':[
'JERRY, Ben',
'VON HELSINKI, Olga',
'JENSEN, James Goodboy Dean',
'THE COMPANY',
'CRUZ, Juan S. de la',
],
'Type':['I','I','I','C','I',],
data = pd.DataFrame(info_dict, columns = info_dict.keys())
""" List of columns to add """
name_cols = [
'First Name',
'Middle Name',
'Last Name',
]
"""
Use partition() to separate first and middle names into Pandas series.
Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
values that we want.
"""
NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())
# We can use index positions to quickly add columns to the dataframe.
# The partition() function will keep the delimited value in the 1 index, so we'll use
# the 0 and 2 index positions for first and middle names.
data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]
# Finally, we'll add our Last Names column
data[name_cols[2]] = LAST_NAMES
# Optional: We can replace all blank values with numpy.NaN values using regular expressions.
data = data.replace(r'^$', np.NaN, regex=True)
那么你应该得到这样的结果:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben NaN JERRY
1 2 VON HELSINKI, Olga I Olga NaN VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C NaN NaN NaN
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
或者,将 NaN 值替换为空白字符串:
data = data.replace(np.NaN, r'', regex=False)
那么你有:
ID Complete_Name Type First Name Middle Name Last Name
0 1 JERRY, Ben I Ben JERRY
1 2 VON HELSINKI, Olga I Olga VON HELSINKI
2 3 JENSEN, James Goodboy Dean I James Goodboy Dean JENSEN
3 4 THE COMPANY C
4 5 CRUZ, Juan S. de la I Juan S. de la CRUZ
【讨论】:
以上是关于如何拆分熊猫字符串以提取中间名?的主要内容,如果未能解决你的问题,请参考以下文章