如何拆分熊猫字符串以提取中间名?

Posted

技术标签:

【中文标题】如何拆分熊猫字符串以提取中间名?【英文标题】:How to split a pandas string to extract middle names? 【发布时间】:2019-05-27 12:45:33 【问题描述】:

我想将个人姓名拆分为多个字符串。我可以很容易地提取名字和姓氏,但是在提取中间名或名字时遇到问题,因为它们在每种情况下都完全不同。

数据如下所示:

ID| Complete_Name               | Type
1 | JERRY, Ben                  | "I"
2 | VON HELSINKI, Olga          | "I"
3 | JENSEN, James Goodboy Dean  | "I"
4 | THE COMPANY                 | "C"
5 | CRUZ, Juan S. de la         | "I"

因此,有些名称只有名字和姓氏,而名称则介于两者之间或两个中间名之间。如何从 Pandas 数据框中提取中间名?我已经可以提取名字和姓氏了。

df = pd.read_csv("list.pip", sep="|")
df["First Name"] = 
np.where(df["Type"]=="I",df['Complete_Name'].str.split(',').str.get(1) , df[""])
df["Last Name"] = np.where(df["Type"]=="I",df['Complete_Name'].str.split(' ').str.get(1) , df[""])

想要的结果应该是这样的:

ID| Complete_Name               | Type | First Name | Middle Name | Last Name
1 | JERRY, Ben                  | "I"  | Ben        |             | JERRY
2 | VON HELSINKI, Olga          | "I"  | Olga       |             |
3 | JENSEN, James Goodboy Dean  | "I"  | James      | Goodboy Dean| VON HELSINKI
4 | THE COMPANY                 | "C"  |            |             |
5 | CRUZ, Juan S. de la         | "I"  | Juan       | S. de la    | CRUZ

【问题讨论】:

splitting a column by delimiter pandas python的可能重复 ^ 不,不是那个。 【参考方案1】:

一个单一的str.extract 呼叫将在这里工作:

p = r'^(?P<Last_Name>.*), (?P<First_Name>\S+)\b\s*(?P<Middle_Name>.*)' 
u = df.loc[df.Type == "I", 'Complete_Name'].str.extract(p)
pd.concat([df, u], axis=1).fillna('')

   ID               Complete_Name Type     Last_Name First_Name   Middle_Name
0   1                  JERRY, Ben    I         JERRY        Ben              
1   2          VON HELSINKI, Olga    I  VON HELSINKI       Olga              
2   3  JENSEN, James Goodboy Dean    I        JENSEN      James  Goodboy Dean
3   4                 THE COMPANY    C                                       
4   5         CRUZ, Juan S. de la    I          CRUZ       Juan      S. de la

正则表达式分解

^                # Start-of-line
(?P<Last_Name>   # First named capture group - Last Name
    .*           # Match anything until...
)
,                # ...we see a comma
\s               # whitespace 
(?P<First_Name>  # Second capture group - First Name
    \S+          # Match all non-whitespace characters
)
\b               # Word boundary 
\s*              # Optional whitespace chars (mostly housekeeping) 
(?P<Middle_Name> # Third capture group - Zero of more middle names 
    .*           # Match everything till the end of string
)

【讨论】:

非常简洁,但是您能否为将来访问的人更详细地解释正则表达式,以便他们可以根据自己的需要调整正则表达式? @mrPy 已编辑,希望对您有所帮助。【参考方案2】:

我认为你可以做到:

# take the complete_name column and split it multiple times
df2 = (df.loc[df['Type'].eq('I'),'Complete_Name'].str
       .split(',', expand=True)
       .fillna(''))

# remove extra spaces 
for x in df2.columns:
    df2[x] = [x.strip() for x in df2[x]]

# split the name on first space and join it
df2 = pd.concat([df2[0],df2[1].str.split(' ',1, expand=True)], axis=1)
df2.columns = ['last','first','middle']

# join the data frames
df = pd.concat([df[['ID','Complete_Name']], df2], axis=1)

# rearrange columns - not necessary though
df = df[['ID','Complete_Name','first','middle','last']]

# remove none values
df = df.replace([None], '')

   ID                  Complete_Name Type  first        middle          last
0   1   JERRY, Ben                      I    Ben                       JERRY
1   2   VON HELSINKI, Olga              I   Olga                VON HELSINKI
2   3   JENSEN, James Goodboy Dean      I  James  Goodboy Dean        JENSEN
3   4   THE COMPANY                     C                                   
4   5   CRUZ, Juan S. de la             I   Juan      S. de la          CRUZ

【讨论】:

似乎可以使用 YOLO,除了您没有考虑到 np.where(df.Type=="I")。所以它也拆分了“C”类型。 @mrPy 对不起,我错过了那部分,现在修复它。【参考方案3】:

这是另一个使用一些简单 lambda 功能的答案。

import numpy as np
import pandas as pd


""" Create data and data frame """

info_dict = 
    'ID': [1,2,3,4,5,],
    'Complete_Name':[
        'JERRY, Ben',
        'VON HELSINKI, Olga',
        'JENSEN, James Goodboy Dean',
        'THE COMPANY',
        'CRUZ, Juan S. de la',
        ],
    'Type':['I','I','I','C','I',],
    

data = pd.DataFrame(info_dict, columns = info_dict.keys())


""" List of columns to add """
name_cols = [
    'First Name',
    'Middle Name',
    'Last Name',
    ]

"""
Use partition() to separate first and middle names into Pandas series.
Note: data[data['Type'] == 'I']['Complete_Name'] will allow us to target only the
values that we want.
"""
NO_LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[2].strip())
LAST_NAMES = data[data['Type'] == 'I']['Complete_Name'].apply(lambda x: str(x).partition(',')[0].strip())

# We can use index positions to quickly add columns to the dataframe.
# The partition() function will keep the delimited value in the 1 index, so we'll use
# the 0 and 2 index positions for first and middle names.
data[name_cols[0]] = NO_LAST_NAMES.str.partition(' ')[0]
data[name_cols[1]] = NO_LAST_NAMES.str.partition(' ')[2]

# Finally, we'll add our Last Names column
data[name_cols[2]] = LAST_NAMES

# Optional: We can replace all blank values with numpy.NaN values using regular expressions.
data = data.replace(r'^$', np.NaN, regex=True)

那么你应该得到这样的结果:

   ID               Complete_Name Type First Name   Middle Name     Last Name
0   1                  JERRY, Ben    I        Ben           NaN         JERRY
1   2          VON HELSINKI, Olga    I       Olga           NaN  VON HELSINKI
2   3  JENSEN, James Goodboy Dean    I      James  Goodboy Dean        JENSEN
3   4                 THE COMPANY    C        NaN           NaN           NaN
4   5         CRUZ, Juan S. de la    I       Juan      S. de la          CRUZ

或者,将 NaN 值替换为空白字符串:

data = data.replace(np.NaN, r'', regex=False)

那么你有:

   ID               Complete_Name Type First Name   Middle Name     Last Name
0   1                  JERRY, Ben    I        Ben                       JERRY
1   2          VON HELSINKI, Olga    I       Olga                VON HELSINKI
2   3  JENSEN, James Goodboy Dean    I      James  Goodboy Dean        JENSEN
3   4                 THE COMPANY    C                                       
4   5         CRUZ, Juan S. de la    I       Juan      S. de la          CRUZ

【讨论】:

以上是关于如何拆分熊猫字符串以提取中间名?的主要内容,如果未能解决你的问题,请参考以下文章

如何在熊猫中拆分没有给定分隔符的字符串

从混合字母和数字列熊猫中提取日期时间

如何从字符串中提取熊猫索引的属性

如何从熊猫中的字符串中提取前8个字符

拆分正则表达式以提取连续字符的字符串

如何根据熊猫中的if-else条件从元组索引中提取字符串?