使用熊猫数据框的正则表达式

Posted 2023-03-12

技术标签:

【中文标题】使用熊猫数据框的正则表达式【英文标题】：regular expression using pandas dataframe 【发布时间】：2020-10-23 06:41:15 【问题描述】：

输入 csv 文件：

_id,field_name,field_friendly_name,purpose_of_use,category,data_source,schema,table,attribute_type,sample_values,mask_it,is_included_in_report
5e95a49b0985567430f8fc00,FullName,,,,,,,,,,
5e95a4dd0985567430f9ef16,xyz,,,,,,,,,,
5e95a4dd0985567430f9ef17,FullNm,,,,,,,,,,
5e95a4dd0985567430f9ef18,FirstName,,,,,,,,,,
5e95a49b0985567430f8fc01,abc,,,,,,,,,,
5e95a4dd0985567430f9ef19,FirstNm,,,,,,,,,,
5e95a4dd0985567430f9ef20,LastName,,,,,,,,,,
5e95a4dd0985567430f9ef21,LastNm,,,,,,,,,,
5e95a49b0985567430f8fc02,LegalName,,,,,,,,,,
5e95a4dd0985567430f9ef22,LegalNm,,,,,,,,,,
5e95a4dd0985567430f9ef23,NickName,,,,,,,,,,
5e95a4dd0985567430f9ef24,pqr,,,,,,,,,,
5e95a49b0985567430f8fc03,NickNm,,,,,,,,,,

正则表达式 csv 表：

Personal_Inforamtion,regex,addiitional_grep
Full Name,full|name|nm|txt|dsc,full
First Name,first|name|nm|txt|dsc,first
Last Name,last|name|nm|txt|dsc,last
Legal Name,legal|name|nm|txt|dsc,legal
Nick Name,nick|name|nm|txt|dsc,nick

我的代码

包括 python 模块

import pandas as pd
import re

从 csv 文件定义数据框

df = pd.read_csv("Default-Profile.csv")

用 df 替换系列 field_name 上的下划线 (_) 和连字符 (-)

df.field_name = df.field_name.str.replace("[_-]", "", regex=True)

在 df 中将系列 field_name 上的所有字符更改为小写

df.field_name = df.field_name.str.lower()

定义正则表达式表

regex_table = pd.read_csv("regex.csv")

代码是更新 field_friendly_name && is_included_in_report

在 df.field_name 中为正则表达式表中的每个正则表达式查找模式，如果找到正确匹配，则使用 Personal_information 更新列 field_friendly_name，如果不更新为 not_found，如果找到匹配，如果不是 False，则更新最后一列为 True。

前：单词应仅由 full|name|nm|txt|dsc 组成，并且应包含 full

Personal_Inforamtion,regex,addiitional_grep
Full Name,full|name|nm|txt|dsc,full

然后更新df如下：

_id,field_name,field_friendly_name,purpose_of_use,category,data_source,schema,table,attribute_type,sample_values,mask_it,is_included_in_report
5e95a49b0985567430f8fc00,FullName,Full Name,,,,,,,,,TRUE
5e95a4dd0985567430f9ef16,xyz,not_found,,,,,,,,,FALSE
5e95a4dd0985567430f9ef17,FullNm,Full Name,,,,,,,,,TRUE

期望的输出

_id,field_name,field_friendly_name,purpose_of_use,category,data_source,schema,table,attribute_type,sample_values,mask_it,is_included_in_report
5e95a49b0985567430f8fc00,FullName,Full Name,,,,,,,,,TRUE
5e95a4dd0985567430f9ef16,xyz,not_found,,,,,,,,,FALSE
5e95a4dd0985567430f9ef17,FullNm,Full Name,,,,,,,,,TRUE
5e95a4dd0985567430f9ef18,FirstName,First Name,,,,,,,,,TRUE
5e95a49b0985567430f8fc01,abc,not_found,,,,,,,,,FALSE
5e95a4dd0985567430f9ef19,FirstNm,First Name,,,,,,,,,TRUE
5e95a4dd0985567430f9ef20,LastName,Last Name,,,,,,,,,TRUE
5e95a4dd0985567430f9ef21,LastNm,Last Name,,,,,,,,,TRUE
5e95a49b0985567430f8fc02,LegalName,Legal Name,,,,,,,,,TRUE
5e95a4dd0985567430f9ef22,LegalNm,Legal Name,,,,,,,,,TRUE
5e95a4dd0985567430f9ef23,NickName,NickName,,,,,,,,,TRUE
5e95a4dd0985567430f9ef24,pqr,not_found,,,,,,,,,FALSE
5e95a49b0985567430f8fc03,NickNm,NickName,,,,,,,,,TRUE

【问题讨论】：

请不要将其标记为复杂，因为我发现它非常困难，即使经过多次 Python 培训，发现它非常困难.. Python 专家需要您的帮助与 Python 专家聊天会有很大帮助 【参考方案1】：

作为替代方案，您可以创建一组正则表达式，使用最后一列中的单词正则表达式表文件

(full)|(first)|(last)|(legal)|(nick)

您仍然可以调整正则表达式表的最后一列以获得更具体的输出与你需要。然后，您可以将 not_found 案例附加到正则表达式数据框以准备与str.extract 一起使用的数据，它从第一个匹配模式中提取组。随着组匹配，然后我们可以在行轴上使用idxmax 获取正则表达式组索引。在那之后，将正则表达式表第一列的信息映射到 df 数据框组索引信息。

import pandas as pd
import re

df = pd.read_csv("data.csv")
print(df)

regxt = pd.read_csv("regex_table.csv")
print(regxt)

# append not_found item case
not_found = pd.Series(["not_found","",""], index=regxt.columns)
regxt = regxt.append(not_found, ignore_index=True)

# create regex groups with last column csv words
regxl = regxt.iloc[:, 2].to_list()
regx_grps = "|".join(["(" + i + ")" for i in regxl])

# get regex group match index
grp_match = df["field_name"].str.extract(regx_grps, flags=re.IGNORECASE)
grp_idx = (~grp_match.isnull()).idxmax(axis=1)

df["field_friendly_name"] = grp_idx.map(lambda r: regxt.loc[r, "Personal_Inforamtion"])
df["is_included_in_report"] = grp_idx.map(lambda r: str(r!=len(regxt)-1).upper())

print(df)

df的输出

                         _id field_name field_friendly_name ... mask_it  is_included_in_report
0   5e95a49b0985567430f8fc00   FullName           Full Name ...     NaN                   TRUE
1   5e95a4dd0985567430f9ef16        xyz           not_found ...     NaN                  FALSE
2   5e95a4dd0985567430f9ef17     FullNm           Full Name ...     NaN                   TRUE
3   5e95a4dd0985567430f9ef18  FirstName          First Name ...     NaN                   TRUE
4   5e95a49b0985567430f8fc01        abc           not_found ...     NaN                  FALSE
5   5e95a4dd0985567430f9ef19    FirstNm          First Name ...     NaN                   TRUE
6   5e95a4dd0985567430f9ef20   LastName           Last Name ...     NaN                   TRUE
7   5e95a4dd0985567430f9ef21     LastNm           Last Name ...     NaN                   TRUE
8   5e95a49b0985567430f8fc02  LegalName          Legal Name ...     NaN                   TRUE
9   5e95a4dd0985567430f9ef22    LegalNm          Legal Name ...     NaN                   TRUE
10  5e95a4dd0985567430f9ef23   NickName           Nick Name ...     NaN                   TRUE
11  5e95a4dd0985567430f9ef24        pqr           not_found ...     NaN                  FALSE
12  5e95a49b0985567430f8fc03     NickNm           Nick Name ...     NaN                   TRUE

【讨论】：

以上是关于使用熊猫数据框的正则表达式的主要内容，如果未能解决你的问题，请参考以下文章