通过从多个列中删除匹配的电子邮件域来过滤 Pandas 数据框
Posted
技术标签:
【中文标题】通过从多个列中删除匹配的电子邮件域来过滤 Pandas 数据框【英文标题】:Filter Pandas Dataframe by Removing matching email domains from multiple columns 【发布时间】:2020-02-10 00:04:09 【问题描述】:我有这个数据框:我想过滤掉“Email__c”列中的域与“Internal_Email”或“Alt_Email”列中的域匹配的行。 例如,在第一行中,“doug@compx.com”中的子字符串/域“compx”与“ruda@compx.com”和“sales@compx.com”中的子字符串/域匹配,所以我希望该行被过滤掉。 下面数据框中的所有行都应该被过滤掉。
Company Email__c Action Internal_Email Alt_Email
CompX doug@compx.com View ruda@compx.com sales@compx.com
Doit Inc try@doit.com.au View pop@doit.com info@doit.com
PIA mbosi@pia.com Sell voss@pia.com info@pia.com
Techy pat@techy.com.br Buy tra@techy.com.br contat@techy.com.br
Techy pat@techy.com.br Buy tra@techy.com.br contat@techy.com.br
【问题讨论】:
【参考方案1】:试试这个
s1 = df.Email__c.str.findall(r'@(\w+).')
s2 = df.Internal_Email.str.findall(r'@(\w+).')
s3 = df.Alt_Email.str.findall(r'@(\w+).')
df[s1.ne(s2) & s1.ne(s3)]
【讨论】:
【参考方案2】:试试:
df.loc[~((df["Email__c"].apply(lambda x: x.split("@")[1]) == df["Internal_Email"].apply(lambda x: x.split("@")[1])) | (df["Email__c"].apply(lambda x: x.split("@")[1]) == df["Alt_Email"].apply(lambda x: x.split("@")[1])))]
【讨论】:
整个字符串不必匹配,只匹配域,或者@后面的子字符串。【参考方案3】:我想你想要这样的东西:
df = pd.read_csv('data.csv', sep=';'). # i just save your data in csv and read it then
df
出来:
Company Email__c Action Internal_Email Alt_Email
0 CompX doug@compx.com View ruda@compx.com sales@compx.com
1 Doit Inc try@doit.com.au View pop@doit.com info@doit.com
2 PIA mbosi@pia.com Sell voss@pia1.com info@pia.com
3 Techy pat@techy.com.br Buy tra@techy.com.br contat@techy1.com.br
4 Techy pat@techy.com.br Buy tra@techy.com.br contat@techy.com.br
加工条件:
df['email_c_domain'] = [x.split('@')[1] for x in df['Email__c']] # make additional column with only domain
df['filter_out_1'] = [x.split('@')[1] for x in df['Internal_Email']] # make additional column with only domain
df['filter_out_2'] = [x.split('@')[1] for x in df['Alt_Email']] # make additional column with only domain
df['match_1'] = (df['email_c_domain'] == (df['filter_out_1'])) # match domains from email_c to Internal_email
df['match_2'] = (df['email_c_domain'] == (df['filter_out_2'])) # match domains from email_c to Alt_email
df['filtered_out'] = df['match_1'] | df['match_2'] # chose if one of match will true
现在 df 看起来像这样:
Company Email__c Action Internal_Email Alt_Email email_c_domain filter_out_1 filter_out_2 match_1 match_2 filtered_out
0 CompX doug@compx.com View ruda@compx.com sales@compx.com compx.com compx.com compx.com True True True
1 Doit Inc try@doit.com.au View pop@doit.com info@doit.com doit.com.au doit.com doit.com False False False
2 PIA mbosi@pia.com Sell voss@pia1.com info@pia.com pia.com pia1.com pia.com False True True
3 Techy pat@techy.com.br Buy tra@techy.com.br contat@techy1.com.br techy.com.br techy.com.br techy1.com.br True False True
4 Techy pat@techy.com.br Buy tra@techy.com.br contat@techy.com.br techy.com.br techy.com.br techy.com.br True True True
现在让我们过滤它:
df[df['filtered_out'] == False]
输出是:
Company Email__c Action Internal_Email Alt_Email email_c_domain filter_out_1 filter_out_2 match_1 match_2 filtered_out
1 Doit Inc try@doit.com.au View pop@doit.com info@doit.com doit.com.au doit.com doit.com False False False
【讨论】:
以上是关于通过从多个列中删除匹配的电子邮件域来过滤 Pandas 数据框的主要内容,如果未能解决你的问题,请参考以下文章
SSAS 表格 - 通过从 Rest API 读取安全权利来应用行级过滤
通过从每一行的不同列中选择一个元素,从 Pandas DataFrame 创建一个系列
对于要求,我需要通过从该数据帧的列中的列表中的值创建行来将数据帧转换为 [重复]