pandas,如果行包含通配符文本,则合并重复项

Posted

技术标签:

【中文标题】pandas,如果行包含通配符文本,则合并重复项【英文标题】:pandas, merge duplicates if row contains wildcard text 【发布时间】:2021-01-30 09:15:09 【问题描述】:

我有一个重复数据集 (ID)。数据集包含信息和电子邮件。 我正在尝试连接电子邮件(如果行有字符 @),然后删除重复项。

我的原始数据集:

我希望完成的事情:

我当前的代码是对Eric Ed Lohmar code 的修改,并给出以下输出。 我的问题是我无法在最终结果中排除“噪声”数据:, nan, 0-,

电流输出:

如何附加仅包含电子邮件地址的行? 我认为我可以通过使用通配符并替换这部分来跳过添加包含字符 @ 的所有行:

if row['Store1_Email']: # <- not working

任何这些尝试,但没有任何效果:

1.

if str('**@**') in row['Store1_Email']: # <- not working

错误:

Traceback (most recent call last):
  File "g:/Till/till_duplicate.py", line 35, in <module>
    if str('**@**') in row['Store1_Email']:
TypeError: argument of type 'float' is not iterable
PS G:\Till>

错误:

Traceback (most recent call last):
  File "g:/Till/till_duplicate.py", line 35, in <module>
    if df_merged_duplicates[df_merged_duplicates.loc[i, 'Store1_Email'].str.contains('@')]:
AttributeError: 'str' object has no attribute 'str'
PS G:\Till>

完整代码:

import pandas as pd
import os
from datetime import datetime
import time
from shutil import copyfile
from functools import reduce
import numpy as np
import glob


# # Settings
path_data_sources = 'G:/Till/'

# print(path_data_sources + 'test_duplicates - Copy.xlsx')

## https://***.com/questions/36271413/pandas-merge-nearly-duplicate-rows-based-on-column-value
# df_merged_duplicates = pd.read_excel(path_data_sources + 'test_duplicates - Source.xlsx', sheet_name="Sheet1", dtype=str)

data = 'ID':['001', '001', '002', '002', '003', '003', '004', '004', '005', '005', '006', '006', '007', '007', '008', '008', '009', '009', '010', '010', '011', '011', '012', '012', '013', '013', '014', '014'], \
    'Header 1':['AA', 'AA', 'BB', 'BB', 'CC', 'CC', 'DD', 'DD', 'EE', 'EE', 'FF', 'FF', 'GG', 'GG', 'HH', 'HH', 'II', 'II', 'JJ', 'JJ', 'KK', 'KK', 'LL', 'LL', 'MM', 'MM', 'NN', 'NN'], \
    'Header 2':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Header 3':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Header 4':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Header 5':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Store1_Email':['Email@company1.com',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Email@company2.com','Email@company2.com','Email@company3.com','Email@company3.com',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Email@company4.com','Email@company4.com',np.nan,np.nan,np.nan,np.nan], \
    'Header 7':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Header 8':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Header 9':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Store2_Email':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Header 11':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Header 12':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Store3_Email':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Header 14':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Header 15':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Header 16':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Header 17':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Store4_Email':['Email2@company2.com','0',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,'Email2@company3.com','-',np.nan,np.nan,'-','Email2@company4.com',np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], \
    'Header 19':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan] 
df_merged_duplicates = pd.DataFrame(data)

print(df_merged_duplicates)

df_merged_duplicates = df_merged_duplicates.sort_values(by=['ID']) # sort ID column

# Store 1 emails, merge
cmnts = 
for i, row in df_merged_duplicates.iterrows():
    while True:
        try:
            if row['Store1_Email']: # <- not working
                cmnts[row['ID']].append(row['Store1_Email'])
            else:
                cmnts[row['ID']].append(np.nan)

            break

        except KeyError:
            cmnts[row['ID']] = []

# Store 2 emails, merge
cmnts2 = 
for i, row in df_merged_duplicates.iterrows():
    while True:
        try:
            if row['Store2_Email']: # <- not working
                cmnts2[row['ID']].append(row['Store2_Email'])
            else:
                cmnts2[row['ID']].append(np.nan)

            break

        except KeyError:
            cmnts2[row['ID']] = []

# Store 3 emails, merge
cmnts3 = 
for i, row in df_merged_duplicates.iterrows():
    while True:
        try:           
            if row['Store3_Email']: # <- not working
                cmnts3[row['ID']].append(row['Store3_Email'])
            else:
                cmnts3[row['ID']].append(np.nan)

            break

        except KeyError:
            cmnts3[row['ID']] = []

# Store 4 emails, merge
cmnts4 = 
for i, row in df_merged_duplicates.iterrows():
    while True:
        try:           
            if row['Store4_Email']: # <- not working
                cmnts4[row['ID']].append(row['Store4_Email'])
            else:
                cmnts4[row['ID']].append(np.nan)

            break

        except KeyError:
            cmnts4[row['ID']] = []

df_merged_duplicates.drop_duplicates('ID', inplace=True)
df_merged_duplicates['Store1_Email'] = [', '.join(map(str, v)) for v in cmnts.values()]
df_merged_duplicates['Store2_Email'] = [', '.join(map(str, v)) for v in cmnts2.values()]
df_merged_duplicates['Store3_Email'] = [', '.join(map(str, v)) for v in cmnts3.values()]
df_merged_duplicates['Store4_Email'] = [', '.join(map(str, v)) for v in cmnts4.values()]



print(df_merged_duplicates)
df_merged_duplicates.to_excel(path_data_sources + 'test_duplicates_ny.xlsx', index=False)

【问题讨论】:

请尝试给出您的问题的最小可行示例。同样重要的是不要发布数据框的图片,因为人们无法处理这些图片。 【参考方案1】:

我会使用“拆分-应用-组合”的方法。在 pandas 中,您可以使用 groupby 函数来执行此操作,然后应用函数将电子邮件地址组合到每个组(在这种情况下,您可以按 ID 列进行分组。

我编写了一个函数来组合给定列的电子邮件地址:

def combine_emails(series):
    strs = [s for s in series.astype(str).values if '@' in s]
    combined_emails = ",".join(strs)
    if combined_emails !='':
        return combined_emails
    else:
        return np.nan

然后我编写了一个函数来获取每个分组数据帧的第一行,并在电子邮件列上调用组合函数来填充行电子邮件值:

def combine_duplicate_rows(df):
    first_row = df.iloc[0]
    for email_col in ['Store1_Email', 'Store2_Email', 'Store3_Email', 'Store4_Email']:
        first_row[email_col] = combine_emails(df[email_col])
    return first_row

然后您可以将combine_duplicate_rows 应用到您的组中,您将获得解决方案:

In [71]: df.groupby('ID').apply(combine_duplicate_rows)
Out[71]:
    ID Header 1  Header 2  Header 3  Header 4  Header 5                           Store1_Email  Header 9  Store2_Email  Header 12  Store3_Email  Header 17         Store4_Email
ID
1    1       AA       NaN       NaN       NaN       NaN                     Email@company1.com       NaN           NaN        NaN           NaN        NaN  Email2@company2.com
2    2       BB       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN
3    3       CC       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN
4    4       DD       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN
5    5       EE       NaN       NaN       NaN       NaN  Email@company2.com,Email@company2.com       NaN           NaN        NaN           NaN        NaN                  NaN
6    6       FF       NaN       NaN       NaN       NaN  Email@company3.com,Email@company3.com       NaN           NaN        NaN           NaN        NaN                  NaN
7    7       GG       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN
8    8       HH       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN
9    9       II       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN  Email2@company3.com
10  10       JJ       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN
11  11       KK       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN  Email2@company4.com
12  12       LL       NaN       NaN       NaN       NaN  Email@company4.com,Email@company4.com       NaN           NaN        NaN           NaN        NaN                  NaN
13  13       MM       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN
14  14       NN       NaN       NaN       NaN       NaN                                    NaN       NaN           NaN        NaN           NaN        NaN                  NaN

然后你有一个重复的ID 列,但你可以删除它

del df['ID']

【讨论】:

谢谢!!功能似乎是解决方案。可能是我是新手,但我正在努力让这条线工作for email_col in [....]Store1_Email 的分组工作正常,但对于其他列,它没有合并......有什么想法吗? 我终于找到了。 return first_row 有一个缩进太多。如果我删除它,代码就完美了:)! 很好的捕获/修复。这在复制粘贴中混淆了。

以上是关于pandas,如果行包含通配符文本,则合并重复项的主要内容,如果未能解决你的问题,请参考以下文章

基于Pandas.Dataframe中的多个列合并多个重复行

Pandas df 操作:如果其他列行重复,则带有值列表的新列 [重复]

Pandas - 合并包含列表的行

Pandas:如果特定列不包含特定文本,则删除数据框中的行

从 pandas 转换为 numpy 后,如果数组包含 nan,则删除“nan”或减少 numpy 数组的长度 [重复]

如果列 x 是重复记录,则合并列 a、b、c - SQL