仅当值存在时，才通过 vlookup 另一个数据框替换列中的值

Posted 2023-02-23

技术标签:

【中文标题】仅当值存在时，才通过 vlookup 另一个数据框替换列中的值【英文标题】：Replace a value in a column by vlookup another dataframe only if the value exists 【发布时间】：2018-06-24 11:43:32 【问题描述】：

我想根据(df2.Name1, df2.Name2) 中的映射表覆盖我的df1.Name 值。但是，df1.Name 中并非所有值都存在于df2.Name1 中

df1:

Name
Alex
Maria 
Marias
Pandas
Coala

df2:

Name1   Name2
Alex    Alexs
Marias  Maria
Coala   Coalas

预期结果：

Name
Alexs
Maria
Maria
Pandas
Coalas

我在网上尝试了几种解决方案，例如使用地图功能。通过在字典中打开df2，我正在使用df1.Name = df1.Name.map(Dictionary)，但这将导致nan 对于不在df2 中的所有值，如下所示。

Name
Alexs
Maria
Maria
NAN
Coalas

我不确定如何使用 IF 语句仅替换 df2 中确实存在的语句，并按照 df1 保留其余语句。我还尝试使用if 语句创建一个函数，但失败了。

我该如何解决这个问题？

【问题讨论】：

预期结果是更新的 df1.Name 数据框 【参考方案1】：

通过使用replace

df1.Name.replace(df2.set_index('Name1').Name2.to_dict())
Out[437]: 
0     Alexs
1     Maria
2     Maria
3    Pandas
4    Coalas
Name: Name, dtype: object

【讨论】：

【参考方案2】：

让我们使用带有map 和combine_first 的 Pandas 解决方案：

df1['Name'].map(df2.set_index('Name1')['Name2']).combine_first(df1['Name'])

输出：

0     Alexs
1     Maria
2     Maria
3    Pandas
4    Coalas
Name: Name, dtype: object

【讨论】：

这导致我出现以下错误：InvalidIndexError: Reindexing only valid with unique value Index objects 啊.. 你的 df2 中只有重复的“Name1”？【参考方案3】：

Python dict.get() 允许使用默认参数。所以如果你建立一个翻译字典，那么如果没有找到查找，很容易只返回原始值，如：

代码：

translate = x: y for x, y in df2[['Name1', 'Name2']].values
new_names = [translate.get(x, x) for x in df1['Name']]

测试代码：

import pandas as pd

df1 = pd.DataFrame('Name': ['Alex', 'Maria', 'Marias', 'Pandas', 'Coala'])
df2 = pd.DataFrame('Name1': ['Alex', 'Marias', 'Coala'],
                    'Name2': ['Alexs', 'Maria', 'Coalas'])

print(df1)
print(df2)

translate = x: y for x, y in df2[['Name1', 'Name2']].values
print([translate.get(x, x) for x in df1['Name']])

测试结果：

     Name
0    Alex
1   Maria
2  Marias
3  Pandas
4   Coala

    Name1   Name2
0    Alex   Alexs
1  Marias   Maria
2   Coala  Coalas

['Alexs', 'Maria', 'Maria', 'Pandas', 'Coalas']

【讨论】：

谢谢斯蒂芬，但我怎样才能把它作为数据框取回，因为我希望 df1.Name 保留更新的名称（如果可用） df1.Name = new_names【参考方案4】：

你也可以使用merge:

In [27]: df1['Name'] = df1.merge(df2.rename(columns='Name1':'Name'), how='left') \
                          .ffill(axis=1)['Name2']

In [28]: df1
Out[28]:
     Name
0   Alexs
1   Maria
2   Maria
3  Pandas
4  Coalas

【讨论】：

【参考方案5】：

你也可以使用replace

df1 = pd.DataFrame('Name': ['Alex', 'Maria', 'Marias', 'Pandas', 'Coala'])
df2 = pd.DataFrame('Name1': ['Alex', 'Marias', 'Coala'],
                    'Name2': ['Alexs', 'Maria', 'Coalas'])

# Create the dictionary from df2
d = "Name": k:v for k, v in zip(df2["Name1"], df2["Name2"])
# Suggestion from Wen to create the dictionary
# d = "Name": df2.set_index('Name1').Name2.to_dict()     

df1.replace(d)   # Use df1.replace(d, inplace=True) if you want this in place

    Name
0   Alexs
1   Maria
2   Maria
3   Pandas
4   Coalas

replace 可以带一个字典，你可以在其中指定要替换的列，这里是"Name"，以及要在这个特定列中替换的相应映射。

"Name": old_1: new_1, old_2: new_2...

-> 替换"Name" 列中的值，以便old_1 将替换为new_1。 old_2 将替换为 new_2 等等。

感谢 Stephen Rauch 提供的设置。感谢 Wen 提供了一种创建字典的干净方法。

【讨论】：

检查我的，你不需要通过循环创建字典：-）

以上是关于仅当值存在时，才通过 vlookup 另一个数据框替换列中的值的主要内容，如果未能解决你的问题，请参考以下文章