如何使用布尔掩码在 pandas DataFrame 中用 nan 替换“任何字符串”？

Posted 2023-03-11

技术标签:

【中文标题】如何使用布尔掩码在 pandas DataFrame 中用 nan 替换“任何字符串”？【英文标题】：How to replace 'any strings' with nan in pandas DataFrame using a boolean mask? 【发布时间】：2018-04-10 14:46:54 【问题描述】：

我有一个 227x4 的数据框，其中包含要清理的国家名称和数值（争吵？）。

这是 DataFrame 的抽象：

import pandas as pd
import random
import string
import numpy as np
pdn = pd.DataFrame(["".join([random.choice(string.ascii_letters) for i in range(3)]) for j in range (6)], columns =['Country Name'])
measures = pd.DataFrame(np.random.random_integers(10,size=(6,2)), columns=['Measure1','Measure2'])
df = pdn.merge(measures, how= 'inner', left_index=True, right_index =True)

df.iloc[4,1] = 'str'
df.iloc[1,2] = 'stuff'
print(df)

  Country Name Measure1 Measure2
0          tua        6        3
1          MDK        3    stuff
2          RJU        7        2
3          WyB        7        8
4          Nnr      str        3
5          rVN        7        4

如何在不触及国家名称的情况下将所有列中的字符串值替换为np.nan？

我尝试使用布尔掩码：

mask = df.loc[:,measures.columns].applymap(lambda x: isinstance(x, (int, float))).values
print(mask)

[[ True  True]
 [ True False]
 [ True  True]
 [ True  True]
 [False  True]
 [ True  True]]

# I thought the following would replace by default false with np.nan in place, but it didn't
df.loc[:,measures.columns].where(mask, inplace=True)
print(df)

  Country Name Measure1 Measure2
0          tua        6        3
1          MDK        3    stuff
2          RJU        7        2
3          WyB        7        8
4          Nnr      str        3
5          rVN        7        4


# this give a good output, unfortunately it's missing the country names
print(df.loc[:,measures.columns].where(mask))

  Measure1 Measure2
0        6        3
1        3      NaN
2        7        2
3        7        8
4      NaN        3
5        7        4

我查看了几个与我有关的问题（[1]、[2]、[3]、[4]、[5]、[6]、[7]、[8]），但找不到一个回答了我的担忧。

【问题讨论】：

"一个元问题，我在这里提出一个问题（包括研究）需要三个多小时是正常的吗？" - 是的。 Stack Overflow 和整个 Stack Exchange 网络的成功取决于其高质量的内容，包括问题和答案。你不可能在几分钟内提出一个高质量的问题。就个人而言，我会将所需的精力更多地放在几天而不是几个小时的顺序上。我当然花了一整天或更长时间来回答问题，而且我希望提问者至少花费一个数量级的努力，因为他是获得好处的人。旁注：元问题应在Meta Stack Overflow 提出。 @JörgWMittag 在我放弃自己尝试之后，我只是在计算写问题的时间。如果我必须计算它确实会在几天内。当我还有几个小时的时间时，我会在 meta 中提出一个问题。花了这么多时间问我的问题，我感到很愚蠢。但我现在感觉好多了，答案的质量证明了努力是值得的。谢谢！ 【参考方案1】：

只分配感兴趣的列：

cols = ['Measure1','Measure2']
mask = df[cols].applymap(lambda x: isinstance(x, (int, float)))

df[cols] = df[cols].where(mask)
print (df)
  Country Name Measure1 Measure2
0          uFv        7        8
1          vCr        5      NaN
2          qPp        2        6
3          QIC       10       10
4          Suy      NaN        8
5          eFS        6        4

一个元问题，我在这里提出一个问题（包括研究）需要3个多小时是正常的吗？

在我看来是的，创造好问题真的很难。

【讨论】：

我喜欢你，但为什么df2= df.loc[:,measures.columns].where(mask, inplace=True) 不做替换？虽然df.loc[:,measures.columns].where(mask) 打印正确。因为inplace总是返回None，所以df2是None 我已经编辑了这个问题。我不明白为什么 df.loc[:,measures.columns].where(mask, inplace=True) 不修改 df ？我认为分配给df 的副本存在问题，与this 中的fillna 相同的问题。如果将您的代码更改为df[measures.columns].where(mask)，则会收到警告。【参考方案2】：

cols = ['Measure1','Measure2']
df[cols] = df[cols].applymap(lambda x: x if not isinstance(x, str) else np.nan)

或

df[cols] = df[cols].applymap(lambda x: np.nan if isinstance(x, str) else x)

结果：

In [22]: df
Out[22]:
  Country Name  Measure1  Measure2
0          nBl      10.0       9.0
1          Ayp       8.0       NaN
2          diz       4.0       1.0
3          aad       7.0       3.0
4          JYI       NaN      10.0
5          BJO       9.0       8.0

【讨论】：

但是为什么否定x if not isinstance(x, str)而不是x if isinstance(int,float) else np.nan`？如果你不需要否定，那将用 nan 替换所有数字然后x: np.nan if isinstance(x, str) else x 我不想替换数字..我想用nan替换非数字 @MalikKoné，我想你想使用Bharath shetty's solution 所有三个答案对我来说都非常有趣......我的重点是了解我还不需要优化物理资源。 :o)【参考方案3】：

使用带有错误强制的数字，即

cols = ['Measure1','Measure2']
df[cols] = df[cols].apply(pd.to_numeric,errors='coerce')

国家名称措施 1 措施 2 0 酒吧 7.0 6.0 1 JHq 2.0 NaN 2 欧培 4.0 3.0 3 像素 3.0 6.0 4 ouP NaN 4.0 5 qZR 4.0 6.0

【讨论】：

我认为在这种情况下我们可以摆脱lambda：df[cols] = df[cols].apply(pd.to_numeric, errors='corece') @Bharathshetty，你的回答太好了（如果可能的话）。我确实会将字符串强制转换为数值，但是当我提出问题时，我并不清楚。我的重点是如何使用布尔掩码以及为什么 inplace 不起作用。 @Bharathshetty 我认为应该阅读errors=coerce 而不是errors=corece 那是一个小错字。很抱歉

以上是关于如何使用布尔掩码在 pandas DataFrame 中用 nan 替换“任何字符串”？的主要内容，如果未能解决你的问题，请参考以下文章