使用 pandas 数据框中的文本字符串数据进行条件数据选择

Posted 2023-03-12

技术标签:

【中文标题】使用 pandas 数据框中的文本字符串数据进行条件数据选择【英文标题】：Conditional data selection with text string data in pandas dataframe 【发布时间】：2017-03-29 13:44:03 【问题描述】：

我已经看过但似乎无法回答以下问题。

我有一个与此类似的 pandas 数据框（称之为“df”）：

        Type              Set
    1   theGreen          Z
    2   andGreen          Z           
    3   yellowRed         X
    4   roadRed           Y

如果类型包含字符串“绿色”，否则为 (0)。

基本上，我正在尝试找到一种方法：

   df['color'] = np.where(df['Type'] == 'Green', 1, 0)

除了通常的 numpy 运算符（、==、!= 等）之外，我需要一种表示“in”或“contains”的方式。这可能吗？任何和所有的帮助表示赞赏！

【问题讨论】：

【参考方案1】：

使用str.contains:

df['color'] = np.where(df['Type'].str.contains('Green'), 1, 0)
print (df)
        Type Set  color
1   theGreen   Z      1
2   andGreen   Z      1
3  yellowRed   X      0
4    roadRed   Y      0

apply 的另一个解决方案：

df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x), 1, 0)
print (df)
        Type Set  color
1   theGreen   Z      1
2   andGreen   Z      1
3  yellowRed   X      0
4    roadRed   Y      0

第二种解决方案更快，但不适用于Type列中的NaN，然后返回error：

TypeError: 'float' 类型的参数不可迭代

时间安排：

#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)  

In [276]: %timeit df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x), 1, 0)
10 loops, best of 3: 94.1 ms per loop

In [277]: %timeit df['color1'] = np.where(df['Type'].str.contains('Green'), 1, 0)
1 loop, best of 3: 256 ms per loop

【讨论】：

你能写一个处理NaN的函数并应用它而不是lambda吗？ @wwii 我只在打电话。我明天添加解决方案。 @wwii - 它更复杂 - 对我来说

df['color'] = np.where(df['Type'].apply(lambda x: 'Green' in x if pd.notnull(x) else False), 1,               np.where(df['Type'].isnull(), np.nan, 0))

和

df = pd.DataFrame( 'Set': 1: 'Z', 2: 'Z', 3: 'X', 4: 'Y',  'Type': 1: 'theGreen', 2: 'andGreen', 3: 'yellowRed', 4: np.nan, columns= ['Type','Set'])

一起工作我的想法更像是 - def is_green(thing): try: return 'Green' in thing; except (ValueError, TypeError) as e: return False - 然后，df['color'] = np.where(df['Type'].apply(is_green), 1, 0)

以上是关于使用 pandas 数据框中的文本字符串数据进行条件数据选择的主要内容，如果未能解决你的问题，请参考以下文章