根据名称阻止 pandas 列中的文本

Posted

技术标签:

【中文标题】根据名称阻止 pandas 列中的文本【英文标题】:Block text in pandas column based on names 【发布时间】:2019-12-22 04:24:03 【问题描述】:

背景

这个问题是Alter text in pandas column based on names 的变体。

我有以下df 故意有各种问题

import pandas as pd
df = pd.DataFrame('Text' : ['But now Smith,J J is Here from Smithsville', 
                                   'Maryland is HYDER,A MARY Found here ', 
                                   'hey here is Annual Doe,Jane Ann until ',
                                'The tuckered was Tucker,Tom is Not here but'], 

                      'P_ID': [1,2,3,4], 
                      'P_Name' : ['SMITH,J J', 'HYDER,A MARY', 'DOE,JANE ANN', 'TUCKER,TOM T'],
                      'N_ID' : ['A1', 'A2', 'A3', 'A4']

                     )

输出

   N_ID P_ID P_Name         Text
0   A1  1   SMITH,J J       But now Smith,J J is Here from Smithsville
1   A2  2   HYDER,A MARY    Maryland is HYDER,A MARY Found here
2   A3  3   DOE,JANE ANN    hey here is Annual Doe,Jane Ann until
3   A4  4   TUCKER,TOM T    The tuckered was Tucker,Tom is Not here but

目标

1) 对于P_Name 中的名称,例如SMITH,J J 块名称与 **BLOCK** 在对应的 Text 列中

2) 创建New_Text

期望的输出

    N_ID P_ID P_Name Text   New_Text
0                           But now **BLOCK** is Here from Smithsville
1                           Maryland is **BLOCK**  Found here
2                           hey here is Annual **BLOCK**  until
3                           The tuckered was **BLOCK** is Not here but

问题

如何实现我想要的输出?

【问题讨论】:

【参考方案1】:

如果您要删除空格,请使用 replace 函数和 regex=True

# new data frame without the whitespace inconsistencies
df = pd.DataFrame('Text' : ['But now Smith,J J is Here from Smithsville', 
                                   'Maryland is HYDER,A MARY Found here ', 
                                   'hey here is Annual Doe,Jane Ann until ',
                                'The tuckered was Tucker,Tom T is Not here but'], 

                      'P_ID': [1,2,3,4], 
                      'P_Name' : ['SMITH,J J', 'HYDER,A MARY', 'DOE,JANE ANN', 'TUCKER,TOM T'],
                      'N_ID' : ['A1', 'A2', 'A3', 'A4']

                     )

print(df.Text.str.lower().replace(df.P_Name.str.lower(), '**BLOCK**', regex=True))

0    but now **BLOCK** is here from smithsville
1             maryland is **BLOCK** found here 
2           hey here is annual **BLOCK** until 
3    the tuckered was **BLOCK** is not here but
Name: Text, dtype: object

【讨论】:

【参考方案2】:

这应该可行:

df['New_Text'] = df.apply(lambda x:x['Text'].lower().replace(x['P_Name'].lower(), '**BLOCK**'), axis=1)

您的示例存在一些空白问题,但它应该适用于正确构造的示例

输出(修改空白问题,最后一行没有完全匹配)

0          but now BLOCK is here from smithsville
1                   maryland is BLOCK found here 
2                 hey here is annual BLOCK until 
3    the tuckered was tucker, tom is not here but

【讨论】:

空白问题是故意的。我的实际数据与上面的数据非常相似,包括空格。上面的代码会因为空白而发生巨大变化吗? 嗯,这不是原始问题的一部分。如果是这种情况,那么您需要模糊匹配。要么删除所有空格,要么做一些非常有创意的空格插入。但是你的新问题比较难,所以请耐心等待! 是的,但我想我在最初的背景声明中并不清楚。我可以调整上述问题以消除空白问题。谢谢!

以上是关于根据名称阻止 pandas 列中的文本的主要内容,如果未能解决你的问题,请参考以下文章

根据名称列表更改 pandas 列中的文本

根据两列中的文本拆分行(Python,Pandas)

第一列中的空值是不是会阻止在 Pentaho Spoon 中导入 Excel 文件?

Kendo Grid 如何以编程方式聚焦网格单元并阻止选择文本

BEFORE UPDATE 触发器阻止所有更新

JavaScript 符号不会阻止对象中的名称冲突