从 pandas DataFrame 中的文本中提取子字符串作为新列

Posted 2023-03-12

技术标签:

【中文标题】从 pandas DataFrame 中的文本中提取子字符串作为新列【英文标题】：Extract substring from text in a pandas DataFrame as new column 【发布时间】：2018-04-05 21:31:19 【问题描述】：

我有一个我想在下面计算的“单词”列表

word_list = ['one','three']

我在 pandas 数据框中有一列，下面有文字。

TEXT                                       |
-------------------------------------------|
"Perhaps she'll be the one for me."        |
"Is it two or one?"                        |
"Mayhaps it be three afterall..."          |
"Three times and it's a charm."            |
"One fish, two fish, red fish, blue fish." |
"There's only one cat in the hat."         |
"One does not simply code into pandas."    |
"Two nights later..."                      |
"Quoth the Raven... nevermore."            |

所需的输出如下，它保留原始文本列，但仅将 word_list 中的单词提取到新列

TEXT                                       | EXTRACT
-------------------------------------------|---------------
"Perhaps she'll be the one for me."        | one
"Is it two or one?"                        | one
"Mayhaps it be three afterall..."          | three
"Three times and it's a charm."            | three
"One fish, two fish, red fish, blue fish." | one
"There's only one cat in the hat."         | one
"One does not simply code into pandas."    | one
"Two nights later..."                      | 
"Quoth the Raven... nevermore."            |

有没有办法在 Python 2.7 中做到这一点？

【问题讨论】：

【参考方案1】：

使用str.extract:

df['EXTRACT'] = df.TEXT.str.extract('()'.format('|'.join(word_list)), 
                        flags=re.IGNORECASE, expand=False).str.lower().fillna('')
df['EXTRACT']

0      one
1      one
2    three
3    three
4      one
5      one
6      one
7         
8         
Name: EXTRACT, dtype: object

word_list 中的每个单词都由正则表达式分隔符 | 连接，然后传递给 str.extract 以进行正则表达式模式匹配。

re.IGNORECASE 开关已打开以进行不区分大小写的比较，结果匹配项将小写以匹配您的预期输出。

【讨论】：

从word_list中提取多个单词怎么样？ @GurselKaracor 您可以查看 findall 或 extractall。

extracted = df['TEXT'].str.findall('(' + '|'.join(word_list) + ')', flags=re.IGNORECASE) df['EXTRACT'] = extracted.str.join(',')

警告如下：“试图在数据帧的切片副本上设置值。尝试使用 .loc[row_indexer,col_indexer] = value 代替请参阅文档中的警告：@ 987654321@" @Z.LI 如果您以某种方式创建 df，您只会遇到该警告。请参阅我关于该主题的帖子以获得更清晰的理解：***.com/questions/20625582/…

以上是关于从 pandas DataFrame 中的文本中提取子字符串作为新列的主要内容，如果未能解决你的问题，请参考以下文章

从多个dicts创建一个pandas DataFrame [重复]

如何从 Pandas DataFrame 中的路径获取基本文件名

将 MultiIndex DataFrame 格式从列排序到 Pandas 中的变量

如何将表格从 rts 文件转换为 pandas DataFrame？

从列表中更改 Pandas Dataframe 中的列名

如何使用 Pandas 从 DataFrame 或 np.array 中的列条目创建字典