如何提高熊猫数据框的列表理解速度

Posted 2023-03-11

技术标签:

【中文标题】如何提高熊猫数据框的列表理解速度【英文标题】：How to improve the speed of list comprehension on a pandas dataframe 【发布时间】：2021-07-13 18:42:16 【问题描述】：

除了列表推导之外，是否有更快的方法从集合中过滤项目，列表推导运行时间对于大型数据集来说有点慢。

我已经将list_stopwords 转换为集合，与列表相比，这需要更少的时间。

             date      description
0        2018-07-18    payment receipt
1        2018-07-18    ogsg s.u.b.e.b june 2018 salar
2        2018-07-18    sal admin charge
3        2018-07-19    sms alert charge outstanding
4        2018-07-19    vat onverve*issuance 


list_stopwords = set(stop_words.get_stop_words('en'))

data['description'] =  data['description'].apply(lambda x: " ".join([word for word in x.split() if word not in (list_stopwords)]))

【问题讨论】：

【参考方案1】：

也许使用正则表达式会更快：

拳头创建你的匹配案例正则表达式：


list_stopwords = set(stop_words.get_stop_words('en'))
re_stopwords= r"\b["
for word in list_stopwords: 
    re_stopwords+= "("+word+")"
re_stopwords+=r"]\b"

现在，申请列：

data['description'] =  data['description'].apply(lambda x: re.sub(re_stopwords,'',x))

这将用''（空字符串）替换所有停用词。

我相信它更快，因为正则表达式直接在字符串上操作，而不是您的代码在拆分时得到一个循环。

要了解更多关于正则表达式库的信息：w3schools。更多关于\b的表达方式：regular-expressions。

【讨论】：

以上是关于如何提高熊猫数据框的列表理解速度的主要内容，如果未能解决你的问题，请参考以下文章

如何将熊猫数据框的列设置为列表

如何在python中获取熊猫数据框的行列表？ [复制]

如何在循环中将不同大小的列表附加到空熊猫数据框的每一列？

熊猫如何在数据框的相应列检查行的每个元素的百分位数

如何将多索引列转换为熊猫数据框的单索引列？

对 pandas 数据框的索引查找。为何这么慢？如何加快速度？ [复制]