在 Pyspark 中按顺序应用多个正则表达式进行文本清理的最快方法

Posted 2023-04-14

技术标签:

【中文标题】在 Pyspark 中按顺序应用多个正则表达式进行文本清理的最快方法【英文标题】：Fastest way to apply multiple regular expressions sequentially for text cleaning in Pyspark 【发布时间】：2019-11-02 22:17:51 【问题描述】：

我有一个列，我希望使用大量的正则表达式来清理它，我想按顺序应用它们。

即使是 pandas，这也是一个耗时的过程，但至少我可以通过将其应用为函数来摆脱困境。

作为一个具体的例子：

import pandas as pd
import re


regex_tuples_list = [(r'\bMR\b', 'middle right', re.I), 
                     ('\bmiddle right area\b', 'center', re.I),
                    ]

def apply_regex(text):
    for (to_repl, value, re_flags) in regex_tuples_list:
        to_repl_compiled = re.compile(to_repl, re_flags)
        text = re.sub(to_repl_compiled, value, text)
    return text

s = pd.Series(['Install the part in the MR', 
               'Check the MR area before using the tool', 
               'Always begin from the middle right area',
              ])

print(s.apply(apply_regex))

## Prints...
#      Install the part in the middle right
#    Check the center before using the tool
#              Always begin from the center

使用 Pyspark 最好的方法是什么？

【问题讨论】：

【参考方案1】：

首先，你的代码pandas可以直接在databricks上运行，无需任何其他操作，如下图。

然后，我看到您希望在做同样的事情时获得更好的性能，所以我认为您可以尝试使用 PySpark 的 koalas 包来获得类似 pandas 的用户体验。而在databricks集群中安装koalas包非常简单，如下图安装即可。

注意：最新版本的koalas包需要spark-2.4.4，所以创建集群必须选择Spark2.4.4版本，如下图。

最后，您只需要使用import databricks.koalas as ks 而不是import pandas as pd，无需任何更多代码更改即可使用 PySpark 运行相同的代码以获得更好的性能，如下图所示。

不用担心调用deprecated函数的用户警告信息，可以参考SO线程UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings了解。

【讨论】：

以上是关于在 Pyspark 中按顺序应用多个正则表达式进行文本清理的最快方法的主要内容，如果未能解决你的问题，请参考以下文章

在 PySpark 中提取多个正则表达式匹配项

pyspark字符串匹配多个精确单词正则表达式的有效方法