PySpark 正则表达式以两个条件提取字符串

Posted 2023-04-15

技术标签:

【中文标题】PySpark 正则表达式以两个条件提取字符串【英文标题】：PySpark regex to extract string with two conditions 【发布时间】：2021-12-13 06:15:06 【问题描述】：

我有一个如下所示的数据框：

id  col1
1   ACC 12-34-11-123-122-A
2   ACC TASKS 12-34-11-123-122-B
3   ABB 12-34-11-123-122-C

我想从前面有ACC 的第一行和第二行（12-34-11-123-122-A、12-34-11-123-122-B）中提取代码。

我找到了这个answer，这是我的尝试：

F.regexp_extract(F.col("col_1"), r'(.)(ACC)(\s+)(\b\d2\-\d2\-\d2\-\d3\-[A-Z0-9]0,3\b)', 4)

我必须添加第二组(ACC)，因为ABB 代码具有相同的格式。

如何修复我的正则表达式以从该数据框中同时提取 ACC 和 ACC TASKS？

【问题讨论】：

【参考方案1】：

使用您显示的示例，请尝试以下正则表达式。

^ACC(?:\s+TASKS)?\s+\d2(?:-\d2)2-\d3-[A-Z0-9]0,3(?=-[A-Z]$)

Online demo for above regex

说明：为上述正则表达式添加详细说明。

^ACC(?:\s+TASKS)?\s+        ##Matching from starting of value ACC followed by spaces(1 or more occurrences) followed by TASKS and
                            ##keeping this non-capturing group as optional, followed by spaces(1 or more occurrences).
\d2(?:-\d2)2-\d3-   ##Matching 2 digits followed by a non-capturing group which matches 2 occurrences of -followed by 2 digits;
                            ##non-capturing group is further followed by - and 3 digits -
[A-Z0-9]0,3(?=-[A-Z]$)    ##Matching capital A to Z OR 0-9 from 0 to 3 occurrences then making sure this is being
                            ##followed by a dash and capital A to Z at end of line/value.

【讨论】：

【参考方案2】：

你可以使用这个正则表达式：

(\bACC(?:\s+TASKS)?)\s+(\d2-\d2-\d2-\d3-[A-Z0-9]0,3)

RegEx Demo

这里(\bACC(?:\s+TASKS)?) 在匹配给定模式之前匹配ACC 或ACC TASKS。

对于您的 python 代码：

F.regexp_extract(F.col("col_1"), r'(\bACC(?:\s+TASKS)?)\s+(\d2-\d2-\d2-\d3-[A-Z0-9]0,3)', 4)

【讨论】：

我打算将我的第三组(\s+) 更改为(.*)，但您的回答更好。

以上是关于PySpark 正则表达式以两个条件提取字符串的主要内容，如果未能解决你的问题，请参考以下文章

关于grep后跟多个正则查询条件的问题

如何在 PySpark 中编写条件正则表达式替换？

Pyspark SparkSQL 正则表达式在空格前获取子字符串

在 PySpark 中提取多个正则表达式匹配项

如何在 Pyspark 中基于正则表达式条件验证（和删除）列，而无需多次扫描和改组？

在pyspark数据框的列中使用正则表达式捕获两个字符串之间的第一次出现的字符串