spark.sql() 的 REGEXP_REPLACE

Posted 2023-03-28

技术标签:

【中文标题】spark.sql() 的 REGEXP_REPLACE【英文标题】：REGEXP_REPLACE for spark.sql() 【发布时间】：2021-06-07 06:49:45 【问题描述】：

我需要为 spark.sql() 作业编写 REGEXP_REPLACE 查询。如果值遵循以下模式，则仅提取第一个连字符之前的单词并将其分配给目标列“名称”，但如果模式不匹配，则应报告整个“名称”。

图案：

值应该用连字符分隔。任何值都可以出现在第一个连字符之前（无论是数字，字母、特殊字符甚至空格）第一个连字符后面应该紧跟两个单词，用连字符隔开（只能是数字，字母或字母数字）（注意：不允许使用特殊字符和空格）两个词后应接一个或多个数字，后接连字符。最后一部分只能是一位或多位数字。

例如：

如果 name = abc45-dsg5-gfdvh6-9890-7685，则输出 REGEXP_REPLACE = abc45

如果 name = abc，则输出 REGEXP_REPLACE = abc

如果 name = abc-gf5-dfg5-asd5-98-00，则 REGEXP_REPLACE 的输出 = abc-gf5-dfg5-asd5-98-00

我有

spark.sql("SELECT REGEXP_REPLACE(name , '-[^-]+-\\w2-\\d+-\\d+$','',1,1,'i')  AS name").show();

但它不起作用。

【问题讨论】：

尝试使用regexp_extract 和模式^(?:[^-\s]+(?=-\w+-\w+\d+-\d+-\d+$)|\S+) 来匹配所需的字符串。见regex101.com/r/BeCTRF/1 【参考方案1】：

使用

^([^-]*)(-[a-zA-Z0-9]+)2-[0-9]+-[0-9]+$

见proof。替换为$1。如果$1 不起作用，请使用\1。如果\1 不起作用，请使用\\1。

解释

--------------------------------------------------------------------------------
  ^                        the beginning of the string
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [^-]*                    any character except: '-' (0 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1
--------------------------------------------------------------------------------
  (                        group and capture to \2 (2 times):
--------------------------------------------------------------------------------
    -                        '-'
--------------------------------------------------------------------------------
    [a-zA-Z0-9]+             any character of: 'a' to 'z', 'A' to
                             'Z', '0' to '9' (1 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
  )2                     end of \2 (NOTE: because you are using a
                           quantifier on this capture, only the LAST
                           repetition of the captured pattern will be
                           stored in \2)
--------------------------------------------------------------------------------
  -                        '-'
--------------------------------------------------------------------------------
  [0-9]+                   any character of: '0' to '9' (1 or more
                           times (matching the most amount possible))
--------------------------------------------------------------------------------
  -                        '-'
--------------------------------------------------------------------------------
  [0-9]+                   any character of: '0' to '9' (1 or more
                           times (matching the most amount possible))
--------------------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string

【讨论】：

这似乎是一个更好的解决方案 +1

以上是关于spark.sql() 的 REGEXP_REPLACE的主要内容，如果未能解决你的问题，请参考以下文章