将火花数据框列中的值提取到新的派生列中

Posted 2023-04-15

技术标签:

【中文标题】将火花数据框列中的值提取到新的派生列中【英文标题】：Extract values from spark dataframe column into new derived column 【发布时间】：2020-10-30 03:52:23 【问题描述】：

我在下面有以下数据框架构

        root
         |-- SOURCE: string (nullable = true)
         |-- SYSTEM_NAME: string (nullable = true)
         |-- BUCKET_NAME: string (nullable = true)
         |-- LOCATION: string (nullable = true)
         |-- FILE_NAME: string (nullable = true)
         |-- LAST_MOD_DATE: string (nullable = true)
         |-- FILE_SIZE: string (nullable = true)

我想在从某些列中提取数据值后派生一列。 location 列中的数据如下所示：

example 1: prod/docs/Folder1/AA160039/Folder2/XXX.pdf
example 2: prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx

问题 1：我想派生一个名为“folder_num”的新列并删除以下内容：

1. the 2 characters followed by 6 digits between the slashes. Output is "AA160039".This expression or mask will not change. always 2 characters followed by 6 digits
2. strip digits only if they are between slashes. Output is "355" from example above. The numbers could be a single digit such as "8", double digits "55", triple "444", up to 5 digits "12345". As long as they are between slashes, they need to be extracted into new column.

如何在 spark 中实现这一点？我是这项技术的新手，非常感谢您的帮助。

df1 = df0.withColumn("LOCATION", trim(col('LOCATION')))
if location like '%/[A-Z]2[0-9]6/%' -- extract value and add to new derived column
if location like '%/[0-9]1 or 2 or 3 or 4 or 5/%' -- extract value and add to new derived column

感谢您的帮助。

添加代码：

df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
                         .withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
                         .withColumn("LOCATION", trim(col('LOCATION')))\
                         .withColumn("FOLDER_NUM", when(regexp_extract(col("FILE_NAME"), "([A-Z]2[0-9]6).*", 1) != lit(""), 
                                                     regexp_extract(col("LOCATION"), ".*/([A-Z]2[0-9]6)/.*", 1))
                                                .otherwise(regexp_extract(col("LOCATION"),".*/([0-9]1,5)/.*" , 1)))



+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|SOURCE|SYSTEM_NAME| BUCKET_NAME|            LOCATION|          FILE_NAME|      LAST_MOD_DATE|FILE_SIZE|FOLDER_NUM|
+------+-----------+------------+--------------------+-------------------+-------------------+---------+-------+
|    s3|       xxx|bucket1|production/Notifi...|AA120068_Letter.pdf|2020-07-20 15:51:21|    13124|       |
|    s3|       xxx|bucket1|production/Notifi...|ZZ120093_Letter.pdf|2020-07-20 15:51:21|    61290|       |
|    s3|       xxx|bucket1|production/Notifi...|XC120101_Letter.pdf|2020-07-20 15:51:21|    61700|       |

【问题讨论】：

【参考方案1】：

嗯，你的路不错：

from pyspark.sql.functions import regexp_extract, trim

df = spark.createDataFrame(["old_column": "ex@mple trimed"], 'old_column string')

df.withColumn('new_column'. regexp_extract(trim('old_column'), '(e.*@)', 1)).show()

这将修剪并提取与正则表达式匹配的组 1 的模式

【讨论】：

【参考方案2】：

您可以使用 regexp_extract 和 when。请参阅下面的示例 scala spark 代码。

  df.withColumn("folder_num",
  when(regexp_extract(col("LOCATION"),".*/[A-Z]2([0-9]6)/.*" ,1) =!= lit(""),
    regexp_extract(col("LOCATION"),".*/[A-Z]2([0-9]6)/.*" , 1))
    .otherwise(regexp_extract(col("LOCATION"),".*/([0-9]1,5)/.*" , 1))
).show(false)

+------------------------------------------------------+----------+
|LOCATION                                              |folder_num|
+------------------------------------------------------+----------+
|prod/docs/Folder1/AA160039/Folder2/XXX.pdf            |160039    |
|prod/docs/Folder1/FolderX/Folder3/355/Folder2/zzz.docx|355       |
+------------------------------------------------------+----------+

如果您需要第一行的输出为 AA160039，只需将正则表达式中的分组更改如下。

regexp_extract(col("LOCATION"),".*/([A-Z]2[0-9]6)/.*" ,1)

【讨论】：

在这种情况下“=!=”是什么意思？还是这是一个错字？ @AJR, "=!=" 是 scala spark 中匹配 cols 的“不等于”。您可以在 python 中用适当的列“不等于”运算符替换它。基本上，当您的第一个正则表达式与模式不匹配时，regexp_extract 将给出带有空字符串的列。我们只是检查如果它没有给出空字符串列然后使用它，否则使用下一个正则表达式。感谢@SD3。另一个问题，所以我理解你的代码......你为什么要检查第一个表达式的值是否“不等于”空字符串而不是第二个？？ @AJR，由于您有两个表达式要匹配，并且第一个表达式将出现或第二个将出现（我假设）。而且，如果这两个表达式中只有一个会出现，那么哪个应该优先（我假设第一个表达式）。所以，在这里做 if else 事情。如果第一个表达式匹配提取它，否则（如果第一个返回空字符串）匹配第二个表达式并提取它。如果第二个也没有找到，我们将得到一个空字符串 col，因为我们现在没有任何第三个表达式要检查。抱歉@SD3 再次打扰您...但是我的框架在文件夹 num 中没有显示任何内容，请您看一下。【参考方案3】：

提供的信息真的很有帮助。我感谢大家让我走上正轨。最终代码版本如下。

df1 = df0.withColumn("LAST_MOD_DATE",(col("LAST_MOD_DATE").cast("timestamp")))\
                         .withColumn("FILE_SIZE",(col("FILE_SIZE").cast("integer")))\
                         .withColumn("LOCATION", trim(col('LOCATION')))\
                         .withColumn("FOLDER_NUM", when(regexp_extract(trim(col("FILE_NAME")), "([A-Z]2[0-9]6).*", 1) != lit(""), regexp_extract(trim(col("FILE_NAME")), "([A-Z]2[0-9]6).*", 1))
                                                .when(regexp_extract(trim(col("LOCATION")), ".*/([A-Z]2[0-9]6)/.*", 1) != lit(""), regexp_extract(trim(col("LOCATION")), ".*/([A-Z]2[0-9]6)/.*", 1))
                                                .when(regexp_extract(trim(col("LOCATION")),".*/([0-9]1,5)/.*" , 1) != lit(""), regexp_extract(trim(col("LOCATION")),".*/([0-9]1,5)/.*" , 1))
                                                .otherwise("Unknown"))

谢谢。

【讨论】：

向@SD3 大喊。

以上是关于将火花数据框列中的值提取到新的派生列中的主要内容，如果未能解决你的问题，请参考以下文章