PySpark 2.2中数组列的每个元素的子串

Posted 2023-04-17

技术标签:

【中文标题】PySpark 2.2中数组列的每个元素的子串【英文标题】：Substring each element of an array column in PySpark 2.2 【发布时间】：2021-09-09 03:00:56 【问题描述】：

我想在 PySpark 2.2 中对数组列的每个元素进行子串化。我的 df 看起来像下面的那个，它是类似于this，尽管我的 df 中的每个元素在连字符分隔符之前的长度相同。

+---------------------------------+----------------------+
|col1                             |new_column            |
+---------------------------------+----------------------+
|[hello-123, abcde-111]           |[hello, abcde]        |
|[hello-234, abcde-221, xyzhi-333]|[hello, abcde, xyzhi] |
|[hiiii-111, abbbb-333, xyzhu-222]|[hiiii, abbbb, xyzhu] |
+---------------------------------+----------------------+

我尝试根据this 答案调整上一个问题中的udf 以获得上面new_column 中的输出，但到目前为止还没有运气。有没有办法在 PySpark 2.2 中完成这项工作？

import pyspark.sql.functions as F
import pyspark.sql.types as T 

cust_udf = F.udf(lambda arr: [x[0:4] for x in arr], T.ArrayType(T.StringType()))
df1.withColumn('new_column', cust_udf(col("col1")))

【问题讨论】：

【参考方案1】：

您的 udf 方法对我有用。此外，您还可以将transform 与substring 一起使用：

import pyspark.sql.functions as f

df.withColumn('new_column', f.expr('transform(col1, x -> substring(x, 0, 5))')).show()

+--------------------+--------------------+
|                col1|          new_column|
+--------------------+--------------------+
|[hello-123, abcde...|      [hello, abcde]|
|[hello-234, abcde...|[hello, abcde, xy...|
|[hiiii-111, abbbb...|[hiiii, abbbb, xy...|
+--------------------+--------------------+

【讨论】：

使用我的 udf 时出现错误“作业因阶段失败而中止”。你使用的是 Pyspark 2.2.0 之后的版本吗？转换来自 Spark 2.4 嗯，很有趣，2.2 可能还不支持从 udf 返回字符串数组？我使用的是 Pyspark 3+ 最终使用了不同的方法，无法修复 udf 错误。不过谢谢！【参考方案2】：

使用不同的方法解决了这个问题：分解数组，对元素进行子串化，然后收集回数组。

import pyspark.sql.functions as F
    
df1\
   .withColumn('idx', F.monotonically_increasing_id())\
   .withColumn('exploded_col', F.explode(col('col1')))\
   .withColumn('substr_col', F.substring(col('exploded_col'),1,5))\
   .groupBy(col('idx'))\
   .agg(F.collect_set('substr_col').alias('new_column'))

【讨论】：

以上是关于PySpark 2.2中数组列的每个元素的子串的主要内容，如果未能解决你的问题，请参考以下文章