如何通过在 PySpark 中选择 struct-array 列的一个字段来提取数组列

Posted 2023-02-18

技术标签:

【中文标题】如何通过在 PySpark 中选择 struct-array 列的一个字段来提取数组列【英文标题】：How to extract array column by selecting one field of struct-array column in PySpark 【发布时间】：2022-01-12 15:58:27 【问题描述】：

我有一个数据框 df 包含一个结构数组列 properties（其元素是结构字段的数组列，具有键 x 和 y），我想通过提取 @ 创建一个新的数组列来自properties 列的 987654325@ 值。

一个示例输入数据框是这样的

import pyspark.sql.functions as F
from pyspark.sql.types import *

data = [
  (1, ['x':11, 'y':'str1a', ]), 
  (2, ['x':21, 'y':'str2a', 'x':22, 'y':0.22, 'z':'str2b', ]), 
    ]
my_schema = StructType([
    StructField('id', LongType()),
    StructField('properties', ArrayType(
      StructType([
        StructField('x', LongType()),
        StructField('y', StringType()),
                  ])
    )           
               ),
])

df = spark.createDataFrame(data, schema=my_schema)
df.show()
# +---+--------------------+
# | id|          properties|
# +---+--------------------+
# |  1|       [[11, str1a]]|
# |  2|[[21, str2a], [22...|
# +---+--------------------+

df.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- properties: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- x: long (nullable = true)
#  |    |    |-- y: string (nullable = true)

另一方面，所需的输出 df_new 应该是这样的

df_new.show()
# +---+--------------------+--------+
# | id|          properties|x_values|
# +---+--------------------+--------+
# |  1|       [[11, str1a]]|    [11]|
# |  2|[[21, str2a], [22...|[21, 22]|
# +---+--------------------+--------+

df_new.printSchema()
# root
#  |-- id: long (nullable = true)
#  |-- properties: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- x: long (nullable = true)
#  |    |    |-- y: string (nullable = true)
#  |-- x_values: array (nullable = true)
#  |    |-- element: long (containsNull = true)

有人知道此类任务的解决方案吗？

理想情况下，我正在寻找一种不依赖 F.explode 的逐行操作的解决方案。事实上，在我的实际数据库中，我还没有确定与 id 列等效的列，并且在调用 F.explode 之后，我不确定如何将分解后的值重新合并在一起。

【问题讨论】：

【参考方案1】：

尝试使用properties.x，然后从属性数组中提取所有值。

示例：

df.withColumn("x_values",col("properties.x")).show(10,False)

#or by using higher order functions
df.withColumn("x_values",expr("transform(properties,p -> p.x)")).show(10,False)

#+---+-------------------------+--------+
#|id |properties               |x_values|
#+---+-------------------------+--------+
#|1  |[[11, str1a]]            |[11]    |
#|2  |[[21, str2a], [22, 0.22]]|[21, 22]|
#+---+-------------------------+--------+

【讨论】：

以上是关于如何通过在 PySpark 中选择 struct-array 列的一个字段来提取数组列的主要内容，如果未能解决你的问题，请参考以下文章