如何在pyspark /中的结构内爆炸结构中的内部数组

Posted 2023-04-15

技术标签:

【中文标题】如何在pyspark /中的结构内爆炸结构中的内部数组【英文标题】：How to explode inner arrays in a struct inside a struct in pyspark/ 【发布时间】：2018-09-26 23:57:57 【问题描述】：

我是新来的火花。我曾尝试在struct 内爆炸array。 JSON 循环有点复杂，如下所示。


"id": 1,
"firstfield": "abc",
"secondfield": "zxc",
"firststruct": 
    "secondstruct": 
        "firstarray": [
            "firstarrayfirstfield": "asd",
            "firstarraysecondfield": "dasd",
            "secondarray": [
                "score": " 7 "
            ]
        ]

我正在尝试访问secondarray 字段下的score 字段，以便能够计算一些指标并得出每个id 的平均分数。

【问题讨论】：

【参考方案1】：

如果您使用的是 Glue，那么您应该将 DynamicFrame 转换为 Spark 的 DataFrame，然后使用explode 函数：

from pyspark.sql.functions import col, explode

scoresDf = dynamicFrame.toDF
  .withColumn("firstExplode", explode(col("firststruct.secondstruct.firstarray")))
  .withColumn("secondExplode", explode(col("firstExplode.secondarray")))
  .select("secondExplode.score") 

scoresDyf = DynamicFrame.fromDF(scoresDf, glueContext, "scoresDyf")

【讨论】：

18/09/27 13:43:20 信息 MemoryStore：MemoryStore 已清除 18/09/27 13:43:20 信息 BlockManager：BlockManager 已停止 18/09/27 13:43:20 信息 ShutdownHookManager : 称为 End of LogType:stderr 的关闭挂钩这是脚本在日志中返回的内容并打印模式我提供的代码不打印模式。如果您愿意，可以添加scoresDf.printSchema()。日志看起来不错很抱歉给您带来了困惑。我已经完成了这个 scoresDf.printSchema() ，其中包含打印模式如果删除.select("secondExplode.score")，它会打印什么吗？

以上是关于如何在pyspark /中的结构内爆炸结构中的内部数组的主要内容，如果未能解决你的问题，请参考以下文章

在 Pyspark 中爆炸不是数组的结构列

使用 pyspark 将结构数组旋转到列中 - 不爆炸数组

如何将字符串转换为配置单元中的结构数组并爆炸？

如何从 PySpark 中的向量结构中获取项目

pyspark中的条件爆炸

嵌入在数组内的数组中的 PySpark Sum 字段