Apache Spark 数据框列爆炸为多列

Posted 2023-04-15

技术标签:

【中文标题】Apache Spark 数据框列爆炸为多列【英文标题】：Apache Spark dataframe column explode to multiple columns 【发布时间】：2018-01-16 17:30:50 【问题描述】：

我目前正在使用 Apache Spark 2.1.1 将 XML 文件处理为 CSV。我的目标是展平 XML，但我目前面临的问题是元素的无限出现。 Spark 会自动将这些无限的出现推断到数组中。现在我要做的是分解一个数组列。

 Sample Schema

 |-- Instrument_XREF_Identifier: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- @bsid: string (nullable = true)
 |    |    |-- @exch_code: string (nullable = true)
 |    |    |-- @id_bb_sec_num: string (nullable = true)
 |    |    |-- @market_sector: string (nullable = true)

我知道我可以通过这种方法来爆炸数组

result = result.withColumn(p.name, explode(col(p.name)))

这将产生多行，每个数组值都包含结构。但我想要产生的输出是将其分解为多列而不是行。

根据我上面提到的架构，这是我的预期输出：

假设数组中有两个结构值。

bsid1   exch_code1   id_bb_sec_num1   market_sector1   bsid2   exch_code2   id_bb_sec_num2   market_sector2
123     3            1                13               234     12           212              221

【问题讨论】：

可变长度数组如何映射到固定列数？请发布示例输入和预期输出。 【参考方案1】：

假设Instrument_XREF_Identifier是array<struct<..>>类型的列，那么你必须分两步完成：

result
.withColumn("tmp",explode(col("Instrument_XREF_Identifier")))
.select("tmp.*")

这将为您提供每个结构元素的列。

似乎没有办法在 1 select/withColumn 语句中做到这一点，请参阅Explode array of structs to columns in Spark

【讨论】：

但这仍然会被分解成多行。我正在尝试接近它，以便在它们爆炸时创建新列。

以上是关于Apache Spark 数据框列爆炸为多列的主要内容，如果未能解决你的问题，请参考以下文章