展平并从 Spark 中的 Struct 类型数据框列中读取值

Posted

技术标签:

【中文标题】展平并从 Spark 中的 Struct 类型数据框列中读取值【英文标题】:Flatten and reading a value from the Struct type dataframe column in Spark 【发布时间】:2020-03-17 12:07:21 【问题描述】:

我有一个这样的镶木地板格式数据集:

parquetFile.toDF().registerTempTable("tempTable")
val PDataFrame = sqlContext.sql("SELECT * FROM tempTable")
PDataFrame.show()

+--------------------+--------------------+-------------------+-----+--------+-------------------+--------------------+
|                 _id|     VehicleDetailId|             PlanID| Type| SubType|          CreatedOn|                Date|
+--------------------+--------------------+-------------------+-----+--------+-------------------+--------------------+
|[($oid,5cc8e1a72f...|[($numberLong,219...|[($numberLong,164)]|Quote|Response|5/1/2019 5:30:39 AM|[($date,155666883...|
|[($oid,5cc8e1a72f...|[($numberLong,219...|[($numberLong,168)]|Quote|Response|5/1/2019 5:30:39 AM|[($date,155666883...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,102)]|  IDV| Request|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,105)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,112)]|Quote| Request|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,134)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,114)]|Quote| Request|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,115)]|Quote| Request|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,113)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,185)]|Quote| Request|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,108)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,149)]|Quote| Request|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,135)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,167)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,116)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,156)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,125)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,102)]|  IDV|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,144)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,171)]|Quote|Response|5/1/2019 5:30:44 AM|[($date,155666884...|
+--------------------+--------------------+-------------------+-----+--------+--------------------+-------------------+--------------------+
only showing top 20 rows

这个数据集的架构是:

PDataFrame.printSchema()
root
 |-- _id: struct (nullable = true)
 |    |-- $oid: string (nullable = true)
 |-- VehicleDetailId: struct (nullable = true)
 |    |-- $numberLong: string (nullable = true)
 |-- PlanID: struct (nullable = true)
 |    |-- $numberLong: string (nullable = true)
 |-- Type: string (nullable = true)
 |-- SubType: string (nullable = true)
 |-- CreatedOn: string (nullable = true)
 |-- Date: struct (nullable = true)
 |    |-- $date: string (nullable = true)

我正在尝试使用 Scala 编写 SparkSQL 代码,以通过 where 子句中的 PlanID 值读取数据。这就是为什么我想使用 SparkSQL 的 SQL 查询。 这是我预期的输出结构(10 行的示例视图)

+-----------------------+--------------------+-------+-----+--------+-------------------+--------+
|                    _id|     VehicleDetailId| PlanID| Type| SubType|          CreatedOn|    Date|
+-----------------------+--------------------+-------+-----+--------+-------------------+--------+
5ae7ae00b07ccf35c020e5ba|10220998|135|Quote|Response|5/1/2018 5:30:00 AM|1525132800096
5ae7ae00b07ccf35c020e5bb|10220998|134|Quote|Response|5/1/2018 5:30:00 AM|1525132800139
5ae7ae00b07ccf35c020e5bc|10220998|104|Quote|Response|5/1/2018 5:30:00 AM|1525132800516
5ae7ae00b07ccf35c020e5bd|10220998|104|Quote|Response|5/1/2018 5:30:00 AM|1525132800519
5ae7ae00b07ccf35c020e5be|10220998|101|Quote|Response|5/1/2018 5:30:00 AM|1525132800539
5ae7ae00b07ccf35c020e5bf|10220998|103|IDV|Request|5/1/2018 5:30:00 AM|1525132800546
5ae7ae00b07ccf35c020e5c0|10220998|105|Quote|Response|5/1/2018 5:30:00 AM|1525132800577
5ae7ae00b07ccf35c020e5c1|10220998|103|IDV|Request|5/1/2018 5:30:00 AM|1525132800581
5ae7ae00b07ccf35c020e5c2|10220998|103|IDV|Response|5/1/2018 5:30:00 AM|1525132800702
5ae7ae00b07ccf35c020e5c3|10220998|128|Quote|Response|5/1/2018 5:30:00 AM|1525132800709

现在,我尝试了各种方法来获得预期的输出,例如:

PDataFrame.withColumn("first", $"PlanID.$$numberLong").show

sqlContext.sql(s""" select _id["$$oid"] as col1, PlanID["$numberLong"] as col2 from tempTable """)

很遗憾,我无法达到预期的输出。 任何帮助将不胜感激。

【问题讨论】:

【参考方案1】:

来自您的 DataFrame 架构,

 |-- PlanID: struct (nullable = true)
 |    |-- $numberLong: string (nullable = true)

$"PlanID.$$numberLong" 的值为($numberLong,164),这是一个字符串。所以你必须拆分并选择你想要的。

PDataFrame.withColumn("first", split($"PlanID.$$numberLong", ",")(1)).show

【讨论】:

谢谢@Lamanus。仍然有一个问题是我得到了列值164),所以我遵循了一种修剪方法。我也无法删除现有列,只想保留新列。【参考方案2】:

我已经使用trim函数来实现了。

parquetFile.withColumn("first", trim($"PlanID.$$numberLong", "($numberLong,')'")).show

输出:

+--------------------+--------------------+-------------------+-----+--------+--------------------+-------------------+--------------------+-----+
|                 _id|     VehicleDetailId|             PlanID| Type| SubType|                 XML|          CreatedOn|                Date|first|
+--------------------+--------------------+-------------------+-----+--------+--------------------+-------------------+--------------------+-----+
|[($oid,5cc8e1a72f...|[($numberLong,219...|[($numberLong,164)]|Quote|Response|<?xml version="1....|5/1/2019 5:30:39 AM|[($date,155666883...|  164|
|[($oid,5cc8e1a72f...|[($numberLong,219...|[($numberLong,168)]|Quote|Response|<?xml version="1....|5/1/2019 5:30:39 AM|[($date,155666883...|  168|
|[($oid,5cc8e1ac2f...|[($numberLong,219...|[($numberLong,102)]|  IDV| Request|<IDV><policy_star...|5/1/2019 5:30:44 AM|[($date,155666884...|  102|

【讨论】:

以上是关于展平并从 Spark 中的 Struct 类型数据框列中读取值的主要内容,如果未能解决你的问题,请参考以下文章

如何有效地展平Spark数据框中的特征?

更新java spark中结构类型列中的值

如何展平结构数组类型的列(由 Spark ML API 返回)?

如何使用 Pandas 或 Spark Dataframe 展平嵌套的 Excel 数据?

Spark处理复杂类型数据

在 Spark 数据帧 udf 中,像 struct(col1,col2) 这样的函数参数的类型是啥?