使用 Apache Spark SQL 将表序列化为嵌套 JSON

Posted

技术标签:

【中文标题】使用 Apache Spark SQL 将表序列化为嵌套 JSON【英文标题】:Serialize table to nested JSON using Apache Spark SQL 【发布时间】:2020-06-11 18:44:14 【问题描述】:

问题与this one相同,但可以这样:

df.withColumn("VEHICLE",struct("VEHICLENUMBER","CUSTOMERID")).
  select("VEHICLE","ACCOUNTNO"). //only select reqired columns
  groupBy("ACCOUNTNO"). 
  agg(collect_list("VEHICLE").as("VEHICLE")). //for the same group create a list of vehicles
  toJSON. //convert to json
  show(false)

用纯SQL重写?我的意思是这样的:

val sqlDF = spark.sql("SELECT VEHICLE, ACCOUNTNO as collect_list(ACCOUNTNO) FROM VEHICLES group by ACCOUNTNO)
sqlDF.show()

可以吗?

【问题讨论】:

【参考方案1】:

您的数据框示例的 SQL 等效项是:

scala> val df = Seq((10003014,"MH43AJ411",20000000),
     |   (10003014,"MH43AJ411",20000001),
     |   (10003015,"MH12GZ3392",20000002)
     | ).toDF("ACCOUNTNO","VEHICLENUMBER","CUSTOMERID").withColumn("VEHICLE",struct("VEHICLENUMBER","CUSTOMERID"))
df: org.apache.spark.sql.DataFrame = [ACCOUNTNO: int, VEHICLENUMBER: string ... 2 more fields]

scala> df.registerTempTable("vehicles")

scala> val sqlDF = spark.sql("SELECT ACCOUNTNO, collect_list(VEHICLE) as ACCOUNT_LIST FROM VEHICLES group by ACCOUNTNO").toJSON
sqlDF: org.apache.spark.sql.Dataset[String] = [value: string]

scala> sqlDF.show(false)
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                          |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|"ACCOUNTNO":10003014,"ACCOUNT_LIST":["VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000000,"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000001]|
|"ACCOUNTNO":10003015,"ACCOUNT_LIST":["VEHICLENUMBER":"MH12GZ3392","CUSTOMERID":20000002]                                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------------+

【讨论】:

以上是关于使用 Apache Spark SQL 将表序列化为嵌套 JSON的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 spark.sql 将表列传递给 rand 函数?

使用来自json数组的java spark sql将表保存在hive中

来自示例 Java 程序的 Spark UDF 反序列化错误

如何在 Apache-Spark 2.x 中使用 java 进行增量序列

如何引发异常以退出 Synapse Apache Spark 笔记本

Spark-RDD/DataFrame/DateSet