使用 Apache Spark SQL 将表序列化为嵌套 JSON
Posted
技术标签:
【中文标题】使用 Apache Spark SQL 将表序列化为嵌套 JSON【英文标题】:Serialize table to nested JSON using Apache Spark SQL 【发布时间】:2020-06-11 18:44:14 【问题描述】:问题与this one相同,但可以这样:
df.withColumn("VEHICLE",struct("VEHICLENUMBER","CUSTOMERID")).
select("VEHICLE","ACCOUNTNO"). //only select reqired columns
groupBy("ACCOUNTNO").
agg(collect_list("VEHICLE").as("VEHICLE")). //for the same group create a list of vehicles
toJSON. //convert to json
show(false)
用纯SQL重写?我的意思是这样的:
val sqlDF = spark.sql("SELECT VEHICLE, ACCOUNTNO as collect_list(ACCOUNTNO) FROM VEHICLES group by ACCOUNTNO)
sqlDF.show()
可以吗?
【问题讨论】:
【参考方案1】:您的数据框示例的 SQL 等效项是:
scala> val df = Seq((10003014,"MH43AJ411",20000000),
| (10003014,"MH43AJ411",20000001),
| (10003015,"MH12GZ3392",20000002)
| ).toDF("ACCOUNTNO","VEHICLENUMBER","CUSTOMERID").withColumn("VEHICLE",struct("VEHICLENUMBER","CUSTOMERID"))
df: org.apache.spark.sql.DataFrame = [ACCOUNTNO: int, VEHICLENUMBER: string ... 2 more fields]
scala> df.registerTempTable("vehicles")
scala> val sqlDF = spark.sql("SELECT ACCOUNTNO, collect_list(VEHICLE) as ACCOUNT_LIST FROM VEHICLES group by ACCOUNTNO").toJSON
sqlDF: org.apache.spark.sql.Dataset[String] = [value: string]
scala> sqlDF.show(false)
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
|"ACCOUNTNO":10003014,"ACCOUNT_LIST":["VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000000,"VEHICLENUMBER":"MH43AJ411","CUSTOMERID":20000001]|
|"ACCOUNTNO":10003015,"ACCOUNT_LIST":["VEHICLENUMBER":"MH12GZ3392","CUSTOMERID":20000002] |
+-----------------------------------------------------------------------------------------------------------------------------------------------+
【讨论】:
以上是关于使用 Apache Spark SQL 将表序列化为嵌套 JSON的主要内容,如果未能解决你的问题,请参考以下文章
如何使用 spark.sql 将表列传递给 rand 函数?
使用来自json数组的java spark sql将表保存在hive中
来自示例 Java 程序的 Spark UDF 反序列化错误
如何在 Apache-Spark 2.x 中使用 java 进行增量序列