在 Spark SQL 中将多个结构组合成单个结构

Posted

技术标签:

【中文标题】在 Spark SQL 中将多个结构组合成单个结构【英文标题】:Combining multiple structs into single struct in Spark SQL 【发布时间】:2022-01-21 20:26:14 【问题描述】:

这是我的输入:

val df = Seq(
  ("Adam","Angra", "Anastasia"),
  ("Boris","Borun", "Bisma"),
  ("Shawn","Samar", "Statham")
).toDF("fname", "mname", "lname")
df.createOrReplaceTempView("df")

我希望 Spark sql 输出如下所示:

struct
"data_description":"fname","data_details":"Adam","data_description":"mname","data_details":"Angra","data_description":"lname","data_details":"Anastasia"
"data_description":"fname","data_details":"Boris","data_description":"mname","data_details":"Borun","data_description":"lname","data_details":"Bisma"
"data_description":"fname","data_details":"Shawn","data_description":"mname","data_details":"Samar","data_description":"lname","data_details":"Statham"

到目前为止,我在下面尝试过:

val df1 = spark.sql("""select concat(fname,':',mname,":",lname) as name from df""")
df1.createOrReplaceTempView("df1")

val df2 = spark.sql("""select named_struct('data_description','fname','data_details',split(name, ':')[0]) as struct1,named_struct('data_description','mname','data_details',split(name, ':')[1]) as struct2, named_struct('data_description','lname','data_details',split(name, ':')[2]) as struct3 from df1""")
df2.createOrReplaceTempView("df2")

上面的输出:

struct1 struct2 struct3
"data_description":"fname","data_details":"Adam"  "data_description":"mname","data_details":"Angra" "data_description":"lname","data_details":"Anastasia"
"data_description":"fname","data_details":"Boris" "data_description":"mname","data_details":"Borun" "data_description":"lname","data_details":"Bisma"
"data_description":"fname","data_details":"Shawn" "data_description":"mname","data_details":"Samar" "data_description":"lname","data_details":"Statham"

但我得到了 3 个不同的结构。我需要一个用逗号分隔的单一结构

【问题讨论】:

【参考方案1】:

sql语句如下,其他的见仁见智。

val sql = """
    select
        concat_ws(
            ','
            ,concat('"data_description":"fname","data_details":"',fname,'"')
            ,concat('"data_description":"mname","data_details":"',mname,'"')
            ,concat('"data_description":"lname","data_details":"',lname,'"')
        ) as struct
    from df
"""

【讨论】:

【参考方案2】:

你可以创建结构数组,如果你想输出为字符串,则使用to_json

spark.sql("""
select  to_json(array(
          named_struct('data_description','fname','data_details', fname),
          named_struct('data_description','mname','data_details', mname), 
          named_struct('data_description','lname','data_details', lname) 
        )) as struct
from  df
""").show()

//+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|struct                                                                                                                                                          |
//+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
//|["data_description":"fname","data_details":"Adam","data_description":"mname","data_details":"Angra","data_description":"lname","data_details":"Anastasia"]|
//|["data_description":"fname","data_details":"Boris","data_description":"mname","data_details":"Borun","data_description":"lname","data_details":"Bisma"]   |
//|["data_description":"fname","data_details":"Shawn","data_description":"mname","data_details":"Samar","data_description":"lname","data_details":"Statham"] |
//+----------------------------------------------------------------------------------------------------------------------------------------------------------------+

如果你有很多列,你可以像这样动态生成struct sql表达式:

val structs = df.columns.map(c => s"named_struct('data_description','$c','data_details', $c)").mkString(",")

val df2 = spark.sql(s"""
  select  to_json(array($structs)) as struct
  from  df
""")

如果你不想使用数组,你可以简单地将to_json的结果连接到3个结构上:

val structs = df.columns.map(c => s"to_json(named_struct('data_description','$c','data_details', $c))").mkString(",")

val df2 = spark.sql(s"""
  select  concat_ws(',', $structs) as struct
  from  df
""")

【讨论】:

以上是关于在 Spark SQL 中将多个结构组合成单个结构的主要内容,如果未能解决你的问题,请参考以下文章

如何在 SQL 中将多个数据库表中的数据组合成一个结果表

将多个分组查询组合成单个查询

将多个 SQL 查询组合成单个结果

使用 PHP 将多个 SQL 列组合成 HTML 表的单个列

Spark Streaming Scala 将不同结构的 json 组合成一个 DataFrame

如何在spark(Python)中将两个rdd组合成on rdd