如何爆炸结构数组?

Posted

技术标签:

【中文标题】如何爆炸结构数组?【英文标题】:How to explode structs array? 【发布时间】:2018-12-20 14:40:38 【问题描述】:

我正在使用 JSON 对象,并希望基于 Spark SQL 数据帧/数据集将 object.hours 转换为关系表。

我尝试使用“explode”,它并不真正支持“structs array”。

json 对象如下:


  "business_id": "abc",
  "full_address": "random_address",
  "hours": 
    "Monday": 
      "close": "02:00",
      "open": "11:00"
    ,
    "Tuesday": 
      "close": "02:00",
      "open": "11:00"
    ,
    "Friday": 
      "close": "02:00",
      "open": "11:00"
    ,
    "Wednesday": 
      "close": "02:00",
      "open": "11:00"
    ,
    "Thursday": 
      "close": "02:00",
      "open": "11:00"
    ,
    "Sunday": 
      "close": "00:00",
      "open": "11:00"
    ,
    "Saturday": 
      "close": "02:00",
      "open": "11:00"
    
  

到如下关系表,

CREATE TABLE "business_hours" (
     "id" integer NOT NULL PRIMARY KEY,
     "business_id" integer NOT NULL FOREIGN KEY REFERENCES "businesses",
     "day" integer NOT NULL,
     "open_time" time,
     "close_time" time
)

【问题讨论】:

【参考方案1】:

你可以用这个技巧做到这一点:

import org.apache.spark.sql.types.StructType
val days = df.schema 
  .fields
  .filter(_.name=="hours")
  .head
  .dataType
  .asInstanceOf[StructType]
  .fieldNames

val solution = df
  .select(
    $"business_id",
    $"full_address",
    explode(
      array(
        days.map(d => struct(
          lit(d).as("day"),
          col(s"hours.$d.open").as("open_time"),
          col(s"hours.$d.close").as("close_time")
        )):_*
      )
    )
  )
  .select($"business_id",$"full_address",$"col.*")

scala> solution.show
+-----------+--------------+---------+---------+----------+
|business_id|  full_address|      day|open_time|close_time|
+-----------+--------------+---------+---------+----------+
|        abc|random_address|   Friday|    11:00|     02:00|
|        abc|random_address|   Monday|    11:00|     02:00|
|        abc|random_address| Saturday|    11:00|     02:00|
|        abc|random_address|   Sunday|    11:00|     00:00|
|        abc|random_address| Thursday|    11:00|     02:00|
|        abc|random_address|  Tuesday|    11:00|     02:00|
|        abc|random_address|Wednesday|    11:00|     02:00|
+-----------+--------------+---------+---------+----------+

【讨论】:

以上是关于如何爆炸结构数组?的主要内容,如果未能解决你的问题,请参考以下文章

如何将字符串转换为配置单元中的结构数组并爆炸?

使用 HiveQL 爆炸结构数组

使用 pyspark 将结构数组旋转到列中 - 不爆炸数组

在 Pyspark 中爆炸不是数组的结构列

Spark 结构化流/Spark SQL 中的条件爆炸

如何在数据框中爆炸嵌套的json数组