在 spark 中读取单行 json，其中列键是可变的

Posted 2023-04-18

技术标签:

【中文标题】在 spark 中读取单行 json，其中列键是可变的【英文标题】：read a single line json in spark where column key is variable 【发布时间】：2017-05-13 10:22:22 【问题描述】：

我有一个单行 json 文件，如下所示

"Hotel Dream":"Guests":20,"Address":"14 Naik Street","City":"Manila","Serenity Stay":"Guests":35,"Address":"10 St Marie Road","City":"Manila"....

如果我使用以下内容读取 json 以触发上下文，则会导致

val hotelDF = sqlContext.read.json("file").printSchema

root
 |-- Hotel Dream: struct (nullable = true)
 |    |-- Address: string (nullable = true)
 |    |-- City: string (nullable = true)
 |    |-- Guests: long (nullable = true)
 |-- Serenity Stay: struct (nullable = true)
 |    |-- Address: string (nullable = true)
 |    |-- City: string (nullable = true)
 |    |-- Guests: long (nullable = true)

我想转置不同的列（Hotel Dream、Serenity Stay 等），以使数据框以正则化模式结束

Hotel: string (nullable = true)
Guests: string (nullable = true)
Address: string (nullable = true)
City: string (nullable = true)

还尝试将 json 作为 textFile 或 wholeTextFiles 注入。但是由于没有换行分隔符，我无法用map函数映射内容。

关于如何读取这种数据格式的任何输入？

【问题讨论】：

【参考方案1】：

根据我从您的问题中理解的内容，以下可能是您的解决方案（虽然它不是一个完美的解决方案）

var newDataFrame = Seq(("test", "test", "test", "test")).toDF("Hotel", "Address", "City", "Guests")
for(name <- hotelDF.schema.fieldNames) 
  val tempdf = hotelDF.withColumn("Hotel", lit(name))
    .withColumn("Address", hotelDF(name + ".Address"))
    .withColumn("City", hotelDF(name + ".City"))
    .withColumn("Guests", hotelDF(name + ".Guests"))
  val tdf = tempdf.select("Hotel", "Address", "City", "Guests")
  newDataFrame = newDataFrame.union(tdf)

newDataFrame.filter(!(col("Hotel") === "test")).show

【讨论】：

以上是关于在 spark 中读取单行 json，其中列键是可变的的主要内容，如果未能解决你的问题，请参考以下文章