使用具有相同名称的嵌套子属性展平 Spark JSON 数据框

Posted 2023-02-23

技术标签:

【中文标题】使用具有相同名称的嵌套子属性展平 Spark JSON 数据框【英文标题】：Flatten Spark JSON Data Frame with nested children attributes having the same names 【发布时间】：2017-12-22 03:09:45 【问题描述】：

作为 Scala / Spark 的菜鸟，我有点卡住了，希望能得到任何帮助！

正在将 JSON 数据导入 Spark 数据框。在这个过程中，我最终得到了一个与 JSON 输入中存在相同嵌套结构的数据框。

我的目标是使用 Scala 递归地展平整个数据框（包括数组/字典中最内层的子属性）。

此外，可能存在具有相同名称的子属性。因此，也需要区分它们。

此处显示了一个有点相似的解决方案（不同父母的相同子属性） - https://***.com/a/38460312/3228300

我希望实现的示例如下：


    "id": "0001",
    "type": "donut",
    "name": "Cake",
    "ppu": 0.55,
    "batters":
        
            "batter":
                [
                     "id": "1001", "type": "Regular" ,
                     "id": "1002", "type": "Chocolate" ,
                     "id": "1003", "type": "Blueberry" ,
                     "id": "1004", "type": "Devil's Food" 
                ]
        ,
    "topping":
        [
             "id": "5001", "type": "None" ,
             "id": "5002", "type": "Glazed" ,
             "id": "5005", "type": "Sugar" ,
             "id": "5007", "type": "Powdered Sugar" ,
             "id": "5006", "type": "Chocolate with Sprinkles" ,
             "id": "5003", "type": "Chocolate" ,
             "id": "5004", "type": "Maple" 
        ]

相应的扁平化输出 Spark DF 结构将是：


    "id": "0001",
    "type": "donut",
    "name": "Cake",
    "ppu": 0.55,
    "batters_batter_id_0": "1001", 
    "batters_batter_type_0": "Regular",
    "batters_batter_id_1": "1002", 
    "batters_batter_type_1": "Chocolate",
    "batters_batter_id_2": "1003", 
    "batters_batter_type_2": "Blueberry",
    "batters_batter_id_3": "1004", 
    "batters_batter_type_3": "Devil's Food",
    "topping_id_0": "5001",
    "topping_type_0": "None",
    "topping_id_1": "5002", 
    "topping_type_1": "Glazed",
    "topping_id_2": "5005", 
    "topping_type_2": "Sugar",
    "topping_id_3": "5007", 
    "topping_type_3": "Powdered Sugar",
    "topping_id_4": "5006", 
    "topping_type_4": "Chocolate with Sprinkles",
    "topping_id_5": "5003", 
    "topping_type_5": "Chocolate",
    "topping_id_6": "5004", 
    "topping_type_6": "Maple"

之前没有过多使用 Scala 和 Spark，不确定如何继续。

最后，如果有人可以请帮助提供通用/非模式解决方案的代码，我将非常感激，因为我需要将它应用于许多不同的集合。

非常感谢:)

【问题讨论】：

【参考方案1】：

这是我们在一个项目中处理它的一种可能性

列表项

定义一个从数据框中映射一行的案例类

case class BattersTopics(id: String, type: String, ..., batters_batter_id_0: String, ..., topping_id_0: String)

列表项

将数据框中的每一行映射到案例类

df.map(row => BattersTopics(id = row.getAs[String]("id"), ..., 
   batters_batter_id_0 = row.getAs[String]("batters_batter_id_0 "), ...)

收集到一个列表并从数据框中制作一个 Map[String, Any]

val rows = dataSet.collect().toList
rows.map(bt => Map (
 "id" -> bt.id,
 "type" -> bt.type, 
 "batters" -> Map(
    "batter" -> List(Map("id" -> bt.batters_batter_id_0, "type" -> 
       bt.batters_batter_type_0), ....) // same for the others id and types
    "topping" -> List(Map("id"-> bt.topping_id_0, "type" -> bt.topping_type_0), ...) // same for the others id and type
  ) 
))

使用 Jackson 将 Map[String, Any] 转换为 Json

【讨论】：

嗨@dumitru，首先，感谢您的帮助:) 据我所知，您已经生成了现有模式到输出模式的映射。但是，我的 JSON 文件可能有一些我目前考虑的额外字典元素（动态生成）。因此，需要在未知模式上递归展平的代码。例如***.com/a/37473765/3228300 但是，需要考虑孩子的属性。也有相同的名字。【参考方案2】：

示例数据：其中包含所有不同类型的 JSON 元素（嵌套 JSON 映射、JSON 数组、long、字符串等）

"name":"Akash","age":16,"watches":"name":"Apple","models":["Apple Watch Series 5","Apple Watch Nike"],"phones":["name":"Apple","models":["iphone X","iphone XR","iphone XS","iphone 11","iphone 11 Pro"],"name":"Samsung","models":["Galaxy Note10","Galaxy S10e","Galaxy S10"],"name":"Google","models":["Pixel 3","Pixel 3a"]]

root
|— age: long (nullable = true)
| — name: string (nullable = true)
| — phones: array (nullable = true)
| | — element: struct (containsNull = true)
| | | — models: array (nullable = true)
| | | | — element: string (containsNull = true)
| | | — name: string (nullable = true)
| — watches: struct (nullable = true)
| | — models: array (nullable = true)
| | | — element: string (containsNull = true)
| | — name: string (nullable = true)

这是在 json 数据中具有 arraytype 和 structtype (Map) 值的示例数据。

我们可以为每种类型使用前两个开关条件，然后重复这个过程，直到它变平到所需的Dataframe。

https://medium.com/@ajpatel.bigdata/flatten-json-data-with-apache-spark-java-api-5f6a8e37596b

这里是 Spark Java API 解决方案。

【讨论】：

以上是关于使用具有相同名称的嵌套子属性展平 Spark JSON 数据框的主要内容，如果未能解决你的问题，请参考以下文章