Spark Read Json：如何读取在整数和结构之间交替的字段

Posted 2023-04-15

技术标签:

【中文标题】Spark Read Json：如何读取在整数和结构之间交替的字段【英文标题】：Spark Read Json: how to read field that alternates between integer and struct 【发布时间】：2020-06-06 07:36:14 【问题描述】：

尝试将多个 json 文件读入一个数据帧，两个文件都有一个“Value”节点，但该节点的类型在 integer 和 struct 之间交替：

文件 1：


   "Value": 123

文件 2：


   "Value": 
      "Value": "On",
      "ValueType": "State",
      "IsSystemValue": true

我的目标是将文件读入这样的数据框：

|---------------------|------------------|---------------------|------------------|
|         File        |       Value      |      ValueType      |   IsSystemValue  |
|---------------------|------------------|---------------------|------------------|
|      File1.json     |        123       |        null         |       null       |
|---------------------|------------------|---------------------|------------------|
|      File2.json     |        On        |        State        |       true       |
|---------------------|------------------|---------------------|------------------|

有可能所有读取的文件都像 FileA 而没有一个像 FileB，反之亦然，或两者兼而有之。它不提前知道。有什么想法吗？？

【问题讨论】：

【参考方案1】：

试试看是否有帮助-

加载测试数据

    /**
      * test/File1.json
      * -----
      * 
      * "Value": 123
      * 
      */
    /**
      * test/File2.json
      * ---------
      * 
      * "Value": 
      * "Value": "On",
      * "ValueType": "State",
      * "IsSystemValue": true
      * 
      * 
      */
    val path = getClass.getResource("/test" ).getPath
    val df = spark.read
      .option("multiLine", true)
      .json(path)

    df.show(false)
    df.printSchema()

    /**
      * +-------------------------------------------------------+
      * |Value                                                  |
      * +-------------------------------------------------------+
      * |"Value":"On","ValueType":"State","IsSystemValue":true|
      * |123                                                    |
      * +-------------------------------------------------------+
      *
      * root
      * |-- Value: string (nullable = true)
      */

转换字符串 json

    df.withColumn("File", substring_index(input_file_name(),"/", -1))
      .withColumn("ValueType", get_json_object(col("Value"), "$.ValueType"))
      .withColumn("IsSystemValue", get_json_object(col("Value"), "$.IsSystemValue"))
      .withColumn("Value", coalesce(get_json_object(col("Value"), "$.Value"), col("Value")))
      .show(false)

    /**
      * +-----+----------+---------+-------------+
      * |Value|File      |ValueType|IsSystemValue|
      * +-----+----------+---------+-------------+
      * |On   |File2.json|State    |true         |
      * |123  |File1.json|null     |null         |
      * +-----+----------+---------+-------------+
      */

【讨论】：

它有一个类似的解决方案，正在运行，但突然停止了。我不知道我做了什么或没做什么，但这个选项现在对我有用。谢谢 Someshwar

以上是关于Spark Read Json：如何读取在整数和结构之间交替的字段的主要内容，如果未能解决你的问题，请参考以下文章