在 Pyspark 代码中读取嵌套的 Json 文件。 pyspark.sql.utils.AnalysisException:

Posted

技术标签:

【中文标题】在 Pyspark 代码中读取嵌套的 Json 文件。 pyspark.sql.utils.AnalysisException:【英文标题】:Reading Nested Json file in Pyspark code. pyspark.sql.utils.AnalysisException: 【发布时间】:2021-07-09 12:14:30 【问题描述】:

我正在尝试读取嵌套的 JSON 文件。我无法分解嵌套列并正确读取 JSON 文件。

My Json file


    "Univerity": "JNTU",
    "Department": 
        "DepartmentID": "101",
        "Student": 
            "lastName": "Fraun",
            "address": "23 hyd 500089",
            "email": "ss.fraun@yahoo.co.in",
            "Subjects": [
                
                    "subjectId": "12592",
                    "subjectName": "Boyce"            
                ,
                
                   "subjectId": "12592",
                    "subjectName": "Boyce"
                
            ]
        
    


Code :
```
     from pyspark.sql import *
     from pyspark.sql.functions import *
     from pyspark.sql import functions as F
     from pyspark.sql.functions import col, regexp_replace, split
     spark = SparkSession.builder.appName('Reminder').master('local').getOrCreate()

     if __name__ == '__main__':

     df = spark.read.option("multiline","true").json("C:\Workspace\student1.json").cache()
     df.show()
     df.printSchema()
     df.withColumn("Department", explode(col("Department")))
     df.show()
```

我的以下输出和错误: +--------------------+----------+ |部门|大学| +--------------------+----------+ |101, [12592, B...|江南大学| +--------------------+---------+

root
 |-- Department: struct (nullable = true)
 |    |-- DepartmentID: string (nullable = true)
 |    |-- Student: struct (nullable = true)
 |    |    |-- Subjects: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- subjectId: string (nullable = true)
 |    |    |    |    |-- subjectName: string (nullable = true)
 |    |    |-- address: string (nullable = true)
 |    |    |-- email: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)
 |-- Univerity: string (nullable = true)

Traceback (most recent call last):
  File "C:/student/agility-data-electrode/electrode/entities/student.py", line 12, in <module>
    df.withColumn("Department", explode(col("Department")))
  File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\pyspark\sql\dataframe.py", line 2455, in withColumn
    return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
  File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\py4j\java_gateway.py", line 1310, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\pyspark\sql\utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: cannot resolve 'explode(`Department`)' due to data type mismatch: input to function explode should be array or map type, not struct<DepartmentID:string,Student:struct<Subjects:array<struct<subjectId:string,subjectName:string>>,address:string,email:string,lastName:string>>;
'Project [explode(Department#0) AS Department#65, Univerity#1]
+- Relation[Department#0,Univerity#1] json

【问题讨论】:

您好,欢迎来到 ***。您能否包含正在运行的生成输出的代码? 【参考方案1】:

您只能展开一个数组列,选择要展开的主题列。

 from pyspark.sql import *
 from pyspark.sql.functions import *
 from pyspark.sql import functions as F
 from pyspark.sql.functions import col, regexp_replace, split
 spark = SparkSession.builder.appName('Reminder').master('local').getOrCreate()

 if __name__ == '__main__':

 df = spark.read.option("multiline","true").json("C:\Workspace\student1.json").cache()
 df.show()
 df.printSchema()
 df.withColumn("Subjects", explode(col("Department.Student.Subjects")))
 df.show()

【讨论】:

以上是关于在 Pyspark 代码中读取嵌套的 Json 文件。 pyspark.sql.utils.AnalysisException:的主要内容,如果未能解决你的问题,请参考以下文章

如何使用 pyspark 在 aws 胶水中展平嵌套 json 中的数组?

在pyspark中展平嵌套的json scala代码

使用 pyspark 解析 JSON 时嵌套动态模式不起作用

使用 pyspark 将 spark 数据帧转换为嵌套 JSON

复杂和嵌套的 json 数据集如何与 pyspark 一起使用

如何在pyspark中转换这个嵌套的json? [复制]