在 Pyspark 代码中读取嵌套的 Json 文件。 pyspark.sql.utils.AnalysisException:
Posted
技术标签:
【中文标题】在 Pyspark 代码中读取嵌套的 Json 文件。 pyspark.sql.utils.AnalysisException:【英文标题】:Reading Nested Json file in Pyspark code. pyspark.sql.utils.AnalysisException: 【发布时间】:2021-07-09 12:14:30 【问题描述】:我正在尝试读取嵌套的 JSON 文件。我无法分解嵌套列并正确读取 JSON 文件。
My Json file
"Univerity": "JNTU",
"Department":
"DepartmentID": "101",
"Student":
"lastName": "Fraun",
"address": "23 hyd 500089",
"email": "ss.fraun@yahoo.co.in",
"Subjects": [
"subjectId": "12592",
"subjectName": "Boyce"
,
"subjectId": "12592",
"subjectName": "Boyce"
]
Code :
```
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col, regexp_replace, split
spark = SparkSession.builder.appName('Reminder').master('local').getOrCreate()
if __name__ == '__main__':
df = spark.read.option("multiline","true").json("C:\Workspace\student1.json").cache()
df.show()
df.printSchema()
df.withColumn("Department", explode(col("Department")))
df.show()
```
我的以下输出和错误: +--------------------+----------+ |部门|大学| +--------------------+----------+ |101, [12592, B...|江南大学| +--------------------+---------+
root
|-- Department: struct (nullable = true)
| |-- DepartmentID: string (nullable = true)
| |-- Student: struct (nullable = true)
| | |-- Subjects: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- subjectId: string (nullable = true)
| | | | |-- subjectName: string (nullable = true)
| | |-- address: string (nullable = true)
| | |-- email: string (nullable = true)
| | |-- lastName: string (nullable = true)
|-- Univerity: string (nullable = true)
Traceback (most recent call last):
File "C:/student/agility-data-electrode/electrode/entities/student.py", line 12, in <module>
df.withColumn("Department", explode(col("Department")))
File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\pyspark\sql\dataframe.py", line 2455, in withColumn
return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\py4j\java_gateway.py", line 1310, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\Workspace\anaconda3\envs\student\pyspark\lib\site-packages\pyspark\sql\utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: cannot resolve 'explode(`Department`)' due to data type mismatch: input to function explode should be array or map type, not struct<DepartmentID:string,Student:struct<Subjects:array<struct<subjectId:string,subjectName:string>>,address:string,email:string,lastName:string>>;
'Project [explode(Department#0) AS Department#65, Univerity#1]
+- Relation[Department#0,Univerity#1] json
【问题讨论】:
您好,欢迎来到 ***。您能否包含正在运行的生成输出的代码? 【参考方案1】:您只能展开一个数组列,选择要展开的主题列。
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col, regexp_replace, split
spark = SparkSession.builder.appName('Reminder').master('local').getOrCreate()
if __name__ == '__main__':
df = spark.read.option("multiline","true").json("C:\Workspace\student1.json").cache()
df.show()
df.printSchema()
df.withColumn("Subjects", explode(col("Department.Student.Subjects")))
df.show()
【讨论】:
以上是关于在 Pyspark 代码中读取嵌套的 Json 文件。 pyspark.sql.utils.AnalysisException:的主要内容,如果未能解决你的问题,请参考以下文章
如何使用 pyspark 在 aws 胶水中展平嵌套 json 中的数组?
使用 pyspark 解析 JSON 时嵌套动态模式不起作用
使用 pyspark 将 spark 数据帧转换为嵌套 JSON