Pyspark 从 JSON 文件中获取 Schema

Posted 2023-04-15

技术标签:

【中文标题】Pyspark 从 JSON 文件中获取 Schema【英文标题】：Pyspark get Schema from JSON file 【发布时间】：2018-07-05 07:12:27 【问题描述】：

我正在尝试从 JSON 文件中获取 Pyspark 架构，但是当我使用 Python 代码中的变量创建架构时，我能够看到 <class 'pyspark.sql.types.StructType'> 的变量类型，但是当我试图获取通过 JSON 文件，它显示unicode 的类型。

有没有办法通过 JSON 文件获取pyspark 架构？

JSON 文件内容：

                                                                                                                                                                                                
"tediasessionclose_schema" : "StructType([ StructField('@timestamp', StringType()), StructField('message' , StructType([ StructField('componentAddress', StringType()), StructField('values', StructType([ StructField('confNum', StringType()), StructField('day', IntegerType())])"

Pyspark 代码：

df = sc.read.json(hdfs_path, schema = jsonfile['tediasessionclose_schema'])

【问题讨论】：

tediasessionclose_schema = StructType([ StructField('@timestamp', StringType()), StructField('message' , StructType([StructField('componentAddress', StringType()) StructField('values', StructType([StructField('confNum', StringType())]))])),StructField('day', IntegerType())]) 是的@RameshMaharjan 我在json文件中有多行..但是为了测试我只放了一个。是的@RameshMaharjan 【参考方案1】：

config_json 文件：

"json_data_schema": ["contactId", "firstName", "lastName"]

PySpark 应用程序：

schema = StructType().add("contactId", StringType()).add("firstName", StringType()).add("lastName", StringType())

参考：https://www.python-course.eu/lambda.php

schema = StructType()
schema = map(lambda x: schema.add(x, StringType(), True), (data["json_data_schema"]))[0][0:]

希望此解决方案对您有用！

【讨论】：

【参考方案2】：

您可以通过评估从读取 json 获得的字符串来获取架构：

import json
from pyspark.sql.types import StructField, StringType, IntegerType, StructType

with open('test.json') as f:
    data = json.load(f)

df = sqlContext.createDataFrame([], schema = eval(data['tediasessionclose_schema']))
print(df.schema)

输出：

StructType(List(StructField(@timestamp,StringType,true),StructField(message,StructType(List(StructField(componentAddress,StringType,true),StructField(values,StructType(List(StructField(confNum,StringType,true),StructField(day,IntegerType,true))),true))),true)))

test.json 在哪里：

"tediasessionclose_schema" : "StructType([ StructField('@timestamp', StringType()), StructField('message' , StructType([ StructField('componentAddress', StringType()), StructField('values', StructType([ StructField('confNum', StringType()), StructField('day', IntegerType())]))]))])"

希望这会有所帮助！

【讨论】：

感谢@Florian，一般的想法是我已经在 json 配置文件中定义了模式，并在读取数据并尝试做同样的事情时从 json 配置文件传递模式..不适合我. @SumitGupta 不确定您想说什么？在上面的答案中，读取了一个带有架构的 json 文件，并传递了架构以创建一个 Dataframe。这也适用于从 hdfs 读取数据。为什么它不起作用，您遇到什么错误？不只是您的 json 架构格式错误吗？谢谢@Florian，让我调试一下。 @SumitGupta 请注意，我在您的 json 文件内容中添加了一些括号以使其正常工作。谢谢@Florian，我不明白为什么我们需要额外的 )]))])

以上是关于Pyspark 从 JSON 文件中获取 Schema的主要内容，如果未能解决你的问题，请参考以下文章

Databricks 上的 PySpark 在绝对 URI 中获取相对路径：尝试使用 DateStamps 读取 Json 文件时

从 pyspark 中的多行文件中读取 JSON 文件

Pyspark：从路径读取多个 JSON 文件

PySpark 从目录中读取多个 txt 文件为 json 格式

Pyspark 从 S3 存储桶的子目录中读取所有 JSON 文件

无法使用本地 PySpark 从 S3 读取 json 文件