如何在pyspark中将JSON字符串转换为JSON对象

Posted

技术标签:

【中文标题】如何在pyspark中将JSON字符串转换为JSON对象【英文标题】:how to Convert JSON String to JSON object in pyspark 【发布时间】:2018-04-11 10:26:22 【问题描述】:

我有一种数据框的列类型是字符串,但实际上它包含 4 个模式的 json 对象,其中很少有字段是常见的。我需要将其转换为 jason 对象。

这是数据框的架构:

query.printSchema()

root
 |-- test: string (nullable = true)

DF的值看起来像

query.show(10)

+--------------------+
|                test|
+--------------------+
|"PurchaseActivit...|
|"PurchaseActivit...|
|"PurchaseActivit...|
|"Interaction":"...|
|"PurchaseActivit...|
|"Interaction":"...|
|"PurchaseActivit...|
|"PurchaseActivit...|
|"PurchaseActivit...|
|"PurchaseActivit...|
+--------------------+
only showing top 10 rows

我应用了什么解决方案 ::

    写入文本文件

query.write.format("text").mode('overwrite').save("s3://bucketname/temp/")

    读取为 json

df = spark.read.json("s3a://bucketname/temp/")

    现在打印Schema,每行已经转换为json对象的json字符串

df.printSchema()

root
 |-- EventDate: string (nullable = true)
 |-- EventId: string (nullable = true)
 |-- EventNotificationType: long (nullable = true)
 |-- Interaction: struct (nullable = true)
 |    |-- ContextId: string (nullable = true)
 |    |-- Created: string (nullable = true)
 |    |-- Description: string (nullable = true)
 |    |-- Id: string (nullable = true)
 |    |-- ModelContextId: string (nullable = true)
 |-- PurchaseActivity: struct (nullable = true)
 |    |-- BillingCity: string (nullable = true)
 |    |-- BillingCountry: string (nullable = true)
 |    |-- ShippingAndHandlingAmount: double (nullable = true)
 |    |-- ShippingDiscountAmount: double (nullable = true)
 |    |-- SubscriberId: long (nullable = true)
 |    |-- SubscriptionOriginalEndDate: string (nullable = true)
 |-- SubscriptionChurn: struct (nullable = true)
 |    |-- PaymentTypeCode: long (nullable = true)
 |    |-- PaymentTypeName: string (nullable = true)
 |    |-- PreviousPaidAmount: double (nullable = true)
 |    |-- SubscriptionRemoved: string (nullable = true)
 |    |-- SubscriptionStartDate: string (nullable = true)
 |-- TransactionDetail: struct (nullable = true)
 |    |-- Amount: double (nullable = true)
 |    |-- OrderShipToCountry: string (nullable = true)
 |    |-- PayPalUserName: string (nullable = true)
 |    |-- PaymentSubTypeCode: long (nullable = true)
 |    |-- PaymentSubTypeName: string (nullable = true)

有没有最好的方法,我不需要将数据帧写为文本文件并再次作为 json 文件读取以获得预期的输出

【问题讨论】:

您是否尝试过this answer 或this answer 中的解决方案? 【参考方案1】:

您可以在写入文本文件之前使用from_json(),但您需要先定义架构。

代码如下所示:

data = query.select(from_json("test",schema=schema).alias("value")).selectExpr("value.*")

data.write.format("text").mode('overwrite').save("s3://bucketname/temp/")

【讨论】:

以上是关于如何在pyspark中将JSON字符串转换为JSON对象的主要内容,如果未能解决你的问题,请参考以下文章

在pyspark中将带有字符串json字符串的列转换为带有字典的列

如何在pyspark中将rdd行转换为带有json结构的数据框?

如何在pyspark中将字符串列转换为ArrayType

如何在pyspark中将字符串值转换为arrayType

如何在 PySpark 中将 Vector 类型的列转换为数组/字符串类型?

如何在 PySpark 1.6 中将 DataFrame 列从字符串转换为浮点/双精度?