如何在pyspark中将JSON字符串转换为JSON对象
Posted
技术标签:
【中文标题】如何在pyspark中将JSON字符串转换为JSON对象【英文标题】:how to Convert JSON String to JSON object in pyspark 【发布时间】:2018-04-11 10:26:22 【问题描述】:我有一种数据框的列类型是字符串,但实际上它包含 4 个模式的 json 对象,其中很少有字段是常见的。我需要将其转换为 jason 对象。
这是数据框的架构:
query.printSchema()
root
|-- test: string (nullable = true)
DF的值看起来像
query.show(10)
+--------------------+
| test|
+--------------------+
|"PurchaseActivit...|
|"PurchaseActivit...|
|"PurchaseActivit...|
|"Interaction":"...|
|"PurchaseActivit...|
|"Interaction":"...|
|"PurchaseActivit...|
|"PurchaseActivit...|
|"PurchaseActivit...|
|"PurchaseActivit...|
+--------------------+
only showing top 10 rows
我应用了什么解决方案 ::
-
写入文本文件
query.write.format("text").mode('overwrite').save("s3://bucketname/temp/")
-
读取为 json
df = spark.read.json("s3a://bucketname/temp/")
-
现在打印Schema,每行已经转换为json对象的json字符串
df.printSchema()
root |-- EventDate: string (nullable = true) |-- EventId: string (nullable = true) |-- EventNotificationType: long (nullable = true) |-- Interaction: struct (nullable = true) | |-- ContextId: string (nullable = true) | |-- Created: string (nullable = true) | |-- Description: string (nullable = true) | |-- Id: string (nullable = true) | |-- ModelContextId: string (nullable = true) |-- PurchaseActivity: struct (nullable = true) | |-- BillingCity: string (nullable = true) | |-- BillingCountry: string (nullable = true) | |-- ShippingAndHandlingAmount: double (nullable = true) | |-- ShippingDiscountAmount: double (nullable = true) | |-- SubscriberId: long (nullable = true) | |-- SubscriptionOriginalEndDate: string (nullable = true) |-- SubscriptionChurn: struct (nullable = true) | |-- PaymentTypeCode: long (nullable = true) | |-- PaymentTypeName: string (nullable = true) | |-- PreviousPaidAmount: double (nullable = true) | |-- SubscriptionRemoved: string (nullable = true) | |-- SubscriptionStartDate: string (nullable = true) |-- TransactionDetail: struct (nullable = true) | |-- Amount: double (nullable = true) | |-- OrderShipToCountry: string (nullable = true) | |-- PayPalUserName: string (nullable = true) | |-- PaymentSubTypeCode: long (nullable = true) | |-- PaymentSubTypeName: string (nullable = true)
有没有最好的方法,我不需要将数据帧写为文本文件并再次作为 json 文件读取以获得预期的输出
【问题讨论】:
您是否尝试过this answer 或this answer 中的解决方案? 【参考方案1】:您可以在写入文本文件之前使用from_json()
,但您需要先定义架构。
代码如下所示:
data = query.select(from_json("test",schema=schema).alias("value")).selectExpr("value.*")
data.write.format("text").mode('overwrite').save("s3://bucketname/temp/")
【讨论】:
以上是关于如何在pyspark中将JSON字符串转换为JSON对象的主要内容,如果未能解决你的问题,请参考以下文章
在pyspark中将带有字符串json字符串的列转换为带有字典的列
如何在pyspark中将rdd行转换为带有json结构的数据框?