json文件嵌套列值在pyspark中为null

Posted 2023-04-15

技术标签:

【中文标题】json文件嵌套列值在pyspark中为null【英文标题】：json file nested column value is coming as null in pyspark 【发布时间】：2021-08-19 18:37:45 【问题描述】：

json 文件： "products":["productId":"8d809e13-fdc5-4d15-9271- 953750f6d592","quantity":500,"soldPrice":104.852, "productId":"ec15ba1d-53b6-44b0-8a22- 1e498485f1b8","quantity":300,"soldPrice":94.3668, "productId":"e672483e-57a8-434a-bc42- ecf827c8a8d4","quantity":1000,"soldPrice":109.57034], "shippingAddress":"attention":"Khaleesi Frost","address":"493 Augustine Drive N ","city":"Miramar","state":"FL","zip":"33785"

我可以使用explode 方法访问“产品”，因为它是ArrayType。但“shippingAddress”的值为 null。

`select(F.col("shippingAddress.attention").alias("shipping_address_attention"),
   F.col("shippingAddress.address").alias("shipping_address_address"))`

json 架构： StructField('shippingAddress' , StructType([ StructField('attention' , StringType(), True), StructField('address' , StringType(), True), StructField('city' , StringType(), True), StructField('state' , StringType(), True), StructField('zip' , IntegerType(), True) ]))

【问题讨论】：

【参考方案1】：

检查您是否有有效的 JSON here。

如果您使用的 JSON 与您在问题中发布的相同，则应将整个 JSON 包含在中。读取 JSON 时使用选项 multiline 为 True。

>>> from pyspark.sql.functions import *
>>> df = spark.read.option('multiline', True).json('file.json')
>>> df.printSchema()
root
 |-- products: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- productId: string (nullable = true)
 |    |    |-- quantity: long (nullable = true)
 |    |    |-- soldPrice: double (nullable = true)
 |-- shippingAddress: struct (nullable = true)
 |    |-- address: string (nullable = true)
 |    |-- attention: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- zip: string (nullable = true)

>>> df.select(col("shippingAddress.attention").alias("shipping_address_attention"),col("shippingAddress.address").alias("shipping_address_address")).show(truncate=False)
+--------------------------+------------------------+
|shipping_address_attention|shipping_address_address|
+--------------------------+------------------------+
|Khaleesi Frost            |493 Augustine Drive N   |
+--------------------------+------------------------+

更新

您提到 zip 列类型为 IntegerType() 但在 JSON 中它是字符串，由于这种类型不匹配，您在列中得到 Null。所以在你的自定义模式中将类型更改为 StringType()。

你的架构应该是，

StructType([StructField('shippingAddress' , StructType([ StructField('attention' , StringType(), True), StructField('address' , StringType(), True),                                 
StructField('city' , StringType(), True), StructField('state' , StringType(), True), StructField('zip' , StringType(), True) ]))])

【讨论】：

json文件有效。我正在将 json 文件作为流读取并加载到表中。读：

orders = (   spark     .readStream     .format("json")     .schema(jason_schema)                .option("maxFilesPerTrigger", 1)     .option("path",f"stream_path")     .load() )

写：

df6 = df5.writeStream.format("delta").outputMode("append") .partitionBy("submitted_yyyy_mm").queryName("orders").trigger(processingTime='20 seconds') .option("checkpointLocation", f"orders_checkpoint_path").table("$orders_table")

在表中，这些运费值是空值。更新答案请查收。是的，将其更改为StringType后，它现在已经工作了。非常感谢:-) @Uttam_Singh 考虑将此答案标记为已接受，这有助于我们继续回答社区 @Kafels 谢谢。没有意识到这一点。我现在已经接受了。

以上是关于json文件嵌套列值在pyspark中为null的主要内容，如果未能解决你的问题，请参考以下文章