解析 Pyspark 数据帧的 json 列，其中一个键值为 None

Posted 2023-04-15

技术标签:

【中文标题】解析 Pyspark 数据帧的 json 列，其中一个键值为 None【英文标题】：Parsing json column of Pyspark dataframe that has one of the key value as None 【发布时间】：2021-07-03 19:29:05 【问题描述】：

我有两列的数据框 - id，cast。 'cast' 列具有 json 数组格式的值，如下所示。

id=862 的列强制转换的 Json 结构

"['cast_id': 14, 'character': 'Woody (voice)', 'credit_id': '52fe4284c3a36847f8024f95', 'gender': 2, 'id': 31, 'name': 'Tom Hanks ', 'order': 0, 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg']"

id=8844 的列强制转换的 Json 结构

"['cast_id': 1, 'character': 'Alan Parrish', 'credit_id': '52fe44bfc3a36847f80a7c73', 'gender': 2, 'id': 2157, 'name': 'Robin Williams', 'order': 0, 'profile_path': 无]"

要解析列'cast'，我有以下代码

cast_schema=ArrayType(StructType([
  StructField('cast_id',IntegerType(),nullable=True),
  StructField('character',StringType(),nullable=True),
  StructField('credit_id',StringType(),nullable=True),
  StructField('gender',IntegerType(),nullable=True),
  StructField('id',IntegerType(),nullable=True),
  StructField('name',StringType(),nullable=True),
  StructField('order',IntegerType(),nullable=True),
  StructField('profile_path',StringType(),nullable=True)
]))
credits_upd.withColumn('movies_cast',from_json(col('cast'),cast_schema).getField("name").alias('movies_cast')).show()

它返回以下输出：

如上所示，对于 id - 8844，'cast' 列未正确解析。我的猜测是 'cast' 列的 'profile_path' 键对于 id 8844 的值为 None，因为没有解析 json。

我不确定如何定义模式，以便当 Json 键之一的值为 None 时它能够解析。

我的实际文件有 45k 条记录。

【问题讨论】：

请检查一下，如果有帮助请告诉我 【参考方案1】：

问题是None"['cast_id': 1, 'character': 'Alan Parrish', 'credit_id': '52fe44bfc3a36847f80a7c73', 'gender': 2, 'id': 2157, 'name': 'Robin Williams', 'order': 0, 'profile_path': None]" 中 profile_path 的值

如果您将None 更改为null，如下所示，它将按预期工作：

"['cast_id': 1, 'character': 'Alan Parrish', 'credit_id': '52fe44bfc3a36847f80a7c73', 'gender': 2, 'id': 2157, 'name': 'Robin Williams', 'order': 0, 'profile_path': null]"

进行此更改后的输出：

+--------------------+----+----------------+
|                cast|  id|     movies_cast|
+--------------------+----+----------------+
|['cast_id': 14, ...| 862|     [Tom Hanks]|
|['cast_id': 1, '...|8844|[Robin Williams]|
+--------------------+----+----------------+

请为您的评论找到答案，将 45K 记录从 None 更新为 null：

@udf
def update_cast(value):
    value = value.replace("'profile_path': None", "'profile_path': null")
    return value


credits_upd = credits_upd.withColumn("cast", update_cast(col("cast")))

credits_upd.withColumn('movies_cast', from_json(col('cast'), cast_schema).getField("name").alias('movies_cast')).show()

【讨论】：

我的文件有 45k 条记录。您能否建议如何在将 Json 数组存储为字符串的列中将 None 替换为 null。这是一个 json 文件还是只是一个文本文件？请找到我的更新答案以解决您的评论。

以上是关于解析 Pyspark 数据帧的 json 列，其中一个键值为 None的主要内容，如果未能解决你的问题，请参考以下文章