在pyspark中将带有字符串json字符串的列转换为带有字典的列
Posted
技术标签:
【中文标题】在pyspark中将带有字符串json字符串的列转换为带有字典的列【英文标题】:Convert column with string json string to column with dictionary in pyspark 【发布时间】:2020-05-29 08:13:00 【问题描述】:我的数据框中有一列具有以下结构。
+--------------------+
| data|
+--------------------+
|"sbar":"_id":"5...|
|"sbar":"_id":"5...|
|"sbar":"_id":"5...|
|"sbar":"_id":"5...|
|"sbar":"_id":"5...|
+--------------------+
only showing top 5 rows
列内的数据是一个json字符串。我想将该列转换为其他类型(map、struct..)。如何使用 udf 函数执行此操作?我已经创建了一个这样的函数,但无法弄清楚返回类型应该是什么。我尝试了抛出错误的 StructType 和 MapType。这是我的代码。
import json
from pyspark.sql.types import MapType, StructType
udf_getDict = F.udf(lambda x: json.loads(x), StructType)
subset.select(udf_getDict(F.col('data'))).printSchema()
【问题讨论】:
【参考方案1】:您可以使用spark.read.json
和df.rdd.map
的方法,例如:
json_string = """
"glossary":
"title": "example glossary",
"GlossDiv":
"title": "S",
"GlossList":
"GlossEntry":
"ID": "SGML",
"SortAs": "SGML",
"GlossTerm": "Standard Generalized Markup Language",
"Acronym": "SGML",
"Abbrev": "ISO 8879:1986",
"GlossDef":
"para": "A meta-markup language, used to create markup languages such as DocBook.",
"GlossSeeAlso": ["GML", "XML"]
,
"GlossSee": "markup"
"""
df2 = spark.createDataFrame(
[
(1, json_string),
],
['id', 'txt']
)
df2.dtypes
[('id', 'bigint'), ('txt', 'string')]
new_df = spark.read.json(df2.rdd.map(lambda r: r.txt))
new_df.printSchema()
root
|-- glossary: struct (nullable = true)
| |-- GlossDiv: struct (nullable = true)
| | |-- GlossList: struct (nullable = true)
| | | |-- GlossEntry: struct (nullable = true)
| | | | |-- Abbrev: string (nullable = true)
| | | | |-- Acronym: string (nullable = true)
| | | | |-- GlossDef: struct (nullable = true)
| | | | | |-- GlossSeeAlso: array (nullable = true)
| | | | | | |-- element: string (containsNull = true)
| | | | | |-- para: string (nullable = true)
| | | | |-- GlossSee: string (nullable = true)
| | | | |-- GlossTerm: string (nullable = true)
| | | | |-- ID: string (nullable = true)
| | | | |-- SortAs: string (nullable = true)
| | |-- title: string (nullable = true)
| |-- title: string (nullable = true)
【讨论】:
以上是关于在pyspark中将带有字符串json字符串的列转换为带有字典的列的主要内容,如果未能解决你的问题,请参考以下文章
如何在 PySpark 中将 Vector 类型的列转换为数组/字符串类型?