在pyspark中将带有字符串json字符串的列转换为带有字典的列

Posted 2023-04-14

技术标签:

【中文标题】在pyspark中将带有字符串json字符串的列转换为带有字典的列【英文标题】：Convert column with string json string to column with dictionary in pyspark 【发布时间】：2020-05-29 08:13:00 【问题描述】：

我的数据框中有一列具有以下结构。

+--------------------+
|                data|
+--------------------+
|"sbar":"_id":"5...|
|"sbar":"_id":"5...|
|"sbar":"_id":"5...|
|"sbar":"_id":"5...|
|"sbar":"_id":"5...|
+--------------------+
only showing top 5 rows

列内的数据是一个json字符串。我想将该列转换为其他类型（map、struct..）。如何使用 udf 函数执行此操作？我已经创建了一个这样的函数，但无法弄清楚返回类型应该是什么。我尝试了抛出错误的 StructType 和 MapType。这是我的代码。

import json
from pyspark.sql.types import MapType, StructType

udf_getDict = F.udf(lambda x: json.loads(x), StructType)

subset.select(udf_getDict(F.col('data'))).printSchema()

【问题讨论】：

【参考方案1】：

您可以使用spark.read.json 和df.rdd.map 的方法，例如：

json_string = """

    "glossary": 
        "title": "example glossary",
        "GlossDiv": 
            "title": "S",
            "GlossList": 
                "GlossEntry": 
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": 
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    ,
                    "GlossSee": "markup"
                
            
        
    

"""
df2 = spark.createDataFrame(
    [
        (1, json_string), 
    ],
    ['id', 'txt'] 
)
df2.dtypes
[('id', 'bigint'), ('txt', 'string')]


new_df = spark.read.json(df2.rdd.map(lambda r: r.txt))
new_df.printSchema()
root
 |-- glossary: struct (nullable = true)
 |    |-- GlossDiv: struct (nullable = true)
 |    |    |-- GlossList: struct (nullable = true)
 |    |    |    |-- GlossEntry: struct (nullable = true)
 |    |    |    |    |-- Abbrev: string (nullable = true)
 |    |    |    |    |-- Acronym: string (nullable = true)
 |    |    |    |    |-- GlossDef: struct (nullable = true)
 |    |    |    |    |    |-- GlossSeeAlso: array (nullable = true)
 |    |    |    |    |    |    |-- element: string (containsNull = true)
 |    |    |    |    |    |-- para: string (nullable = true)
 |    |    |    |    |-- GlossSee: string (nullable = true)
 |    |    |    |    |-- GlossTerm: string (nullable = true)
 |    |    |    |    |-- ID: string (nullable = true)
 |    |    |    |    |-- SortAs: string (nullable = true)
 |    |    |-- title: string (nullable = true)
 |    |-- title: string (nullable = true)

【讨论】：

以上是关于在pyspark中将带有字符串json字符串的列转换为带有字典的列的主要内容，如果未能解决你的问题，请参考以下文章