PySpark 从 excel 中读取,只有一列 json 格式
Posted
技术标签:
【中文标题】PySpark 从 excel 中读取,只有一列 json 格式【英文标题】:PySpark read from excel with only one column in json format 【发布时间】:2020-05-25 11:14:51 【问题描述】:我有一个存储在 excel 中的数据,但只有一列是 json 格式。我想平展这个专栏,到目前为止我尝试了以下方法:
首先我提供我期望的输入数据和所需的输出:
输入数据
[Row(point='["\\"data\\":\\"state\\":\\"IL\\"","\\"data\\":\\"state\\":\\"CA\\"","\\"data\\":\\"pop\\":\\"100\\",\\"band\\":\\"Rock\\"","\\"data\\":\\"pop\\":\\"200\\",\\"band\\":\\"Melody\\"","\\"data\\":\\"pop\\":\\"300\\",\\"band\\":\\"Race\\""]', id='1abc'),
Row(point='["\\"data\\":\\"state\\":\\"IL\\"","\\"data\\":\\"state\\":\\"CA\\"","\\"data\\":\\"pop\\":\\"400\\",\\"band\\":\\"Rock\\"","\\"data\\":\\"pop\\":\\"500\\",\\"band\\":\\"Jazz\\"","\\"data\\":\\"pop\\":\\"500\\",\\"band\\":\\"Loops\\""]', id='2cde')]
预期输出数据
id = 1abc, state = IL, pop = None, band = None
id = 1abc, state = CA, pop = None, band = None
id = 1abc, state = None, pop = 100, band = Rock
id = 1abc, state = None, pop = 200, band = Melody
id = 1abc, state = None, pop = 300, band = Race
id = 2cde, state = IL, pop = None, band = None
id = 2cde, state = CA, pop = None, band = None
id = 2cde, state = None, pop = 400, band = Rock
id = 2cde, state = None, pop = 500, band = Jazz
id = 2cde, state = None, pop = 500, band = Loops
到目前为止的代码..
# Read as pandas
pd_df = pd.read_excel('test.xlsx')
# Convert to spark df
schema = StructType([StructField("point", StringType(), True),
StructField("id", StringType(), True)
])
df = spark.createDataFrame(pd_df, schema = schema)
# Do some cleaning to remove \\ and quotes
a = df.withColumn('point', regexp_replace(col('point'), "\\\\", ""))
b = a.withColumn('point', regexp_replace(col('point'), '","', ','))
c = b.withColumn('point', regexp_replace(col('point'), '\\["', '['))
d = c.withColumn('point', regexp_replace(col('point'), '\\"]', ']'))
# after cleaning
d.take(2)
[Row(point='["data":"state":"IL","data":"state":"CA","data":"pop":"100","band":"Rock","data":"pop":"200","band":"Melody","data":"pop":"300","band":"Race"]', id='1abc'),
Row(point='["data":"state":"IL","data":"state":"CA","data":"pop":"400","band":"Rock","data":"pop":"500","band":"Jazz","data":"pop":"500","band":"Loops"]', id='2cde')]
# Flatten the point column
point_schema = score_schema = StructType([StructField("state", StringType(), True),
StructField("band", StringType(), True),
StructField("pop", IntegerType(), True)
])
final_df = d.withColumn('point', from_json('point', point_schema))
尽管指定了 point_schema,但数据帧 final_df 的结果始终为 None。我不确定为什么它返回无。任何帮助都会很有帮助
【问题讨论】:
【参考方案1】:使用这个 -
final_df = d.withColumn('point', from_json('point', lit('array<struct<data:struct<band:string,pop:string,state:string>>>')))
您可以像下面一样更改您的架构-
point_schema = ArrayType(StructType([
StructField("data",
StructType([
StructField("state", StringType(),True),
StructField("band", StringType(), True),
StructField("pop", StringType(), True)
])
,True)]))
final_df = d.withColumn('point', from_json('point', point_schema))
请注意,不要在架构中将
pop
的类型更改为int
,这将导致整个from_json(...)
表达式的null
,因为pop
字段的值为string
在字符串 json 中。
【讨论】:
以上是关于PySpark 从 excel 中读取,只有一列 json 格式的主要内容,如果未能解决你的问题,请参考以下文章