PySpark 从 excel 中读取,只有一列 json 格式



【中文标题】PySpark 从 excel 中读取,只有一列 json 格式【英文标题】:PySpark read from excel with only one column in json format 【发布时间】:2020-05-25 11:14:51 【问题描述】:

我有一个存储在 excel 中的数据,但只有一列是 json 格式。我想平展这个专栏,到目前为止我尝试了以下方法:



[Row(point='["\\"data\\":\\"state\\":\\"IL\\"","\\"data\\":\\"state\\":\\"CA\\"","\\"data\\":\\"pop\\":\\"100\\",\\"band\\":\\"Rock\\"","\\"data\\":\\"pop\\":\\"200\\",\\"band\\":\\"Melody\\"","\\"data\\":\\"pop\\":\\"300\\",\\"band\\":\\"Race\\""]', id='1abc'),  
 Row(point='["\\"data\\":\\"state\\":\\"IL\\"","\\"data\\":\\"state\\":\\"CA\\"","\\"data\\":\\"pop\\":\\"400\\",\\"band\\":\\"Rock\\"","\\"data\\":\\"pop\\":\\"500\\",\\"band\\":\\"Jazz\\"","\\"data\\":\\"pop\\":\\"500\\",\\"band\\":\\"Loops\\""]', id='2cde')]


id = 1abc, state = IL, pop = None, band = None
id = 1abc, state = CA, pop = None, band = None
id = 1abc, state = None, pop = 100, band = Rock
id = 1abc, state = None, pop = 200, band = Melody
id = 1abc, state = None, pop = 300, band = Race
id = 2cde, state = IL, pop = None, band = None
id = 2cde, state = CA, pop = None, band = None
id = 2cde, state = None, pop = 400, band = Rock
id = 2cde, state = None, pop = 500, band = Jazz
id = 2cde, state = None, pop = 500, band = Loops


# Read as pandas
pd_df = pd.read_excel('test.xlsx')

# Convert to spark df
schema = StructType([StructField("point", StringType(), True),
                StructField("id", StringType(), True)
df = spark.createDataFrame(pd_df, schema = schema)

# Do some cleaning to remove \\ and quotes
a = df.withColumn('point', regexp_replace(col('point'), "\\\\", ""))
b = a.withColumn('point', regexp_replace(col('point'), '","', ','))
c = b.withColumn('point', regexp_replace(col('point'), '\\["', '['))
d = c.withColumn('point', regexp_replace(col('point'), '\\"]', ']'))

# after cleaning

[Row(point='["data":"state":"IL","data":"state":"CA","data":"pop":"100","band":"Rock","data":"pop":"200","band":"Melody","data":"pop":"300","band":"Race"]', id='1abc'), Row(point='["data":"state":"IL","data":"state":"CA","data":"pop":"400","band":"Rock","data":"pop":"500","band":"Jazz","data":"pop":"500","band":"Loops"]', id='2cde')]

# Flatten the point column

point_schema = score_schema = StructType([StructField("state", StringType(), True),
            StructField("band", StringType(), True),
            StructField("pop", IntegerType(), True)

final_df = d.withColumn('point', from_json('point', point_schema))

尽管指定了 point_schema,但数据帧 final_df 的结果始终为 None。我不确定为什么它返回无。任何帮助都会很有帮助



使用这个 -

final_df = d.withColumn('point', from_json('point', lit('array<struct<data:struct<band:string,pop:string,state:string>>>')))


point_schema = ArrayType(StructType([
      StructField("state", StringType(),True),
      StructField("band", StringType(), True),
      StructField("pop", StringType(), True)

final_df = d.withColumn('point', from_json('point', point_schema))

请注意,不要在架构中将pop 的类型更改为int,这将导致整个from_json(...) 表达式的null,因为pop 字段的值为string在字符串 json 中。


以上是关于PySpark 从 excel 中读取,只有一列 json 格式的主要内容,如果未能解决你的问题,请参考以下文章

使用 pandas 从 Excel 文件中读取最后一列


从 excel 中读取数据并插入 HIVE

在 pyspark 中以 csv 格式读取 excel 文件


在 pyspark 中读取 Column<COLUMN-NAME> 的内容