使用 Pyspark 将每个 json 对象读取为 Dataframe 中的单行?
Posted
技术标签:
【中文标题】使用 Pyspark 将每个 json 对象读取为 Dataframe 中的单行?【英文标题】:Read each json object as single row in Dataframe using Pyspark? 【发布时间】:2020-05-13 08:01:59 【问题描述】:我有以下 JSON 文件
"name":"John", "age":31, "city":"New York"
"name":"Henry", "age":41, "city":"Boston"
"name":"Dave", "age":26, "city":"New York"
因此,我需要将每个 json 行与 Dataframe 一起作为单行读取。
以下是预期输出:
我已尝试使用以下代码:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName('Read Json') \
.getOrCreate()
df = spark.read.format('json').load('sample_json')
df.show()
但我只能得到以下输出:
请帮助我。提前致谢。
【问题讨论】:
【参考方案1】:将文件读取为json
,然后使用to_json
函数创建json_column
。
1.Using to_json function:
from pyspark.sql.functions import *
spark.read.json("sample.json").\
withColumn("Json_column",to_json(struct(col("age"),col('city'),col('name')))).\
show(10,False)
#+---+--------+-----+------------------------------------------+
#|age|city |name |Json_column |
#+---+--------+-----+------------------------------------------+
#|31 |New York|John |"age":31,"city":"New York","name":"John"|
#|41 |Boston |Henry|"age":41,"city":"Boston","name":"Henry" |
#|26 |New York|Dave |"age":26,"city":"New York","name":"Dave"|
#+---+--------+-----+------------------------------------------+
#or more dynamic way
df=spark.read.json("sample.json")
df.withColumn("Json_column",to_json(struct([col(c) for c in df.columns]))).show(10,False)
#+---+--------+-----+------------------------------------------+
#|age|city |name |Json_column |
#+---+--------+-----+------------------------------------------+
#|31 |New York|John |"age":31,"city":"New York","name":"John"|
#|41 |Boston |Henry|"age":41,"city":"Boston","name":"Henry" |
#|26 |New York|Dave |"age":26,"city":"New York","name":"Dave"|
#+---+--------+-----+------------------------------------------+
2.Other approach using get_json_object function:
将 json 文件读取为 text,然后通过从 json object
中提取来创建 name,age,city
列。
from pyspark.sql.functions import *
spark.read.text("sample.json").\
withColumn("name",get_json_object(col("value"),"$.name")).\
withColumn("city",get_json_object(col("value"),"$.city")).\
withColumn("age",get_json_object(col("value"),"$.age")).\
withColumnRenamed("value","Json_column").\
select("age","city","name","Json_column").\
show(10,False)
#+---+--------+-----+--------------------------------------------+
#|age|city |name |Json_column |
#+---+--------+-----+--------------------------------------------+
#|31 |New York|John |"name":"John", "age":31, "city":"New York"|
#|41 |Boston |Henry|"name":"Henry", "age":41, "city":"Boston" |
#|26 |New York|Dave |"name":"Dave", "age":26, "city":"New York"|
#+---+--------+-----+--------------------------------------------+
【讨论】:
否则你也可以使用列表理解to_json(struct([df[col] for col in df.columns]))
@BeardAspirant,是的.. 也添加了那部分!以上是关于使用 Pyspark 将每个 json 对象读取为 Dataframe 中的单行?的主要内容,如果未能解决你的问题,请参考以下文章
如何将 json 对象列表转换为单个 pyspark 数据框?