pyspark 数据框连接操作和写为 json 非常慢
Posted
技术标签:
【中文标题】pyspark 数据框连接操作和写为 json 非常慢【英文标题】:pyspark data frame join operation and write as json is very slow 【发布时间】:2021-07-03 20:45:30 【问题描述】:我有两个大型 spark 数据框,用户详细信息和用户关系。 两个数据框都有超过 20M 的记录。
数据框joinJoin操作非常非常慢。请帮助我提高加入性能。
DF 详细信息
user_df
一个用户应该有多个地址。 一个用户+地址应该有多个网络架构
root
|-- USERIDENTIFIER: string (nullable = true)
|-- ADDRESSIDENTIFIER: string (nullable = true)
|-- LATITUDE: decimal(15,12) (nullable = true)
|-- LONGITUDE: decimal(15,12) (nullable = true)
|-- USERNAME: string (nullable = true)
|-- NETWORKIDENTIFIER: string (nullable = true)
样本数据
+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+
|USERIDENTIFIER |ADDRESSIDENTIFIER |LATITUDE |LONGITUDE |USERNAME |NETWORKIDENTIFIER |
+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+
|C9BBB242202692B589DC5E6AD1040229|0B60DA9CB69084711BC119CB7DB5A120|33.779730000000|-117.867278000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0EBFEB7F15B503D7F34BA4650E561D4B|33.804552000000|-118.067973000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0DC22C71A345C6750158E88D98D6671D|33.701665000000|-117.956545000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|086E9420C60A7D037FB127727967337B|33.780334000000|-117.863353000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0E30A48D4829E093E60C7026351DFA04|33.780334000000|-117.863353000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|33CF41F3F7AC69029EDD664DF569AE41|33.610987000000|-117.712710000000|Mary A Wilkinson |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|97612FEFFD5EA7664E566161CE9318EF|33.569658000000|-117.726847000000|Mary A Wilkinson |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|33.665445000000|-117.761503000000|Mary A Wilkinson |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|A8EFFB0D29B8628B9A3E993490FF6F8F|33.439137000000|-117.621570000000|Mary A Wilkinson |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|33.665445000000|-117.761503000000|Mary A Wilkinson |06702F3EAF8846A450AB8A6DF93E8227|
+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+
user_relationship_df
一个用户地址可以与另一个用户有关系(所有地址)架构
root
|-- PARENTUSERIDENTIFIER: string (nullable = true)
|-- PARENTADDRESSIDENTIFIER: string (nullable = true)
|-- CHILDUSERIDENTIFIER: string (nullable = true)
|-- NETWORKIDENTIFIER: string (nullable = true)
样本数据
+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
|PARENTUSERIDENTIFIER |PARENTADDRESSIDENTIFIER |CHILDUSERIDENTIFIER |NETWORKIDENTIFIER |
+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
|97B177D33281DF30AFC9924294D1973D|A8EFFB0D29B8628B9A3E993490FF6F8F|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|33CF41F3F7AC69029EDD664DF569AE41|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|97612FEFFD5EA7664E566161CE9318EF|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|C9BBB242202692B589DC5E6AD1040229|06702F3EAF8846A450AB8A6DF93E8227|
+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
我必须执行波纹管连接才能获取所有父子关系详细信息。
user_relationship_df = user_df.alias('U1').join(
user_relationship_df.alias('R'),
[
f_col('U1.UserIdentifier') == f_col('R.parentUserIdentifier'),
f_col('U1.addressIdentifier') == f_col('R.parentAddressIdentifier'),
f_col('U1.networkIdentifier') == f_col('R.networkIdentifier'),
],
'inner'
).join(
user_df.alias('U2'),
[
f_col('U2.UserIdentifier') == f_col('R.childUserIdentifier'),
f_col('U2.networkIdentifier') == f_col('R.networkIdentifier')
]
).select(
f_col('U1.UserIdentifier').alias('parentUserIdentifier'),
f_col('U1.AddressIdentifier').alias('parentAddressIdentifier'),
f_col('U2.UserName').alias('parentUserName'),
f_col('U1.latitude').alias('parentlatitude'),
f_col('U1.longitude').alias('parentlongitude'),
f_col('U2.UserIdentifier').alias('childUserIdentifier'),
f_col('U2.AddressIdentifier').alias('childadressIdentifier'),
f_col('U2.latitude').alias('childlatitude'),
f_col('U2.longitude').alias('childlongitude'),
f_col('U2.UserName').alias('childUserName'),
f_col('R.networkIdentifier').alias('networkIdentifier')
)
上面的join操作很慢。如何提高性能?
下面是join后的写操作。
加入后的预期输出
+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
|parentUserIdentifier|parentAddressIdentifier| parentUserName| parentlatitude| parentlongitude| childUserIdentifier|childadressIdentifier| childlatitude| childlongitude| childUserName| networkIdentifier|
+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| 0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...| A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
user_relationship_details_df.coalesce(1000).write.option(
'maxRecordsPerFile', 100000).mode('overwrite').format('json').save('hdfs_path')
用来执行的pyspark命令。
spark-submit script.py
【问题讨论】:
合并 1 .... Coalesce 导致性能过低。请尝试根据数据量或大小增加。 @Srinivas 我已经将 coalesce 增加到 1000。但是没有太大的改进 @mck 你能帮帮我吗? 你能发布一些示例数据吗?两张桌子 【参考方案1】:我可以建议你尝试两件事
-
尝试将合并中的分区数量增加一个以上,以启用并行性,这将提高写入操作的性能。
2.尝试使用partitionby并给出分区列,这将提高写操作的性能
【讨论】:
写没什么大不了的。听说加入速度越来越慢。【参考方案2】:如果问题在于写入性能,则不要使用coalesce()
,而是使用repartition()
。 Coalesce() 旨在减少分区数量,其中repartition()
可用于增加分区数量,从而提高并行度。如果您将重新分区计数增加得非常大,除非您有足够的资源,否则不会真正有帮助。关于coalesce vs repartition的更多详情
【讨论】:
加入运行缓慢,不是写操作。 这里有一些关于优化连接的指南:databricks.com/session/optimizing-apache-spark-sql-joins 另外,如果您的一张表很小,那么您可以进行 broadcast 连接。以上是关于pyspark 数据框连接操作和写为 json 非常慢的主要内容,如果未能解决你的问题,请参考以下文章
带有 json 的 Pyspark 数据框,迭代以创建新的数据框