pyspark 数据框连接操作和写为 json 非常慢

Posted

技术标签:

【中文标题】pyspark 数据框连接操作和写为 json 非常慢【英文标题】:pyspark data frame join operation and write as json is very slow 【发布时间】:2021-07-03 20:45:30 【问题描述】:

我有两个大型 spark 数据框,用户详细信息和用户关系。 两个数据框都有超过 20M 的记录。

数据框joinJoin操作非常非常慢。请帮助我提高加入性能。

DF 详细信息

user_df

一个用户应该有多个地址。 一个用户+地址应该有多个网络

架构

root
 |-- USERIDENTIFIER: string (nullable = true)
 |-- ADDRESSIDENTIFIER: string (nullable = true)
 |-- LATITUDE: decimal(15,12) (nullable = true)
 |-- LONGITUDE: decimal(15,12) (nullable = true)
 |-- USERNAME: string (nullable = true)
 |-- NETWORKIDENTIFIER: string (nullable = true)

样本数据

+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+
|USERIDENTIFIER                  |ADDRESSIDENTIFIER               |LATITUDE       |LONGITUDE        |USERNAME                  |NETWORKIDENTIFIER               |
+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+
|C9BBB242202692B589DC5E6AD1040229|0B60DA9CB69084711BC119CB7DB5A120|33.779730000000|-117.867278000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0EBFEB7F15B503D7F34BA4650E561D4B|33.804552000000|-118.067973000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0DC22C71A345C6750158E88D98D6671D|33.701665000000|-117.956545000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|086E9420C60A7D037FB127727967337B|33.780334000000|-117.863353000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|C9BBB242202692B589DC5E6AD1040229|0E30A48D4829E093E60C7026351DFA04|33.780334000000|-117.863353000000|CHOC CHILDRENS SPECIALISTS|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|33CF41F3F7AC69029EDD664DF569AE41|33.610987000000|-117.712710000000|Mary A Wilkinson          |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|97612FEFFD5EA7664E566161CE9318EF|33.569658000000|-117.726847000000|Mary A Wilkinson          |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|33.665445000000|-117.761503000000|Mary A Wilkinson          |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|A8EFFB0D29B8628B9A3E993490FF6F8F|33.439137000000|-117.621570000000|Mary A Wilkinson          |0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|33.665445000000|-117.761503000000|Mary A Wilkinson          |06702F3EAF8846A450AB8A6DF93E8227|
+--------------------------------+--------------------------------+---------------+-----------------+--------------------------+--------------------------------+

user_relationship_df

一个用户地址可以与另一个用户有关系(所有地址)

架构

 root
     |-- PARENTUSERIDENTIFIER: string (nullable = true)
     |-- PARENTADDRESSIDENTIFIER: string (nullable = true)
     |-- CHILDUSERIDENTIFIER: string (nullable = true)
     |-- NETWORKIDENTIFIER: string (nullable = true)

样本数据

+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
|PARENTUSERIDENTIFIER            |PARENTADDRESSIDENTIFIER         |CHILDUSERIDENTIFIER             |NETWORKIDENTIFIER               |
+--------------------------------+--------------------------------+--------------------------------+--------------------------------+
|97B177D33281DF30AFC9924294D1973D|A8EFFB0D29B8628B9A3E993490FF6F8F|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|33CF41F3F7AC69029EDD664DF569AE41|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|97612FEFFD5EA7664E566161CE9318EF|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|C9BBB242202692B589DC5E6AD1040229|0364142E829F4B9384C8023C3BD7194B|
|97B177D33281DF30AFC9924294D1973D|0779CA3DCA30B801B55AB8FE8EFE8E77|C9BBB242202692B589DC5E6AD1040229|06702F3EAF8846A450AB8A6DF93E8227|
+--------------------------------+--------------------------------+--------------------------------+--------------------------------+

我必须执行波纹管连接才能获取所有父子关系详细信息。

user_relationship_df = user_df.alias('U1').join(
        user_relationship_df.alias('R'),
              [
                  f_col('U1.UserIdentifier') == f_col('R.parentUserIdentifier'),
                  f_col('U1.addressIdentifier') == f_col('R.parentAddressIdentifier'),
                  f_col('U1.networkIdentifier') == f_col('R.networkIdentifier'),
              ],
        'inner'
    ).join(
        user_df.alias('U2'),
        [
            f_col('U2.UserIdentifier') == f_col('R.childUserIdentifier'),
            f_col('U2.networkIdentifier') == f_col('R.networkIdentifier')
        ]
    ).select(
        f_col('U1.UserIdentifier').alias('parentUserIdentifier'),
        f_col('U1.AddressIdentifier').alias('parentAddressIdentifier'),
        f_col('U2.UserName').alias('parentUserName'),
        f_col('U1.latitude').alias('parentlatitude'),
        f_col('U1.longitude').alias('parentlongitude'),
        f_col('U2.UserIdentifier').alias('childUserIdentifier'),
        f_col('U2.AddressIdentifier').alias('childadressIdentifier'),
        f_col('U2.latitude').alias('childlatitude'),
        f_col('U2.longitude').alias('childlongitude'),
        f_col('U2.UserName').alias('childUserName'),
        f_col('R.networkIdentifier').alias('networkIdentifier')
    )

上面的join操作很慢。如何提高性能?

下面是join后的写操作。

加入后的预期输出

+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
|parentUserIdentifier|parentAddressIdentifier|      parentUserName| parentlatitude|  parentlongitude| childUserIdentifier|childadressIdentifier|  childlatitude|   childlongitude|       childUserName|   networkIdentifier|
+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
|97B177D33281DF30A...|   33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   33CF41F3F7AC69029...|CHOC CHILDRENS SP...|33.610987000000|-117.712710000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   97612FEFFD5EA7664...|CHOC CHILDRENS SP...|33.569658000000|-117.726847000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   0779CA3DCA30B801B...|CHOC CHILDRENS SP...|33.665445000000|-117.761503000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0E30A48D4829E093E...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 086E9420C60A7D037...|33.780334000000|-117.863353000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0DC22C71A345C6750...|33.701665000000|-117.956545000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0EBFEB7F15B503D7F...|33.804552000000|-118.067973000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
|97B177D33281DF30A...|   A8EFFB0D29B8628B9...|CHOC CHILDRENS SP...|33.439137000000|-117.621570000000|C9BBB242202692B58...| 0B60DA9CB69084711...|33.779730000000|-117.867278000000|CHOC CHILDRENS SP...|0364142E829F4B938...|
+--------------------+-----------------------+--------------------+---------------+-----------------+--------------------+---------------------+---------------+-----------------+--------------------+--------------------+
user_relationship_details_df.coalesce(1000).write.option(
            'maxRecordsPerFile', 100000).mode('overwrite').format('json').save('hdfs_path')

用来执行的pyspark命令。

spark-submit script.py

【问题讨论】:

合并 1 .... Coalesce 导致性能过低。请尝试根据数据量或大小增加。 @Srinivas 我已经将 coalesce 增加到 1000。但是没有太大的改进 @mck 你能帮帮我吗? 你能发布一些示例数据吗?两张桌子 【参考方案1】:

我可以建议你尝试两件事

    尝试将合并中的分区数量增加一个以上,以启用并行性,这将提高写入操作的性能。 2.尝试使用partitionby并给出分区列,这将提高写操作的性能

【讨论】:

写没什么大不了的。听说加入速度越来越慢。【参考方案2】:

如果问题在于写入性能,则不要使用coalesce(),而是使用repartition()。 Coalesce() 旨在减少分区数量,其中repartition() 可用于增加分区数量,从而提高并行度。如果您将重新分区计数增加得非常大,除非您有足够的资源,否则不会真正有帮助。关于coalesce vs repartition的更多详情

【讨论】:

加入运行缓慢,不是写操作。 这里有一些关于优化连接的指南:databricks.com/session/optimizing-apache-spark-sql-joins 另外,如果您的一张表很小,那么您可以进行 broadcast 连接。

以上是关于pyspark 数据框连接操作和写为 json 非常慢的主要内容,如果未能解决你的问题,请参考以下文章

将嵌套的 Json 转换为 Pyspark 中的数据框

如何对 pyspark 数据框执行连接操作?

将 JSON 多行文件加载到 pyspark 数据框中

带有 json 的 Pyspark 数据框,迭代以创建新的数据框

如何将 json 对象列表转换为单个 pyspark 数据框?

使用 json 字符串值和模式创建 pyspark 数据框