如何在pyspark中的数据框之间进行连接

Posted 2023-04-15

技术标签:

【中文标题】如何在pyspark中的数据框之间进行连接【英文标题】：How to make join between dataframes of pspark 【发布时间】：2021-03-08 16:48:52 【问题描述】：

我有两个DataFrame，分别叫DF1和DF2，每个DataFrame的内容如下：

df1:

line_item_usage_account_id  line_item_unblended_cost    name 
100000000001                12.05                       account1
200000000001                52                          account2
300000000003                12.03                       account3

df2:

accountname     accountproviderid   clustername     app_pmo     app_costcenter
account1        100000000001        cluster1        111111      11111111
account2        200000000001        cluster2        222222      22222222

我需要为字段 df1.line_item_usage_account_id 和 df2.accountproviderid 进行连接

当两个字段具有相同的 ID 时，必须添加 DF1 line_item_unblended_cost 列的值。而当DF1的line_item_usage_account_id字段的值不在DF2的accountproviderid列时，df1字段必须按如下方式聚合：

accountname     accountproviderid   clustername     app_pmo     app_costcenter      line_item_unblended_cost
account1        100000000001        cluster1        111111      11111111            12.05
account2        200000000001        cluster2        222222      22222222            52
account3        300000000003        NA              NA          NA                  12.03

account3 数据通过填充 DF2 的“na”列添加到新 DataFrame 的末尾。

任何帮助提前谢谢。

【问题讨论】：

【参考方案1】：

from pyspark.sql import SparkSession   
spark = SparkSession.builder.getOrCreate()

df1 = spark.createDataFrame([
    [100000000001, 12.05, 'account1'], 
    [200000000001, 52.00, 'account2'], 
    [300000000003, 12.03, 'account3']], 
    schema=['line_item_usage_account_id',  'line_item_unblended_cost', 'name' ])

df1.show()
df1.printSchema()

df2 = spark.createDataFrame([
    ['account1', 100000000001, 'cluster1', 111111, 11111111],
    ['account2', 200000000001, 'cluster2', 222222, 22222222]], 
    schema=['accountname', 'accountproviderid', 'clustername', 'app_pmo', 'app_costcenter'])

df2.printSchema()
df2.show()

cols = ['name', 'line_item_usage_account_id', 'clustername', 'app_pmo', 'app_costcenter', 'line_item_unblended_cost']
resDF = df1.join(df2, df1.line_item_usage_account_id == df2.accountproviderid, "leftouter").select(*cols).withColumnRenamed('name', 'accountname').withColumnRenamed('line_item_usage_account_id', 'accountproviderid').orderBy('accountname')

resDF.printSchema()
 # |-- accountname: string (nullable = true)
 # |-- accountproviderid: long (nullable = true)
 # |-- clustername: string (nullable = true)
 # |-- app_pmo: long (nullable = true)
 # |-- app_costcenter: long (nullable = true)
#  |-- line_item_unblended_cost: double (nullable = true)

resDF.show()
# +-----------+-----------------+-----------+-------+--------------+------------------------+
# |accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
# +-----------+-----------------+-----------+-------+--------------+------------------------+
# |   account1|     100000000001|   cluster1| 111111|      11111111|                   12.05|
# |   account2|     200000000001|   cluster2| 222222|      22222222|                    52.0|
# |   account3|     300000000003|       null|   null|          null|                   12.03|
# +-----------+-----------------+-----------+-------+--------------+------------------------+

【讨论】：

以上是关于如何在pyspark中的数据框之间进行连接的主要内容，如果未能解决你的问题，请参考以下文章