如何在pyspark中的数据框之间进行连接
Posted
技术标签:
【中文标题】如何在pyspark中的数据框之间进行连接【英文标题】:How to make join between dataframes of pspark 【发布时间】:2021-03-08 16:48:52 【问题描述】:我有两个DataFrame,分别叫DF1和DF2,每个DataFrame的内容如下:
df1:
line_item_usage_account_id line_item_unblended_cost name
100000000001 12.05 account1
200000000001 52 account2
300000000003 12.03 account3
df2:
accountname accountproviderid clustername app_pmo app_costcenter
account1 100000000001 cluster1 111111 11111111
account2 200000000001 cluster2 222222 22222222
我需要为字段 df1.line_item_usage_account_id 和 df2.accountproviderid 进行连接
当两个字段具有相同的 ID 时,必须添加 DF1 line_item_unblended_cost 列的值。 而当DF1的line_item_usage_account_id字段的值不在DF2的accountproviderid列时,df1字段必须按如下方式聚合:
accountname accountproviderid clustername app_pmo app_costcenter line_item_unblended_cost
account1 100000000001 cluster1 111111 11111111 12.05
account2 200000000001 cluster2 222222 22222222 52
account3 300000000003 NA NA NA 12.03
account3 数据通过填充 DF2 的“na”列添加到新 DataFrame 的末尾。
任何帮助提前谢谢。
【问题讨论】:
【参考方案1】:from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df1 = spark.createDataFrame([
[100000000001, 12.05, 'account1'],
[200000000001, 52.00, 'account2'],
[300000000003, 12.03, 'account3']],
schema=['line_item_usage_account_id', 'line_item_unblended_cost', 'name' ])
df1.show()
df1.printSchema()
df2 = spark.createDataFrame([
['account1', 100000000001, 'cluster1', 111111, 11111111],
['account2', 200000000001, 'cluster2', 222222, 22222222]],
schema=['accountname', 'accountproviderid', 'clustername', 'app_pmo', 'app_costcenter'])
df2.printSchema()
df2.show()
cols = ['name', 'line_item_usage_account_id', 'clustername', 'app_pmo', 'app_costcenter', 'line_item_unblended_cost']
resDF = df1.join(df2, df1.line_item_usage_account_id == df2.accountproviderid, "leftouter").select(*cols).withColumnRenamed('name', 'accountname').withColumnRenamed('line_item_usage_account_id', 'accountproviderid').orderBy('accountname')
resDF.printSchema()
# |-- accountname: string (nullable = true)
# |-- accountproviderid: long (nullable = true)
# |-- clustername: string (nullable = true)
# |-- app_pmo: long (nullable = true)
# |-- app_costcenter: long (nullable = true)
# |-- line_item_unblended_cost: double (nullable = true)
resDF.show()
# +-----------+-----------------+-----------+-------+--------------+------------------------+
# |accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
# +-----------+-----------------+-----------+-------+--------------+------------------------+
# | account1| 100000000001| cluster1| 111111| 11111111| 12.05|
# | account2| 200000000001| cluster2| 222222| 22222222| 52.0|
# | account3| 300000000003| null| null| null| 12.03|
# +-----------+-----------------+-----------+-------+--------------+------------------------+
【讨论】:
以上是关于如何在pyspark中的数据框之间进行连接的主要内容,如果未能解决你的问题,请参考以下文章
PySpark:如何将数据框与存储在其他变量中的列名连接起来
PYSPARK:如何在 pyspark 数据框中找到两列的余弦相似度?
如何将数据框中的连接值插入到 Pyspark 中的另一个数据框中?