合并两个数据框pyspark

Posted 2023-04-17

技术标签:

【中文标题】合并两个数据框pyspark【英文标题】：Merging two dataframes pyspark 【发布时间】：2017-01-26 15:45:31 【问题描述】：

我有 2 个输入文件：

a) 一个原始文件 ( orig_file.json )，包含如下记录：

"id": 1, "app": test_a, "description": test_app_a 
"id": 2, "app": test_b, "description": test_app_b 
"id": 3, "app": test_c, "description": test_app_c 
"id": 4, "app": test_d, "description": test_app_d 
"id": 5, "app": test_e, "description": test_app_e

b) 一个“deltas”文件 (deltas_file.json)，包含如下记录：

"id": 1, "app": test_aaaxxx, "description": test_app_aaaxxx 
"id": 6, "app": test_ffffff, "description": test_app_ffffff

我正在尝试以这样的方式合并两个文件（原始文件 + deltas 文件）

"id": 1, "app": test_aaaxxx, "description": test_app_aaaxxx 
"id": 2, "app": test_b, "description": test_app_b 
"id": 3, "app": test_c, "description": test_app_c 
"id": 4, "app": test_d, "description": test_app_d 
"id": 5, "app": test_e, "description": test_app_e 
"id": 6, "app": test_ffffff, "description": test_app_ffffff

*基本上通过添加任何新应用程序将原始文件与增量文件合并，并仅更新已存在的记录。 .

我尝试使用不同的连接，但无法获得解决方案。

有人可以指导我解决这个问题的方法吗？谢谢

【问题讨论】：

【参考方案1】：

左外连接和合并：

from pyspark.sql.functions import *


deltas.join(origin, ["id"], "leftouter") \
  .select("id", 
      coalesce(deltas["app"], origin["app"]).alias("app"),
      coalesce(deltas["description"], origin["description"]).alias("description"))

【讨论】：

当我运行上述命令时，它显示数据框对象不可调用的错误。【参考方案2】：

尝试 python panda 合并。

import panda as pd
# create your data frames here
pd.merge(delta_frame,orig_frame)  # Try various required arguments in function

希望这会有所帮助！

【讨论】：

pySpark 中有多个选项不需要toPandas() 操作即可完成合并。 .join() 可能是最合适的。

以上是关于合并两个数据框pyspark的主要内容，如果未能解决你的问题，请参考以下文章

如何并排合并两个数据框？

如何合并两个熊猫数据框[重复]

熊猫合并：合并同一列上的两个数据框，但保留不同的列

如何合并/连接两个不同长度的熊猫数据框？

如何合并两个数据框？ [复制]

如何根据日期列合并两个数据框？