如何对联合数据框进行分组以组合相同的行

Posted 2023-04-15

技术标签:

【中文标题】如何对联合数据框进行分组以组合相同的行【英文标题】：How to group unioned dataframe to combine same rows 【发布时间】：2019-06-04 20:49:25 【问题描述】：

我刚刚在 pyspark 中合并了两个数据框，而不是将具有相同日期的行组合在一起，而是将它们堆叠在一起，如下所示：

df1:

+----------+------------+--------------+
|      date| bounceCount|  captureCount|
+----------+------------+--------------+ 
|  20190518|           2|          null|
|  20190521|           1|          null|
|  20190519|           1|          null|
|  20190522|           1|          null|
+----------+------------+--------------+

df2:

+----------+-------------+-------------+
|      date| captureCount|  bounceCount|
+----------+-------------+-------------+ 
|  20190516|         null|            3|
|  20190518|         null|            2|
|  20190519|         null|            1|
|  20190524|         null|            5|
+----------+-------------+-------------+

联合：

+----------+------------+--------------+
|      date| bounceCount|  captureCount|
+----------+------------+--------------+ 
|  20190518|           2|          null|
|  20190521|           1|          null|
|  20190519|           1|          null|
|  20190522|           1|          null|
|  20190516|        null|             3|
|  20190518|        null|             2|
|  20190519|        null|             1|
|  20190524|        null|             5|
+----------+------------+--------------+

我希望它对它进行分组，以便将具有相同日期的行与正确的 bounceCount 和 captureCount 组合在一起：

+----------+------------+--------------+
|      date| bounceCount|  captureCount|
+----------+------------+--------------+ 
|  20190518|           2|             2|
|  20190521|           1|          null|
|  20190519|           1|             1|
|  20190522|           1|          null|
|  20190516|        null|             3|
|  20190524|        null|             5|
+----------+------------+--------------+

我尝试过以不同的方式将它们放在一起，并以不同的方式对数据框进行分组，但我无法理解。我还将将此数据框与其他几列附加，因此我想知道执行此操作的最佳方法。有人知道这样做的简单方法吗？

【问题讨论】：

你想做一个join，而不是union。见What are the various join types in Spark?和update a dataframe column with new values。或者您可以在union 之后执行groupBy，但join 可能更有效。示例中df2的列名可能是错误的。 【参考方案1】：

你可以通过外连接来实现。

df = (
    df1.select('date', 'bounceCount')
    .join(
        df2.select('date', 'captureCount'),
        on='data', how='outer'
    )
)

【讨论】：

【参考方案2】：

试试这个 -

加入（完整）两个数据帧并使用coalesce 函数。

from pyspark.sql.functions import coalesce

joining_condition = [df1.date == df2.date]

df1\
    .join(df2,joining_condition,'full')\
    .select(coalesce(df1.date,df2.date).alias('date')
            ,df1.bounceCount
            ,df2.bounceCount.alias('captureCount'))\
    .show()

#+--------+-----------+------------+
#|    date|bounceCount|captureCount|
#+--------+-----------+------------+
#|20190518|          2|           2|
#|20190519|          1|           1|
#|20190521|          1|        null|
#|20190524|       null|           5|
#|20190522|          1|        null|
#|20190516|       null|           3|
#+--------+-----------+------------+

我认为df2 数据框的列被互换了。请检查。如果是这种情况，请更改解决方案中的列名。

【讨论】：

以上是关于如何对联合数据框进行分组以组合相同的行的主要内容，如果未能解决你的问题，请参考以下文章

按列对分组数据帧进行采样

Pandas - 按函数和总和列分组以提取其他列总和为 0 的行

如何用R将列中相同值的行值分组？ [复制]

如何获取具有相同ID的行的分组信息的列？ [复制]

如何对彼此“接近”的纬度/经度点进行分组？

Python - 在熊猫数据框中对列表中的行进行分组