将行透视到列级别

Posted 2023-04-15

技术标签:

【中文标题】将行透视到列级别【英文标题】：Pivot row to column level 【发布时间】：2020-09-01 09:27:01 【问题描述】：

我有一个 spark 数据框 t，它是 spark.sql("...") 查询的结果。这是t 的前几行：

| yyyy_mm_dd | x_id | x_name      | b_app   | status        | has_policy | count |
|------------|------|-------------|---------|---------------|------------|-------|
| 2020-08-18 | 1    | first_name  | content | no_contact    | 1          | 23    |
| 2020-08-18 | 1    | first_name  | content | no_contact    | 0          | 346   |
| 2020-08-18 | 2    | second_name | content | implemented   | 1          | 64    |
| 2020-08-18 | 2    | second_name | content | implemented   | 0          | 5775  |
| 2020-08-18 | 3    | third_name  | content | implemented   | 1          | 54    |
| 2020-08-18 | 3    | third_name  | content | implemented   | 0          | 368   |
| 2020-08-18 | 4    | fourth_name | content | first_contact | 1          | 88    |
| 2020-08-18 | 4    | fourth_name | content | first_contact | 0          | 659   |

每个x_id 有两行，这是由于has_policy 上的分组。我想将has_policy 和count 旋转到列，这样我就可以每个x_id 有一行。这就是输出的样子：

| yyyy_mm_dd | x_id | x_name      | b_app   | status        | has_policy_count | has_no_policy_count |
|------------|------|-------------|---------|---------------|------------------|---------------------|
| 2020-08-18 | 1    | first_name  | content | no_contact    | 23               | 346                 |
| 2020-08-18 | 2    | second_name | content | implemented   | 64               | 5775                |
| 2020-08-18 | 3    | third_name  | content | implemented   | 54               | 368                 |
| 2020-08-18 | 4    | fourth_name | content | first_contact | 88               | 659                 |

我不确定通过首先转换为 Pandas 是否更容易实现这一点，或者我们是否可以在 Spark df 上操作以获得相同的结果？

数据类型：

t.dtypes
[('yyyy_mm_dd', 'date'),
 ('xml_id', 'int'),
 ('xml_name', 'string'),
 ('b_app', 'string'),
 ('status', 'string'),
 ('has_policy', 'bigint'),
 ('count', 'bigint')]

【问题讨论】：

既然你用 pandas 标记了它，那么基于 pandas 的解决方案适合你吗？ @ShubhamSharma 是的，没关系。 pyspark 或 pandas 解决方案都可以在这里使用。但是t = t.toPandas() 会为数据框添加一个索引，所以要小心 【参考方案1】：

假设 df 是您的数据框。 pivot 在阅读文档时使用起来非常简单。

df.groupBy(
    "yyyy_mm_dd", "x_id", "x_name", "b_app", "status"
).pivot("has_policy", [0, 1]).sum("count")

【讨论】：

我在.pivot() 上收到了string indices must be integers。我已将 dtypes 添加到原始问题中 @Someguywhocodes 您可能更改了我的代码中的某些内容，因为它可以很好地处理您的示例数据...尝试删除 ,[0, 1] 部分。 @Someguywhocodes 顺便说一句，这是一个 pyspark 代码，而不是 pandas。我忘记了[] 之前的逗号。效果很好，谢谢！

以上是关于将行透视到列级别的主要内容，如果未能解决你的问题，请参考以下文章

PowerBI：使用计算字段获得数据透视表级别的灵活性