至少对于获取的批量数据，Spark-sql Pivoting 无法按预期工作

Posted 2023-03-22

技术标签:

【中文标题】至少对于获取的批量数据，Spark-sql Pivoting 无法按预期工作【英文标题】：Spark-sql Pivoting not working as expected at least for taken bulk data 【发布时间】：2019-07-04 17:01:20 【问题描述】：

数据透视在大多数情况下都不能正常工作，即增加源表记录。


source_df
+---------------+-------------------+--------------------+-------------------+-------------------+--------------+-----------------------+----------------------+-----------+--------------+-------------------+----------------+---------------+---------------+
|model_family_id|classification_type|classification_value|benchmark_type_code|          data_date|data_item_code|data_item_value_numeric|data_item_value_string|fiscal_year|fiscal_quarter|        create_date|last_update_date|create_user_txt|update_user_txt|
+---------------+-------------------+--------------------+-------------------+-------------------+--------------+-----------------------+----------------------+-----------+--------------+-------------------+----------------+---------------+---------------+
|              1|            COUNTRY|                 HKG|               MEAN|2017-12-31 00:00:00|   CREDITSCORE|                     13|                   bb-|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|            OBS_CNT|2017-12-31 00:00:00|   CREDITSCORE|                    649|                    aa|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|         OBS_CNT_CA|2017-12-31 00:00:00|   CREDITSCORE|                    649|                  null|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|       PERCENTILE_0|2017-12-31 00:00:00|   CREDITSCORE|                      3|                    aa|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|      PERCENTILE_10|2017-12-31 00:00:00|   CREDITSCORE|                      8|                  bbb+|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|     PERCENTILE_100|2017-12-31 00:00:00|   CREDITSCORE|                     23|                     d|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|      PERCENTILE_25|2017-12-31 00:00:00|   CREDITSCORE|                     11|                   bb+|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|      PERCENTILE_50|2017-12-31 00:00:00|   CREDITSCORE|                     14|                    b+|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|      PERCENTILE_75|2017-12-31 00:00:00|   CREDITSCORE|                     15|                     b|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
|              1|            COUNTRY|                 HKG|      PERCENTILE_90|2017-12-31 00:00:00|   CREDITSCORE|                     17|                  ccc+|       2017|             4|2018-03-31 14:04:18|            null|           LOAD|           null|
+---------------+-------------------+--------------------+-------------------+-------------------+--------------+-----------------------+----------------------+-----------+--------------+-------------------+----------------+---------------+---------------+

我试过下面的代码

val pivot_df =  source_df.groupBy("model_family_id","classification_type","classification_value" ,"data_item_code","data_date","fiscal_year","fiscal_quarter" , "create_user_txt", "create_date")
                .pivot("benchmark_type_code" , 
                        Seq("mean","obs_cnt","obs_cnt_ca","percentile_0","percentile_10","percentile_25","percentile_50","percentile_75","percentile_90","percentile_100")
                      )
                .agg(  first(

                  when(  col("data_item_code") === "CREDITSCORE" ,  col("data_item_value_string"))
                  .otherwise(col("data_item_value_numeric"))
                )
              )

我的结果低于预期，不确定我的代码有什么问题。


+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+-------------+-------------+-------------+-------------+--------------+
|model_family_id|classification_type|classification_value|data_item_code|          data_date|fiscal_year|fiscal_quarter|create_user_txt|        create_date|mean|obs_cnt|obs_cnt_ca|percentile_0|percentile_10|percentile_25|percentile_50|percentile_75|percentile_90|percentile_100|
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+-------------+-------------+-------------+-------------+--------------+
|              1|            COUNTRY|                 HKG|   CREDITSCORE|2017-12-31 00:00:00|       2017|             4|           LOAD|2018-03-31 14:04:18|null|   null|      null|        null|         null|         null|         null|         null|         null|          null|
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+-------------+-------------+-------------+-------------+--------------+

我尝试在数据透视函数中不使用 Seq 列。但它仍然没有像预期的那样旋转，请帮忙？？？

2) 在 when 子句中，如果透视列是 $"benchmark_type_code" === 'OBS_CNT' | 'OBS_CNT' 那么它应该需要 $data_item_value_numeric 。如何做到这一点？

【问题讨论】：

@Derek Kaknes 请帮忙 【参考方案1】：

我不确定您的 spark 版本是 2.X。我的软件版本如下：火花==>2.2.1 斯卡拉==> 2.11 根据以上，我得到了正确答案：

+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+--------------+-------------+-------------+-------------+-------------+
|model_family_id|classification_type|classification_value|data_item_code|          data_date|fiscal_year|fiscal_quarter|create_user_txt|        create_date|MEAN|OBS_CNT|OBS_CNT_CA|PERCENTILE_0|PERCENTILE_10|PERCENTILE_100|PERCENTILE_25|PERCENTILE_50|PERCENTILE_75|PERCENTILE_90|
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+--------------+-------------+-------------+-------------+-------------+
|              1|            COUNTRY|                 HKG|   CREDITSCORE|2017-12-31 00:00:00|       2017|             4|           LOAD|2018-03-31 14:04:18| bb-|     aa|          |          aa|         bbb+|             d|          bb+|           b+|            b|         ccc+|
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+--------------+-------------+-------------+-------------+-------------+

这是我的代码，你可以试试

import spark.implicits._
source_df
    .groupBy($"model_family_id",$"classification_type",$"classification_value",$"data_item_code",$"data_date",$"fiscal_year",$"fiscal_quarter",$"create_user_txt",$"create_date")
    .pivot("benchmark_type_code")
    .agg(
      first(
        when($"data_item_code"==="CREDITSCORE", $"data_item_value_string")
          .otherwise($"data_item_value_numeric")
      )
    ).show()

【讨论】：

谢谢。你能告诉另一个问题吗 2）在 when 子句中，如果透视列是 $"benchmark_type_code" === 'OBS_CNT' | 'OBS_CNT' 那么它应该需要 $data_item_value_numeric 。如何做到这一点？ @Chandan Ray 关于上述第二期的任何线索？ @Shyam 对不起，你能清楚地表达你的问题吗？恐怕我不明白你想要的【参考方案2】：

我们可以在下面的 when 条件中设置 when 条件，它可以正常工作。

.agg(  first(
                  when(  col("data").isin("x","a","y","z")  ,
                   when(  col("code").isin("aa","bb")  ,  col("numeric")).otherwise(col("string"))
                          )
                 .otherwise(col("numeric"))
                )

【讨论】：

以上是关于至少对于获取的批量数据，Spark-sql Pivoting 无法按预期工作的主要内容，如果未能解决你的问题，请参考以下文章

spark-sql的概述以及编程模型的介绍

spark-sql/Scala 中的反透视列名是数字

Spark中文手册7：Spark-sql由入门到精通续

从 PIV 智能卡中提取名称

[Spark]-结构化流之初始篇

大数据 - spark-sql 常用命令