至少对于获取的批量数据,Spark-sql Pivoting 无法按预期工作
Posted
技术标签:
【中文标题】至少对于获取的批量数据,Spark-sql Pivoting 无法按预期工作【英文标题】:Spark-sql Pivoting not working as expected at least for taken bulk data 【发布时间】:2019-07-04 17:01:20 【问题描述】:数据透视在大多数情况下都不能正常工作,即增加源表记录。
source_df
+---------------+-------------------+--------------------+-------------------+-------------------+--------------+-----------------------+----------------------+-----------+--------------+-------------------+----------------+---------------+---------------+
|model_family_id|classification_type|classification_value|benchmark_type_code| data_date|data_item_code|data_item_value_numeric|data_item_value_string|fiscal_year|fiscal_quarter| create_date|last_update_date|create_user_txt|update_user_txt|
+---------------+-------------------+--------------------+-------------------+-------------------+--------------+-----------------------+----------------------+-----------+--------------+-------------------+----------------+---------------+---------------+
| 1| COUNTRY| HKG| MEAN|2017-12-31 00:00:00| CREDITSCORE| 13| bb-| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| OBS_CNT|2017-12-31 00:00:00| CREDITSCORE| 649| aa| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| OBS_CNT_CA|2017-12-31 00:00:00| CREDITSCORE| 649| null| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| PERCENTILE_0|2017-12-31 00:00:00| CREDITSCORE| 3| aa| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| PERCENTILE_10|2017-12-31 00:00:00| CREDITSCORE| 8| bbb+| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| PERCENTILE_100|2017-12-31 00:00:00| CREDITSCORE| 23| d| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| PERCENTILE_25|2017-12-31 00:00:00| CREDITSCORE| 11| bb+| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| PERCENTILE_50|2017-12-31 00:00:00| CREDITSCORE| 14| b+| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| PERCENTILE_75|2017-12-31 00:00:00| CREDITSCORE| 15| b| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
| 1| COUNTRY| HKG| PERCENTILE_90|2017-12-31 00:00:00| CREDITSCORE| 17| ccc+| 2017| 4|2018-03-31 14:04:18| null| LOAD| null|
+---------------+-------------------+--------------------+-------------------+-------------------+--------------+-----------------------+----------------------+-----------+--------------+-------------------+----------------+---------------+---------------+
我试过下面的代码
val pivot_df = source_df.groupBy("model_family_id","classification_type","classification_value" ,"data_item_code","data_date","fiscal_year","fiscal_quarter" , "create_user_txt", "create_date")
.pivot("benchmark_type_code" ,
Seq("mean","obs_cnt","obs_cnt_ca","percentile_0","percentile_10","percentile_25","percentile_50","percentile_75","percentile_90","percentile_100")
)
.agg( first(
when( col("data_item_code") === "CREDITSCORE" , col("data_item_value_string"))
.otherwise(col("data_item_value_numeric"))
)
)
我的结果低于预期,不确定我的代码有什么问题。
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+-------------+-------------+-------------+-------------+--------------+
|model_family_id|classification_type|classification_value|data_item_code| data_date|fiscal_year|fiscal_quarter|create_user_txt| create_date|mean|obs_cnt|obs_cnt_ca|percentile_0|percentile_10|percentile_25|percentile_50|percentile_75|percentile_90|percentile_100|
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+-------------+-------------+-------------+-------------+--------------+
| 1| COUNTRY| HKG| CREDITSCORE|2017-12-31 00:00:00| 2017| 4| LOAD|2018-03-31 14:04:18|null| null| null| null| null| null| null| null| null| null|
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+-------------+-------------+-------------+-------------+--------------+
我尝试在数据透视函数中不使用 Seq 列。但它仍然没有像预期的那样旋转,请帮忙???
2) 在 when 子句中,如果透视列是 $"benchmark_type_code" === 'OBS_CNT' | 'OBS_CNT' 那么它应该需要 $data_item_value_numeric 。如何做到这一点?
【问题讨论】:
@Derek Kaknes 请帮忙 【参考方案1】:我不确定您的 spark 版本是 2.X。我的软件版本如下: 火花==>2.2.1 斯卡拉==> 2.11 根据以上,我得到了正确答案:
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+--------------+-------------+-------------+-------------+-------------+
|model_family_id|classification_type|classification_value|data_item_code| data_date|fiscal_year|fiscal_quarter|create_user_txt| create_date|MEAN|OBS_CNT|OBS_CNT_CA|PERCENTILE_0|PERCENTILE_10|PERCENTILE_100|PERCENTILE_25|PERCENTILE_50|PERCENTILE_75|PERCENTILE_90|
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+--------------+-------------+-------------+-------------+-------------+
| 1| COUNTRY| HKG| CREDITSCORE|2017-12-31 00:00:00| 2017| 4| LOAD|2018-03-31 14:04:18| bb-| aa| | aa| bbb+| d| bb+| b+| b| ccc+|
+---------------+-------------------+--------------------+--------------+-------------------+-----------+--------------+---------------+-------------------+----+-------+----------+------------+-------------+--------------+-------------+-------------+-------------+-------------+
这是我的代码,你可以试试
import spark.implicits._
source_df
.groupBy($"model_family_id",$"classification_type",$"classification_value",$"data_item_code",$"data_date",$"fiscal_year",$"fiscal_quarter",$"create_user_txt",$"create_date")
.pivot("benchmark_type_code")
.agg(
first(
when($"data_item_code"==="CREDITSCORE", $"data_item_value_string")
.otherwise($"data_item_value_numeric")
)
).show()
【讨论】:
谢谢。你能告诉另一个问题吗 2)在 when 子句中,如果透视列是 $"benchmark_type_code" === 'OBS_CNT' | 'OBS_CNT' 那么它应该需要 $data_item_value_numeric 。如何做到这一点? @Chandan Ray 关于上述第二期的任何线索? @Shyam 对不起,你能清楚地表达你的问题吗?恐怕我不明白你想要的【参考方案2】:我们可以在下面的 when 条件中设置 when 条件,它可以正常工作。
.agg( first(
when( col("data").isin("x","a","y","z") ,
when( col("code").isin("aa","bb") , col("numeric")).otherwise(col("string"))
)
.otherwise(col("numeric"))
)
【讨论】:
以上是关于至少对于获取的批量数据,Spark-sql Pivoting 无法按预期工作的主要内容,如果未能解决你的问题,请参考以下文章