如何计算基于组的分位数?

Posted

技术标签:

【中文标题】如何计算基于组的分位数?【英文标题】:How to calculate group-based quantiles? 【发布时间】:2020-03-06 11:54:53 【问题描述】:

我正在使用 spark-sql-2.4.1v,我正在尝试在给定数据的每一列上查找分位数,即百分位数 0、百分位数 25 等。

我的数据框df

+----+---------+-------------+----------+-----------+--------+
|  id|     date|      revenue|con_dist_1| con_dist_2| state  |
+----+---------+-------------+----------+-----------+--------+
|  10|1/15/2018|  0.010680705|         6|0.019875458|   TX   |
|  10|1/15/2018|  0.006628853|         4|0.816039063|   AZ   |
|  10|1/15/2018|   0.01378215|         4|0.082049528|   TX   |
|  10|1/15/2018|  0.010680705|         6|0.019875458|   TX   |
|  10|1/15/2018|  0.006628853|         4|0.816039063|   AZ   |
+----+---------+-------------+----------+-----------+--------+

如何找到每个州的“con_dist_1”和“con_dist_2”列上的分位数?

【问题讨论】:

相关问题:***.com/questions/46845672/… 【参考方案1】:

可能的解决方案是:

scala> input.show
+---+---------+-----------+----------+-----------+-----+
| id|     date|    revenue|con_dist_1| con_dist_2|state|
+---+---------+-----------+----------+-----------+-----+
| 10|1/15/2018|0.010680705|         6|0.019875458|   TX|
| 10|1/15/2018|0.006628853|         4|0.816039063|   AZ|
| 10|1/15/2018| 0.01378215|         4|0.082049528|   TX|
| 10|1/15/2018|0.010680705|         6|0.019875458|   TX|
| 10|1/15/2018|0.006628853|         4|0.816039063|   AZ|
+---+---------+-----------+----------+-----------+-----+

scala> val df1 = input.groupBy("state").agg(collect_list("con_dist_1").as("combined_1"), collect_list("con_dist_2").as("combined_2"))
df1: org.apache.spark.sql.DataFrame = [state: string, combined_1: array<int> ... 1 more field]

scala> df1.show
+-----+----------+--------------------+                                         
|state|combined_1|          combined_2|
+-----+----------+--------------------+
|   AZ|    [4, 4]|[0.816039063, 0.8...|
|   TX| [6, 4, 6]|[0.019875458, 0.0...|
+-----+----------+--------------------+

scala> df1.
     | withColumn("comb1_Q1", sort_array($"combined_1")(((size($"combined_1")-1)*0.25).cast("int"))).
     | withColumn("comb1_Q2", sort_array($"combined_1")(((size($"combined_1")-1)*0.5).cast("int"))).
     | withColumn("comb1_Q3", sort_array($"combined_1")(((size($"combined_1")-1)*0.75).cast("int"))).
     | withColumn("comb_2_Q1", sort_array($"combined_2")(((size($"combined_2")-1)*0.25).cast("int"))).
     | withColumn("comb_2_Q2", sort_array($"combined_2")(((size($"combined_2")-1)*0.5).cast("int"))).
     | withColumn("comb_2_Q3", sort_array($"combined_2")(((size($"combined_2")-1)*0.75).cast("int"))).
     | show
+-----+----------+--------------------+--------+--------+--------+-----------+-----------+-----------+
|state|combined_1|          combined_2|comb1_Q1|comb1_Q2|comb1_Q3|  comb_2_Q1|  comb_2_Q2|  comb_2_Q3|
+-----+----------+--------------------+--------+--------+--------+-----------+-----------+-----------+
|   AZ|    [4, 4]|[0.816039063, 0.8...|       4|       4|       4|0.816039063|0.816039063|0.816039063|
|   TX| [6, 4, 6]|[0.019875458, 0.0...|       4|       6|       6|0.019875458|0.019875458|0.019875458|
+-----+----------+--------------------+--------+--------+--------+-----------+-----------+-----------+

编辑

我不认为我们可以使用 approx quantile 方法来实现,因为您需要在 state 列上分组并聚合 con_dist 列和 approx quantile 期望一整列整数或浮点数,但不是数组类型。

另一种解决方案是使用spark-sql,如下图:

scala> input.show
+---+---------+-----------+----------+-----------+-----+
| id|     date|    revenue|con_dist_1| con_dist_2|state|
+---+---------+-----------+----------+-----------+-----+
| 10|1/15/2018|0.010680705|         6|0.019875458|   TX|
| 10|1/15/2018|0.006628853|         4|0.816039063|   AZ|
| 10|1/15/2018| 0.01378215|         4|0.082049528|   TX|
| 10|1/15/2018|0.010680705|         6|0.019875458|   TX|
| 10|1/15/2018|0.006628853|         4|0.816039063|   AZ|
+---+---------+-----------+----------+-----------+-----+


scala> input.createOrReplaceTempView("input")

scala> :paste
// Entering paste mode (ctrl-D to finish)

val query = "select state, percentile_approx(con_dist_1,0.25) as col1_quantile_1, " +
  "percentile_approx(con_dist_1,0.5) as col1_quantile_2," +
  "percentile_approx(con_dist_1,0.75) as col1_quantile_3, " +
  "percentile_approx(con_dist_2,0.25) as col2_quantile_1,"+
  "percentile_approx(con_dist_2,0.5) as col2_quantile_2," +
  "percentile_approx(con_dist_2,0.75) as col2_quantile_3 " +
  "from input group by state"

// Exiting paste mode, now interpreting.

query: String = select state, percentile_approx(con_dist_1,0.25) as col1_quantile_1, percentile_approx(con_dist_1,0.5) as col1_quantile_2,percentile_approx(con_dist_1,0.75) as col1_quantile_3, percentile_approx(con_dist_2,0.25) as col2_quantile_1,percentile_approx(con_dist_2,0.5) as col2_quantile_2,percentile_approx(con_dist_2,0.75) as col2_quantile_3 from input group by state

scala> val df2 = spark.sql(query)
df2: org.apache.spark.sql.DataFrame = [state: string, col1_quantile_1: int ... 5 more fields]

scala> df2.show
+-----+---------------+---------------+---------------+---------------+---------------+---------------+
|state|col1_quantile_1|col1_quantile_2|col1_quantile_3|col2_quantile_1|col2_quantile_2|col2_quantile_3|
+-----+---------------+---------------+---------------+---------------+---------------+---------------+
|   AZ|              4|              4|              4|    0.816039063|    0.816039063|    0.816039063|
|   TX|              4|              6|              6|    0.019875458|    0.019875458|    0.082049528|
+-----+---------------+---------------+---------------+---------------+---------------+---------------+

如果有帮助请告诉我!!

【讨论】:

@BdEngineer 是的,您可以将其保存在循环内,但通常情况下,您可以在循环外创建临时视图。由于只是一个视图,所以可以替换。 是的,输入视图将被覆盖。这取决于您从中创建视图的数据框

以上是关于如何计算基于组的分位数?的主要内容,如果未能解决你的问题,请参考以下文章

如何理解概率分布的分位数和上侧分位数?

R语言plotly可视化:plotly可视化箱图基于预先计算好的分位数均值中位数等统计指标可视化箱图箱图中添加缺口可视化均值和标准差(With Precomputed Quartiles)

Hive - 如何获取每组值的分位数

python pandas df.quantile 计算样本的分位数

熊猫数据帧的分位数归一化

如何估计R中的分位数回归预测