BigQuery 将 rank / percent_rank 应用于带有 WHERE 子句的列

Posted

技术标签:

【中文标题】BigQuery 将 rank / percent_rank 应用于带有 WHERE 子句的列【英文标题】:BigQuery apply rank / percent_rank to column with a WHERE clause 【发布时间】:2019-11-11 00:50:05 【问题描述】:

我有一个相当宽的 bigquery 表,其中包含大约 20-30 个不同的列,每个列都需要接收一个互补的 percentile 列,该列显示该列与表中所有其他行相比的百分位值.但是,每一列应该只收到一个百分位值如果另一列中的值满足某个阈值。为了展示这一点,我在下面创建了一个可重现的示例:

WITH
  correct_games_played AS
    (
      SELECT "a" as name, 7 as num1, 0.4 as num2, 0.55 as num3
      UNION ALL SELECT "b" as name, 13 as num1, 0.53 as num2, 0.37 as num3
      UNION ALL SELECT "c" as name, 4 as num1, 0.42 as num2, 0.32 as num3
      UNION ALL SELECT "d" as name, 17 as num1, 0.6 as num2, 0.23 as num3
      UNION ALL SELECT "e" as name, 7 as num1, 0.3 as num2, 0.25 as num3
      UNION ALL SELECT "f" as name, 16 as num1, 0.7 as num2, 0.43 as num3
      UNION ALL SELECT "g" as name, 10 as num1, 0.53 as num2, 0.52 as num3
      UNION ALL SELECT "h" as name, 5 as num1, 0.54 as num2, 0.21 as num3
      UNION ALL SELECT "i" as name, 9 as num1, 0.56 as num2, 0.17 as num3
      UNION ALL SELECT "j" as name, 3 as num1, 0.75 as num2, 0.53 as num3
    )

  SELECT 
    a.*,
    -- RANK() OVER(ORDER BY a.num1 DESC) AS num1_rank,
    -- RANK() OVER(ORDER BY a.num2 DESC) AS num2_rank,
    -- RANK() OVER(ORDER BY a.num3 DESC) AS num3_rank
    RANK() OVER(ORDER BY a.num1 DESC) AS num1_rank,
    RANK() OVER(ORDER BY a.num2 WHERE a.num1 > 4 DESC) AS num2_rank
    RANK() OVER(ORDER BY a.num3 WHERE a.num1 > 3 DESC) AS num3_rank
  FROM correct_games_played as a

这个脚本会抛出错误Syntax error: Expected ")" but got keyword WHERE at [22:37],但是如果我将rank() 替换为注释掉的rank(),这将有效。我的目标其实很简单:

num2_rank:如果a.num1大于4,则只对a.num2中的值进行排名,否则显示nullnum3_rank:如果a.num1 大于3,则仅对a.num3 中的值进行排名,否则显示null

我的表很宽,每列都有可能需要自己的条件来确定是否应该对每列的行值进行排名。对此的任何帮助将不胜感激,谢谢!

【问题讨论】:

【参考方案1】:

以下是 BigQuery 标准 SQL

#standardSQL
WITH correct_games_played AS (
  SELECT "a" AS name, 7 AS num1, 0.4 AS num2, 0.55 AS num3 UNION ALL 
  SELECT "b" AS name, 13 AS num1, 0.53 AS num2, 0.37 AS num3 UNION ALL 
  SELECT "c" AS name, 4 AS num1, 0.42 AS num2, 0.32 AS num3 UNION ALL 
  SELECT "d" AS name, 17 AS num1, 0.6 AS num2, 0.23 AS num3 UNION ALL 
  SELECT "e" AS name, 7 AS num1, 0.3 AS num2, 0.25 AS num3 UNION ALL 
  SELECT "f" AS name, 16 AS num1, 0.7 AS num2, 0.43 AS num3 UNION ALL 
  SELECT "g" AS name, 10 AS num1, 0.53 AS num2, 0.52 AS num3 UNION ALL 
  SELECT "h" AS name, 5 AS num1, 0.54 AS num2, 0.21 AS num3 UNION ALL 
  SELECT "i" AS name, 9 AS num1, 0.56 AS num2, 0.17 AS num3 UNION ALL 
  SELECT "j" AS name, 3 AS num1, 0.75 AS num2, 0.53 AS num3
)
SELECT *,
  RANK() OVER(ORDER BY num1 DESC) AS num1_rank,
  IF(num1 > 4, RANK() OVER(ORDER BY IF(num1 > 4, num2, NULL) DESC), NULL)  AS num2_rank,
  IF(num1 > 3, RANK() OVER(ORDER BY IF(num1 > 3, num3, NULL) DESC), NULL) AS num3_rank
FROM correct_games_played

【讨论】:

我错过了这个显而易见的

以上是关于BigQuery 将 rank / percent_rank 应用于带有 WHERE 子句的列的主要内容,如果未能解决你的问题,请参考以下文章

如何在 Spark 中计算过去时间值的“percent_rank”?

R语言dplyr包排序及序号函数实战(row_numberntilemin_rankdense_rankpercent_rankcume_dist)

cume_dist vs percent_rank 或之间的差异

Spark2 DataFrame数据框常用操作之分析函数--排名函数row_number,rank,dense_rank,percent_rank

Hive学习之路 (十五)Hive分析窗口函数 CUME_DIST和PERCENT_RANK

MySQL窗口_分布、前后、头尾函数