bigquery row_number 根据特定字段按某些行分组

Posted

技术标签:

【中文标题】bigquery row_number 根据特定字段按某些行分组【英文标题】:bigquery row_number group by some rows based on specific field 【发布时间】:2015-05-05 07:10:23 【问题描述】:

我有这样的数据 即

城市名称 stateId 城市文本1 52 城市文本2 52 城市文本3 52 城市Exp1 72 城市Exp2 72 城市Exp3 72 城市1 21 城市2 21

我正在使用子查询来获取数据。 现在使用 BIGQUERY 我想要这样的数据:

城市名称行号 城市文本1 1 城市文本2 1 城市文本3 1 城市Exp1 2 城市Exp2 2 城市Exp3 2 城市1 3 城市2 3

我曾尝试使用 row_number(),但它为每一行提供唯一编号。这可能是我想要的。

【问题讨论】:

【参考方案1】:

您需要先将它们与标量值连接为一个分区,然后您可以在该分区上应用ROW_NUMBER。

更新:滚动到答案的底部以查看不使用标量的建议。

SELECT stateId,
       row_number() over (partition BY scalar) AS INDEX
FROM
  (SELECT stateId,
          1 AS scalar
   FROM
     (SELECT 'cityText1' AS cityName,
             52 AS stateId),
     (SELECT 'cityText2' AS cityName,
             52 AS stateId),
     (SELECT 'cityText3' AS cityName,
             52 AS stateId),
     (SELECT 'cityExp1' AS cityName,
             72 AS stateId),
     (SELECT 'cityExp2' AS cityName,
             72 AS stateId),
     (SELECT 'cityExp3' AS cityName,
             72 AS stateId),
     (SELECT 'city1' AS cityName,
             21 AS stateId),
     (SELECT 'city2' AS cityName,
             21 AS stateId)
   GROUP BY stateId) d

返回:

+-----+---------+-------+---+
| Row | stateId | index |   |
+-----+---------+-------+---+
|   1 |      52 |     1 |   |
|   2 |      72 |     2 |   |
|   3 |      21 |     3 |   |
+-----+---------+-------+---+

然后您可以再次加入表格并准备最终输出。对于我们的静态表,这是一个相当长的查询:

SELECT t.cityName,
       t.stateId,
       d.index
FROM
  (SELECT *
   FROM
     (SELECT 'cityText1' AS cityName,
             52 AS stateId),
     (SELECT 'cityText2' AS cityName,
             52 AS stateId),
     (SELECT 'cityText3' AS cityName,
             52 AS stateId),
     (SELECT 'cityExp1' AS cityName,
             72 AS stateId),
     (SELECT 'cityExp2' AS cityName,
             72 AS stateId),
     (SELECT 'cityExp3' AS cityName,
             72 AS stateId),
     (SELECT 'city1' AS cityName,
             21 AS stateId),
     (SELECT 'city2' AS cityName,
             21 AS stateId)) t
JOIN
  (SELECT stateId,
          row_number() over (partition BY scalar) AS INDEX
   FROM
     (SELECT stateId,
             1 AS scalar
      FROM
        (SELECT 'cityText1' AS cityName,
                52 AS stateId),
        (SELECT 'cityText2' AS cityName,
                52 AS stateId),
        (SELECT 'cityText3' AS cityName,
                52 AS stateId),
        (SELECT 'cityExp1' AS cityName,
                72 AS stateId),
        (SELECT 'cityExp2' AS cityName,
                72 AS stateId),
        (SELECT 'cityExp3' AS cityName,
                72 AS stateId),
        (SELECT 'city1' AS cityName,
                21 AS stateId),
        (SELECT 'city2' AS cityName,
                21 AS stateId)
      GROUP BY stateId)) d ON d.stateId=t.stateId

这将返回最终输出:

+-----+------------+-----------+---------+---+
| Row | t_cityName | t_stateId | d_index |   |
+-----+------------+-----------+---------+---+
|   1 | cityText1  |        52 |       1 |   |
|   2 | cityText2  |        52 |       1 |   |
|   3 | cityText3  |        52 |       1 |   |
|   4 | cityExp1   |        72 |       2 |   |
|   5 | cityExp2   |        72 |       2 |   |
|   6 | cityExp3   |        72 |       2 |   |
|   7 | city1      |        21 |       3 |   |
|   8 | city2      |        21 |       3 |   |
+-----+------------+-----------+---------+---+

更新: 在没有标量的更新之后,查询变为:

SELECT stateId,
       row_number() over () AS INDEX
FROM

     (SELECT 'cityText1' AS cityName,
             52 AS stateId),
     (SELECT 'cityText2' AS cityName,
             52 AS stateId),
     (SELECT 'cityText3' AS cityName,
             52 AS stateId),
     (SELECT 'cityExp1' AS cityName,
             72 AS stateId),
     (SELECT 'cityExp2' AS cityName,
             72 AS stateId),
     (SELECT 'cityExp3' AS cityName,
             72 AS stateId),
     (SELECT 'city1' AS cityName,
             21 AS stateId),
     (SELECT 'city2' AS cityName,
             21 AS stateId)
   group by stateId

【讨论】:

很好的解决方案,@Pentium10,一个小建议:你不需要 PARTITION BY scalar 并引入 scalar,你可以有空的 OVER() 子句,它会正常工作。

以上是关于bigquery row_number 根据特定字段按某些行分组的主要内容,如果未能解决你的问题,请参考以下文章

BigQuery row_number 删除重复项

Bigquery:如何根据特定时间范围聚合几列的数据?

处理 BigQuery(嵌套表)中的重复项

需要帮助根据 BigQuery 中的值将 Google Cloud Storage 中的特定 PDF 文件移动到 SFTP

BigQuery 重复数据删除和分区表

Bigquery 根据另一个表在列中查找文本