如何在 Google 的 Bigquery 中获取最频繁的值

Posted

技术标签:

【中文标题】如何在 Google 的 Bigquery 中获取最频繁的值【英文标题】:How to get the most frequent value in Google's Bigquery 【发布时间】:2019-05-08 19:43:38 【问题描述】:

Postgres 有一个简单的函数来实现这一点,只需使用mode() 函数我们就可以找到最频繁的值。在 Google 的 Bigquery 中有没有类似的东西?

如何在 Bigquery 中编写这样的查询?

select count(*),
       avg(vehicles)                                         as mean,
       percentile_cont(0.5) within group (order by vehicles) as median,
       mode() within group (order by vehicles)               as most_frequent_value
FROM "driver"
WHERE vehicles is not null;

【问题讨论】:

【参考方案1】:

以下是 BigQuery 标准 SQL

选项一

#standardSQL
SELECT * FROM (
  SELECT COUNT(*) AS cnt,
    AVG(vehicles) AS mean,
    APPROX_TOP_COUNT(vehicles, 1)[OFFSET(0)].value AS most_frequent_value
  FROM `project.dataset.table`
  WHERE vehicles IS NOT NULL
) CROSS JOIN (
  SELECT PERCENTILE_CONT(vehicles, 0.5) OVER() AS median
  FROM `project.dataset.table`
  WHERE vehicles IS NOT NULL
  LIMIT 1
)

选项 2

#standardSQL
SELECT * FROM (
  SELECT COUNT(*) cnt,
    AVG(vehicles) AS mean
  FROM `project.dataset.table`
  WHERE vehicles IS NOT NULL
) CROSS JOIN (
  SELECT PERCENTILE_CONT(vehicles, 0.5) OVER() AS median
  FROM `project.dataset.table`
  WHERE vehicles IS NOT NULL
  LIMIT 1
) CROSS JOIN (
  SELECT vehicles AS most_frequent_value
  FROM `project.dataset.table`
  WHERE vehicles IS NOT NULL
  GROUP BY vehicles
  ORDER BY COUNT(1) DESC
  LIMIT 1
)  

选项 3

#standardSQL
CREATE TEMP FUNCTION median(arr ANY TYPE) AS ((
  SELECT PERCENTILE_CONT(x, 0.5) OVER() 
  FROM UNNEST(arr) x LIMIT 1 
));
CREATE TEMP FUNCTION most_frequent_value(arr ANY TYPE) AS ((
  SELECT x 
  FROM UNNEST(arr) x
  GROUP BY x
  ORDER BY COUNT(1) DESC
  LIMIT 1  
));
SELECT COUNT(*) cnt,
  AVG(vehicles) AS mean,
  median(ARRAY_AGG(vehicles)) AS median,
  most_frequent_value(ARRAY_AGG(vehicles)) AS most_frequent_value
FROM `project.dataset.table`
WHERE vehicles IS NOT NULL   

等等……

【讨论】:

【参考方案2】:

您可以使用APPROX_TOP_COUNT 获取最高值,例如:

SELECT APPROX_TOP_COUNT(vehicles, 5) AS top_five_vehicles
FROM dataset.driver

如果你只想要顶部的值,你可以从数组中选择它:

SELECT APPROX_TOP_COUNT(vehicles, 1)[OFFSET(0)] AS most_frequent_value
FROM dataset.driver

【讨论】:

如果你只想要值,追加.value - 该函数返回一个带有值和计数的结构。【参考方案3】:

我更喜欢的方法是从数组中查询,因为您可以轻松调整模式的标准。下面是两个同时使用偏移量和限制方法的示例。使用偏移量,您可以获取第 N 个最频繁/最不频繁的值。

WITH t AS (SELECT 18 AS length, 
'HIGH' as amps, 
99.95 price UNION ALL
SELECT 18,  "HIGH", 99.95 UNION ALL
SELECT 18,  "HIGH", 5.95 UNION ALL
SELECT 18,  "LOW", 33.95 UNION ALL
SELECT 18,  "LOW", 33.95 UNION ALL
SELECT 18,  "LOW", 4.5 UNION ALL
SELECT 3,  "HIGH", 77.95 UNION ALL
SELECT 3,  "HIGH", 77.95 UNION ALL
SELECT 3,  "HIGH", 9.99 UNION ALL
SELECT 3,  "LOW", 44.95 UNION ALL
SELECT 3,  "LOW", 44.95 UNION ALL
SELECT 3,  "LOW", 5.65 
)

SELECT
length,
amps,

-- By Limit
(SELECT x FROM UNNEST(price_array) x 
    GROUP BY x ORDER BY COUNT(*) DESC LIMIT 1 ) most_freq_price,
(SELECT x FROM UNNEST(price_array) x 
    GROUP BY x ORDER BY COUNT(*) ASC  LIMIT 1 ) least_freq_price,

-- By Offset
ARRAY((SELECT x FROM UNNEST(price_array) x 
    GROUP BY x ORDER BY COUNT(*) DESC))[OFFSET(0)] most_freq_price_offset,
ARRAY((SELECT x FROM UNNEST(price_array) x 
    GROUP BY x ORDER BY COUNT(*) ASC))[OFFSET(0)] least_freq_price_offset

FROM (
SELECT 
    length,
    amps,
    ARRAY_AGG(price) price_array
FROM t
GROUP BY 1,2
)

【讨论】:

【参考方案4】:

不,BigQuery 中没有与 mode()-function 等效的函数,但您可以自己定义一个,使用此线程其他答案中的任何逻辑。你可以这样称呼它:

SELECT mode(`an_array`) AS top_count FROM `somewhere_with_arrays`

但是这种方法会导致多个逐行子查询,这对性能很不利,因此,如果您以前从未停止过 BQ,则可以使用这些函数来完成。我(第二个)只是为了快速修复非常小的数据集的可读性。

查看下面的两个 UDF:s。第三种方法是实现一个 JS 函数,在这种情况下,这个 oneliner 应该很有用

return arr.sort((a,b) => arr.filter(v => v===a).length - arr.filter(v => v===b).length).pop();

这段代码建立了两个类似mode()的函数,它们吃数组并返回最常见的字符串:

CREATE TEMPORARY FUNCTION mode1(mystring ANY TYPE)
RETURNS STRING
AS
(
    (
        SELECT var FROM
        (   /* Count occurances of each value of input */ 
            SELECT var, COUNT(*) AS n FROM 
                (   /* Unnest and name*/
                    SELECT var FROM UNNEST(mystring) var
                )
                GROUP BY var    /* Output is one of existing values */
                ORDER BY n DESC /* Output is value with HIGHEST n   */
        )                       /* -------------------------------- */
    LIMIT 1                     /* Only ONE string is the output    */
    )
);

CREATE TEMPORARY FUNCTION mode2(inp ANY TYPE)
RETURNS STRING
AS
(
    (
        SELECT result.value FROM UNNEST( (SELECT APPROX_TOP_COUNT(v,1) AS result FROM UNNEST(inp) v)) result
    )
);

SELECT
    inp,
    mode1(inp) AS first_logic_output,
    mode2(inp) AS second_logic_output
FROM
(
    /* Test data */
    SELECT ['Erdős','Turán', 'Erdős','Turán','Euler','Erdős'] AS inp
    UNION ALL 
    SELECT ['Euler','Euler', 'Gauss', 'Euler'] AS inp
)

【讨论】:

以上是关于如何在 Google 的 Bigquery 中获取最频繁的值的主要内容,如果未能解决你的问题,请参考以下文章

如何在Google BigQuery中获取Day名称

如何使用 Google Analytics 数据在 Bigquery 中获取可用的日期时间字段

如何在 Google BigQuery 中获取数据集名称,包括“publicdata”

如何在 Google BigQuery 中使用 UNNEST 函数获取 COUNT?

google-bigquery 如何使用 https 获取数据集列表?

如何从 Google bigquery(google-cloud-ruby gem)的视图表(具有 resource_full)中获取数据