在 bigquery 中查询多个数据集中的表时遇到问题

Posted

技术标签:

【中文标题】在 bigquery 中查询多个数据集中的表时遇到问题【英文标题】:Trouble querying tables in multiple datasets in bigquery 【发布时间】:2020-05-25 16:52:59 【问题描述】:

我正在尝试从两个不同的数据集中查询两个 bigquery 表以获取 2 个单独的列。我都尝试过联合和加入,但他们没有给我我需要的东西。以下是我尝试过的查询

with abagrowth as (
SELECT
  session abas,
  term abat,
  COUNT(distinct studentid) AS acount,
  ROUND(100 * (COUNT(distinct studentid) - LAG(COUNT(distinct studentid), 1) OVER (ORDER BY session)) / LAG(COUNT(distinct studentid), 1) OVER (ORDER BY session),0) || '%' AS agrowth
FROM
  aba.abaresult
GROUP BY
  1,
  2
ORDER BY
  1,
  2),

bidagrowth as (
SELECT
  session bidas,
  term bidat,
  COUNT(distinct studentid) AS bcount,
  ROUND(100 * (COUNT(distinct studentid) - LAG(COUNT(distinct studentid), 1) OVER (ORDER BY session)) / LAG(COUNT(distinct studentid), 1) OVER (ORDER BY session),0) || '%' AS bgrowth
FROM
  bida.bidaresult
GROUP BY
  1,
  2
ORDER BY
  1,
  2)

select abas, agrowth from abagrowth
union all
select bidas, bgrowth from bidagrowth

数据集与此类似

name  subject  session      totalscore
-------------------------------------------
jack  maths    2013/2014         70
jane  maths    2013/2014         65
jill  maths    2013/2014         80
jack  maths    2014/2015         72
jack  eng      2014/2015         87
jane  science  2014/2015         67
jill  maths    2014/2015         70
jerry eng      2014/2015         70
jaasp science  2014/2015         85

我要获取的表格应该是这种格式或类似的格式

session    agrowth  bgrowth
2013/2014   null     null
2014/2015   10%       11%
2015/2016   5%        2%

上面的数字是假设的,例如缘故。

问题

    bigquery 可以做到这一点吗?

    如果是,如何实现?

谢谢

【问题讨论】:

【参考方案1】:

关于数据集。是的,您可以查询两个数据集。查看this answer。基本上,您只需要指明您正在使用的项目(可选)、数据集和表即可。

对于您想要获取的数据。您可以使用 JOIN 而不是 UNION 来实现它。按会话加入表将允许您每个会话有一行。然后您可以选择要包含在 SELECT 中的列。

WITH abagrowth AS (
SELECT
  session,
  term abat,
  COUNT(distinct studentid) AS acount,
  ROUND(100 * (COUNT(distinct studentid) - LAG(COUNT(distinct studentid), 1) OVER (ORDER BY session)) / LAG(COUNT(distinct studentid), 1) OVER (ORDER BY session),0) || '%' AS agrowth
FROM
  aba.abaresult
GROUP BY
  1,
  2
ORDER BY
  1,
  2),

bidagrowth AS (
SELECT
  session,
  term bidat,
  COUNT(distinct studentid) AS bcount,
  ROUND(100 * (COUNT(distinct studentid) - LAG(COUNT(distinct studentid), 1) OVER (ORDER BY session)) / LAG(COUNT(distinct studentid), 1) OVER (ORDER BY session),0) || '%' AS bgrowth
FROM
  bida.bidaresult
GROUP BY
  1,
  2
ORDER BY
  1,
  2)

SELECT aba.session, aba.agrowth, bida.bgrowth
   FROM abagrowth aba
   JOIN bidagrowth bida
        ON aba.session = bida.session

UNION 将两个查询的结果堆叠起来。

【讨论】:

以上是关于在 bigquery 中查询多个数据集中的表时遇到问题的主要内容,如果未能解决你的问题,请参考以下文章

在 BigQuery 中动态查询多个表

使用 Pandas 附加 BigQuery 表时如何修复无效架构

Bigquery:数据集中的大量表会影响性能吗?

在 bigquery 中查询多个基于日期的表

Google Cloud datalab 查询 BIgQuery 表时出错

尝试在BigQuery中查询多个表时,列名称不明确