通过联接合并到 BigQuery 分区表中，无需扫描整个表

Posted 2023-03-24

技术标签:

【中文标题】通过联接合并到 BigQuery 分区表中，无需扫描整个表【英文标题】：Merge into a BigQuery Partitioned Table via a Join Without Scanning Entire Table 【发布时间】：2019-09-26 05:12:14 【问题描述】：

示例场景.. 我有数百万行的“BigTable”和只有几行的“TinyTable”。我需要将 TinyTable 中的一些信息合并到 BigTable 中。 BigTable 由列“date_time”分区。我的合并将在 date_time 和 ID 上加入。我真的只需要 ID 列来执行连接，但我认为在那里也有 date_time 列将允许 BQ 修剪分区并只查看必要的日期。没有。它对 BigTable 进行全面扫描（向我收取千兆字节的数据费用）......即使 TinyTable 只有一个值（即从一个日期开始）。

BigTable
+---------------------------+---------+-------+
|         date_time         |      ID | value |
+---------------------------+---------+-------+
| '2019-03-13 00:00:00 UTC' |     100 | .2345 |
| '2019-03-13 00:00:00 UTC' |     101 |   .65 |
| '2019-03-14 00:00:00 UTC' |     102 |  .648 |
|  [+50 millions rows...]   |         |       |
+---------------------------+---------+-------+


TinyTable
+---------------------------+---------+-------+
|         date_time         |      ID | value |
+---------------------------+---------+-------+
| '2019-03-13 00:00:00 UTC' |     100 |  .555 |
| '2019-03-14 00:00:00 UTC' |     102 |  .666 |
|                           |         |       |
+---------------------------+---------+-------+

...

使用 8 GB...

 MERGE BigTable
    USING TinyTable
    ON BigTable.date_time = TinyTable.date_time and BigTable.id = TinyTable.id 
    WHEN MATCHED THEN
      UPDATE SET date_time = TinyTable.date_time, value = TinyTable.value
    WHEN NOT MATCHED THEN
      INSERT  (date_time, id , value) values (date_time, id , value);

使用 8 GB...

update BigTable 
set value = TinyTable.value 
from 
TinyTable where 
BigTable.date_time = TinyTable.date_time 
and 
BigTable.id = TinyTable.id

如果我在时间戳文字中硬编码而不是使用连接中的值（但不是我所追求的），则可以按预期工作（仅 12 MB）...

update BigTable 
set value = TinyTable.value 
from 
TinyTable where 
BigTable.date_time = '2019-03-13 00:00:00 UTC' 
and 
BigTable.id = TinyTable.id

我需要每天运行数百次这样的事情。照原样，这在成本方面是不可持续的。我错过了什么？

谢谢！

【问题讨论】：

【参考方案1】：

有了 BigQuery scripting（现在是测试版），有一种方法可以降低成本。

基本上，脚本变量被定义为捕获子查询的动态部分。然后在后续查询中，使用脚本变量作为过滤器来修剪要扫描的分区。

DECLARE date_filter ARRAY<DATETIME> 
  DEFAULT (SELECT ARRAY_AGG(d) FROM TinyTable);

update BigTable 
set value = TinyTable.value 
from 
TinyTable where 
BigTable.date_time in UNNEST(date_filter) --This prunes the partition to be scanned
AND
BigTable.date_time = TinyTable.date_time 
and 
BigTable.id = TinyTable.id;

【讨论】：

【参考方案2】：

可能的解决方案 1：

从某个分区获取所有数据并保存到临时表中对临时表执行更新/合并语句用临时表内容重写分区

对于第 3 步 - 您可以使用 $ 装饰器访问某些分区：Dataset.BigTable$20190926

可能的解决方案 2：

您可以安排 python 脚本运行 SQL 查询，如上一个。谷歌提供nice library。您甚至可以使用来自concurrent.futures 的ThreadPoolExecutor 或任何其他线程库并行运行它们。

【讨论】：

感谢您的提示。我目前正在使用类似解决方案 2 的 Java 客户端库。我维护一个需要更新的日期表，对其进行查询，然后动态生成一个包含日期文字的查询。只是看起来很傻。我以为我只是错过了一些东西。没想到解决方案 1。这很有趣……给了我一些新的思考。

以上是关于通过联接合并到 BigQuery 分区表中，无需扫描整个表的主要内容，如果未能解决你的问题，请参考以下文章