将 BigQuery 表的并发导出扩展到 Google Cloud Storage

Posted

技术标签:

【中文标题】将 BigQuery 表的并发导出扩展到 Google Cloud Storage【英文标题】:Scaling concurrent exports of BigQuery tables to Google Cloud Storage 【发布时间】:2018-03-01 19:23:47 【问题描述】:

我正在尝试在 BigQuery 中运行查询并将结果存储在 Cloud Storage 中。使用 BigQueries API 很简单。

当我尝试同时处理多个查询时出现问题。我尝试提取的表越多,将结果表“提取”到 Cloud Storage 的速度就会显着降低。这是我为 20 个并发作业所做的实验的总结结果。结果以秒为单位。

job 013 done. Query: 012.0930221081. Extract: 009.8582818508. Signed URL: 000.3398022652
job 000 done. Query: 012.1677722931. Extract: 010.7060177326. Signed URL: 000.3358650208
job 002 done. Query: 009.5634860992. Extract: 014.2841088772. Signed URL: 000.3027939796
job 004 done. Query: 011.7068181038. Extract: 012.5938670635. Signed URL: 000.2734949589
job 020 done. Query: 009.8888399601. Extract: 015.4054799080. Signed URL: 000.3903510571
job 022 done. Query: 012.9012901783. Extract: 013.9143507481. Signed URL: 000.3490731716
job 014 done. Query: 012.8500978947. Extract: 015.0055649281. Signed URL: 000.2981300354
job 006 done. Query: 011.6835210323. Extract: 016.2601530552. Signed URL: 000.2789318562
job 001 done. Query: 013.4435272217. Extract: 015.2819819450. Signed URL: 000.2984759808
job 005 done. Query: 012.0956349373. Extract: 018.9619371891. Signed URL: 000.3134548664
job 018 done. Query: 013.6754779816. Extract: 020.0537509918. Signed URL: 000.3496448994
job 011 done. Query: 011.9627509117. Extract: 025.1803772449. Signed URL: 000.3009829521
job 008 done. Query: 015.7373569012. Extract: 136.8249070644. Signed URL: 000.3158171177
job 023 done. Query: 013.7817242146. Extract: 148.2014479637. Signed URL: 000.4145238400
job 012 done. Query: 014.5390141010. Extract: 151.3171939850. Signed URL: 000.3226230145
job 007 done. Query: 014.1386809349. Extract: 160.1254091263. Signed URL: 000.2966897488
job 021 done. Query: 013.6751790047. Extract: 162.8383400440. Signed URL: 000.3162341118
job 019 done. Query: 013.5642910004. Extract: 163.2161693573. Signed URL: 000.2765989304
job 003 done. Query: 013.8807480335. Extract: 165.1014308929. Signed URL: 000.3309218884
job 024 done. Query: 013.5861997604. Extract: 182.0707099438. Signed URL: 000.3331830502
job 009 done. Query: 013.5025639534. Extract: 199.4397711754. Signed URL: 000.4156360626
job 015 done. Query: 013.7611100674. Extract: 230.2218120098. Signed URL: 000.2913899422
job 016 done. Query: 013.4659759998. Extract: 285.7284781933. Signed URL: 000.3109869957
job 017 done. Query: 019.2001299858. Extract: 322.5298812389. Signed URL: 000.2890429497
job 010 done. Query: 014.7132742405. Extract: 363.8596160412. Signed URL: 000.6748869419

一份工作做三件事

    向 BigQuery 提交查询 将结果表提取到 Cloud Storage 在 Cloud Storage 中生成 Blob 的签名 URL

如结果所示,第一组提取物需要 9 - 25 秒,之后开始需要更长的时间。

关于为什么会发生这种情况的任何想法?这是原因吗? https://cloud.google.com/storage/docs/request-rate 有没有办法解决这个问题?

编辑:这是我发现的一些附加信息。

| job | Local Extract timed | Google Extract timed | Google's Extract started | Google's Extract ended | Local Extract start | Local Extract start | 
| --- | ------------------- | -------------------- | ------------------------ | ---------------------- | ------------------- | ------------------- |
| 026 | 009.26328           | 008.84300            | 13:39:00.441000          | 13:39:09.284000        | 07:39:00.235970     | 07:39:09.498784     |
| 009 | 011.52299           | 008.04000            | 13:39:00.441000          | 13:39:08.481000        | 07:39:00.234297     | 07:39:11.756788     |
| 004 | 010.35730           | 008.66700            | 13:39:03.436000          | 13:39:12.103000        | 07:39:03.240466     | 07:39:13.597328     |
| 011 | 011.86404           | 009.29900            | 13:39:03.055000          | 13:39:12.354000        | 07:39:02.893600     | 07:39:14.756887     |
| 006 | 012.50416           | 011.75400            | 13:39:02.854000          | 13:39:14.608000        | 07:39:02.623032     | 07:39:15.126790     |
| 000 | 013.30535           | 008.77000            | 13:39:02.056000          | 13:39:10.826000        | 07:39:01.863548     | 07:39:15.168434     |
| 002 | 011.47199           | 008.53700            | 13:39:04.443000          | 13:39:12.980000        | 07:39:04.236455     | 07:39:15.708005     |
| 032 | 015.68229           | 009.69200            | 13:39:02.915000          | 13:39:12.607000        | 07:39:02.768185     | 07:39:18.450160     |
| 001 | 017.46480           | 009.35800            | 13:39:01.313000          | 13:39:10.671000        | 07:39:01.071540     | 07:39:18.535896     |
| 012 | 019.02242           | 008.65700            | 13:39:00.903000          | 13:39:09.560000        | 07:39:00.727101     | 07:39:19.749070     |
| 018 | 016.95632           | 009.75800            | 13:39:03.259000          | 13:39:13.017000        | 07:39:03.080580     | 07:39:20.036199     |
| 019 | 017.24428           | 008.51100            | 13:39:03.773000          | 13:39:12.284000        | 07:39:03.575118     | 07:39:20.819042     |
| 008 | 019.55018           | 009.83600            | 13:39:02.110000          | 13:39:11.946000        | 07:39:01.905548     | 07:39:21.455273     |
| 023 | 016.64131           | 008.94500            | 13:39:05.282000          | 13:39:14.227000        | 07:39:05.041235     | 07:39:21.682086     |
| 017 | 019.39104           | 007.12700            | 13:39:03.118000          | 13:39:10.245000        | 07:39:02.896256     | 07:39:22.286485     |
| 020 | 019.96283           | 010.05000            | 13:39:03.115000          | 13:39:13.165000        | 07:39:02.942562     | 07:39:22.904864     |
| 036 | 022.05831           | 010.51200            | 13:39:02.626000          | 13:39:13.138000        | 07:39:02.461061     | 07:39:24.518903     |
| 024 | 028.39538           | 008.79600            | 13:39:05.151000          | 13:39:13.947000        | 07:39:04.916194     | 07:39:33.311248     |
| 007 | 107.36010           | 010.68900            | 13:40:31.555000          | 13:40:42.244000        | 07:39:03.050049     | 07:40:50.409359     |
| 028 | 120.63134           | 009.52400            | 13:40:49.915000          | 13:40:59.439000        | 07:39:02.941202     | 07:41:03.572094     |
| 033 | 120.78268           | 009.54200            | 13:40:27.147000          | 13:40:36.689000        | 07:39:04.152378     | 07:41:04.934602     |
| 037 | 122.64949           | 008.80400            | 13:40:33.298000          | 13:40:42.102000        | 07:39:06.500587     | 07:41:09.149629     |
| 035 | 125.35254           | 009.13200            | 13:40:27.600000          | 13:40:36.732000        | 07:39:04.295941     | 07:41:09.647836     |
| 015 | 139.13287           | 011.17800            | 13:40:27.116000          | 13:40:38.294000        | 07:39:03.406321     | 07:41:22.538701     |
| 029 | 141.21037           | 008.23700            | 13:40:24.271000          | 13:40:32.508000        | 07:39:03.816588     | 07:41:25.026438     |
| 013 | 145.94239           | 009.19400            | 13:40:33.809000          | 13:40:43.003000        | 07:39:03.375451     | 07:41:29.317454     |
| 039 | 149.92807           | 009.72300            | 13:40:33.090000          | 13:40:42.813000        | 07:39:03.635156     | 07:41:33.562607     |
| 016 | 166.26505           | 010.12000            | 13:40:39.999000          | 13:40:50.119000        | 07:39:03.383215     | 07:41:49.647907     |
| 010 | 210.61908           | 011.37900            | 13:42:20.287000          | 13:42:31.666000        | 07:39:03.702486     | 07:42:34.321079     |
| 027 | 227.83011           | 010.00900            | 13:42:25.845000          | 13:42:35.854000        | 07:39:02.953435     | 07:42:50.783106     |
| 025 | 228.48326           | 009.71000            | 13:42:20.845000          | 13:42:30.555000        | 07:39:03.673122     | 07:42:52.155934     |
| 022 | 244.57685           | 010.06900            | 13:42:53.712000          | 13:43:03.781000        | 07:39:03.963936     | 07:43:08.540307     |
| 021 | 263.74717           | 009.81400            | 13:42:40.211000          | 13:42:50.025000        | 07:39:04.505016     | 07:43:28.251864     |
| 031 | 273.96990           | 008.55100            | 13:43:18.645000          | 13:43:27.196000        | 07:39:03.618419     | 07:43:37.587862     |
| 034 | 280.96174           | 010.53300            | 13:42:58.364000          | 13:43:08.897000        | 07:39:04.313498     | 07:43:45.274962     |
| 030 | 281.76029           | 008.27100            | 13:42:49.448000          | 13:42:57.719000        | 07:39:03.832644     | 07:43:45.592592     |
| 005 | 288.15577           | 009.85300            | 13:43:04.825000          | 13:43:14.678000        | 07:39:04.006553     | 07:43:52.161888     |
| 003 | 296.52279           | 009.65300            | 13:43:24.041000          | 13:43:33.694000        | 07:39:03.831264     | 07:44:00.353715     |
| 038 | 380.01783           | 008.45000            | 13:44:57.326000          | 13:45:05.776000        | 07:39:03.055733     | 07:45:23.073209     |
| 014 | 397.05841           | 008.99800            | 13:44:48.577000          | 13:44:57.575000        | 07:39:03.132323     | 07:45:40.190302     |

该表显示了我必须在本地等待多长时间才能运行我的工作,并显示 Google 需要多长时间来完成我的工作。从时间来看,它表明 Google 执行提取不需要很长时间,但它不会同时运行这些作业,因此会强制一些提取在开始前等待几分钟。

【问题讨论】:

昨天推出的管道有一些改进,如果您仍然被阻止,您可以再试一次,看看是否有帮助。谢谢! 重新运行我的一些测试,现在速度更快了。谢谢! 【参考方案1】:

您说得对,目前在内部处理导出作业的速度存在内部限制。这最初是为了保护并行运行的过多长时间且昂贵的导出的系统。但是,正如您所指出的,对于您在 1 分钟内完成许多导出作业的情况,此限制似乎没有帮助。

我们有一个开放的(内部)错误来解决这个问题,以使像您这样的小型出口的情况更好。同时,如果您认为自己因此受阻,请提交错误或让我知道您的项目 ID,我们可以帮助您提高项目的限制。

【讨论】:

以上是关于将 BigQuery 表的并发导出扩展到 Google Cloud Storage的主要内容,如果未能解决你的问题,请参考以下文章

bigquery 是不是保持并发性?

是否可以使用 python 中的 bigquery API 将数据集中所有表的计数(*)发送到 csv 文件?

将 bigquery 视图复制到另一个区域

如何通过 BigQuery 中的 WebUI 导出现有表的架构?

将多个 BigQuery 表导出为一个

如何在 Airflow 中修改 BigQuery 外部表的源文件路径?