将 BigQuery 表的并发导出扩展到 Google Cloud Storage
Posted
技术标签:
【中文标题】将 BigQuery 表的并发导出扩展到 Google Cloud Storage【英文标题】:Scaling concurrent exports of BigQuery tables to Google Cloud Storage 【发布时间】:2018-03-01 19:23:47 【问题描述】:我正在尝试在 BigQuery 中运行查询并将结果存储在 Cloud Storage 中。使用 BigQueries API 很简单。
当我尝试同时处理多个查询时出现问题。我尝试提取的表越多,将结果表“提取”到 Cloud Storage 的速度就会显着降低。这是我为 20 个并发作业所做的实验的总结结果。结果以秒为单位。
job 013 done. Query: 012.0930221081. Extract: 009.8582818508. Signed URL: 000.3398022652
job 000 done. Query: 012.1677722931. Extract: 010.7060177326. Signed URL: 000.3358650208
job 002 done. Query: 009.5634860992. Extract: 014.2841088772. Signed URL: 000.3027939796
job 004 done. Query: 011.7068181038. Extract: 012.5938670635. Signed URL: 000.2734949589
job 020 done. Query: 009.8888399601. Extract: 015.4054799080. Signed URL: 000.3903510571
job 022 done. Query: 012.9012901783. Extract: 013.9143507481. Signed URL: 000.3490731716
job 014 done. Query: 012.8500978947. Extract: 015.0055649281. Signed URL: 000.2981300354
job 006 done. Query: 011.6835210323. Extract: 016.2601530552. Signed URL: 000.2789318562
job 001 done. Query: 013.4435272217. Extract: 015.2819819450. Signed URL: 000.2984759808
job 005 done. Query: 012.0956349373. Extract: 018.9619371891. Signed URL: 000.3134548664
job 018 done. Query: 013.6754779816. Extract: 020.0537509918. Signed URL: 000.3496448994
job 011 done. Query: 011.9627509117. Extract: 025.1803772449. Signed URL: 000.3009829521
job 008 done. Query: 015.7373569012. Extract: 136.8249070644. Signed URL: 000.3158171177
job 023 done. Query: 013.7817242146. Extract: 148.2014479637. Signed URL: 000.4145238400
job 012 done. Query: 014.5390141010. Extract: 151.3171939850. Signed URL: 000.3226230145
job 007 done. Query: 014.1386809349. Extract: 160.1254091263. Signed URL: 000.2966897488
job 021 done. Query: 013.6751790047. Extract: 162.8383400440. Signed URL: 000.3162341118
job 019 done. Query: 013.5642910004. Extract: 163.2161693573. Signed URL: 000.2765989304
job 003 done. Query: 013.8807480335. Extract: 165.1014308929. Signed URL: 000.3309218884
job 024 done. Query: 013.5861997604. Extract: 182.0707099438. Signed URL: 000.3331830502
job 009 done. Query: 013.5025639534. Extract: 199.4397711754. Signed URL: 000.4156360626
job 015 done. Query: 013.7611100674. Extract: 230.2218120098. Signed URL: 000.2913899422
job 016 done. Query: 013.4659759998. Extract: 285.7284781933. Signed URL: 000.3109869957
job 017 done. Query: 019.2001299858. Extract: 322.5298812389. Signed URL: 000.2890429497
job 010 done. Query: 014.7132742405. Extract: 363.8596160412. Signed URL: 000.6748869419
一份工作做三件事
-
向 BigQuery 提交查询
将结果表提取到 Cloud Storage
在 Cloud Storage 中生成 Blob 的签名 URL
如结果所示,第一组提取物需要 9 - 25 秒,之后开始需要更长的时间。
关于为什么会发生这种情况的任何想法?这是原因吗? https://cloud.google.com/storage/docs/request-rate 有没有办法解决这个问题?
编辑:这是我发现的一些附加信息。
| job | Local Extract timed | Google Extract timed | Google's Extract started | Google's Extract ended | Local Extract start | Local Extract start |
| --- | ------------------- | -------------------- | ------------------------ | ---------------------- | ------------------- | ------------------- |
| 026 | 009.26328 | 008.84300 | 13:39:00.441000 | 13:39:09.284000 | 07:39:00.235970 | 07:39:09.498784 |
| 009 | 011.52299 | 008.04000 | 13:39:00.441000 | 13:39:08.481000 | 07:39:00.234297 | 07:39:11.756788 |
| 004 | 010.35730 | 008.66700 | 13:39:03.436000 | 13:39:12.103000 | 07:39:03.240466 | 07:39:13.597328 |
| 011 | 011.86404 | 009.29900 | 13:39:03.055000 | 13:39:12.354000 | 07:39:02.893600 | 07:39:14.756887 |
| 006 | 012.50416 | 011.75400 | 13:39:02.854000 | 13:39:14.608000 | 07:39:02.623032 | 07:39:15.126790 |
| 000 | 013.30535 | 008.77000 | 13:39:02.056000 | 13:39:10.826000 | 07:39:01.863548 | 07:39:15.168434 |
| 002 | 011.47199 | 008.53700 | 13:39:04.443000 | 13:39:12.980000 | 07:39:04.236455 | 07:39:15.708005 |
| 032 | 015.68229 | 009.69200 | 13:39:02.915000 | 13:39:12.607000 | 07:39:02.768185 | 07:39:18.450160 |
| 001 | 017.46480 | 009.35800 | 13:39:01.313000 | 13:39:10.671000 | 07:39:01.071540 | 07:39:18.535896 |
| 012 | 019.02242 | 008.65700 | 13:39:00.903000 | 13:39:09.560000 | 07:39:00.727101 | 07:39:19.749070 |
| 018 | 016.95632 | 009.75800 | 13:39:03.259000 | 13:39:13.017000 | 07:39:03.080580 | 07:39:20.036199 |
| 019 | 017.24428 | 008.51100 | 13:39:03.773000 | 13:39:12.284000 | 07:39:03.575118 | 07:39:20.819042 |
| 008 | 019.55018 | 009.83600 | 13:39:02.110000 | 13:39:11.946000 | 07:39:01.905548 | 07:39:21.455273 |
| 023 | 016.64131 | 008.94500 | 13:39:05.282000 | 13:39:14.227000 | 07:39:05.041235 | 07:39:21.682086 |
| 017 | 019.39104 | 007.12700 | 13:39:03.118000 | 13:39:10.245000 | 07:39:02.896256 | 07:39:22.286485 |
| 020 | 019.96283 | 010.05000 | 13:39:03.115000 | 13:39:13.165000 | 07:39:02.942562 | 07:39:22.904864 |
| 036 | 022.05831 | 010.51200 | 13:39:02.626000 | 13:39:13.138000 | 07:39:02.461061 | 07:39:24.518903 |
| 024 | 028.39538 | 008.79600 | 13:39:05.151000 | 13:39:13.947000 | 07:39:04.916194 | 07:39:33.311248 |
| 007 | 107.36010 | 010.68900 | 13:40:31.555000 | 13:40:42.244000 | 07:39:03.050049 | 07:40:50.409359 |
| 028 | 120.63134 | 009.52400 | 13:40:49.915000 | 13:40:59.439000 | 07:39:02.941202 | 07:41:03.572094 |
| 033 | 120.78268 | 009.54200 | 13:40:27.147000 | 13:40:36.689000 | 07:39:04.152378 | 07:41:04.934602 |
| 037 | 122.64949 | 008.80400 | 13:40:33.298000 | 13:40:42.102000 | 07:39:06.500587 | 07:41:09.149629 |
| 035 | 125.35254 | 009.13200 | 13:40:27.600000 | 13:40:36.732000 | 07:39:04.295941 | 07:41:09.647836 |
| 015 | 139.13287 | 011.17800 | 13:40:27.116000 | 13:40:38.294000 | 07:39:03.406321 | 07:41:22.538701 |
| 029 | 141.21037 | 008.23700 | 13:40:24.271000 | 13:40:32.508000 | 07:39:03.816588 | 07:41:25.026438 |
| 013 | 145.94239 | 009.19400 | 13:40:33.809000 | 13:40:43.003000 | 07:39:03.375451 | 07:41:29.317454 |
| 039 | 149.92807 | 009.72300 | 13:40:33.090000 | 13:40:42.813000 | 07:39:03.635156 | 07:41:33.562607 |
| 016 | 166.26505 | 010.12000 | 13:40:39.999000 | 13:40:50.119000 | 07:39:03.383215 | 07:41:49.647907 |
| 010 | 210.61908 | 011.37900 | 13:42:20.287000 | 13:42:31.666000 | 07:39:03.702486 | 07:42:34.321079 |
| 027 | 227.83011 | 010.00900 | 13:42:25.845000 | 13:42:35.854000 | 07:39:02.953435 | 07:42:50.783106 |
| 025 | 228.48326 | 009.71000 | 13:42:20.845000 | 13:42:30.555000 | 07:39:03.673122 | 07:42:52.155934 |
| 022 | 244.57685 | 010.06900 | 13:42:53.712000 | 13:43:03.781000 | 07:39:03.963936 | 07:43:08.540307 |
| 021 | 263.74717 | 009.81400 | 13:42:40.211000 | 13:42:50.025000 | 07:39:04.505016 | 07:43:28.251864 |
| 031 | 273.96990 | 008.55100 | 13:43:18.645000 | 13:43:27.196000 | 07:39:03.618419 | 07:43:37.587862 |
| 034 | 280.96174 | 010.53300 | 13:42:58.364000 | 13:43:08.897000 | 07:39:04.313498 | 07:43:45.274962 |
| 030 | 281.76029 | 008.27100 | 13:42:49.448000 | 13:42:57.719000 | 07:39:03.832644 | 07:43:45.592592 |
| 005 | 288.15577 | 009.85300 | 13:43:04.825000 | 13:43:14.678000 | 07:39:04.006553 | 07:43:52.161888 |
| 003 | 296.52279 | 009.65300 | 13:43:24.041000 | 13:43:33.694000 | 07:39:03.831264 | 07:44:00.353715 |
| 038 | 380.01783 | 008.45000 | 13:44:57.326000 | 13:45:05.776000 | 07:39:03.055733 | 07:45:23.073209 |
| 014 | 397.05841 | 008.99800 | 13:44:48.577000 | 13:44:57.575000 | 07:39:03.132323 | 07:45:40.190302 |
该表显示了我必须在本地等待多长时间才能运行我的工作,并显示 Google 需要多长时间来完成我的工作。从时间来看,它表明 Google 执行提取不需要很长时间,但它不会同时运行这些作业,因此会强制一些提取在开始前等待几分钟。
【问题讨论】:
昨天推出的管道有一些改进,如果您仍然被阻止,您可以再试一次,看看是否有帮助。谢谢! 重新运行我的一些测试,现在速度更快了。谢谢! 【参考方案1】:您说得对,目前在内部处理导出作业的速度存在内部限制。这最初是为了保护并行运行的过多长时间且昂贵的导出的系统。但是,正如您所指出的,对于您在 1 分钟内完成许多导出作业的情况,此限制似乎没有帮助。
我们有一个开放的(内部)错误来解决这个问题,以使像您这样的小型出口的情况更好。同时,如果您认为自己因此受阻,请提交错误或让我知道您的项目 ID,我们可以帮助您提高项目的限制。
【讨论】:
以上是关于将 BigQuery 表的并发导出扩展到 Google Cloud Storage的主要内容,如果未能解决你的问题,请参考以下文章
是否可以使用 python 中的 bigquery API 将数据集中所有表的计数(*)发送到 csv 文件?