如何从 BigQuery 将大型数据集加载到 R？

Posted 2023-03-25

技术标签:

【中文标题】如何从 BigQuery 将大型数据集加载到 R？【英文标题】：How to load large datasets to R from BigQuery? 【发布时间】：2018-09-02 14:42:42 【问题描述】：

我尝试了两种使用Bigrquery 包的方法

library(bigrquery)
library(DBI)

con <- dbConnect(
  bigrquery::bigquery(),
  project = "YOUR PROJECT ID HERE",
  dataset = "YOUR DATASET"
)
test<- dbGetQuery(con, sql, n = 10000, max_pages = Inf)

和

sql <- `YOUR LARGE QUERY HERE` #long query saved to View and its select here
tb <- bigrquery::bq_project_query(project, sql)
bq_table_download(tb, max_results = 1000)

但未能出现错误"Error: Requested Resource Too Large to Return [responseTooLarge]"，可能相关问题here，但我对完成工作的任何工具感兴趣：我已经尝试了here 概述的解决方案，但它们失败了。

如何将大型数据集从 BigQuery 加载到 R？

【问题讨论】：

为什么要投反对票？我可以很容易地用 Python 等其他语言进行操作，但使用 R 似乎没有简单的方法，除非找到一些分片选项或类似的选项。该错误专门来自 Google/bigquery。你确定你read everything you should have in the docs吗？ @hrbrmstr 原来如此，没想到这个工具不会像 Python 中的 Pandas read_gbq 那样使用批处理或分片下载，我目前使用提到的方法用于大数据集，但它会更方便直接来自R。难道真的没有像read_gbq这样的R批量下载选项吗？ 【参考方案1】：

正如@hrbrmstr 所建议的那样，the documentation 特别提到：

> #' @param page_size The number of rows returned per page. Make this smaller
> #'   if you have many fields or large records and you are seeing a
> #'   'responseTooLarge' error.

在来自 r-project.org 的这份文档中，您将在 the explanation of this function (page 13) 中阅读不同的建议：

这会检索 page_size 块中的行。最适合较小查询的结果（例如，

【讨论】：

【参考方案2】：

我看到有人创造了一种使这更容易的方法。有一些 setup 涉及，但您可以使用 Google Storage API like so 下载：

## Auth is done automagically using Application Default Credentials.
## Use the following command once to set it up :
## gcloud auth application-default login --billing-project=project
library(bigrquerystorage)

# TODO(developer): Set the project_id variable.
# project_id <- 'your-project-id'
#
# The read session is created in this project. This project can be
# different from that which contains the table.

rows <- bqs_table_download(
  x = "bigquery-public-data:usa_names.usa_1910_current"
  , parent = project_id
  # , snapshot_time = Sys.time() # a POSIX time
  , selected_fields = c("name", "number", "state"),
  , row_restriction = 'state = "WA"'
  # , as_tibble = TRUE # FALSE : arrow, TRUE : arrow->as.data.frame
)

sprintf("Got %d unique names in states: %s",
        length(unique(rows$name)),
        paste(unique(rows$state), collapse = " "))

# Replace bigrquery::bq_download_table
library(bigrquery)
rows <- bigrquery::bq_table_download("bigquery-public-data.usa_names.usa_1910_current")
# Downloading 6,122,890 rows in 613 pages.
overload_bq_table_download(project_id)
rows <- bigrquery::bq_table_download("bigquery-public-data.usa_names.usa_1910_current")
# Streamed 6122890 rows in 5980 messages.

【讨论】：

【参考方案3】：

我也刚开始使用 BigQuery。我觉得应该是这样的。

当前的 bigrquery 版本可以从 CRAN 安装：

install.packages("bigrquery")

可以从 GitHub 安装最新的开发版本：

install.packages('devtools')
devtools::install_github("r-dbi/bigrquery")

用法低级 API

library(bigrquery)
billing <- bq_test_project() # replace this with your project ID 
sql <- "SELECT year, month, day, weight_pounds FROM `publicdata.samples.natality`"

tb <- bq_project_query(billing, sql)
#> Auto-refreshing stale OAuth token.
bq_table_download(tb, max_results = 10)

DBI

library(DBI)

con <- dbConnect(
  bigrquery::bigquery(),
  project = "publicdata",
  dataset = "samples",
  billing = billing
)
con 
#> <BigQueryConnection>
#>   Dataset: publicdata.samples
#>   Billing: bigrquery-examples

dbListTables(con)
#> [1] "github_nested"   "github_timeline" "gsod"            "natality"       
#> [5] "shakespeare"     "trigrams"        "wikipedia"

dbGetQuery(con, sql, n = 10)



library(dplyr)

natality <- tbl(con, "natality")

natality %>%
  select(year, month, day, weight_pounds) %>% 
  head(10) %>%
  collect()

【讨论】：

我在问题中演示了这种方法，但是"Error: Requested Resource Too Large to Return [responseTooLarge]"，你有没有找到解决更大数据集错误的方法？【参考方案4】：

这对我有用。

# Make page_size some value greater than the default (10000)
x <- 50000

bq_table_download(tb, page_size=x)

请注意，如果您将 page_size 设置为任意高的值（在我的情况下为 100000），您将开始看到很多空行。

对于给定的表大小，正确的page_size 值应该是多少，仍然没有找到一个好的“经验法则”。

【讨论】：

以上是关于如何从 BigQuery 将大型数据集加载到 R？的主要内容，如果未能解决你的问题，请参考以下文章

将时间戳从 Dataframe 加载到 BigQuery 数据集

将数据从存储加载到 bigquery 时解析数据类型时出错

将 bigquery JSON 数据转储加载到 R tibble

R - 为 Google BigQuery 导入清理数据

使用python从bigquery处理大量数据集，将其加载回bigquery表

Dataflow Bigquery-Bigquery 管道在较小的数据上执行，但不是在大型生产数据集上执行