通过 dbplyr/bigRquery 将 summarise() 调用中的分位数返回到 BigQuery SQL 数据库

Posted

技术标签:

【中文标题】通过 dbplyr/bigRquery 将 summarise() 调用中的分位数返回到 BigQuery SQL 数据库【英文标题】:Return quantiles within a summarise() call through dbplyr/bigRquery to BigQuery SQL database 【发布时间】:2020-07-24 19:39:09 【问题描述】:

我正在尝试获取分组 BigQuery 表中变量的分位数,但出现以下错误:

Error: Job 'xxxxx' failed
Syntax error: Expected end of input but got keyword WITHIN at [1:45] [invalidQuery]

Reprex 在下面。

# NOTE: for reprex to work, you must have BIGQUERY_TEST_PROJECT envvar set to name of project which has billing set up and to which you have write access

library(DBI)
library(bigrquery)
library(dplyr)

billing <- bq_test_project()

con <- dbConnect(
  bigrquery::bigquery(),
  project = "publicdata",
  dataset = "samples",
  billing = billing
)

natality <- tbl(con, "natality")
   
natality %>%
  group_by(year) %>%
  summarize(q25 = quantile(weight_pounds,0.25),
            q50 = median(weight_pounds),
            q75 = quantile(weight_pounds,0.75)
  )

任何人都知道一种解决方法,也许是通过在summarise() 调用中通过sql() 提供SQL 代码?

谢谢!

【问题讨论】:

您找到解决方法了吗?我面临着和你完全相同的问题。 中位数也给了我同样的错误...... @Ploulack 查看下面的答案 【参考方案1】:

一位同事通过在summarize() 调用中使用sql() 提供SQL 代码找到了答案:

# NOTE: for reprex to work, you must have BIGQUERY_TEST_PROJECT envvar set to name of project which has billing set up and to which you have write access

library(DBI)
library(bigrquery)
library(dplyr)

billing <- bq_test_project()

con <- dbConnect(
  bigrquery::bigquery(),
  project = "publicdata",
  dataset = "samples",
  billing = billing
)

natality <- tbl(con, "natality")
   
natality %>%
  group_by(year) %>%
  summarize(q25 = sql("approx_quantiles(weight_pounds,4)[offset(1)]"),
            q50 = sql("approx_quantiles(weight_pounds,2)[offset(1)]"),
            q75 = sql("approx_quantiles(weight_pounds,4)[offset(3)]")
  )

【讨论】:

以上是关于通过 dbplyr/bigRquery 将 summarise() 调用中的分位数返回到 BigQuery SQL 数据库的主要内容,如果未能解决你的问题,请参考以下文章

Excel 集成为 Sum 提供 0,但通过指定值来添加单元格的 #Value :(

与 sum 一起的情况

需要帮助将连接中的重复项返回到包含 sum 函数的单行

在mysql中,sum(a)和sum(a) over()有啥区别?

利用参数的值得返回来求和

T-SQL:来自String的分隔符之间的SUM号