将 purrr::map2() 与 dbplyr 一起使用

Posted 2023-03-25

技术标签:

【中文标题】将 purrr::map2() 与 dbplyr 一起使用【英文标题】：Using purrr::map2() with dbplyr 【发布时间】：2018-03-26 18:04:28 【问题描述】：

我正在尝试从一个表（“位置”）中选择具有特定列（“位置”）的值的行，这些行位于另一个（“my_ranges”）表中定义的范围内，然后添加一个分组“my_ranges”表中的标记。

我可以使用 tibbles 和几个 purrr::map2 调用来做到这一点，但同样的方法不适用于 dbplyr database-tibbles。这是预期的行为吗？如果是，我应该采取不同的方法来使用 dbplyr 来完成此类任务吗？

这是我的例子：

library("tidyverse")
set.seed(42)

my_ranges <-
  tibble(
    group_id = c("a", "b", "c", "d"),
    start = c(1, 7, 2, 25),
    end = c(5, 23, 7, 29)
    )

positions <-
  tibble(
    position = as.integer(runif(n = 100, min = 0, max = 30)),
    annotation = stringi::stri_rand_strings(n = 100, length = 10)
  )

# note: this works as I expect and returns a tibble with 106 obs of 3 variables:
result <- map2(.x = my_ranges$start, .y = my_ranges$end,
             .f = function(x, y) between(positions$position, x, y)) %>%
  map2(.y = my_ranges$group_id,
              .f = function(x, y)
                positions %>%
                  filter(x) %>%
                  mutate(group_id = y)
) %>% bind_rows()

# next, make an in-memory db for testing:
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")

# copy data to db
copy_to(con, my_ranges, "my_ranges", temporary = FALSE)
copy_to(con, positions, "positions", temporary = FALSE)

# get db-backed tibbles:
my_ranges_db <- tbl(con, "my_ranges")
positions_db <- tbl(con, "positions")

# note: this does not work as I expect, and instead returns a tibble with 0 obsevations of 0 variables:
# database range-based query:
db_result <- map2(.x = my_ranges_db$start, .y = my_ranges_db$end,
                  .f = function(x, y) 
                    between(positions_db$position, x, y)
                    ) %>%
  map2(.y = my_ranges_db$group_id,
       .f = function(x, y)
         positions_db %>%
           filter(x) %>%
           mutate(group_id = y)
  ) %>% bind_rows()

【问题讨论】：

我可以直接使用 SQL：range_query 【参考方案1】：

只要每次迭代创建一个相同维度的表，那么可能有一种巧妙的方式将整个操作推送到数据库。这个想法是同时使用来自purrr 的map() 和reduce()。每个tbl_sql() 操作都是惰性的，因此我们可以遍历它们而不必担心发送一堆查询，然后我们可以使用union()，它基本上将使用UNION 子句将每次迭代的结果SQL 附加到下一次从给定的数据库。这是一个例子：

library(dbplyr, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
library(purrr, warn.conflicts = FALSE)
library(DBI, warn.conflicts = FALSE)
library(rlang, warn.conflicts = FALSE)

con <- DBI::dbConnect(RSQLite::SQLite(), path = ":dbname:")

db_mtcars <- copy_to(con, mtcars)

cyls <- c(4, 6, 8)

all <- cyls %>%
  map(~
    db_mtcars %>%
      filter(cyl == .x) %>%
      summarise(mpg = mean(mpg, na.rm = TRUE)
      )
  ) %>%
  reduce(function(x, y) union(x, y)) 

all
#> # Source:   lazy query [?? x 1]
#> # Database: sqlite 3.22.0 []
#>     mpg
#>   <dbl>
#> 1  15.1
#> 2  19.7
#> 3  26.7

show_query(all)
#> <SQL>
#> SELECT AVG(`mpg`) AS `mpg`
#> FROM (SELECT *
#> FROM (SELECT *
#> FROM `mtcars`)
#> WHERE (`cyl` = 4.0))
#> UNION
#> SELECT AVG(`mpg`) AS `mpg`
#> FROM (SELECT *
#> FROM (SELECT *
#> FROM `mtcars`)
#> WHERE (`cyl` = 6.0))
#> UNION
#> SELECT AVG(`mpg`) AS `mpg`
#> FROM (SELECT *
#> FROM (SELECT *
#> FROM `mtcars`)
#> WHERE (`cyl` = 8.0))

dbDisconnect(con)

【讨论】：

【参考方案2】：

dbplyr 将 R 转换为 SQL。 SQL 中不存在列表。 map 创建列表。因此不可能将map 翻译成SQL。

主要是dplyr 函数和一些base 函数被翻译，据我所知，它们也在处理tidyr 函数。使用dbplyr 时，请尝试在您的方法中使用SQL 逻辑，否则它很容易崩溃。

【讨论】：

谢谢！这有助于澄清我的想法！那么有没有办法远程运行像这样简单的东西iris_db %>% sapply(unique) (iris_db %>% map(unique))？试试map(names(iris_db), ~select_at(iris_db,.x) %>% distinct() %>% collect())。它将在每一列上运行一个单独的 sql 查询非常感谢，我只能使用基于小插图示例here：map(dbListFields(con, "flights"), ~select_at(flights_db,.x) %>% distinct() %>% collect()) 的这种方法才能使其工作，我不得不将names(flights_db) 替换为dbListFields(con, "flights")。有没有办法在数据库数据集上使用names 命令来打印列名？它比输入 dbListFields 更直观哦，是的，对不起。我认为colnames() 也可以。

以上是关于将 purrr::map2() 与 dbplyr 一起使用的主要内容，如果未能解决你的问题，请参考以下文章

使用 purrr::map2 具有所有变量排列的模型

在具有包含数据帧的列表列的小标题中，如何使用自定义函数包装 mutate(foo = map2(...))？

从 dbplyr 中的给定 SQL 查询开始使用 dbplyr

使用 dbplyr 跨数据库连接

通过 dbplyr/bigRquery 将 summarise() 调用中的分位数返回到 BigQuery SQL 数据库

如何在没有数据库连接的情况下从 dbplyr 生成 SQL？