从 sparlyr tibble 对象读取数据时访问列时出错

Posted 2023-03-23

技术标签:

【中文标题】从 sparlyr tibble 对象读取数据时访问列时出错【英文标题】：Error in accessing columns while reading data from sparlyr tibble objects 【发布时间】：2020-06-28 20:27:30 【问题描述】：

我正在尝试通过此链接在 spark 中复制 ALS 的基本示例：

https://rdrr.io/cran/sparklyr/man/ml_als.html

movies <- data.frame(
  user   = c(1, 2, 0, 1, 2, 0),
  item   = c(1, 1, 1, 2, 2, 0),
  rating = c(3, 1, 2, 4, 5, 4)
)
movies_tbl <- sdf_copy_to(sc, movies)

model <- ml_als(movies_tbl, rating ~ user + item)

ml_predict(model, movies_tbl)

ml_recommend(model, type = "item", 1)

这段代码对我来说没有问题，问题是我无法操作预测表中的值具有以下格式：

prediction = ml_recommend(model, type = "item", 1)


> prediction
# Source: spark<?> [?? x 4]
   user recommendations  item rating
  <int> <list>          <int>  <dbl>
1     1 <list [2]>          2   3.98
2     2 <list [2]>          2   4.86
3     0 <list [2]>          0   3.88

我无法选择得到空响应的列

> prediction$prediction
NULL

也不过滤它们：

> prediction %>%
+   select  (user)
Error in select(., user) : object 'user' not found

我什至无法以这种方式从原始数据框中读取数据：

movies_tbl %>%
  select  (user)

这将返回与上述相同的错误。

【问题讨论】：

prediction %>% select(user) 和 movies_tbl %>% select(user) 为我工作。您是否正确执行了sc <- spark_connect(master = "local") 并加载了 dplyr？ 【参考方案1】：

当使用 R 中的 Spark 时，您可以使用 (1) SQL 通过例如DBI 包，或 (2) dplyr 包。您不能使用基本 R 函数，例如使用 $ 进行子集化。

# Using DBI
dbGetQuery(sc, "SELECT count(*) FROM movies")
  count(1)
1        6

# Using dplyr
select(prediction, user)
# Source: spark<?> [?? x 1]
   user
  <int>
1     2
2     0
3     1

要将数据取回 R，请使用 sparklyr 中的 collect() 函数。

【讨论】：

以上是关于从 sparlyr tibble 对象读取数据时访问列时出错的主要内容，如果未能解决你的问题，请参考以下文章

读取 csv 文件以根据分组变量分隔数据帧

生成 Tibble / DataFrame 代码

Tibbles 拒绝 lubridate 的持续时间和周期对象

将 bigquery JSON 数据转储加载到 R tibble

Tidymodels：在 R 中进行 10 倍交叉验证后，从 TIbble 中取消最佳拟合模型的 RMSE 和 RSQ 值

当分配规则存储在另一个tibble中时，如何用新值替换tibble中的数据？