R(arules)将数据帧转换为事务并删除NA

Posted

技术标签:

【中文标题】R(arules)将数据帧转换为事务并删除NA【英文标题】:R (arules) Convert dataframe into transactions and remove NA 【发布时间】:2017-08-19 16:30:37 【问题描述】:

我有一个设置的数据框。我的目的是将数据框转换为交易数据,以便使用 R 中的 Arules 包进行购物篮分析。我在网上做了一些关于将数据框转换为交易数据的研究,例如(How to prep transaction data into basket for arules)和(Transform csv into transactions for arules),但我得到的结果不同。

输入(df)

structure(list(Transaction_ID = c("A001", "A002", "A003", "A004", "A005", "A006"), 
Fruits = c(NA, "Apple", "Orange", NA, "Pear", "Grape"), 
Vegetables = c(NA, NA, NA, "Potato", NA, "Yam"), 
Personal = c("ToothP", "ToothP", NA, "ToothB", "ToothB", NA), 
Drink = c("Coff", NA, "Coff", "Milk", "Milk", "Coff"), 
Other = c(NA, NA, NA, NA, "Promo", NA)), 
.Names = c("Transaction_ID", "Fruits", "Vegetables", "Personal", "Drink", "Other"), 
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))

下面是我的数据框结构

Transaction_ID  Fruits  Vegetables  Personal  Drink  Other
      A001        NA        NA       ToothP   Coff    NA
      A002       Apple      NA       ToothP    NA     NA
      A003      Orange      NA         NA     Coff    NA
      A004        NA      Potato     ToothB   Milk    NA
      A005       Pear       NA       ToothB   Milk   Promo
      A006      Grape      Yam         NA     Coff    NA

每一列的分类

sapply(df, class)
Transaction_ID         Fruits     Vegetables       Personal          Drink          Other 
"character"    "character"    "character"    "character"    "character"    "character"

将数据帧转换为交易数据

data <- as(split(df[,"Fruits"], df[,"Vegetables"],df[,"Personal"], df[,"Drink"], df[,"Other"]), "transactions")
inspect(data)

我得到的结果

[1] NA,NA,ToothP,Coff,NA
[2] Apple,NA,ToothP,NA,NA
[3] Orange,NA,NA,Coff,NA
[4] NA,Potato,ToothB,Milk,NA
[5] Pear,NA,ToothB,Milk,Promo
[6] Grape,Yam,NA,Coff,NA

交易数据已成功转换,但我想知道有什么办法可以删除 NA 项目?因为如果它们仍然保留在交易列表中,NA 将作为一个项目考虑。

【问题讨论】:

我无法重现您的示例。你能提供dput(df)吗? 嗨 Steven,编辑了我的帖子并添加了 dput(df) :) 【参考方案1】:

Ogustari 是对的。下面是处理事务 ID 的完整代码。

library("arules")
library("dplyr")  ### for dbl_df
df <- structure(list(Transaction_ID = c("A001", "A002", "A003", "A004", "A005", "A006"), 
  Fruits = c(NA, "Apple", "Orange", NA, "Pear", "Grape"), 
  Vegetables = c(NA, NA, NA, "Potato", NA, "Yam"), 
  Personal = c("ToothP", "ToothP", NA, "ToothB", "ToothB", NA), 
  Drink = c("Coff", NA, "Coff", "Milk", "Milk", "Coff"), 
  Other = c(NA, NA, NA, NA, "Promo", NA)), 
  .Names = c("Transaction_ID", "Fruits", "Vegetables", "Personal", "Drink", "Other"), 
  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))

### remove transaction IDs
tid <- as.character(df[["Transaction_ID"]])
df <- df[,-1]

### make all columns factors
for(i in 1:ncol(df)) df[[i]] <- as.factor(df[[i]])

trans <- as(df, "transactions")

### set transactionIDs
transactionInfo(trans)[["transactionID"]] <- tid

inspect(trans)

   items                                          transactionID
[1] Personal=ToothP,Drink=Coff                   A001         
[2] Personal=ToothP                              A002         
[3] Drink=Coff                                   A003         
[4] Vegetables=Potato,Personal=ToothB,Drink=Milk A004         
[5] Personal=ToothB,Drink=Milk,Other=Promo       A005         
[6] Vegetables=Yam,Drink=Coff                    A006         

【讨论】:

嗨,迈克尔,感谢您的分享!!它适用于我的情况,但我有 1 个关于 (### remove transaction IDs) 部分的问题。在我删除事务 ID 并将其设置回 trans 后,我发现我的 transactionID 丢失了。也许临时文件丢失了? tid @yc.koong 我添加了两个库语句,因此该示例是自包含的并加载 arules 和 dplyr(需要,因为您的数据是 tbl_df)。它现在在一个新的会话中为我运行,没有问题。【参考方案2】:

我可以向您推荐这个解决方案,但我不知道您是否正在寻找这个解决方案。

输入(df)

df <- data.frame(structure(list(Transaction_ID = as.factor(c("A001", "A002", "A003", "A004", "A005", "A006")), 
               Fruits = as.factor(c(NA, "Apple", "Orange", NA, "Pear", "Grape")), 
               Vegetables = as.factor(c(NA, NA, NA, "Potato", NA, "Yam")), 
               Personal = as.factor(c("ToothP", "ToothP", NA, "ToothB", "ToothB", NA)), 
               Drink = as.factor(c("Coff", NA, "Coff", "Milk", "Milk", "Coff")), 
               Other = as.factor(c(NA, NA, NA, NA, "Promo", NA))), 
          .Names = c("Transaction_ID", "Fruits", "Vegetables", "Personal", "Drink", "Other"), 
          class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L)))

每列的类 注意类都是“Factor”

sapply(df, class)
Transaction_ID         Fruits     Vegetables       Personal          Drink          Other 
      "factor"       "factor"       "factor"       "factor"       "factor"       "factor"

将数据框转换为交易数据

data <- as(df, "transactions")
inspect(data)

我得到的结果

     items                 transactionID
[1] Transaction_ID=A001,              
     Personal=ToothP,                  
     Drink=Coff                      1
[2] Transaction_ID=A002,              
     Fruits=Apple,                     
     Personal=ToothP                 2
[3] Transaction_ID=A003,              
     Fruits=Orange,                    
     Drink=Coff                      3
[4] Transaction_ID=A004,              
     Vegetables=Potato,                
     Personal=ToothB,                  
     Drink=Milk                      4
[5] Transaction_ID=A005,              
     Fruits=Pear,                      
     Personal=ToothB,                  
     Drink=Milk,                       
     Other=Promo                     5
[6] Transaction_ID=A006,              
     Fruits=Grape,                     
     Vegetables=Yam,                   
     Drink=Coff                      6

我在这里convert data frame in r to transaction or an itemMatrix 找到了部分解决方案。而且似乎你的命令

data <- as(split(df[,"Fruits"], df[,"Vegetables"],df[,"Personal"], df[,"Drink"], df[,"Other"]), "transactions")
inspect(data)

仅适用于仅包含两列的 data.frame。

【讨论】:

以上是关于R(arules)将数据帧转换为事务并删除NA的主要内容,如果未能解决你的问题,请参考以下文章

R arules / apriori - 如何实际实现

r 从数据帧中删除NA并替换为0

将字符向量转换为规则的事务

如何将字符串输入(带空格的数字)转换为Shiny(R)中的数据帧?

r 在R中的数据帧中将空白转换为NA

如何从 R 中的数据帧的开头和结尾删除 NA?