用 ff 处理大数据

Posted 2023-04-18

技术标签:

【中文标题】用 ff 处理大数据【英文标题】：Handling Big data with ff 【发布时间】：2017-11-26 06:52:45 【问题描述】：

我正在使用 16Gb 的数据集。这当然太大而无法加载到 RAM 内存中，所以我需要在 R 中使用某种大数据处理方法。我的数据集由很多变量组成，其中大部分是字符变量，如姓名和地址。我想做数据清理/编辑，比如根据现有变量创建新变量并对地址进行地理编码。我曾尝试使用 ff 包，但我无法让它工作。首先，我无法正确地将我的数据集放入 ffdf 文件中。其次，当我有点做时，我无法像之前在常规数据帧上的工作方式那样进行数据清理。

示例数据集的问题示例：

#create example dataset similar to mine with strings 
df2 <- read.table(text='npi dier  getal  mubilair
             51  "aap"  een  tafel
             52  vis  twee stoel
             53 paard  twee  zetel
             54  kip  drie  fouton
             55  beer vier   fouton
             56  aap  vijf   bureau
             57  tijger  zes bank
             58  zebra  zeven  sofa
             59  olifant  acht  wastafel
             60  mens acht  spiegel', header=T, sep='')
dfstring <- df2[,-1]
rownames(dfstring) <- df2[,1]
    write.csv(dfstring, "~/UC Berkeley/Research/dfstring.csv")

library(ff)

# creating the ff file
headset = read.csv(file="~/UC Berkeley/Research/dfstring.csv", header = TRUE, nrows = 5000)
headclasses = sapply(headset, class)
str(headclasses)
dfstring.ff <- read.csv.ffdf(file="~/UC Berkeley/Research/dfstring.csv", first.rows=5000, colClasses=headclasses)
#doesn't work error:scan() expected 'an integer', got '"51"'

headclasses [c(1)] = "factor"
dfstring.ff <- read.csv.ffdf(file="~/UC Berkeley/Research/dfstring.csv", first.rows=5000, colClasses=headclasses)
dfstring.ff
#set all variables to factor

dfstring.ff$getalmubilair <- paste(dfstring.ff$getal, dfstring.ff$mubilair, sep = ' ')
#doesn't work error: assigned value must be ff

getalmubilair <- paste(dfstring.ff$getal, dfstring.ff$mubilair, sep = ' ')
getalmubilair
#doesn't work creates an empty object

我的问题：

首先是ff包在我的情况下使用，大数据中的字符变量很多？

如果是这种情况，如何将我的文件加载到正确的 ff 文件中？（例如如何处理 first.rows 或 colClasses）

可以对 ff 文件执行哪些操作，它们与您在常规数据帧上使用的操作有何不同？

在哪里可以找到易懂的 ff 包手册/演练我看过一些，但它们非常技术性，我无法通过它们。

附带说明：我尝试通过以下方式使用 colClasses 需求删除不必要的变量：

#Delete the unnecessary variables:
headclasses[c(1,2)]= "NULL"

但是，我收到以下错误：

repnam(colClasses, colnames(x), default = NA) 中的错误：以下参数名称不匹配

如果我能够立即删除真实数据集中不必要的变量，它可能会工作得更快。那我该怎么做呢？

【问题讨论】：

看看here 并尝试给出一个可重现的例子。我试图在我上次编辑@Christoph 时给出一个可重现的例子这可能对你有帮助：***.com/questions/1727772/… 【参考方案1】：

由于您的文件大小“巨大”，我建议将此文件存储在 db 中（例如 SQLite），然后使用 RSQLite 包处理它。其他选项可能是直接在存储在 hdfs 中的文件上使用RHadoop。

您还可以使用read.table通过循环访问内存中的小块来读取大文件。你可以试试下面的代码sn-p。

chunkSize <- 1000000
testFile <- "testFile.csv"
con <- file(description=testFile, open="r")

#column headers
headers <- strsplit(readLines(testFile,n=1), split=',')[[1]]

df <- read.table(con, nrows=chunkSize, header=T, fill=T, sep=",", col.names = headers)

repeat 
  if (nrow(df) == 0)
    break
  print(head(df))

  ####
  #add code to process chunk data
  ####

  #read next chunk
  if (nrow(df) != chunkSize)
    break
  df <- tryCatch(
    read.table(con, nrows=chunkSize, skip=0, header=F, fill=T, sep=",", col.names = headers)
  , error=function(e)
    if (identical(conditionMessage(e), "no lines available in input"))
      data.frame()
    else stop(e)
  )

close(con)

如果您想了解ff 包，您可以参考this 演示文稿，该演示文稿可在official website 上找到。

【讨论】：

以上是关于用 ff 处理大数据的主要内容，如果未能解决你的问题，请参考以下文章