在 R 中处理大型 csv 文件时避免挂断

Posted 2023-03-12

技术标签:

【中文标题】在 R 中处理大型 csv 文件时避免挂断【英文标题】：Avoid hang-ups while processing big csv files in R 【发布时间】：2018-01-22 06:02:26 【问题描述】：

我的任务是加载一个大的 csv 文件 (9 gb) 并提取一些特定的行并将这些特定的行保存在一个新的 csv 文件中。我正在用一个函数做这个过程。因此，在我的控制台中，我使用source() 命令加载我的函数，然后使用myfun() 执行该函数。

超过 csv 文件 6 gb 我的电脑挂了。

我尝试过但没有成功的解决方法：

How can I remove all objects but one from the workspace in R?

因为我有一个函数，所以我的变量不在我的工作区中，所以我无法删除它们……

gc() 命令

*** 上有一些关于这个主题的各种帖子最近我使用 gc() 来释放我的内存 - 它有效但现在我需要我的三个变量（start、quantity 和 l）——这意味着并非所有变量都可以删除在第五或第六个for循环中电脑挂了注意：如果没有 gc() 命令，我只能到达第二个或第三个 for 循环 gc() 命令有效果

我的 csv 文件的附加说明：

它有 6 列我需要每第四十行或第一百行提取一次我必须检测的行距我不确定行的距离在整个 csv 文件中是否保持不变

我的电脑是 64 位和 16 mb 内存的 win7 机器。

现在我的问题是：有没有办法避免挂断？也许 gc() 在我的代码中的更好位置或 gc() 的其他一些参数？

当您需要更多信息时，请发表评论——我会编辑我的帖子。

非常感谢！

现在我的代码：

    library(data.table)  # because of the fread() command

    myfun=function () 

    start=i
    quantity=2.2*10^7  # this is the number of rows and this amount is about 1.2 gb of the csv file
    for (l in 1:12)   # the 12 is guessed… perhaps here exists also a better solution

        DT=fread("C:\\user1\\AllRows.csv",sep = ";",stringsAsFactors=FALSE,drop=7,header=FALSE,nrows= quantity,skip=start,data.table=FALSE)
        colnames(DT)=c("col_1"," col_2"," col_3"," col_4"," col_5"," col_6")

        # Detect the distance of rows and extract the corresponding rows
        # and save it in data.df

        # and now data.df will be saved
        file=file.path("C:\\user1\\ExtractedRows.csv"))
        if (l==1) write.table(data.df,file=file,sep=";",dec=",",row.names=FALSE,col.names= c("col_1"," col_2"," col_3"," col_4"," col_5"," col_6"),append=FALSE)
        if (l!=1) write.table(data.df,file=file,sep=";",dec=",",row.names=FALSE,col.names=FALSE,append=TRUE)

        # release the internal memory
        gc(reset=T)

        # incrementing start
        start = start + quantity

      # end of for loop

      # end of function

【问题讨论】：

【参考方案1】：

当处理太大而无法放入内存的数据时，您可以使用 SQLlite 将数据保存在磁盘上，然后查询出您需要的内容。

以下是帮助您入门的参考：https://www.r-bloggers.com/r-and-sqlite-part-1/

这种方法可能比您在问题中概述的方法更适合您的问题。

【讨论】：

以上是关于在 R 中处理大型 csv 文件时避免挂断的主要内容，如果未能解决你的问题，请参考以下文章