R如何在将csv文件写入磁盘之前估计它的大小

Posted

技术标签:

【中文标题】R如何在将csv文件写入磁盘之前估计它的大小【英文标题】:R How to estimate csv file size prior to writing it to disk 【发布时间】:2016-06-25 08:53:56 【问题描述】:

在通过write.csvreadr::write_csv 将 csv 文件实际写入磁盘之前,R 中是否有任何方法可以估计 csv 文件的文件大小?如果用户不小心尝试在函数中将大文件写入磁盘,我想发出警告。

数据帧的内存占用 (object.size) 与磁盘大小之间似乎存在某种关系,后者要大得多。但是,内存中的对象越大,差异越小。此外,数据框的结构可能存在差异。

我不想强迫人们下载大量数据,所以请原谅缺乏可重复的例子。

【问题讨论】:

对于警告,您可以简单地比较对象在R内部的内存占用情况。如果大于阈值,您可以发出警告。 您可能会发现this article 相关:关于对象大小与各种文件格式的文件大小。对于基准测试中使用的对象大小,与 csv 文件的差异“相当大”(343.3 MB vs 2885 MB) 【参考方案1】:

您可以使用以下三个函数来计算将分别由write.table()write.csv()write.csv2() 写入的文件的确切大小,而不必一次生成整个字节流,或者在内存或磁盘上:

size.write.table <- function(x,...) 
    x <- as.data.frame(x); ## write.table() coerces to data.frame
    args <- list(...);
    defaults <- formals(write.table);
    ## get write specs as locals
    for (name in names(defaults)[-1])
        assign(name,if (is.null(args[[name]])) defaults[[name]] else args[[name]]);
    ## normalize quote arg to logical, quoteIndexes as columns to quote
    ## note: regardless of qmethod, does not touch characters other than double-quote, and only adds one byte per embedded double-quote for either qmethod
    quoteIndexesGiven <- F; ## assumption
    if (is.logical(quote) && quote) 
        quoteIndexes <- seq_along(x);
     else if (is.numeric(quote)) 
        quoteIndexes <- quote;
        quote <- T;
        quoteIndexesGiven <- T;
    ; ## end if
    ## normalize col.names arg to logical T/F, colNames as actual column names
    emptyColNameForRowNames <- F; ## assumption
    if (is.logical(col.names)) 
        if (is.na(col.names)) 
            emptyColNameForRowNames <- T;
            col.names <- T;
        ; ## end if
        if (col.names) colNames <- names(x);
     else 
        colNames <- as.character(col.names);
        col.names <- T;
    ; ## end if
    ## normalize row.names arg to logical, rowNames as actual row names
    if (is.logical(row.names)) 
        if (row.names) rowNames <- rownames(x);
     else 
        rowNames <- as.character(row.names);
        row.names <- T;
    ; ## end if (else must be F)
    ## start building up file size
    size <- 0L;
    ## 1: column header
    if (col.names) 
        ## special case for zero columns: write.table() behaves as if there's one empty-string column name, weirdly
        if (ncol(x)==0L) 
            if (quote) size <- size + 2L;
         else 
            if (emptyColNameForRowNames) 
                if (quote) size <- size + 2L; ## two double-quotes
                size <- size + nchar(sep,'bytes'); ## separator
            ; ## end if
            size <- size + sum(nchar(colNames,'bytes')); ## names (note: NA works with this; nchar() returns 2)
            if (quote) size <- size + ncol(x)*2L + sum(do.call(c,gregexpr(perl=T,'"',colNames[quoteIndexes]))>0L); ## quotes and escapes
            size <- size + nchar(sep,'bytes')*(ncol(x)-1L); ## separators
        ; ## end if
        size <- size + nchar(eol,'bytes'); ## eol; applies to both zero-columns special case and otherwise
    ; ## end if
    ## 2: row names
    if (row.names) 
        ## note: missing values are not allowed in row names
        size <- size + sum(nchar(rowNames,'bytes')); ## names
        size <- size + nchar(sep,'bytes')*nrow(x); ## separator (note: always present after row names, even for zero-column data.frame)
        if (quote) size <- size + nrow(x)*2L + sum(do.call(c,gregexpr(perl=T,'"',rowNames))>0L); ## quotes and escapes (can ignore quoteIndexes, since row names are always quoted if any column is quoted)
    ; ## end if
    ## 3: column content
    for (ci in seq_along(x)) 
        ## calc depends on class
        cl <- class(x[[ci]]);
        ## normalize date/time classes
        if (identical(cl,c('POSIXct','POSIXt')) || identical(cl,c('POSIXlt','POSIXt')))
            cl <- 'POSIXt';
        ## branch on normalized class
        ## note: can't write list type to file, so don't bother supporting list columns
        if (length(cl)==1L && cl=='raw') 
            size <- size + nrow(x)*2L;
            ## note: cannot have raw NAs
         else  ## remaining types can have NAs
            size <- size + sum(is.na(x[[ci]]))*nchar(na,'bytes'); ## NAs
            if (length(cl)==1L && cl=='logical') 
                size <- size + sum((5:4)[na.omit(x[[ci]])+1L]); ## non-NAs
             else if (length(cl)==1L && cl%in%c('integer','numeric','complex','ts')) 
                size <- size + sum(nchar(as.character(na.omit(x[[ci]])),'bytes')); ## non-NAs
             else if (length(cl)==1L && cl%in%c('character','factor')) 
                size <- size + sum(nchar(as.character(na.omit(x[[ci]])),'bytes')); ## non-NAs, values -- as.character() required for factors to work
                if (quote && ci%in%quoteIndexes) size <- size + sum(!is.na(x[[ci]]))*2L + sum(do.call(c,gregexpr(perl=T,'"',na.omit(x[[ci]])))>0L); ## quotes and escapes
             else if (length(cl)==1L && cl=='POSIXt') 
                size <- size + sum(nchar(as.character(na.omit(x[[ci]])),'bytes')); ## non-NAs
                ## special case for POSIXt: only quoted if explicitly specified by index in quote arg
                if (quoteIndexesGiven && ci%in%quoteIndexes) size <- size + sum(!is.na(x[[ci]]))*2L; ## quotes (can't be any escapes)
             else 
                stop(sprintf('unsupported class(es) %s.',paste(collapse=',',cl)));
            ; ## end if
        ; ## end if
    ; ## end for
    ## 4: separators between columns
    size <- size + nchar(sep,'bytes')*(ncol(x)-1L)*nrow(x);
    ## 5: eols
    size <- size + nchar(eol,'bytes')*nrow(x);
    size;
; ## end size.write.table()
## note: documentation should say "col.names to NA if row.names = TRUE (the default) or given as a character vector" for csv functions
size.write.csv <- function(x,...) do.call(size.write.table,c(list(x),sep=',',dec='.',qmethod='double',col.names= row.names <- list(...)$row.names; if (!identical(F,row.names)) NA else T; ,list(...)));
size.write.csv2 <- function(x,...) do.call(size.write.table,c(list(x),sep=';',dec=',',qmethod='double',col.names= row.names <- list(...)$row.names; if (!identical(F,row.names)) NA else T; ,list(...)));

以下是一些证明正确性的测试:

size.write.func.test.impl <- function(funcName,x,...,tf='/tmp/size.write.func.test.impl.txt') 
    writeFunc <- match.fun(funcName);
    sizeFunc <- match.fun(paste0('size.',funcName));
    writeFunc(x,tf,...);
    expected <- file.info(tf)$size;
    actual <- sizeFunc(x,tf,...);
    cat(sprintf('%s: %d %s %d\n',if (expected==actual) 'SUCCESS' else 'FAILURE',actual,if (expected==actual) '==' else '!=',expected));
; ## end size.write.func.test.impl()
size.write.table.test <- function(...) size.write.func.test.impl('write.table',...);
size.write.csv.test <- function(...) size.write.func.test.impl('write.csv',...);
size.write.csv2.test <- function(...) size.write.func.test.impl('write.csv2',...);
size.all.test <- function(...) 
    size.write.table.test(...);
    size.write.csv.test(...);
    size.write.csv2.test(...);
; ## end size.all.test()

size.all.test(data.frame(),quote=F);
## SUCCESS: 1 == 1
## SUCCESS: 1 == 1
## SUCCESS: 1 == 1
size.all.test(data.frame());
## SUCCESS: 3 == 3
## SUCCESS: 3 == 3
## SUCCESS: 3 == 3
size.all.test(data.frame(a=1:3));
## SUCCESS: 22 == 22
## SUCCESS: 25 == 25
## SUCCESS: 25 == 25
set.seed(1L);
df <- data.frame(raw=as.raw(0:255),logical=rep(c(F,T),len=256L),integer=0:255,double1=runif(256L),double2=runif(256L,-.Machine$double.xmax*0.5,.Machine$double.xmax*0.5),character=paste(sapply(0:255,intToUtf8),sample(c('','x','x"x','"x""x"'),256L,replace=T)),factor=factor(rep(letters,len=256L)),dtΩ=as.POSIXct('1970-01-01 00:00:00',tz='England/London'),stringsAsFactors=F);
for (ci in seq(2,ncol(df))) df[[ci]][sample(256L,10L)] <- NA;
head(df);
##   raw logical integer   double1        double2 character factor        dtΩ
## 1  00   FALSE       0 0.2655087 -4.535097e+307                a 1970-01-01
## 2  01    TRUE       1 0.3721239 -2.670418e+305  \001 x"x      b 1970-01-01
## 3  02   FALSE       2 0.5728534 -2.285466e+307  \002 x"x      c 1970-01-01
## 4  03    TRUE       3 0.9082078  7.814417e+307     \003       d 1970-01-01
## 5  04      NA       4 0.2016819  4.311961e+306  \004 x"x      e 1970-01-01
## 6  05    TRUE       5 0.8983897 -3.287178e+307  \005 x"x      f 1970-01-01
size.all.test(df);
## SUCCESS: 20634 == 20634
## SUCCESS: 20637 == 20637
## SUCCESS: 20637 == 20637
size.all.test(df,eol='zzz');
## SUCCESS: 21148 == 21148
## SUCCESS: 21151 == 21151
## SUCCESS: 21151 == 21151
size.all.test(df,sep='///'); ## csv incarnations take ownership of their overridden arguments
## SUCCESS: 24744 == 24744
## SUCCESS: 20637 == 20637
## SUCCESS: 20637 == 20637
## Warning messages:
## 1: In writeFunc(x, tf, ...) : attempt to set 'sep' ignored
## 2: In writeFunc(x, tf, ...) : attempt to set 'sep' ignored
size.all.test(df,quote=F);
## SUCCESS: 18807 == 18807
## SUCCESS: 18808 == 18808
## SUCCESS: 18808 == 18808
size.all.test(df,quote=seq(2L,ncol(df),by=2L));
## SUCCESS: 20634 == 20634
## SUCCESS: 20637 == 20637
## SUCCESS: 20637 == 20637
size.all.test(df,row.names=F);
## SUCCESS: 19206 == 19206
## SUCCESS: 19206 == 19206
## SUCCESS: 19206 == 19206
size.all.test(df,row.names=seq(1234,len=nrow(df)));
## SUCCESS: 20998 == 20998
## SUCCESS: 21001 == 21001
## SUCCESS: 21001 == 21001
size.all.test(df,na='blah');
## SUCCESS: 20774 == 20774
## SUCCESS: 20777 == 20777
## SUCCESS: 20777 == 20777
size.all.test(iris);
## SUCCESS: 4818 == 4818
## SUCCESS: 4821 == 4821
## SUCCESS: 4821 == 4821
size.all.test(USAccDeaths);
## SUCCESS: 724 == 724
## SUCCESS: 727 == 727
## SUCCESS: 727 == 727
size.all.test(USArrests);
## SUCCESS: 1384 == 1384
## SUCCESS: 1387 == 1387
## SUCCESS: 1387 == 1387
size.all.test(USArrests,eol='\r\n'); ## you're probably on Windows
## SUCCESS: 1435 == 1435
## SUCCESS: 1438 == 1438
## SUCCESS: 1438 == 1438

【讨论】:

【参考方案2】:

受@lukeA 回答的启发,我想出了以下估计文件大小的良好结果:

csv_write <- function(df = idata, filename)
  if(interactive)
    if(dim(df)[1] > 10^4)
      divisor <- 1000
      if(dim(df)[1] > 10^5)
        divisor <- 10000
      

      df_sample <- sample_n(df, size = dim(df)[1] %/% divisor)

      test <- paste(capture.output(write.csv(df_sample)), collapse="\n")

      cat("Writing", capture.output(print(object.size(test)*divisor,
                                          units="auto")), "to disk. \n")
    
  

  write_csv(df, path = filename)

剩下的一个问题是我需要使用write.csv 而不是readr::write_csv 来计算近似输出大小,因为readr::write_csv 需要path 参数。不需要路径的替代 readr::format_csv 大大低估了文件大小(例如,估计为 2.7 Gb,而磁盘上为 3.3 Gb)。

【讨论】:

考虑readr 的好主意 - 试试readr::format_csv(USArrests),它不需要路径参数。 是的,但是就像我写的那样,使用read::format_csv()gets 估计的文件大小有点错误@lukaA【参考方案3】:

这里有一个想法

to <- paste(capture.output(write.csv(USArrests)), collapse="\n")
write.csv(USArrests, tf <- tempfile(fileext = ".csv"))
file.info(tf)$size
# [1] 1438
print(object.size(to), units="b")
# 1480 bytes

【讨论】:

不确定我是否正确,但这确实写入磁盘,或者? 他正在写入 (1) 内存和 (2) 文件并显示它们的大小相同。我认为出于您估算的目的,您可以将前 20 条记录写入内存(或临时文件),然后按与总行数的比率估算为 20。 就像@A.Webb 说的:我(1)将一个虚拟csv文件写入内存并获取它的大小。然后 (2) 我将另一个 csv 写入磁盘以比较大小相似 - 至少对于 USArrests 数据集。【参考方案4】:

尝试使用object.size 方法:

 object.size(m) # where m is your data object

【讨论】:

以上是关于R如何在将csv文件写入磁盘之前估计它的大小的主要内容,如果未能解决你的问题,请参考以下文章

在R语言 中如何把list对象逐行写入csv文件中

CHCSVWriter用于写入更大的CSV文件的内存使用情况

如何在 R 中下载之前确定在线文件大小?

使用 R 在一个 .csv 文件中写入不同的数据帧

使用 R 在一个 .csv 文件中写入不同的数据帧

如何使用 R 文件中的常量值写入 csv 文件