R：使用 fread 或等价物从文件中读取随机行？

Posted 2023-04-18

技术标签:

【中文标题】R：使用 fread 或等价物从文件中读取随机行？【英文标题】：R: Read in random rows from file using fread or equivalent? 【发布时间】：2017-10-17 21:37:23 【问题描述】：

我有一个非常大的数 GB 文件，加载到内存中的成本太高。但是，文件中行的顺序不是随机的。有没有办法使用 fread 之类的方法读取行的随机子集？

例如，像这样的东西？

data <- fread("data_file", nrows_sample = 90000)

这个github post 建议一种可能性是做这样的事情：

fread("shuf -n 5 data_file")

但是，这对我不起作用。有什么想法吗？

【问题讨论】：

【参考方案1】：

使用 tidyverse（而不是 data.table），您可以这样做：

library(readr)
library(purrr)
library(dplyr)

# generate some random numbers between 1 and how many rows your files has,
# assuming you can ballpark the number of rows in your file
#
# Generating 900 integers because we'll grab 10 rows for each start, 
# giving us a total of 9000 rows in the final
start_at  <- floor(runif(900, min = 1, max = (n_rows_in_your_file - 10) ))

# sort the index sequentially
start_at  <- start_at[order(start_at)]

# Read in 10 rows at a time, starting at your random numbers, 
# binding results rowwise into a single data frame
sample_of_rows  <- map_dfr(start_at, ~read_csv("data_file", n_max = 10, skip = .x) )

【讨论】：

【参考方案2】：

如果您的数据文件恰好是文本文件，则使用包LaF 的此解决方案可能很有用：

library(LaF)

# Prepare dummy data
mat <- matrix(sample(letters,10*1000000,T), nrow = 1000000)

dim(mat)
#[1] 1000000      10

write.table(mat, "tmp.csv",
    row.names = F,
    sep = ",",
    quote = F)

# Read 90'000 random lines
start <- Sys.time()
random_mat <- sample_lines(filename = "tmp.csv",
    n = 90000,
    nlines = 1000000)
random_mat <- do.call("rbind",strsplit(random_mat,","))
Sys.time() - start
#Time difference of 1.135546 secs    

dim(random_mat)
#[1] 90000    10

【讨论】：

以上是关于R：使用 fread 或等价物从文件中读取随机行？的主要内容，如果未能解决你的问题，请参考以下文章

从文件中读取随机行的简单方法是啥？

从巨大的 CSV 文件中读取随机行

读取随机行 com 文件

在Unix命令行中从文件中读取随机行的简单方法是什么？

读取大型 csv 文件、python、pandas 的随机行

从STDIN读取文件并打印随机行。