在R中读取具有多个空格作为分隔符的文本文件

Posted 2023-02-16

技术标签:

【中文标题】在R中读取具有多个空格作为分隔符的文本文件【英文标题】：Reading text file with multiple space as delimiter in R 【发布时间】：2013-06-03 12:14:44 【问题描述】：

我有大约 94 列和 300 万行的大数据集。该文件具有单个和多个空格作为列之间的分隔符。我需要从 R 中的这个文件中读取一些列。为此，我尝试使用 read.table() 和下面代码中可以看到的选项，代码粘贴在下面-

### Defining the columns to be read from the file, the first 5 column, then we do not read next 24, after this we read next 5 columns. Last 60 columns are not read in-

    col_classes = c(rep("character",2), rep("numeric", 3), rep("NULL",24), rep("numeric", 5), rep("NULL", 60))   

### Reading first 100 rows of the data

    data <- read.table(file, sep = " ",header = F, nrows = 100, na.strings ="", stringsAsFactors= F)

由于必须读入的文件在某些列之间有多个空格作为分隔符，因此上述方法不起作用。有没有什么方法可以有效地读取这个文件。

【问题讨论】：

只需删除 sep=" " 参数。 read.table 默认知道如何处理多个空格。我有一个非常相似的问题，但我需要一个更通用的解决方案，因为我在某些字段中有单个空格。这意味着我应该能够将最小连续空格数（在我的情况下为 2）设置为分隔符，没有限制。相关帖子：***.com/questions/30955464/… @HongOoi: 是的，但只是因为read.table/read.csv 的默认值是 sep=""，这意味着“多个空格”，我们可能期望它应该是一个正则表达式“\w*”或“\ w+" 不是 ""。 【参考方案1】：

如果您想改用tidyverse（或readr）包，您可以改用read_table。

read_table(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = "NA", skip = 0, n_max = Inf,
  guess_max = min(n_max, 1000), progress = show_progress(), comment = "")

并在描述中看到这里：

read_table() and read_table2() are designed to read the type of textual data where
each column is #' separate by one (or more) columns of space.

【讨论】：

【参考方案2】：

如果您的字段具有固定宽度，您应该考虑使用read.fwf()，它可能会更好地处理缺失值。

【讨论】：

【参考方案3】：

您需要更改分隔符。 " " 指一个空白字符。 "" 将任意长度的空格作为分隔符

 data <- read.table(file, sep = "" , header = F , nrows = 100,
                     na.strings ="", stringsAsFactors= F)

来自手册：

如果 sep = ""（read.table 的默认值），则分隔符为“空白”，即一个或多个空格、制表符、换行符或回车。

此外，对于大型数据文件，您可能需要考虑使用data.table:::fread 快速将数据直接读取到 data.table 中。今天早上我自己在使用这个功能。它仍处于试验阶段，但我发现它确实运作良好。

【讨论】：

'fread' 如何处理多个空格？这是我尝试使用的第一个读取功能，但对我来说它由于多个空格而失败，有什么解决方法吗？？ @user2412678 您是否尝试过fread(... , sep = "" ) 或者您也可以尝试fread( ... , sep = "\s" )，但我不知道这是否可行。您能否同时尝试并报告，如果其中一个有效，我们可以更新fread 的答案。 fread( ...,sep ="") 在 fread 中不起作用，当我们使用它时会出现以下错误 - Error in fread(file, sep = "", : 'sep' must be 'auto' or a single character fread(....,sep = "\s") 在 fread 中不起作用，在这种情况下会出现以下错误Error: '\s' is an unrecognized escape in character string starting ""\s" 但是，fread(...,sep = " " ) 可以，但这并没有解决多个空格作为分隔符的问题，而是将多个空格视为列

以上是关于在R中读取具有多个空格作为分隔符的文本文件的主要内容，如果未能解决你的问题，请参考以下文章