如何正确读取定宽格式文件

Posted

技术标签:

【中文标题】如何正确读取定宽格式文件【英文标题】:How to properly read fixed-width format files 【发布时间】:2019-03-26 19:18:46 【问题描述】:

R 和 R Studio 相对较新,我正在尝试重新格式化文本文件以对其中的数据进行一些分析。我目前正在尝试使用 read.fwf 来整理数据,但似乎做错了什么,导致各种错误。我试图通过参考资料/指南来解决这些问题,但仍然很困难。有什么建议么? (代码、文本文件中的信息示例以及所需的格式如下)。

当前代码:

library(readr)
library(tidyr)
read.fwf("AK_JAN_2017_TMAS_", widths =c(1,2,6,1,1,2,2,2,2,2,3,4,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3), header = false, sep = "", row.names = c("Record Type", "FIPS", "Station ID", "Direction of Travel code", "Lane of Travel", "Year of Data", "Month of Data", "Day of Data", "Hour of Data", "Vehicle Class", "Open", "Total Weight of Vehicle", "Number of Axles", "A-axle Weight", "A-B Axle Spacing", "B-axle Weight", "B-C Axle Spacing", "C-axle Weight", "C-D Axle Spacing", "D-axle Spacing", "D-E Axle Spacing", "E-axle Weight", "E-F Axle Spacing", "F-axle Weight", "F-G Axle Spacing", "G-axle Weight", "G-H Axle Spacing", "H-axle Weight", "H-I Axle Spacing", "I-axle Weight", "I-J Axle Spacing", "J-axle Weight", "J-K Axle Spacing", "K-axle Weight", "K-L Axle Spacing", "L-axle Weight", "L-M Axle Spacing", "M-axle Weight"), col.names = NULL, n = -1, buffersize = 2000, fileencoding = "" )

文本文件示例:

W02000103311701021610061031206057054056013054096054015053015038 W02000103311701021606055024403039038084005121 W02000103311701021609067028505040038054013065104062012064 W02000103311701021705073004302024043019 W02000103311701021710066045606055070075015088094085018086018067 W02000103311701021710080044706052069075015087096083018085018065 W02000103311701021805076007402034056040 W02000103311701021805076004802025043023 W02000103311701021905077002402010051014 W02000103311701021905072004702026042021 W02000103311701021906044020303068053067015068 W02000103311701022006066014803057045049014042 W02000103311701022006053012903058041038014033 W02000103311701022005060003702020043017 W02000103311701022006063009503046047023014026 W02000103311701022105072006602036060030 W02000103311701022206068017703045050059015073 W02000103311701022305065006902033037036 W02000103311701030005066008802032038056 W02000103311701030305066008202037063045

所需的数据格式:

问题:

 library(readr)
 library(tidyr)
 AK_JAN_2017_TMAS_ <- read.table("~/R Studio Sessions/AK_JAN_2017_TMAS_.txt", quote="\"", comment.char="")
   View(AK_JAN_2017_TMAS_)
 left<-c(2,4,10,11,12,14,16,18,20,22,25,29,31,34,37,40,43,46,49,52,55,58,61,64,67,70,73,76,79,82,85,88,91,94,97,100,103)
 right<-c(3,9,10,11,13,15,17,19,21,24,28,30,33,36,39,42,45,48,51,54,57,60,63,66,69,72,75,78,81,84,87,90,93,96,99,102,105)
 df <- data.frame(matrix(numeric(length(x)*length(left)),ncol=length(left)))
Error in numeric(length(x) * length(left)) : object 'x' not found
 for (i in 1:length(input.set)) 
     stop <- nchar(x[i])
     for (j in 1:length(left)) 
         df[i,j] <- as.numeric(substr(input.set[i], left[j], right[j]))
         if (right[j] ==  stop) break
     
 
Error in length(input.set) : object 'input.set' not found
 df <- data.frame(AK_JAN_2017_TMAS_(numeric(length(x)*length(left)),ncol=length(left)))
Error in AK_JAN_2017_TMAS_(numeric(length(x) * length(left)), ncol = length(left)) : 
  could not find function "AK_JAN_2017_TMAS_"
 for (i in 1:length(input.set)) 
     stop <- nchar(x[i])
     for (j in 1:length(left)) 
         df[i,j] <- as.numeric(substr(input.set[i], left[j], right[j]))
         if (right[j] ==  stop) break
     
 

 df <- data.frame(matrix(numeric(length(x)*length(left)),ncol=length(left)))
Error in numeric(length(x) * length(left)) : object 'x' not found
 for (i in 1:length('AK_JAN_2017_TMAS_'.set)) 
Error: unexpected symbol in "for (i in 1:length('AK_JAN_2017_TMAS_'.set"
     stop <- nchar(x[i])
Error in nchar(x[i]) : object 'x' not found
     for (j in 1:length(left)) 
         df[i,j] <- as.numeric(substr(input.set[i], left[j], right[j]))
         if (right[j] ==  stop) break
     
Error in substr(input.set[i], left[j], right[j]) : 
  object 'input.set' not found
 
Error: unexpected '' in ""


And a few changes I attempted to make:

     df <- data.frame(matrix(numeric(length(x)*length(left)),ncol=length(left)))
    for (i in 1:length('AK_JAN_2017_TMAS_'.set)) 
        stop <- nchar(x[i])
        for (j in 1:length(left)) 
            df[i,j] <- as.numeric(substr(input.set[i], left[j], right[j]))
            if (right[j] ==  stop) break
        
    

【问题讨论】:

请提供reproducible example in r。我提供的链接会告诉你如何做。请接受tour 并查看How to Ask,然后相应地编辑问题。您需要提供Minimal, Complete, and Verifiable Example 并向我们展示一些努力。干杯。 您收到了什么错误信息?您确定要使用 row.names 而不是列名并且已定义 false 您的意思是read_fwf 而不是read.fwf?由于您正在加载readr.. 我对帖子进行了一些更改,以反映(在一定程度上)您链接的示例。 【参考方案1】:

不同的行长可能是个问题。您可以按如下方式逐行执行此操作:

构造两个显示值边界的列表(我们跳过“W”)。这些将用于提取变量的子字符串。

left<-c(2,4,10,11,12,14,16,18,20,22,25,29,31,34,37,40,
        43,46,49,52,55,58,61,64,67,70,73,76,79,82,85,88,91,94,97,100,103)

right<-c(3,9,10,11,13,15,17,19,21,24,28,30,33,36,39,42,
         45,48,51,54,57,60,63,66,69,72,75,78,81,84,87,90,93,96,99,102,105)

在输入行上循环,从子字符串生成数字。您只想为每个处理最后一个可能的值以防止NAs,因此根据每个数据字符串中的字符数创建一个标记变量(stop)。

df <- data.frame(matrix(numeric(length(input.set)*length(left)),ncol=length(left)))
for (i in 1:length(input.set)) 
    stop <- nchar(input.set[i])
    for (j in 1:length(left)) 
        df[i,j] <- as.numeric(substr(input.set[i], left[j], right[j]))
        if (right[j] ==  stop) break
    

然后您可以为列添加名称。

nvals <- c("FIPS","StaID","Dir","Lane","Year","Month","Day","Hour","Class",
           "Open", "TotW", "Axles",
           "AW","ASp","BW","BSp","CW","CSp","DW","DSp",
           "EW","ESp","FW","FSp","GW","GSp","HW","HSp",
           "IW","ISp","JW","JSp","KW","KSp","LW","LSp","MW")
names(df) <- nvals

以下是结果数据框中的几行:

  FIPS StaID Dir Lane Year Month Day Hour Class Open TotW Axles AW ASp BW BSp  CW
1    2   103   3    1   17     1   2   16    10   61  312     6 57  54 56  13  54
2    2   103   3    1   17     1   2   16     6   55  244     3 39  38 84   5 121
3    2   103   3    1   17     1   2   16     9   67  285     5 40  38 54  13  65
  CSp DW DSp EW ESp FW FSp GW GSp HW HSp IW ISp JW JSp KW KSp LW LSp MW
1  96 54  15 53  15 38   0  0   0  0   0  0   0  0   0  0   0  0   0  0
2   0  0   0  0   0  0   0  0   0  0   0  0   0  0   0  0   0  0   0  0
3 104 62  12 64   0  0   0  0   0  0   0  0   0  0   0  0   0  0   0  0

【讨论】:

这很有帮助——谢谢,爱德华! 我在尝试输入上面的代码时遇到了一些问题——这可能是我犯的一个简单的错误,我不确定。我在帖子底部添加了我的问题。 基本上,我遇到了对象“x”未在长度(x)中定义和 input.set 也未定义的问题。 对不起。我已将其编辑为将x 更改为input.set;我没能改变所有的事情。顺便说一句,您可以使用readLines 函数来获取文件内容,如input.set &lt;- readLines(“filename.txt”)

以上是关于如何正确读取定宽格式文件的主要内容,如果未能解决你的问题,请参考以下文章

利用Python进行数据分析_Pandas_数据加载存储与文件格式

使用chardet模块获取文件的编码格式,进而正确的读取文件内容

如何从 MATLAB 的 audioread 等 libsndfile 库中读取数组格式的音频文件

无法读取 Info.plist,因为它的格式不正确

java读取excel时间格式出现数字怎么处理

读取 xls,将所有日期转换为正确格式,-> 写入 csv