在 R 中导入 txt 文件,忽略前几行
Posted
技术标签:
【中文标题】在 R 中导入 txt 文件,忽略前几行【英文标题】:Import txt file in R ignoring first few lines 【发布时间】:2016-03-04 05:13:52 【问题描述】:从 MET 办公室下载了苏格兰降雨数据。
前几行:
Scotland Rainfall (mm)
Areal series, starting from 1910
Allowances have been made for topographic, coastal and urban effects where relationships are found to exist.
Seasons: Winter=Dec-Feb, Spring=Mar-May, Summer=June-Aug, Autumn=Sept-Nov. (Winter: Year refers to Jan/Feb).
Values are ranked and displayed to 1 dp. Where values are equal, rankings are based in order of year descending.
Data are provisional from February 2015 & Winter 2015. Last updated 26/11/2015
JAN Year FEB Year MAR Year APR Year MAY Year JUN Year JUL Year AUG Year SEP Year OCT Year NOV Year DEC Year WIN Year SPR Year SUM Year AUT Year ANN Year
293.8 1993 278.1 1990 238.5 1994 191.1 1947 191.4 2011 155.0 1938 185.6 1940 216.5 1985 267.6 1950 258.1 1935 262.0 2009 300.7 2013 743.6 2014 409.5 1986 455.6 1985 661.2 1981 1886.4 2011
292.2 1928 258.8 1997 233.4 1990 149.0 1910 168.7 1986 137.9 2002 181.4 1988 211.9 1992 221.2 1981 254.0 1954 244.8 1938 268.5 1986 649.5 1995 401.3 2015 435.6 1948 633.8 1954 1828.1 1990
275.6 2008 244.7 2002 201.3 1992 146.8 1934 155.9 1925 137.8 1948 170.1 1939 202.3 2009 193.9 1982 248.8 2014 242.2 2006 267.2 1929 645.4 2000 393.7 1994 427.8 2009 615.8 1938 1756.8 2014
我正在尝试将此 txt 文件读入 R 并尝试以下操作:
fileURL <- "http://www.metoffice.gov.uk/pub/data/weather/uk/climate/datasets/Rainfall/ranked/Scotland.txt"
if(!file.exists("scotland_rainfall.txt"))
#this will download the file in the current working directory
download.file(fileURL,destfile = "scotland_rainfall.txt")
dateDownload <- Sys.Date() #30-11-2015
scotland_weather <- read.table("scotland_rainfall.txt",skip = 8,header = F,sep = "\t",na.strings = "")
它以不同的层次解释所有因素:
> head(scotland_weather)
V1
1 293.8 1993 278.1 1990 238.5 1994 191.1 1947 191.4 2011 155.0 1938 185.6 1940 216.5 1985 267.6 1950 258.1 1935 262.0 2009 300.7 2013 743.6 2014 409.5 1986 455.6 1985 661.2 1981 1886.4 2011
2 292.2 1928 258.8 1997 233.4 1990 149.0 1910 168.7 1986 137.9 2002 181.4 1988 211.9 1992 221.2 1981 254.0 1954 244.8 1938 268.5 1986 649.5 1995 401.3 2015 435.6 1948 633.8 1954 1828.1 1990
3 275.6 2008 244.7 2002 201.3 1992 146.8 1934 155.9 1925 137.8 1948 170.1 1939 202.3 2009 193.9 1982 248.8 2014 242.2 2006 267.2 1929 645.4 2000 393.7 1994 427.8 2009 615.8 1938 1756.8 2014
4 252.3 2015 227.9 1989 200.2 1967 142.1 1949 149.5 2015 137.7 1931 165.8 2010 191.4 1962 189.7 2011 247.7 1938 231.3 1917 265.4 2011 638.3 2007 393.2 1967 422.6 1956 594.5 1935 1735.8 1938
5 246.2 1974 224.9 2014 180.2 1979 133.5 1950 137.4 2003 135.0 1966 162.9 1956 190.3 2014 189.7 1927 242.3 1983 229.9 1981 264.0 2006 608.9 1990 391.7 1992 397.0 2004 590.6 1982 1720.0 2008
6 245.0 1975 195.6 1995 180.0 1989 132.9 1932 129.7 2007 131.7 2004 159.9 1985 189.1 2004 189.6 1985 240.9 2001 224.9 1951 261.0 1912 592.8 2015 389.1 1913 390.1 1938 589.2 2006 1716.5 1954
> str(scotland_weather)
'data.frame': 106 obs. of 1 variable:
$ V1: Factor w/ 106 levels " 38.6 1963 10.3 1932 28.7 1929 14.0 1974 22.5 1984 30.1 1988 32.7 1913 5.1 1947 31.7 1972 "| __truncated__,..: 106 105 104 103 102 101 100 99 98 97 ...
也试过Header=T
> scotland_weather <- read.table("scotland_rainfall.txt",skip = 8,header = T,sep = "\t",na.strings = "")
> head(scotland_weather)
X293.8..1993...278.1..1990...238.5..1994...191.1..1947...191.4..2011...155.0..1938...185.6..1940...216.5..1985...267.6..1950...258.1..1935...262.0..2009...300.7..2013...743.6..2014...409.5..1986...455.6..1985...661.2..1981..1886.4..2011
1 292.2 1928 258.8 1997 233.4 1990 149.0 1910 168.7 1986 137.9 2002 181.4 1988 211.9 1992 221.2 1981 254.0 1954 244.8 1938 268.5 1986 649.5 1995 401.3 2015 435.6 1948 633.8 1954 1828.1 1990
2 275.6 2008 244.7 2002 201.3 1992 146.8 1934 155.9 1925 137.8 1948 170.1 1939 202.3 2009 193.9 1982 248.8 2014 242.2 2006 267.2 1929 645.4 2000 393.7 1994 427.8 2009 615.8 1938 1756.8 2014
3 252.3 2015 227.9 1989 200.2 1967 142.1 1949 149.5 2015 137.7 1931 165.8 2010 191.4 1962 189.7 2011 247.7 1938 231.3 1917 265.4 2011 638.3 2007 393.2 1967 422.6 1956 594.5 1935 1735.8 1938
4 246.2 1974 224.9 2014 180.2 1979 133.5 1950 137.4 2003 135.0 1966 162.9 1956 190.3 2014 189.7 1927 242.3 1983 229.9 1981 264.0 2006 608.9 1990 391.7 1992 397.0 2004 590.6 1982 1720.0 2008
5 245.0 1975 195.6 1995 180.0 1989 132.9 1932 129.7 2007 131.7 2004 159.9 1985 189.1 2004 189.6 1985 240.9 2001 224.9 1951 261.0 1912 592.8 2015 389.1 1913 390.1 1938 589.2 2006 1716.5 1954
6 241.9 2005 194.8 1998 179.6 1921 132.3 1927 129.6 1920 130.4 1980 158.0 1953 188.8 1948 187.5 1935 238.1 2008 223.2 1986 260.8 1949 580.6 1920 386.5 1947 387.5 2012 587.8 1984 1696.7 2004
> str(scotland_weather)
'data.frame': 105 obs. of 1 variable:
$ X293.8..1993...278.1..1990...238.5..1994...191.1..1947...191.4..2011...155.0..1938...185.6..1940...216.5..1985...267.6..1950...258.1..1935...262.0..2009...300.7..2013...743.6..2014...409.5..1986...455.6..1985...661.2..1981..1886.4..2011: Factor w/ 105 levels " 38.6 1963 10.3 1932 28.7 1929 14.0 1974 22.5 1984 30.1 1988 32.7 1913 5.1 1947 31.7 1972 "| __truncated__,..: 105 104 103 102 101 100 99 98 97 96 ...
我希望保留与 txt 文件相同的列名。
任何其他想法将不胜感激。
谢谢
【问题讨论】:
你试过skip = 7, header = TRUE
吗?
是的。请查看更新后的帖子
check.names=FALSE
通常 R 不能容忍多个同名的列。
也许它不是真正的制表符分隔。尝试完全省略 sep
参数(默认为通用空格)
问题可以重新表述:如何导入具有独立列的文件,可能具有不同的长度
【参考方案1】:
看起来文件确实有固定宽度的字段,但标题与数据行不一致,所以像这样分别读取标题和数据。不需要任何软件包。
hdr <- read.table(fileURL, skip = 7, nrow = 1, as.is = TRUE)
widths <- rep(c(8, 6), times = 17) # 8, 6, 8, 6, ..., 8, 6
dd <- read.fwf(fileURL, widths, skip = 8, col.names = hdr, check.names = FALSE)
注意:可以像这样从数据的第一行计算widths
:
one.line <- readLines(fileURL, n = 9)[9] # char string with 1st line of data
widths <- diff(c(0, gregexpr("\\S(?=\\s)", paste(one.line, ""), perl = TRUE)[[1]]))
【讨论】:
完美解决问题。你能解释一下你是如何计算width
的吗?试图理解……其他人似乎很直截了当
这正是我所看到的...是times
、length.out
、each
还是别的什么?
现在我完全明白了...谢谢...接受它作为答案以上是关于在 R 中导入 txt 文件,忽略前几行的主要内容,如果未能解决你的问题,请参考以下文章
使用 VBA 在 Access 中导入 txt 文件 - 日期格式问题 - 导入规范