在 R 中导入 txt 文件，忽略前几行

Posted 2023-03-21

技术标签:

【中文标题】在 R 中导入 txt 文件，忽略前几行【英文标题】：Import txt file in R ignoring first few lines 【发布时间】：2016-03-04 05:13:52 【问题描述】：

从 MET 办公室下载了苏格兰降雨数据。

前几行：

Scotland Rainfall (mm)
Areal series, starting from 1910
Allowances have been made for topographic, coastal and urban effects where relationships are found to exist.
Seasons: Winter=Dec-Feb, Spring=Mar-May, Summer=June-Aug, Autumn=Sept-Nov. (Winter: Year refers to Jan/Feb).
Values are ranked and displayed to 1 dp. Where values are equal, rankings are based in order of year descending.
Data are provisional from February 2015 & Winter 2015. Last updated 26/11/2015

     JAN  Year     FEB  Year     MAR  Year     APR  Year     MAY  Year     JUN  Year     JUL  Year     AUG  Year    SEP   Year     OCT  Year     NOV  Year     DEC  Year     WIN  Year     SPR  Year     SUM  Year     AUT  Year     ANN  Year
   293.8  1993   278.1  1990   238.5  1994   191.1  1947   191.4  2011   155.0  1938   185.6  1940   216.5  1985   267.6  1950   258.1  1935   262.0  2009   300.7  2013   743.6  2014   409.5  1986   455.6  1985   661.2  1981  1886.4  2011
   292.2  1928   258.8  1997   233.4  1990   149.0  1910   168.7  1986   137.9  2002   181.4  1988   211.9  1992   221.2  1981   254.0  1954   244.8  1938   268.5  1986   649.5  1995   401.3  2015   435.6  1948   633.8  1954  1828.1  1990
   275.6  2008   244.7  2002   201.3  1992   146.8  1934   155.9  1925   137.8  1948   170.1  1939   202.3  2009   193.9  1982   248.8  2014   242.2  2006   267.2  1929   645.4  2000   393.7  1994   427.8  2009   615.8  1938  1756.8  2014

我正在尝试将此 txt 文件读入 R 并尝试以下操作：

fileURL <- "http://www.metoffice.gov.uk/pub/data/weather/uk/climate/datasets/Rainfall/ranked/Scotland.txt"

if(!file.exists("scotland_rainfall.txt"))
        #this will download the file in the current working directory
        download.file(fileURL,destfile = "scotland_rainfall.txt")
        dateDownload <- Sys.Date() #30-11-2015


scotland_weather <- read.table("scotland_rainfall.txt",skip = 8,header = F,sep = "\t",na.strings = "")

它以不同的层次解释所有因素：

> head(scotland_weather)
                                                                                                                                                                                                                                              V1
1    293.8  1993   278.1  1990   238.5  1994   191.1  1947   191.4  2011   155.0  1938   185.6  1940   216.5  1985   267.6  1950   258.1  1935   262.0  2009   300.7  2013   743.6  2014   409.5  1986   455.6  1985   661.2  1981  1886.4  2011
2    292.2  1928   258.8  1997   233.4  1990   149.0  1910   168.7  1986   137.9  2002   181.4  1988   211.9  1992   221.2  1981   254.0  1954   244.8  1938   268.5  1986   649.5  1995   401.3  2015   435.6  1948   633.8  1954  1828.1  1990
3    275.6  2008   244.7  2002   201.3  1992   146.8  1934   155.9  1925   137.8  1948   170.1  1939   202.3  2009   193.9  1982   248.8  2014   242.2  2006   267.2  1929   645.4  2000   393.7  1994   427.8  2009   615.8  1938  1756.8  2014
4    252.3  2015   227.9  1989   200.2  1967   142.1  1949   149.5  2015   137.7  1931   165.8  2010   191.4  1962   189.7  2011   247.7  1938   231.3  1917   265.4  2011   638.3  2007   393.2  1967   422.6  1956   594.5  1935  1735.8  1938
5    246.2  1974   224.9  2014   180.2  1979   133.5  1950   137.4  2003   135.0  1966   162.9  1956   190.3  2014   189.7  1927   242.3  1983   229.9  1981   264.0  2006   608.9  1990   391.7  1992   397.0  2004   590.6  1982  1720.0  2008
6    245.0  1975   195.6  1995   180.0  1989   132.9  1932   129.7  2007   131.7  2004   159.9  1985   189.1  2004   189.6  1985   240.9  2001   224.9  1951   261.0  1912   592.8  2015   389.1  1913   390.1  1938   589.2  2006  1716.5  1954

> str(scotland_weather)
'data.frame':   106 obs. of  1 variable:
 $ V1: Factor w/ 106 levels "    38.6  1963    10.3  1932    28.7  1929    14.0  1974    22.5  1984    30.1  1988    32.7  1913     5.1  1947    31.7  1972 "| __truncated__,..: 106 105 104 103 102 101 100 99 98 97 ...

也试过Header=T

> scotland_weather <- read.table("scotland_rainfall.txt",skip = 8,header = T,sep = "\t",na.strings = "")
> head(scotland_weather)
    X293.8..1993...278.1..1990...238.5..1994...191.1..1947...191.4..2011...155.0..1938...185.6..1940...216.5..1985...267.6..1950...258.1..1935...262.0..2009...300.7..2013...743.6..2014...409.5..1986...455.6..1985...661.2..1981..1886.4..2011
1    292.2  1928   258.8  1997   233.4  1990   149.0  1910   168.7  1986   137.9  2002   181.4  1988   211.9  1992   221.2  1981   254.0  1954   244.8  1938   268.5  1986   649.5  1995   401.3  2015   435.6  1948   633.8  1954  1828.1  1990
2    275.6  2008   244.7  2002   201.3  1992   146.8  1934   155.9  1925   137.8  1948   170.1  1939   202.3  2009   193.9  1982   248.8  2014   242.2  2006   267.2  1929   645.4  2000   393.7  1994   427.8  2009   615.8  1938  1756.8  2014
3    252.3  2015   227.9  1989   200.2  1967   142.1  1949   149.5  2015   137.7  1931   165.8  2010   191.4  1962   189.7  2011   247.7  1938   231.3  1917   265.4  2011   638.3  2007   393.2  1967   422.6  1956   594.5  1935  1735.8  1938
4    246.2  1974   224.9  2014   180.2  1979   133.5  1950   137.4  2003   135.0  1966   162.9  1956   190.3  2014   189.7  1927   242.3  1983   229.9  1981   264.0  2006   608.9  1990   391.7  1992   397.0  2004   590.6  1982  1720.0  2008
5    245.0  1975   195.6  1995   180.0  1989   132.9  1932   129.7  2007   131.7  2004   159.9  1985   189.1  2004   189.6  1985   240.9  2001   224.9  1951   261.0  1912   592.8  2015   389.1  1913   390.1  1938   589.2  2006  1716.5  1954
6    241.9  2005   194.8  1998   179.6  1921   132.3  1927   129.6  1920   130.4  1980   158.0  1953   188.8  1948   187.5  1935   238.1  2008   223.2  1986   260.8  1949   580.6  1920   386.5  1947   387.5  2012   587.8  1984  1696.7  2004
> str(scotland_weather)
'data.frame':   105 obs. of  1 variable:
 $ X293.8..1993...278.1..1990...238.5..1994...191.1..1947...191.4..2011...155.0..1938...185.6..1940...216.5..1985...267.6..1950...258.1..1935...262.0..2009...300.7..2013...743.6..2014...409.5..1986...455.6..1985...661.2..1981..1886.4..2011: Factor w/ 105 levels "    38.6  1963    10.3  1932    28.7  1929    14.0  1974    22.5  1984    30.1  1988    32.7  1913     5.1  1947    31.7  1972 "| __truncated__,..: 105 104 103 102 101 100 99 98 97 96 ...

我希望保留与 txt 文件相同的列名。

任何其他想法将不胜感激。

谢谢

【问题讨论】：

你试过skip = 7, header = TRUE吗？是的。请查看更新后的帖子 check.names=FALSE 通常 R 不能容忍多个同名的列。也许它不是真正的制表符分隔。尝试完全省略 sep 参数（默认为通用空格）问题可以重新表述：如何导入具有独立列的文件，可能具有不同的长度 【参考方案1】：

看起来文件确实有固定宽度的字段，但标题与数据行不一致，所以像这样分别读取标题和数据。不需要任何软件包。

hdr <- read.table(fileURL, skip = 7, nrow = 1, as.is = TRUE)
widths <- rep(c(8, 6), times = 17) # 8, 6, 8, 6, ..., 8, 6
dd <- read.fwf(fileURL, widths, skip = 8, col.names = hdr, check.names = FALSE)

注意：可以像这样从数据的第一行计算widths：

one.line <- readLines(fileURL, n = 9)[9] # char string with 1st line of data
widths <- diff(c(0, gregexpr("\\S(?=\\s)", paste(one.line, ""), perl = TRUE)[[1]]))

【讨论】：

完美解决问题。你能解释一下你是如何计算width的吗？试图理解……其他人似乎很直截了当这正是我所看到的...是times、length.out、each 还是别的什么？现在我完全明白了...谢谢...接受它作为答案

以上是关于在 R 中导入 txt 文件，忽略前几行的主要内容，如果未能解决你的问题，请参考以下文章