汇总缺乏明确分组变量的每日数据(月份)

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了汇总缺乏明确分组变量的每日数据(月份)相关的知识,希望对你有一定的参考价值。

我有数据框,有6000个位置。对于每个地点,我有36年的每日降雨量数据。

样本数据:

      set.seed(123)

      mat <- matrix(round(rnorm(6000*36*365), digits = 2),nrow = 6000*36, ncol = 365)
      dat <- data.table(mat)
      names(dat) <- rep(paste0("d_",1:365))

      dat$loc.id <- rep(1:6000, each = 36)
      dat$year <- rep(1980:2015, times = 6000)                     

我想做的是每个地点,每个月产生长期平均降雨量。对于例如for loc.id = 1,平均降雨量在1月,2月,3月和12月。

让我们说这个数据叫做df,这是一个数据表

    library(dplyr)

这是我做的:

    loc.list <- unique(dat$loc.id)
      my.list <- list() # a list to store results 

      ptm <- proc.time()

      for(i in seq_along(loc.list)){

          n <- loc.list[i]
          df1 <- dat[dat$loc.id == n,]
          df2 <- gather(df1, day, rain, -year)   # this melts the data in long format

          df3 <- df2 %>% mutate(day = gsub("d_","", day)) %>% # since the day column was in "d_1" format, I converted into integer (1,2,3..365)
                         mutate(day = as.numeric(as.character(day))) %>%  # ensure that day column is numeric. For some reasonson, some NA.s appear.
                         arrange(year,day) %>% # ensure that they are arranged in order
                         mutate(month = strptime(paste(year, day), format = "%Y %j")$mon + 1) %>% # assing each day to a month
                         group_by(year,month) %>%  # group by year and month
                         summarise(month.rain = sum(rain)) %>% # calculate for each location, year and month, total rainfall
                         group_by(month) %>% # group by month
                         summarise(month.mean = round(mean(month.rain), digits = 2)) #  calculate for each month, the long term mean

          my.list[[i]] <- df3
          }
      proc.time() - ptm

      user  system elapsed 
      1036.17    0.20 1040.68

我想询问是否有更有效,更快捷的方法来完成这项任务

答案

另一个data.table替代方案:

# change column names to month, grabbed from 365 dates of a non-leap year
setnames(dat, c(format(as.Date("2017-01-01") + 0:364, "%b"),
                "loc.id", "year"))

# melt to long format
d <- melt(dat, id.vars = c("loc.id", "year"),
          variable.name = "month", value.name = "rain")

# calculate mean rain by location and month
d2 <- d[ , .(mean_rain = mean(rain)), by = .(loc, month)]

这似乎比caw5cs的答案快约7倍。 Martin Morgan的结果虽然采用了不同的格式,但却无法直接比较时间。


如果您更喜欢'dat'中的唯一列名,则可以使用%b_%d(month-day)而不是%b。然后在substr中使用by获取月份部分:

# change column names to month_day, using 365 dates of a non-leap year
setnames(dat, c(format(as.Date("2017-01-01") + 0:364, "%b_%d"),
                "loc.id", "year"))

# melt to long format
d <- melt(dat, id.vars = c("loc.id", "year"),
          variable.name = "month_day", value.name = "rain")

# calculate mean rain by location and month
d2 <- d[ , .(mean_rain = mean(rain)), by = .(loc.id, month = substr(month_day, 1, 3))]
另一答案

使用密码命名的rowsum()来汇总每个站点的日降雨量

loc.id = rep(1:6000, each = 36)
daily.by.loc = rowsum(mat, loc.id)

并在转置矩阵上使用相同的技巧按月求和(因为必须忽略365列闰年)。

month = factor(
    months(as.Date(0:364, origin="1970-01-01")),
    levels = month.name
)
loc.by.month = rowsum(t(daily.by.loc), month)

通过除以观察数来计算平均值; R的列主要矩阵表示和回收规则适用。转置使方向与数据相同。

days.per.month = tabulate(month)
ans = t(loc.by.month / (36 * days.per.month))

结果是6000 x 12矩阵

> dim(ans)
[1] 6000   12
> head(ans, 3)
      January     February       March       April         May         June
1  0.01554659  0.002043651 -0.02950717 -0.02700926 0.003521505 -0.011268519
2  0.04953405  0.032926587 -0.04959677  0.02808333 0.022051971  0.009768519
3 -0.01125448 -0.023343254 -0.02672939  0.04012963 0.018530466  0.035583333
          July       August   September     October    November    December
1  0.009874552 -0.030824373 -0.04958333 -0.03366487 -0.07390741 -0.07899642
2 -0.011630824 -0.003369176 -0.00100000 -0.00594086 -0.02817593 -0.01161290
3  0.031810036  0.059641577 -0.01109259  0.04646953 -0.01601852  0.03103943

在不到一秒钟。

另一答案

第一次严重误读了这个问题,哎呀!这次似乎按预期工作。

library(data.table)
set.seed(123)

mat <- matrix(round(rnorm(6000*36*365), digits = 2),nrow = 6000*36, ncol = 365)
dat <- data.table(mat)
names(dat) <- rep(paste0("d_",1:365))

dat$loc.id <- rep(1:6000, each = 36)
dat$year <- rep(1980:2015, times = 6000)


system.time({

# convert to long format with month # as column name
date_cols <- colnames(dat)[1:365]
setnames(dat, date_cols, as.character(1:365))
dat.long <- melt(dat, measure.vars=as.character(1:365), variable="day", value="rainfall")

# R date starts at 0 for Jan 1, so we offset the day by 1
dat.long[, day := as.numeric(day) - 1]
setkey(dat.long, year, day)

# Make table for merging year/day/month
months <- CJ(year=1980:2015, day=0:365)
months[, date := as.Date(day, origin=paste(year, "-01-01", sep=""))]
months[, month := tstrsplit(date, "-")[2]]
setkey(months, year, day)

# Merge tables to get month column
dat.merge <- merge(dat.long, months)



# aggregate by location an dmonth
dat.ag <- dat.merge[, list(mean_rainfall = mean(rainfall)), by=list(loc.id, month)]
})

生产

  user  system elapsed
14.420   4.205  18.626

> dat.ag
       loc.id month mean_rainfall
    1:      1    01   0.015546595
    2:      2    01   0.049534050
    3:      3    01  -0.011254480
    4:      4    01  -0.019453405
    5:      5    01   0.005860215
   ---
71996:   5996    12   0.027407407
71997:   5997    12   0.020334237
71998:   5998    12   0.043360434
71999:   5999    12  -0.006856369
72000:   6000    12   0.040542005

以上是关于汇总缺乏明确分组变量的每日数据(月份)的主要内容,如果未能解决你的问题,请参考以下文章

使用 SQL 聚合基于不同月份的年度数据

如何根据类别添加总收入

2星|《自金融》:公开信息汇总,缺乏深度,缺乏自有观点

按一个变量分组,但对 R 中的所有其他变量(均值)进行汇总()

按月分组的运行计数以汇总销售额

R语言dplyr包使用count函数统计分组的行数(样本数)实战:包含单变量样本统计多变量样本统计分组的汇总统计