加速R应用于数据帧

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了加速R应用于数据帧相关的知识,希望对你有一定的参考价值。

我有一个赛马数据集。对于每条赛马记录,如果都没有丢失赛道价值,我想计算过去两年在相同场地,赛道和类似距离下的赛马获胜次数。我使用apply循环每一行,但是速度非常慢。如何加快循环速度?

日期:赛马年月日。地点:ST,HV。曲目:草皮,全天候追踪。距离:1200、1400、1600、1800等ind_win:0(马未赢得第一名),1(马未赢得第一名)。

structure(list(rdate = structure(c(17450, 17475, 17481, 17496, 
17510, 17517, 17532, 17566, 17593, 17615, 17629, 17657, 17667, 
17796, 17817, 17839, 17856, 17860, 17881, 17881, 17902, 17902
), class = "Date"), venue = c("HV", "ST", "ST", "ST", "ST", "ST", 
"ST", "ST", "ST", "ST", "ST", "ST", "HV", "ST", "ST", "ST", "HV", 
"ST", "ST", "ST", "ST", "ST"), track = c("TURF", "TURF", "TURF", 
"TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF", 
"TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF", "TURF", 
"TURF", "TURF", "TURF"), horsenum = c("A366", "A366", "A366", 
"A366", "A366", "A366", "A366", "A366", "A366", "A366", "A366", 
"A366", "A366", "B440", "B440", "B440", "A366", "B440", "A366", 
"B440", "A366", "B440"), distance = c(1800L, 1800L, 1600L, 1600L, 
1800L, 1600L, 1800L, 1800L, 1800L, 1600L, 1800L, 2000L, 1800L, 
1200L, 1400L, 1400L, 1650L, 1400L, 1600L, 1400L, 1800L, 1400L
), ind_win = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L)), row.names = c(NA, -22L
), class = "data.frame")

library(tidyverse)
library(lubridate)

HWinCountF <- function(df){
    if (!is.na(df["track"])) {
      tmp <- subset(jc.data, horsenum == df["horsenum"] & rdate < df["rdate"] & rdate > ymd(df["rdate"]) - years(2) &
                      venue == df["venue"] & track==df["track"] &  distance>=as.integer(df["distance"])-200 &
                      distance<=as.integer(df["distance"])+200)
      if (nrow(tmp) > 0) {
        return(nrow(tmp[tmp$ind_win == 1,]))
      } else {
        return(NA)
      }
    } else {
      return(NA)
    }
  }

  jc.data['h_win_count'] <- apply(jc.data, 1, HWinCountF)
答案

我想计算过去两年中在相同场地,赛道和类似距离下获胜的马匹数量

由于这是直接汇总,请避免循环,并考虑将merge与数据帧的subset放在其自身上,因为您似乎需要相互比较观察值。然后运行aggregate以赢得马匹。下面以发布的数据示例运行。

# MERGE BY COMMON VARIABLES AND SUBSET RESULTS BY DATE AND DISTANCE compare_df <- subset(merge(jc.data, jc.data, by=c("horsenum", "venue", "track")), rdate.x < rdate.y & rdate.x > lubridate::ymd(rdate.y) - lubridate::years(2) & distance.x >= as.integer(distance.y) - 200 & distance.x <= as.integer(distance.y) + 200 ) # SUM ind_win GROUPED BY COMMON VARIABLES agg_df <- aggregate(cbind(h_win_count = ind_win.x) ~ horsenum + venue + track, data = compare_df, FUN=sum) agg_df # horsenum venue track h_win_count # 1 A366 HV TURF 0 # 2 A366 ST TURF 0 # 3 B440 ST TURF 2

以上是关于加速R应用于数据帧的主要内容,如果未能解决你的问题,请参考以下文章

如何将 dunn.test 应用于 R 中的数据帧?

加速数据帧 .loc()

将当前代码应用于另一个数据帧

使用 R 将多个数据帧写入 .csv 文件

Python:Pandas:加速应用函数

在列表中的多个数据帧上应用 lapply,R