从R中的一系列分组变量中提取第一个值以计算旅行时间
Posted
技术标签:
【中文标题】从R中的一系列分组变量中提取第一个值以计算旅行时间【英文标题】:Extracting first value from series of grouped variables in R to calculate travel time 【发布时间】:2017-03-07 16:45:53 【问题描述】:我编写了代码来计算一小群带标签的鱼的旅行时间。我的每条鱼“路径”的数据框d
如下所示:
TagID Station arrival departure
1 2051 I80_1 2012-04-04 20:20:04 2012-04-04 20:35:04
2 2051 Lisbon 2012-04-05 09:06:18 2012-04-05 09:11:36
3 2051 Rstr 2012-04-05 18:46:34 2012-04-05 19:03:21
4 2051 Rstr 2012-04-05 22:31:59 2012-04-05 22:51:09
5 2051 Rstr 2012-04-06 02:30:31 2012-04-06 02:54:01
6 2051 Base_TD 2012-04-06 06:52:39 2012-04-06 08:24:11
我的代码提取每条鱼的最终到达站的最终到达时间(例如,在上面的 sn-p 中,它将是 2012-04-06 06:52:39
在站 Base_TD
)。
一旦我确定了最终到达时间,我将ttime
总结为每条鱼从releasetime
(预设值)经过的总时间,以及每条鱼的最终站点位置。我已经通过以下管道在整个数据集上使用dplyr
完成了此操作,但dplyr
是我知道执行此任务的唯一方法,我担心我会通过所有分组传播不可见的错误和取消分组。这是一个有效的担忧吗?我将如何在 base R 中编写等效代码以确保获得相同的结果?
releasetime <- as.POSIXct('2012-03-30 18:00:00', tz = 'Pacific/Pitcairn')
releasetime <- lubridate::with_tz(releasetime, tzone = 'UTC')
tt <- d %>%
group_by(TagID, Station) %>%
arrange(arrival) %>%
slice(row_number() == 1) %>% # cuts df down to first detection of fish at each station
ungroup() %>%
group_by(TagID) %>% # get back up to full path level
arrange(arrival) %>% #arrange path by arrival time
summarise(ttime = last(arrival) - releasetime,
laststation = Station[arrival ==last(arrival)]) # now the last arrival should be the only arrival at the last station; summarize travel time for each fish.
如果您想使用示例数据集,这里是三个不同个人轨迹的dput()
:
d <- structure(list(TagID = c(2059L, 2059L, 2059L, 2059L, 2059L, 2059L,
2059L, 2059L, 2059L, 2059L, 2059L, 2062L, 2062L, 2062L, 2062L,
2062L, 2062L, 2062L, 2062L, 2062L, 2062L, 2066L, 2066L, 2066L,
2066L, 2066L, 2066L, 2066L, 2066L, 2066L, 2066L, 2066L, 2066L,
2066L), Station = c("I80_1", "Lisbon", "Rstr", "Rstr", "Base_TD",
"BCE", "MAE", "MAW", "MAW", "MAE", "MAE", "I80_1", "Lisbon",
"Rstr", "Base_TD", "BCE", "BCE", "BCE", "BCE", "BCE", "BCE",
"I80_1", "Lisbon", "Rstr", "BCE", "BCE", "BCE", "MAE", "MAW",
"MAW", "MAE", "MAE", "MAW", "MAE"), arrival = structure(c(1333451872,
1333562100, 1333607351, 1333626207, 1333642897, 1333725713, 1334092156,
1334092450, 1334102208, 1334102426, 1334169836, 1333232026, 1333301118,
1333364285, 1333383477, 1333729987, 1333746859, 1333788503, 1333844040,
1333857104, 1333884034, 1333184935, 1333229762, 1333270977, 1333503027,
1333533868, 1333542226, 1333822681, 1333823087, 1333832661, 1333832863,
1333861226, 1333861662, 1333877063), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), departure = structure(c(1333452194, 1333562472,
1333608264, 1333626844, 1333643196, 1333725773, 1334092599, 1334093077,
1334102905, 1334103169, 1334169868, 1333232307, 1333301776, 1333366712,
1333385467, 1333730036, 1333746859, 1333788634, 1333844585, 1333857123,
1333884226, 1333185124, 1333230300, 1333272832, 1333503224, 1333535705,
1333542296, 1333823638, 1333824235, 1333832964, 1333833171, 1333861898,
1333862298, 1333877179), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = "data.frame", row.names = c(NA,
-34L), .Names = c("TagID", "Station", "arrival", "departure"))
正确的输出应该是:
TagID ttime laststation
2059 10.801505 days MAW
2062 6.606331 days BCE
2066 7.683877 days MAW
非常感谢您的帮助。
【问题讨论】:
【参考方案1】:do.call(rbind,
lapply(split(d, d$TagID), function(a) #split by 'TagID' and loop over sub-groups
a = a[!duplicated(a$Station),] #Retain only the first appearances of 'arrival'
a = a[order(a$arrival),] #Sort each sub-group by 'arrival'
cbind(TagID = a$TagID[1], #obtain TagID, station, and ttime of the sub-group,
Last_Station = a$Station[NROW(a)],
ttime = (as.numeric(as.POSIXct(a$arrival[NROW(a)])) - as.numeric(releasetime))/(60*60*24))
)
)
# TagID Last_Station ttime
#[1,] "2059" "MAW" "10.8015046296296"
#[2,] "2062" "BCE" "6.60633101851852"
#[3,] "2066" "MAW" "7.68387731481481"
【讨论】:
快速问题:我怎么知道a = a[!duplicated(a$Station),]
不会随着我需要的到达而摆脱行? duplicated
如何决定保留哪些行以及删除哪些行?我阅读了文档,但仍然不明白,我深表歉意。
duplicated
不会将首次出现标记为TRUE
。它仅在第二次出现后标记TRUE
(运行duplicated(c(2,2,3))
)。所以,它不应该摆脱你需要的到达。
认为我已经知道了 - 所以如果出于某种原因我需要提取倒数第二个到达或第一个以外的其他索引,duplicated
不会是要走的路,但它适用于这个特定的问题。谢谢!
@Von,对。如果要求不同并且我仍然采用这种方法,我可能会尝试将duplicated
与head
和tail
或类似的东西结合起来。【参考方案2】:
我们可以试试split
来自base R
r1 <- do.call(rbind, lapply(split(d, list(d$TagID, d$Station),
drop = TRUE), function(x) head(x[order(x$arrival),],1)))
r2 <- do.call(rbind, lapply(split(r1, r1$TagID), function(x)
x1 <- x[order(x$arrival),]
data.frame(TagID = x1$TagID[1],
ttime = x1$arrival[nrow(x1)] - releasetime,
laststation = x1$Station[x1$arrival == x1$arrival[nrow(x1)]])))
row.names(r2) <- NULL
r2
# TagID ttime laststation
#1 2059 10.801505 days MAW
#2 2062 6.606331 days BCE
#3 2066 7.683877 days MAW
【讨论】:
这个解决方案也很有效——我在上面选择了 d.b 的答案,因为它稍微简洁一些,但两种方法都对dplyr
管道进行了很好的测试。谢谢!
@Von 感谢 cmets。您的选择没有问题。我的解决方案为列返回不同的类以上是关于从R中的一系列分组变量中提取第一个值以计算旅行时间的主要内容,如果未能解决你的问题,请参考以下文章
按一个变量排序,按另一个分组,然后在 R 中的 SQL Query 中选择第一行