如何计算数据通过R中某个阈值的次数(频率)?
Posted
技术标签:
【中文标题】如何计算数据通过R中某个阈值的次数(频率)?【英文标题】:How to count the amount of times (frequency) that data passes through a certain threshold in R? 【发布时间】:2021-12-02 15:28:07 【问题描述】:我正在尝试计算我的数据(时间序列上的一条线)在 R 中越过某个阈值的次数,最好使用dplyr
。对于这个例子,我试图弄清楚我的数据有多少次超过 50m 的阈值。
所以在这种情况下,它会超过 50m 阈值的 3 倍。我的数据框有数百万个点长,因此手动执行此操作不是一种选择。任何帮助将不胜感激!
输入数据样本:
structure(list(Date.time = structure(c(1458626300, 1458626310,
1458626320, 1458626330, 1458626340, 1458626350, 1458626360, 1458626370,
1458626380, 1458626390, 1458626400, 1458626410, 1458626420, 1458626430,
1458626440, 1458626450, 1458626460, 1458626470, 1458626480, 1458626490,
1458626500, 1458626510, 1458626520, 1458626530, 1458626540, 1458627840,
1458627850, 1458627860, 1458627870, 1458627880, 1458627890, 1458627900,
1458627910, 1458627920, 1458627930, 1458627940, 1458627950, 1458627960,
1458627970, 1458627980, 1458627990, 1458628000, 1458628010, 1458628020,
1458628030, 1458628040, 1458628050, 1458628060, 1458628070, 1458628080,
1458628090, 1458628100, 1458628110, 1458628120, 1458628130, 1458628140,
1458628150, 1458628160, 1458630830, 1458630840, 1458630850, 1458630860,
1458630870, 1458630880, 1458630890, 1458630900, 1458630910, 1458630920,
1458630930, 1458630940), tzone = "UTC", class = c("POSIXct",
"POSIXt")), Depth = c(67.5, 66, 63.5, 62, 59.5, 57.5, 56, 54.5,
53.5, 52.5, 51, 49.5, 48, 46.5, 45, 44, 42, 40.5, 37.5, 35, 33,
33, 32.5, 30, 27, 26.5, 28, 29, 31.5, 33, 32.5, 32.5, 32.5, 32,
32, 32.5, 33, 34.5, 36.5, 38.5, 40.5, 41.5, 43, 44.5, 46.5, 49.5,
51.5, 53, 54.5, 56, 58, 59, 60.5, 62, 64.5, 66, 67, 69, 69.5,
67, 64, 62.5, 60, 57, 55.5, 51.5, 48.5, 46, 43, 40.5)), row.names = c(NA,
-70L), class = c("data.table", "data.frame")
【问题讨论】:
是否可以按条件depth >= threshold
来统计?
【参考方案1】:
您可以使用data.table
的rleid
来识别哪些数据组低于50,哪些高于。如果然后从组数中减去 1,则得到数据通过阈值的次数:
library(data.table)
library(dplyr)
df %>%
summarize(rleid(Depth > 50)) %>%
n_distinct()-1
【讨论】:
【参考方案2】:dplyr
也应该可以解决问题:
您基本上检查Depth
的当前值和下一个(使用lead
)在阈值的两侧,然后计算此条件匹配的次数。
library(dplyr)
df <-
tibble(
Date.time =
structure(
c(
1458626300, 1458626310, 1458626320, 1458626330, 1458626340, 1458626350,
1458626360, 1458626370, 1458626380, 1458626390, 1458626400, 1458626410,
1458626420, 1458626430, 1458626440, 1458626450, 1458626460, 1458626470,
1458626480, 1458626490, 1458626500, 1458626510, 1458626520, 1458626530,
1458626540, 1458627840, 1458627850, 1458627860, 1458627870, 1458627880,
1458627890, 1458627900, 1458627910, 1458627920, 1458627930, 1458627940,
1458627950, 1458627960, 1458627970, 1458627980, 1458627990, 1458628000,
1458628010, 1458628020, 1458628030, 1458628040, 1458628050, 1458628060,
1458628070, 1458628080, 1458628090, 1458628100, 1458628110, 1458628120,
1458628130, 1458628140, 1458628150, 1458628160, 1458630830, 1458630840,
1458630850, 1458630860, 1458630870, 1458630880, 1458630890, 1458630900,
1458630910, 1458630920, 1458630930, 1458630940
),
tzone = "UTC", class = c("POSIXct", "POSIXt")
),
Depth =
c(
67.5, 66, 63.5, 62, 59.5, 57.5, 56, 54.5, 53.5, 52.5, 51, 49.5, 48,
46.5, 45, 44, 42, 40.5, 37.5, 35, 33, 33, 32.5, 30, 27, 26.5, 28, 29,
31.5, 33, 32.5, 32.5, 32.5, 32, 32, 32.5, 33, 34.5, 36.5, 38.5, 40.5,
41.5, 43, 44.5, 46.5, 49.5, 51.5, 53, 54.5, 56, 58, 59, 60.5, 62, 64.5,
66, 67, 69, 69.5, 67, 64, 62.5, 60, 57, 55.5, 51.5, 48.5, 46, 43, 40.5
)
)
## define the threshold
threshold <- 50
## keep only date just before threshold is crossed
df.cross <-
df %>%
filter(
## cross the treshold with positive trend
(Depth <= threshold & lead(Depth) > threshold) |
## or cross the treshold with negative trend
(Depth >= threshold & lead(Depth) < threshold)
)
df.cross
## count the number of time threshold is crossed
nrow(df.cross)
这应该给你:
> df.cross
# A tibble: 3 × 2
Date.time Depth
<dttm> <dbl>
1 2016-03-22 06:00:00 51
2 2016-03-22 06:27:20 49.5
3 2016-03-22 07:15:00 51.5
> ## count the number of time threshold is crossed
> nrow(df.cross)
[1] 3
【讨论】:
【参考方案3】:这是dplyr
方式。
# dat1 is your data
thre <- 50
dat2 <- dat1 %>%
as_tibble() %>%
mutate(a1 = Depth > thre,
a2 = lag(Depth) < thre,
cross = a1 == a2)
sum(dat2$cross, na.rm = T)
【讨论】:
【参考方案4】:Tidyverse 解决方案:
# Load the tidyverse package:
library(tidyverse)
# Set the threshold: th => integer scalar
th <- 50
# Calculate the number of times the threshold has been crossed:
# integer scalar => stdout(console)
df %>%
summarise(cnt = sum(
Depth >= th & lag(Depth) < th | Depth < th & lag(Depth) >= th,
na.rm = TRUE
)
) %>%
pull(cnt)
基础 R 解决方案:
# Set the threshold: th => integer scalar
th <- 50
# Calculate the number of times the threshold is crossed:
# integer scalar => stdout(console)
with(
df,
sum(Depth[-nrow(df)] >= th & Depth[-1] < th |
Depth[-nrow(df)] < th & Depth[-1] >= th)
)
【讨论】:
以上是关于如何计算数据通过R中某个阈值的次数(频率)?的主要内容,如果未能解决你的问题,请参考以下文章