在 R 中应用带有 group_by 的调整后的箱线图方法 adjboxstats()?
Posted
技术标签:
【中文标题】在 R 中应用带有 group_by 的调整后的箱线图方法 adjboxstats()?【英文标题】:Applying adjusted boxplot method adjboxstats() with group_by in R? 【发布时间】:2020-11-04 09:36:20 【问题描述】:我是初学者,希望
-
为我的数据中的每个代码生成 adjboxStats()(见下文)
消除每个代码的异常值
一些虚拟数据:
code=c("A1","A2","A3","A1","A2","A3","A1","A2","A3","A1","A2","A3","A1","A2","A3","A1","A2","A3","A2","A3","A1","A2","A3","A1","A2"),
duration=c(100,100,100,200,200,200,23523,213123,12,23213,968,37253,573012,472662,3846516,233,262,5737,3038,2,5,123,969,6,40582)
)
目前,我能够在所有代码中生成结果,见下文。但是我有问题 i) 为每个代码运行统计信息(group_by(code) 会起作用吗?)然后 ii) 为每个代码排除发现的异常值 ($out)。
library(robustbase)
adjboxStats(data$duration, coef = 1.5, a = -4, b = 3, do.conf = TRUE, do.out = TRUE)
$stats
[1] 2 100 262 23523 573012
$n
[1] 50
$conf
[1] -4971.77 5495.77
$fence
[1] -571.2153 707257.8400
$out
[1] 3846516 3846516
非常感谢您的帮助!
【问题讨论】:
【参考方案1】:我们可以通过summarise
和list
进行分组
library(dplyr)
library(robustbase)
data1 <- data %>%
group_by(code) %>%
summarise(out = list(adjboxStats(duration, coef = 1.5,
a = -4, b = 3, do.conf = TRUE, do.out = TRUE)))
data1
# A tibble: 3 x 2
# code out
# <chr> <list>
#1 A1 <named list [5]>
#2 A2 <named list [5]>
#3 A3 <named list [5]>
data1$out[[1]]
#$stats
#[1] 5.0 53.0 216.5 23368.0 573012.0
#$n
#[1] 8
#$conf
#[1] -12807.59 13240.59
#$fence
#[1] -624.4143 696935.1967
#$out
#numeric(0)
如果我们对filter
排除异常值感兴趣,则在提取“out”组件后使用%in%
和!
data %>%
group_by(code) %>%
filter(!duration %in% adjboxStats(duration, coef = 1.5,
a = -4, b = 3, do.conf = TRUE, do.out = TRUE)$out)
# A tibble: 24 x 2
# Groups: code [3]
# code duration
# <chr> <dbl>
# 1 A1 100
# 2 A2 100
# 3 A3 100
# 4 A1 200
# 5 A2 200
# 6 A3 200
# 7 A1 23523
# 8 A2 213123
# 9 A3 12
#10 A1 23213
# … with 14 more rows
数据
data <- structure(list(code = c("A1", "A2", "A3", "A1", "A2", "A3", "A1",
"A2", "A3", "A1", "A2", "A3", "A1", "A2", "A3", "A1", "A2", "A3",
"A2", "A3", "A1", "A2", "A3", "A1", "A2"), duration = c(100,
100, 100, 200, 200, 200, 23523, 213123, 12, 23213, 968, 37253,
573012, 472662, 3846516, 233, 262, 5737, 3038, 2, 5, 123, 969,
6, 40582)), class = "data.frame", row.names = c(NA, -25L))
【讨论】:
嗨!首先,非常感谢。这目前不排除我的数据框中的异常值,您对此有什么提示吗? @Kaya。我在输出中看到 24 个观察结果,在输入中看到 25 个观察结果以上是关于在 R 中应用带有 group_by 的调整后的箱线图方法 adjboxstats()?的主要内容,如果未能解决你的问题,请参考以下文章
在 R 中使用 dplyr 在 group_by 之后应用自定义函数
R语言使用dplyr包聚合(group_by)数据并过滤(fiter)之后再拆开聚合数据(ungroup取消组合)使用ggplot2可视化拆开分组后的线图(line plot)