删除数据框中列的每个因子的异常值

Posted

技术标签:

【中文标题】删除数据框中列的每个因子的异常值【英文标题】:Remove outliers for every factor of a column in a dataframe 【发布时间】:2022-01-22 20:30:13 【问题描述】:

我有下面的数据框,我想查找并删除我的数据框 Area 列的每个因素的异常值。通常因素大于 2。因此,我想要新的数据框,而每个因素都没有异常值。

    subs<-structure(list(Sold_Pr = c(6500, 173000, 60000, 73000, 155000, 
105000, 140000, 39900, 73500, 46000, 99900, 180000, 164000, 120000, 
206000, 160000, 67400, 215000, 145000, 175000, 350000, 425000, 
435000, 490000, 545000, 585000), Area = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("411", "415", "981", 
"8001", "8002", "8003", "8004", "8005", "8006", "8007", "8008", 
"8009", "8010", "8011", "8012", "8013", "8014", "8015", "8016", 
"8017", "8018", "8019", "8020", "8021", "8022", "8023", "8024", 
"8025", "8026", "8027", "8028", "8029", "8030", "8031", "8034", 
"8035", "8037", "8038", "8039", "8040", "8041", "8042", "8043", 
"8044", "8045", "8046", "8047", "8048", "8049", "8050", "8051", 
"8052", "8053", "8055", "8056", "8057", "8058", "8059", "8060", 
"8061", "8062", "8063", "8064", "8065", "8066", "8067", "8068", 
"8069", "8070", "8071", "8072", "8073", "8074", "8075", "8076", 
"8077"), class = "factor"), Closed_Date = structure(c(18668, 
18933, 18716, 18740, 18639, 18845, 18708, 18676, 18733, 18695, 
18715, 18709, 18794, 18803, 18750, 18787, 18906, 18810, 18855, 
18870, 18626, 18786, 18808, 18864, 18961, 18914), class = "Date")), row.names = c(NA, 
-26L), class = c("tbl_df", "tbl", "data.frame"))

remove_outliers <- function(x, na.rm = TRUE, ...) 
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y

【问题讨论】:

您想删除异常值,还是像您的函数一样将它们替换为NA 完全删除它们 【参考方案1】:

您可以更改函数以返回逻辑值并在filtergroup_by 中使用它们-

library(dplyr)

remove_outliers <- function(x, na.rm = TRUE, ...) 
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  !(x < (qnt[1] - H) | x > (qnt[2] + H))


subs %>%
  group_by(Area) %>%
  filter(remove_outliers(Sold_Pr)) %>%
  ungroup

【讨论】:

【参考方案2】:

您可以使用ave 将分组添加到函数中。对于截止值的计算,您可以使用mapply

remove_outliers <- \(x, g, na.rm=TRUE, ...) 
  q <- \(z) 
    qnt <- quantile(z, probs=c(.25, .75), na.rm=na.rm, ...)
    H <- 1.5 * IQR(z, na.rm=na.rm)
    r <- mapply(`+`, qnt, c(-H, H))
    z > r[1] & z < r[2]
  
  return(as.logical(ave(x, as.character(g), FUN=q)))


subs[26, 1] <- 1e9  ## fabricate outlier

给予

with(subs, remove_outliers(Sold_Pr, Area))
# [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
# [14]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

应用于子集

subs[with(subs, remove_outliers(Sold_Pr, Area)), ]
#    Sold_Pr Area Closed_Date
# 1     6500  411  2021-02-10
# 2   173000  411  2021-11-02
# 3    60000  411  2021-03-30
# 4    73000  411  2021-04-23
# 5   155000  411  2021-01-12
# 6   105000  411  2021-08-06
# 7   140000  411  2021-03-22
# 8    39900  411  2021-02-18
# 9    73500  411  2021-04-16
# 10   46000  411  2021-03-09
# 11   99900  411  2021-03-29
# 12  180000  411  2021-03-23
# 13  164000  411  2021-06-16
# 14  120000  411  2021-06-25
# 15  206000  411  2021-05-03
# 16  160000  411  2021-06-09
# 17   67400  411  2021-10-06
# 18  215000  411  2021-07-02
# 19  145000  411  2021-08-16
# 20  175000  411  2021-08-31
# 21  350000  415  2020-12-30
# 22  425000  415  2021-06-08
# 23  435000  415  2021-06-30
# 24  490000  415  2021-08-25
# 25  545000  415  2021-11-30

注意: R 版本 4.1.2 (2021-11-01)

【讨论】:

以上是关于删除数据框中列的每个因子的异常值的主要内容,如果未能解决你的问题,请参考以下文章

获取数据框中列的唯一值的计数,这些值最终出现在决策树的每个叶节点中?

使用Python中的循环一一删除数据框中列中具有最小值的观察值

数据框中列的匹配值

比较两个数据框中列的值

将定义集中的值设置为 Pandas 数据框中列的给定值(f.e. NaN)

检索任意行以获取数据框中列的唯一组合