用于过滤和替换异常值的循环
Posted
技术标签:
【中文标题】用于过滤和替换异常值的循环【英文标题】:Loops for filter and replace outliers 【发布时间】:2022-01-13 03:10:22 【问题描述】:我想知道如何在重复过滤器的数据集上编写循环:
我的示例数据集1:
df= structure(list(system = c("1-Jan-16", "2-Jan-16", "3-Jan-16",
"4-Jan-16"), evi1500 = c(0.437, 0.408, 0.429, NA), evi21500 = c(0.3891771,
0.38915543, 0.389133761, 0.389112091), kndvi1500 = c(0.493, 0.471,
0.769, 0.223), ndvi1500 = c(0.261, 0.698, 0.645, 0.627), nirv1500 = c(0.444426458,
0.444472048, 0.444517639, 0.444563229), evi2500 = c(0.366, 0.33,
0.367, 0.608), evi22500 = c(0.74, 0.241, 0.424, 0.398), kndvi2500 = c(0.41,
0.384, 0.684, 0.173), ndvi2500 = c(0.474621566, 0.474655555,
0.474689544, 0.474723532), nirv2500 = c(0.362, 0.596, 0.145,
0.442)), row.names = c(NA, 4L), class = "data.frame")
代码1
outliersevi1500=hampel_outlier(df$evi1500,k_mad_value = 3)
outliersevi1500
outliersevi21500=hampel_outlier(df$evi21500,k_mad_value = 3)
outliersevi21500
outlierskndvi1500=hampel_outlier(df$kndvi1500,k_mad_value = 3)
outlierskndvi1500
df$evi1500[df$evi1500 < 0.1992968 | df$evi1500 > 0.5907032 ] <- NA
df$evi21500[df$evi21500 < 0.2243160 | df$evi21500 > 0.5534532 ] <- NA
df$kndvi1500[df$kndvi1500 < 0.1596835 | df$kndvi1500 > 0.7749794 ] <- NA
提前感谢您的帮助。
【问题讨论】:
您有不同的过滤条件。这不仅仅是对许多行重复应用一个函数。您想将hampel_outlier
仅应用于所提及的 3 列,还是仅应用于名称中包含 1500
的每一列?
我想将 hampel_outlier
应用于我的数据集中包含 1500
和 2500
的名称的十列。 @danlooo
【参考方案1】:
hampel_outlier
返回异常值检测的上限和下限。应通过将其值设置为 NA
来删除此间隔之外的值。 Invervals 是为df
的每一列单独确定的。此过滤应仅应用于名称中包含 1500
或 2500
的列。
然后你可以计算你的阈值并像这样进行异常值替换:
library(tidyverse)
library(funModeling)
#> Loading required package: Hmisc
#> Loading required package: lattice
#> Loading required package: survival
#> Loading required package: Formula
#>
#> Attaching package: 'Hmisc'
#> The following objects are masked from 'package:dplyr':
#>
#> src, summarize
#> The following objects are masked from 'package:base':
#>
#> format.pval, units
#> funModeling v.1.9.4 :)
#> Examples and tutorials at livebook.datascienceheroes.com
#> / Now in Spanish: librovivodecienciadedatos.ai
df <- structure(list(system = c(
"1-Jan-16", "2-Jan-16", "3-Jan-16",
"4-Jan-16"
), evi1500 = c(0.437, 0.408, 0.429, NA), evi21500 = c(
0.3891771,
0.38915543, 0.389133761, 0.389112091
), kndvi1500 = c(
0.493, 0.471,
0.769, 0.223
), ndvi1500 = c(0.261, 0.698, 0.645, 0.627), nirv1500 = c(
0.444426458,
0.444472048, 0.444517639, 0.444563229
), evi2500 = c(
0.366, 0.33,
0.367, 0.608
), evi22500 = c(0.74, 0.241, 0.424, 0.398), kndvi2500 = c(
0.41,
0.384, 0.684, 0.173
), ndvi2500 = c(
0.474621566, 0.474655555,
0.474689544, 0.474723532
), nirv2500 = c(
0.362, 0.596, 0.145,
0.442
)), row.names = c(NA, 4L), class = "data.frame")
thresholds <-
df %>%
pivot_longer(-system) %>%
group_by(name) %>%
summarise(outlieres = hampel_outlier(value, k_mad_value = 3) %>% list()) %>%
deframe()
df %>%
mutate(across(matches("1500|2500"), ~
(
.x < thresholds[[cur_column()]][["bottom_threshold"]] |
.x > thresholds[[cur_column()]][["top_threshold"]]
) %>%
ifelse(NA, .x)
)) %>%
pivot_wider()
#> # A tibble: 4 x 11
#> system evi1500 evi21500 kndvi1500 ndvi1500 nirv1500 evi2500 evi22500 kndvi2500
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1-Jan… 0.437 0.389 0.493 NA 0.444 0.366 0.74 0.41
#> 2 2-Jan… 0.408 0.389 0.471 0.698 0.444 0.33 0.241 0.384
#> 3 3-Jan… 0.429 0.389 0.769 0.645 0.445 0.367 0.424 0.684
#> 4 4-Jan… NA 0.389 0.223 0.627 0.445 NA 0.398 0.173
#> # … with 2 more variables: ndvi2500 <dbl>, nirv2500 <dbl>
由reprex package 创建于 2021-12-09 (v2.0.1)
【讨论】:
感谢您的帮助!你能帮我解决另一个问题吗?我还想编写另一个循环以使过程顺利进行。提前致谢! ***.com/questions/70276439/…@danlooo以上是关于用于过滤和替换异常值的循环的主要内容,如果未能解决你的问题,请参考以下文章
C# 中图表的 System.ExecutionEngine 异常