用于过滤和替换异常值的循环

Posted

技术标签:

【中文标题】用于过滤和替换异常值的循环【英文标题】:Loops for filter and replace outliers 【发布时间】:2022-01-13 03:10:22 【问题描述】:

我想知道如何在重复过滤器的数据集上编写循环:

我的示例数据集1:

df=  structure(list(system = c("1-Jan-16", "2-Jan-16", "3-Jan-16", 
    "4-Jan-16"), evi1500 = c(0.437, 0.408, 0.429, NA), evi21500 = c(0.3891771, 
    0.38915543, 0.389133761, 0.389112091), kndvi1500 = c(0.493, 0.471, 
    0.769, 0.223), ndvi1500 = c(0.261, 0.698, 0.645, 0.627), nirv1500 = c(0.444426458, 
    0.444472048, 0.444517639, 0.444563229), evi2500 = c(0.366, 0.33, 
    0.367, 0.608), evi22500 = c(0.74, 0.241, 0.424, 0.398), kndvi2500 = c(0.41, 
    0.384, 0.684, 0.173), ndvi2500 = c(0.474621566, 0.474655555, 
    0.474689544, 0.474723532), nirv2500 = c(0.362, 0.596, 0.145, 
    0.442)), row.names = c(NA, 4L), class = "data.frame")

代码1

outliersevi1500=hampel_outlier(df$evi1500,k_mad_value = 3)
outliersevi1500
outliersevi21500=hampel_outlier(df$evi21500,k_mad_value = 3)
outliersevi21500
outlierskndvi1500=hampel_outlier(df$kndvi1500,k_mad_value = 3)
outlierskndvi1500
df$evi1500[df$evi1500 < 0.1992968  | df$evi1500 >  0.5907032 ] <- NA
df$evi21500[df$evi21500 < 0.2243160  | df$evi21500 >  0.5534532 ] <- NA
df$kndvi1500[df$kndvi1500 < 0.1596835  | df$kndvi1500 >  0.7749794 ] <- NA

提前感谢您的帮助。

【问题讨论】:

您有不同的过滤条件。这不仅仅是对许多行重复应用一个函数。您想将hampel_outlier 仅应用于所提及的 3 列,还是仅应用于名称中包含 1500 的每一列? 我想将 hampel_outlier 应用于我的数据集中包含 15002500 的名称的十列。 @danlooo 【参考方案1】:

hampel_outlier 返回异常值检测的上限和下限。应通过将其值设置为 NA 来删除此间隔之外的值。 Invervals 是为df 的每一列单独确定的。此过滤应仅应用于名称中包含 15002500 的列。

然后你可以计算你的阈值并像这样进行异常值替换:

library(tidyverse)
library(funModeling)
#> Loading required package: Hmisc
#> Loading required package: lattice
#> Loading required package: survival
#> Loading required package: Formula
#> 
#> Attaching package: 'Hmisc'
#> The following objects are masked from 'package:dplyr':
#> 
#>     src, summarize
#> The following objects are masked from 'package:base':
#> 
#>     format.pval, units
#> funModeling v.1.9.4 :)
#> Examples and tutorials at livebook.datascienceheroes.com
#>  / Now in Spanish: librovivodecienciadedatos.ai

df <- structure(list(system = c(
  "1-Jan-16", "2-Jan-16", "3-Jan-16",
  "4-Jan-16"
), evi1500 = c(0.437, 0.408, 0.429, NA), evi21500 = c(
  0.3891771,
  0.38915543, 0.389133761, 0.389112091
), kndvi1500 = c(
  0.493, 0.471,
  0.769, 0.223
), ndvi1500 = c(0.261, 0.698, 0.645, 0.627), nirv1500 = c(
  0.444426458,
  0.444472048, 0.444517639, 0.444563229
), evi2500 = c(
  0.366, 0.33,
  0.367, 0.608
), evi22500 = c(0.74, 0.241, 0.424, 0.398), kndvi2500 = c(
  0.41,
  0.384, 0.684, 0.173
), ndvi2500 = c(
  0.474621566, 0.474655555,
  0.474689544, 0.474723532
), nirv2500 = c(
  0.362, 0.596, 0.145,
  0.442
)), row.names = c(NA, 4L), class = "data.frame")

thresholds <-
  df %>%
  pivot_longer(-system) %>%
  group_by(name) %>%
  summarise(outlieres = hampel_outlier(value, k_mad_value = 3) %>% list()) %>%
  deframe()

df %>%
  mutate(across(matches("1500|2500"), ~ 
    (
      .x < thresholds[[cur_column()]][["bottom_threshold"]] |
      .x > thresholds[[cur_column()]][["top_threshold"]]
    ) %>%
      ifelse(NA, .x)
  )) %>%
  pivot_wider()
#> # A tibble: 4 x 11
#>   system evi1500 evi21500 kndvi1500 ndvi1500 nirv1500 evi2500 evi22500 kndvi2500
#>   <chr>    <dbl>    <dbl>     <dbl>    <dbl>    <dbl>   <dbl>    <dbl>     <dbl>
#> 1 1-Jan…   0.437    0.389     0.493   NA        0.444   0.366    0.74      0.41 
#> 2 2-Jan…   0.408    0.389     0.471    0.698    0.444   0.33     0.241     0.384
#> 3 3-Jan…   0.429    0.389     0.769    0.645    0.445   0.367    0.424     0.684
#> 4 4-Jan…  NA        0.389     0.223    0.627    0.445  NA        0.398     0.173
#> # … with 2 more variables: ndvi2500 <dbl>, nirv2500 <dbl>

由reprex package 创建于 2021-12-09 (v2.0.1)

【讨论】:

感谢您的帮助!你能帮我解决另一个问题吗?我还想编写另一个循环以使过程顺利进行。提前致谢! ***.com/questions/70276439/…@danlooo

以上是关于用于过滤和替换异常值的循环的主要内容,如果未能解决你的问题,请参考以下文章

C# 中图表的 System.ExecutionEngine 异常

MAD+异常检测

对于异常值的检测

手写选择排序--Java

内存中常见异常值的解释(比如0xcccccccc0xcdcdcdcd和 0xfeeefeee 异常值 )

MATLAB从入门到精通-缺失值和异常值的处理应用案例