以独立于类型的变量属性为条件替换所有负值

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了以独立于类型的变量属性为条件替换所有负值相关的知识,希望对你有一定的参考价值。

我有一个非常大的混合数据集(字符变量,数值变量,因子),其中负值通常表示缺失值,请参阅Scales,但并非总是如此,请参阅Profit

     Country Ccode  Year Profit Scale    ID Happiness_d Power_d  ID_d
  <chr>   <fcr> <dbl>     <dbl> <labelled>    <dbl>    <dbl>   <dbl>  <dbl>
1 France  FR     2000      1000  NA        1      40000. 160000.  1.67
2 France  FR     2001     -1200   1        1      80000. 320000.  1.67
3 France  FR     2000      1400   0        2      40000. 160000.  1.67
4 France  FR     2001      1600   3        2      80000. 320000.  1.67
5 UK      UK     2000     -1000  -9        3      40000. 160000.  1.67
6 UK      UK     2001      1000   2        3      80000. 320000.  1.67
7 UK      UK     2000      1000   4        4      40000. 160000.  1.67
8 UK      UK     2001      1000   0        4      80000. 320000.  1.67

我想用NA代替所有负值:

df[df< 0] <- NA

问题是,虽然它旨在删除表示NA的负值,例如在Scale中,但在示例数据集中也会删除Profit中的负数,这显然不是NA。

因此,我想使结果以变量的范围为条件。 Scale变量的结构如下:

Class 'labelled'  atomic [1:135894] NA NA 2 NA NA NA NA NA NA NA ...
  ..- attr(*, "label")= chr "Do You Use Technology Licensed From A Foreign-Owned Company?"
  ..- attr(*, "format.stata")= chr "%24.0g"
  ..- attr(*, "labels")= Named num [1:3] -9 1 2
  .. ..- attr(*, "names")= chr [1:3] "Don't Know (Spontaneous)" "Yes" "No"
> names(New_Comprehensive_June_25_2018$e6)

我已经发现,使用havenlink你可以从中获得因子水平;

   ..- attr(*, "labels")= Named num [1:3] -9 1 2

使用get_values()。

get_values(df$Scale)
[1] -9 1 2

是否有可能让解决方案只删除这些负面因素而不是其他负值?

..- attr(*, "labels")= Named num [1:3] -9 1 2

要清楚,期望的输出将是:

  Country Ccode  Year Profit Scale    ID Happiness_d Power_d  ID_d
  <chr>   <fcr> <dbl>     <dbl> <dbl>    <dbl>    <dbl>   <dbl>  <dbl>
1 France  FR     2000      1000  NA        1      40000. 160000.  1.67
2 France  FR     2001     -1200   1        1      80000. 320000.  1.67
3 France  FR     2000      1400   0        2      40000. 160000.  1.67
4 France  FR     2001      1600   3        2      80000. 320000.  1.67
5 UK      UK     2000     -1000  **NA**    3      40000. 160000.  1.67
6 UK      UK     2001      1000   2        3      80000. 320000.  1.67
7 UK      UK     2000      1000   4        4      40000. 160000.  1.67
8 UK      UK     2001      1000   0        4      80000. 320000.  1.67

dput示例(请注意变量Scale实际上并不存在:

h7a = structure(c(1, -9, 2, 3, 1, 3, -9, 2, 3, 1, 2, 1, 3, 
    3, 2, 2, 1, 2, 2, 1, 2, -9, 1, 4, 3, 3, 1, 1, 1, 1, 3, 4, 
    3, 1, 2, 2, 1, 2, 1, NA, 2, 1, 2, 4, 3, 1, 3, 4, 4, 3, 2, 
    4, 1, 1, 2, 3, 2, 2, 2, 2, 1, 2, 1, 3, 4, 3, 1, 3, 1, 2, 
    3, 3, 3, 1, 1, 4, -9, 4, 3, 1, 2, 3, 1, -9, 1, 4, 1, 3, 1, 
    -9, 1, 1, 1, 1, 2, 3, 1, 3, 1, 2, 1, 2, 3, 4, 3, 3, 2, 4, 
    3, 3, 1, -9, 1, -7, 3, 1, 1, 2, 1, 2, -7, 2, 3, 1, 3, -7, 
    3, 4, 3, 2, 3, NA, 3, 3, 3, 1, 1, 2, 2, -9, 3, 1, 1, 2, 1, 
    1, -9, -9, -9, 2, -9, 1, 2, 3, 2, 3, 3, 3, 3, 1, 2, -9, 4, 
    3, 3, 1, 2, 2, 4, 4, 4, 3, 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 
    -9, 4, 4, 4, 2, 1, -7, 2, 2, 1, 1, 2, 1, 2, 2, 4, 2, 3, -7, 
    3, 3, 3, 4, 2, 4, 2, NA, 1, 3, 1, 2, 3, 4, 3, -9, 3, 3, 4, 
    3, 2, 4, 1, 3, 1, 3, 4, 3, 1, 3, 3, 3, NA, 1, 3, 3, -7, 1, 
    1, 3, 2, 1, 4), label = "The Court System Is Fair, Impartial And Uncorrupted", format.stata = "%24.0g", class = "labelled", labels = structure(c(NA, 
    NA, 1, 2, 3, 4), .Names = c("Don't Know (Spontaneous)", "Does Not Apply", 
    "Strongly disagree", "Tend to disagree", "Tend to agree", 
    "Strongly agree"))),
答案

这是一个可以应用于数据集的简单示例。

# example data
df = data.frame(a = c("A","A","B"),
                x = c(1,2,3),
                y = c(NA,3,-7),
                z = c(200,300,-400))

library(dplyr)

df %>% mutate_if(is.numeric, ~ifelse(between(min(., na.rm = T), -9, -1) & .<0, NA, .))

#   a x  y    z
# 1 A 1 NA  200
# 2 A 2  3  300
# 3 B 3 NA -400

只有当该列为数字并且该列的最小值介于-9和-1之间时,才能更新(mutate)列。更新是用NA替换负值。

这假设您只有整数值。如果没有,你可以使用between(..., -9, 0)

另一答案

Bass-R解决方案:

# Find negative value from 3rd column onwards, replace it with NA 
# and bind with Country,Ccode and Profit columns. 
cbind(df[,c(1,2,4)],do.call(cbind, lapply(df[,-c(1,2,4)], function(x) ifelse(x<0,NA,x))))

输出:

     Country Ccode Profit Year Scale ID Happiness_d Power_d ID_d
  1  France    FR   1000 2000    NA  1       40000  160000 1.67
  2  France    FR  -1200 2001     1  1       80000  320000 1.67
  3  France    FR   1400 2000     0  2       40000  160000 1.67
  4  France    FR   1600 2001     3  2       80000  320000 1.67
  5      UK    UK  -1000 2000    NA  3       40000  160000 1.67
  6      UK    UK   1000 2001     2  3       80000  320000 1.67
  7      UK    UK   1000 2000     4  4       40000  160000 1.67
  8      UK    UK   1000 2001     0  4       80000  320000 1.67

以上是关于以独立于类型的变量属性为条件替换所有负值的主要内容,如果未能解决你的问题,请参考以下文章

简化条件表达式

spss软件中非参数检验两个独立样本检验分析结果中z值为负值代表啥意思

Bash的变量类型

WiX:当我将 repititous <ComponentRef> 替换为 Feature 属性时,片段未包含在内

人工智能数学基础--概率与统计5:独立随机变量和变量替换

C++拾遗关于变量