dplyr 均值的错误结果

Posted

技术标签:

【中文标题】dplyr 均值的错误结果【英文标题】:Wrong result of mean with dplyr 【发布时间】:2019-08-08 06:14:18 【问题描述】:

我是 R 的初学者,我有一个大的 data.frame(超过 300000 个 obs),看起来像这样:

Dados <- data.frame(stringsAsFactors=FALSE,
               id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L,
                      14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L,
                      25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L,
                      37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L,
                      49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 60L,
                      61L, 62L, 63L, 64L, 65L, 66L, 67L, 68L, 69L, 70L, 71L,
                      72L, 73L, 74L, 75L, 76L, 77L, 78L, 79L, 80L, 81L, 82L, 83L,
                      84L, 85L, 86L, 87L, 88L, 89L, 90L, 91L, 92L, 93L, 94L, 95L,
                      96L, 97L, 98L, 99L, 100L, 101L, 102L, 103L, 104L, 105L,
                      106L, 107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L,
                      116L, 117L, 118L, 119L, 120L, 121L, 122L, 123L, 124L, 125L,
                      126L, 127L, 128L, 129L, 130L, 131L, 132L, 133L, 134L, 135L,
                      136L, 137L, 138L, 139L, 140L, 141L, 142L, 143L),
   Identification = "LONNIE POOL FIELD WEAVERVILLE",
            Dates = c("1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
                      "2/01/2014", "2/01/2014", "2/01/2014"),
     TEMP_Celcius = c(13L, 10L, 8L, 7L, 5L, 4L, 3L, 3L, 2L, 2L, 2L, 2L, 0L, 0L,
                      0L, 0L, 0L, 0L, 0L, 0L, -1L, -1L, -2L, -1L, -2L, -2L,
                      -2L, -2L, -2L, -2L, -2L, -2L, -3L, -3L, -3L, -3L, -3L, -3L,
                      -3L, -3L, -4L, -4L, -3L, -4L, -4L, -4L, -4L, -4L, -4L, -3L,
                      -3L, -2L, 0L, 1L, 2L, 3L, 4L, 6L, 6L, 8L, 9L, 9L, 10L, 11L,
                      12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 12L, 10L, 9L, 8L,
                      6L, 5L, 5L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 2L, 0L, 1L, 0L, 0L,
                      0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, -1L, -1L, -1L, -2L, -2L,
                      -1L, -2L, -2L, -2L, -2L, -2L, -2L, -2L, -2L, -3L, -3L, -3L,
                      -3L, -3L, -3L, -3L, -3L, -2L, -2L, 0L, 0L, 1L, 3L, 4L, 5L,
                      6L, 7L, 8L, 9L, 10L, 10L, 12L, 13L, 13L, 13L, 13L, 14L, 14L,
                      14L))

我需要获取其他列,例如每天的平均温度、最小值、最大值和平均露点最大值和最小值。 我每天有很多结果,因为它是每小时一次。 我尝试了很多方法,但我得到了错误的结果。

首先我尝试用这段代码求平均值:

tapply (Dados$TEMP_Celcius, Dados$Dates, mean) 但我得到错误的结果。例如,对于日期 01-01-2014,我得到 27.8,正确的结果是 1.97。

我尝试了以下任一代码:

tapply(Dados$TEMP_Celcius, Dados$Dates, mean, na.rm = TRUE)

aggregate(Dados$TEMP_Celcius, by=list(TMEDIA=Dados$Dates), mean)

但我得到了相同的结果。我不知道我做错了什么,你能帮帮我吗?

我已经检查了日期列的类别,它是“日期”和可变温度的类别,它是“数字”。

【问题讨论】:

欢迎,如果你提供一个可重现的例子,每个人都会更容易回答这个问题。 另外,您的示例图像不会加载,但无论如何您都应该提供代码而不是图像。 当然。我该怎么做? 我尝试将表格粘贴到此处,但显示不正确 使用dput 的请求的重点是,它将使其他人能够准确地重新创建您的数据、格式和所有内容的样本。如果您的数据框名为 Dados 并且您想共享前 100 行,您可以在控制台中输入 dput(head(Dados, 100)),然后将其输出粘贴到您的问题中。 【参考方案1】:

在没有真正看到你拥有什么数据的情况下,也许你可以试试这个?它使用tidyverse(你应该学习它,因为它会让一切变得更容易)。

library(tidyverse)
Dados %>% 
  group_by(Dates) %>% 
  summarise(mean = mean(TEMP_Celcius), min = min(TEMP_Celcius), max = max(TEMP_Celcius))

这给了我这个输出:

# A tibble: 2 x 4
  Dates      mean   min   max
  <chr>     <dbl> <dbl> <dbl>
1 1/01/2014  1.97    -4    13
2 2/01/2014  2.75    -3    14

根据@Jon Spring 的建议更新:

library(tidyverse)
Dados %>% 
  group_by(Identification, Dates) %>% 
  summarise(mean = mean(TEMP_Celcius), min = min(TEMP_Celcius), max = max(TEMP_Celcius))

输出:

# A tibble: 2 x 5
# Groups:   Identification [?]
  Identification                Dates      mean   min   max
  <chr>                         <chr>     <dbl> <dbl> <dbl>
1 LONNIE POOL FIELD WEAVERVILLE 1/01/2014  1.97    -4    13
2 LONNIE POOL FIELD WEAVERVILLE 2/01/2014  2.75    -3    14

【讨论】:

您的数据中是否有多个位置?如果是这样,您应该在上面的代码中group_by(Identification, Dates) %&gt;%,否则您将获得每天所有位置的平均值。 为我工作,见上面的结果。 (抱歉,不小心删除了评论)。

以上是关于dplyr 均值的错误结果的主要内容,如果未能解决你的问题,请参考以下文章

bigrquery:使用 dplyr 动词在 5-95 个分位数内提取平均值、最大值、最小值和 sd 不起作用

R Shiny Reactive 值,dplyr 过滤器错误?

错误:数据源必须是字典 (dplyr)

使用 dplyr 和 RcppRoll 计算所有固定窗口平均值

更新 dplyr,dplyr::select_vars 中的错误

多组 - 加权平均值 - 不在 r 中工作(使用 dplyr)