dplyr 均值的错误结果
Posted
技术标签:
【中文标题】dplyr 均值的错误结果【英文标题】:Wrong result of mean with dplyr 【发布时间】:2019-08-08 06:14:18 【问题描述】:我是 R 的初学者,我有一个大的 data.frame(超过 300000 个 obs),看起来像这样:
Dados <- data.frame(stringsAsFactors=FALSE,
id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L,
14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L,
25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L,
37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L,
49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L, 57L, 58L, 59L, 60L,
61L, 62L, 63L, 64L, 65L, 66L, 67L, 68L, 69L, 70L, 71L,
72L, 73L, 74L, 75L, 76L, 77L, 78L, 79L, 80L, 81L, 82L, 83L,
84L, 85L, 86L, 87L, 88L, 89L, 90L, 91L, 92L, 93L, 94L, 95L,
96L, 97L, 98L, 99L, 100L, 101L, 102L, 103L, 104L, 105L,
106L, 107L, 108L, 109L, 110L, 111L, 112L, 113L, 114L, 115L,
116L, 117L, 118L, 119L, 120L, 121L, 122L, 123L, 124L, 125L,
126L, 127L, 128L, 129L, 130L, 131L, 132L, 133L, 134L, 135L,
136L, 137L, 138L, 139L, 140L, 141L, 142L, 143L),
Identification = "LONNIE POOL FIELD WEAVERVILLE",
Dates = c("1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014", "1/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014", "2/01/2014",
"2/01/2014", "2/01/2014", "2/01/2014"),
TEMP_Celcius = c(13L, 10L, 8L, 7L, 5L, 4L, 3L, 3L, 2L, 2L, 2L, 2L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, -1L, -1L, -2L, -1L, -2L, -2L,
-2L, -2L, -2L, -2L, -2L, -2L, -3L, -3L, -3L, -3L, -3L, -3L,
-3L, -3L, -4L, -4L, -3L, -4L, -4L, -4L, -4L, -4L, -4L, -3L,
-3L, -2L, 0L, 1L, 2L, 3L, 4L, 6L, 6L, 8L, 9L, 9L, 10L, 11L,
12L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 12L, 10L, 9L, 8L,
6L, 5L, 5L, 4L, 4L, 3L, 3L, 2L, 2L, 2L, 2L, 0L, 1L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, -1L, -1L, -1L, -2L, -2L,
-1L, -2L, -2L, -2L, -2L, -2L, -2L, -2L, -2L, -3L, -3L, -3L,
-3L, -3L, -3L, -3L, -3L, -2L, -2L, 0L, 0L, 1L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L, 10L, 12L, 13L, 13L, 13L, 13L, 14L, 14L,
14L))
我需要获取其他列,例如每天的平均温度、最小值、最大值和平均露点最大值和最小值。 我每天有很多结果,因为它是每小时一次。 我尝试了很多方法,但我得到了错误的结果。
首先我尝试用这段代码求平均值:
tapply (Dados$TEMP_Celcius, Dados$Dates, mean)
但我得到错误的结果。例如,对于日期 01-01-2014,我得到 27.8,正确的结果是 1.97。
我尝试了以下任一代码:
tapply(Dados$TEMP_Celcius, Dados$Dates, mean, na.rm = TRUE)
aggregate(Dados$TEMP_Celcius, by=list(TMEDIA=Dados$Dates), mean)
但我得到了相同的结果。我不知道我做错了什么,你能帮帮我吗?
我已经检查了日期列的类别,它是“日期”和可变温度的类别,它是“数字”。
【问题讨论】:
欢迎,如果你提供一个可重现的例子,每个人都会更容易回答这个问题。 另外,您的示例图像不会加载,但无论如何您都应该提供代码而不是图像。 当然。我该怎么做? 我尝试将表格粘贴到此处,但显示不正确 使用dput
的请求的重点是,它将使其他人能够准确地重新创建您的数据、格式和所有内容的样本。如果您的数据框名为 Dados
并且您想共享前 100 行,您可以在控制台中输入 dput(head(Dados, 100))
,然后将其输出粘贴到您的问题中。
【参考方案1】:
在没有真正看到你拥有什么数据的情况下,也许你可以试试这个?它使用tidyverse
(你应该学习它,因为它会让一切变得更容易)。
library(tidyverse)
Dados %>%
group_by(Dates) %>%
summarise(mean = mean(TEMP_Celcius), min = min(TEMP_Celcius), max = max(TEMP_Celcius))
这给了我这个输出:
# A tibble: 2 x 4
Dates mean min max
<chr> <dbl> <dbl> <dbl>
1 1/01/2014 1.97 -4 13
2 2/01/2014 2.75 -3 14
根据@Jon Spring 的建议更新:
library(tidyverse)
Dados %>%
group_by(Identification, Dates) %>%
summarise(mean = mean(TEMP_Celcius), min = min(TEMP_Celcius), max = max(TEMP_Celcius))
输出:
# A tibble: 2 x 5
# Groups: Identification [?]
Identification Dates mean min max
<chr> <chr> <dbl> <dbl> <dbl>
1 LONNIE POOL FIELD WEAVERVILLE 1/01/2014 1.97 -4 13
2 LONNIE POOL FIELD WEAVERVILLE 2/01/2014 2.75 -3 14
【讨论】:
您的数据中是否有多个位置?如果是这样,您应该在上面的代码中group_by(Identification, Dates) %>%
,否则您将获得每天所有位置的平均值。
为我工作,见上面的结果。 (抱歉,不小心删除了评论)。以上是关于dplyr 均值的错误结果的主要内容,如果未能解决你的问题,请参考以下文章
bigrquery:使用 dplyr 动词在 5-95 个分位数内提取平均值、最大值、最小值和 sd 不起作用
R Shiny Reactive 值,dplyr 过滤器错误?
使用 dplyr 和 RcppRoll 计算所有固定窗口平均值