ggplot 分组箱线图中的平均值 (R)

Posted

技术标签:

【中文标题】ggplot 分组箱线图中的平均值 (R)【英文标题】:Mean value in ggplot grouped box plot (R) 【发布时间】:2017-06-30 11:10:30 【问题描述】:

虽然之前有人问过这个问题here,但我有一个新数据框,因此有一个新问题。数据示例如下:

ID,Region,Dimension,BlogsInd.,BlogsNews,BlogsTech,Columns
1,PK,Dim1,-4.75,NA,NA,NA
2,PK,Dim1,-5.69,NA,NA,NA
3,PK,Dim1,-0.27,NA,NA,NA
4,PK,Dim1,-2.76,NA,NA,NA
5,PK,Dim1,-8.24,NA,NA,NA
6,PK,Dim1,-12.51,NA,NA,NA
7,PK,Dim1,-1.28,NA,NA,NA
8,PK,Dim1,0.95,NA,NA,NA
9,PK,Dim1,-5.96,NA,NA,NA
10,PK,Dim1,-8.81,NA,NA,NA
11,PK,Dim1,-8.46,NA,NA,NA
12,PK,Dim1,-6.15,NA,NA,NA
13,PK,Dim1,-13.98,NA,NA,NA
14,PK,Dim1,-16.43,NA,NA,NA
15,PK,Dim1,-4.09,NA,NA,NA
16,PK,Dim1,-11.06,NA,NA,NA
17,PK,Dim1,-9.04,NA,NA,NA
18,PK,Dim1,-8.56,NA,NA,NA
19,PK,Dim1,-8.13,NA,NA,NA
20,PK,Dim2,-14.46,NA,NA,NA
21,PK,Dim2,-4.21,NA,NA,NA
22,PK,Dim2,-4.96,NA,NA,NA
23,PK,Dim2,-5.48,NA,NA,NA
24,PK,Dim2,-4.53,NA,NA,NA
25,PK,Dim2,6.31,NA,NA,NA
26,PK,Dim2,-11.16,NA,NA,NA
27,PK,Dim2,-1.27,NA,NA,NA
28,PK,Dim2,-11.49,NA,NA,NA
29,PK,Dim2,-0.9,NA,NA,NA
30,PK,Dim2,-12.27,NA,NA,NA
31,PK,Dim2,6.85,NA,NA,NA
32,PK,Dim2,-5.21,NA,NA,NA
33,PK,Dim2,-1.06,NA,NA,NA
34,PK,Dim2,-2.6,NA,NA,NA
35,PK,Dim2,-0.95,NA,NA,NA
36,PK,Dim3,-0.82,NA,NA,NA
37,PK,Dim3,-7.65,NA,NA,NA
38,PK,Dim3,0.64,NA,NA,NA
39,PK,Dim3,-2.25,NA,NA,NA
40,PK,Dim3,-1.58,NA,NA,NA
41,PK,Dim3,-5.73,NA,NA,NA
42,PK,Dim3,0.37,NA,NA,NA
43,PK,Dim3,-5.46,NA,NA,NA
44,PK,Dim3,-3.48,NA,NA,NA
45,PK,Dim3,0.88,NA,NA,NA
46,PK,Dim3,-2.11,NA,NA,NA
47,PK,Dim3,-10.13,NA,NA,NA
48,PK,Dim3,-2.08,NA,NA,NA
49,PK,Dim3,-4.33,NA,NA,NA
50,PK,Dim3,1.09,NA,NA,NA
51,PK,Dim3,-4.23,NA,NA,NA
52,PK,Dim3,-1.46,NA,NA,NA
53,PK,Dim3,9.37,NA,NA,NA
54,PK,Dim3,5.84,NA,NA,NA
55,PK,Dim3,8.21,NA,NA,NA
56,PK,Dim3,7.34,NA,NA,NA
57,PK,Dim4,1.83,NA,NA,NA
58,PK,Dim4,14.39,NA,NA,NA
59,PK,Dim4,22.02,NA,NA,NA
60,PK,Dim4,4.83,NA,NA,NA
61,PK,Dim4,-3.24,NA,NA,NA
62,PK,Dim4,-5.69,NA,NA,NA
63,PK,Dim4,-22.92,NA,NA,NA
64,PK,Dim4,0.41,NA,NA,NA
65,PK,Dim4,-4.42,NA,NA,NA
66,PK,Dim4,-10.72,NA,NA,NA
67,PK,Dim4,-11.29,NA,NA,NA
68,PK,Dim4,-2.89,NA,NA,NA
69,PK,Dim4,-7.59,NA,NA,NA
70,PK,Dim4,-7.45,NA,NA,NA
71,US,Dim1,-12.49,NA,NA,NA
72,US,Dim1,-11.59,NA,NA,NA
73,US,Dim1,-4.6,NA,NA,NA
74,US,Dim1,-22.83,NA,NA,NA
75,US,Dim1,-4.83,NA,NA,NA
76,US,Dim1,-14.76,NA,NA,NA
77,US,Dim1,-15.93,NA,NA,NA
78,US,Dim1,-2.78,NA,NA,NA
79,US,Dim1,-16.39,NA,NA,NA
80,US,Dim1,-15.22,NA,NA,NA
81,US,Dim1,3.25,NA,NA,NA
82,US,Dim1,-2.73,NA,NA,NA
83,US,Dim1,0.96,NA,NA,NA
84,US,Dim1,-1.12,NA,NA,NA
85,US,Dim1,-0.33,NA,NA,NA
86,US,Dim1,-6.45,NA,NA,NA
87,US,Dim1,2.52,NA,NA,NA
88,US,Dim1,3.18,NA,NA,NA
89,US,Dim1,4.65,NA,NA,NA
90,US,Dim2,-1.75,NA,NA,NA
91,US,Dim2,-0.22,NA,NA,NA
92,US,Dim2,8.16,NA,NA,NA
93,US,Dim2,1.89,NA,NA,NA
94,US,Dim2,4.31,NA,NA,NA
95,US,Dim2,-0.41,NA,NA,NA
96,US,Dim2,-23.02,NA,NA,NA
97,US,Dim2,3.87,NA,NA,NA
98,US,Dim2,-4.76,NA,NA,NA
99,US,Dim2,4.95,NA,NA,NA
100,US,Dim2,4.78,NA,NA,NA
101,US,Dim2,-15.11,NA,NA,NA
102,US,Dim2,-3.74,NA,NA,NA
103,US,Dim2,-6.15,NA,NA,NA
104,US,Dim2,-8.33,NA,NA,NA
105,US,Dim2,-5.55,NA,NA,NA
106,US,Dim3,-5.1,NA,NA,NA
107,US,Dim3,-0.41,NA,NA,NA
108,US,Dim3,-8,NA,NA,NA
109,US,Dim3,-11.8,NA,NA,NA
110,US,Dim3,-10.39,NA,NA,NA
111,US,Dim3,-14.98,NA,NA,NA
112,US,Dim3,-13.14,NA,NA,NA
113,US,Dim3,-16.06,NA,NA,NA
114,US,Dim3,-16.75,NA,NA,NA
115,US,Dim3,-17.58,NA,NA,NA
116,US,Dim3,-13.12,NA,NA,NA
117,US,Dim3,-15.69,NA,NA,NA
118,US,Dim3,-9.29,NA,NA,NA
119,US,Dim3,-14.93,NA,NA,NA
120,US,Dim3,-18.75,NA,NA,NA
121,US,Dim3,-16.15,NA,NA,NA
122,US,Dim3,-14.38,NA,NA,NA
123,US,Dim3,-11.33,NA,NA,NA
124,US,Dim3,2.06,NA,NA,NA
125,US,Dim3,1.55,NA,NA,NA
126,US,Dim3,3.17,NA,NA,NA
127,US,Dim4,3.33,NA,NA,NA
128,US,Dim4,-3.31,NA,NA,NA
129,US,Dim4,5.67,NA,NA,NA
130,US,Dim4,-1.94,NA,NA,NA
131,US,Dim4,-4.2,NA,NA,NA
132,US,Dim4,-13.53,NA,NA,NA
133,US,Dim4,-10.84,NA,NA,NA
134,US,Dim4,-1.04,NA,NA,NA
135,US,Dim4,-8.02,NA,NA,NA
136,US,Dim4,-14.65,NA,NA,NA
137,US,Dim4,-6.39,NA,NA,NA
138,US,Dim4,-3.69,NA,NA,NA
139,US,Dim4,-11.62,NA,NA,NA
140,US,Dim4,-3.02,NA,NA,NA
141,US,Dim4,-28.84,NA,NA,NA

我正在尝试创建一个分组箱线图(使用函数),其平均值显示在每个组的箱线图中。代码如下:

attach(data_Blogs)    
plotgraph <- function(x, y, colour, min, max)

      plot1 <- ggplot(dims_Blog, aes_string(x = x, y = y, fill = colour)) +
        geom_boxplot()+
        labs(color=colour) +
        #scale_y_continuous(breaks=c(seq(min,max,5)), limits = c(min, max))+
        labs(x="Dimensions", y="Dimension Score") +
        scale_fill_grey(start = 0.3, end = 0.7) + 
        theme_grey()+
        theme(legend.justification = c(1, 1), legend.position = c(1, 1))+
        geom_text(data= melt(with(dims_Blog, tapply(eval(parse(text=y)),list(eval(parse(text=x)),eval(parse(text=colour))), mean)),varnames=c("Dimension","Region"),value.name="med"),
                  aes_string(y = "med",x=x, label = "round(med,3)"),position=position_dodge(width = 0.8),size = 3, vjust = -0.5,colour="white")
      return(plot1)
    
    plot1 <- plotgraph ("Dimension", "BlogsInd.", "Region")

我无法理解以“geom_text”开头的部分,其中数据被传递为平均值。数据框正在融化(长到宽格式),我认为在这种情况下不需要,因为数据已经是宽格式。我尝试使用“stats_summary”函数但没有成功。您的帮助将极大地帮助我找到解决方案。

【问题讨论】:

【参考方案1】:

确实融化数据似乎是多余的。相反,您应该总结数据,例如使用dplyr

library(dplyr)
ggplot(dims_Blog, aes(x=Dimension, y=BlogsInd., fill=Region)) +
  geom_boxplot() +
  geom_text(data = dims_Blog %>% group_by(Dimension, Region) %>% summarise(mean = mean(BlogsInd.)), 
            aes(x = Dimension, y = mean, label = round(mean, 2)), 
            position = position_dodge(width = .7))

然后微调您的定位/格式。

编辑:我没有点击您之前的问题,它已经扩展了上面的示例以防止 NSE 在编程上下文中。所以在你的函数中使用group_by_aes_string

【讨论】:

感谢您的回答。存在NA的问题。数据框的结构是这样的,一次只有一列有数值,而其他列有 NA。这就是为什么整组行不是数字的。试过这个 summarise(mean = mean(y = y %&gt;% na.omit()) 没有运气。 我不完全理解您所指的问题 - BlogsInd 中没有NA。数据样本的列。要从平均值中排除 NA,您可以使用 mean(y, na.rm = TRUE)。同上,如果您想在某些列中选择(仅?)数值,您可以使用max(col1, col2, etc, na.rm = TRUE) 对不起。将非数值传递给 round() 存在问题。现在它根本无法识别该功能。 R Runtime 好像有问题,解决后我会反馈。 plotgraph &lt;- function(x, y, colour) plot1 &lt;- ggplot(dims_Blog, aes_string(x = x, y = y, fill = colour)) + geom_boxplot()+ labs(color=colour) + labs(x="Dimensions", y="Dimension Score") + scale_fill_grey(start = 0.3, end = 0.7) + theme_grey()+ theme(legend.justification = c(1, 1), legend.position = c(1, 1)) + geom_text(data = dims_Blog %&gt;% group_by_(x, colour) %&gt;% summarise(mean=mean(y)), aes_string(x=x, y="mean", label="round(mean,3)"), position=position_dodge(width=0.8), size = 3, vjust = -0.5, colour="white") return(plot1) plot1 &lt;- plotgraph("Dimension", "BlogsInd.", "Region") 它返回空白。如果summarise_ 被使用“argument is not numeric in mean.default”

以上是关于ggplot 分组箱线图中的平均值 (R)的主要内容,如果未能解决你的问题,请参考以下文章

Boxplot ggplot2:在分组箱线图中显示平均值和观察次数

R:在箱线图ggplot上显示平均值和中值标签

如何使用ggplot2显示箱线图中的所有平均值? [复制]

R语言ggplot2可视化分组箱图并且在箱图中添加每个分组的均值将每个箱图中的分组均值使用线条连接起来(Add a line connecting Mean Values on Boxplot)

GGPLOT箱线图按颜色细分,箱线图中间有平均值

如何在ggplot的箱线图中按组绘制平均值