R geom_histogram position="identity" 不一致

Posted 2023-02-16

技术标签:

【中文标题】R geom_histogram position="identity" 不一致【英文标题】：R geom_histogram position="identity" inconsistent 【发布时间】：2021-09-23 18:01:35 【问题描述】：

我目前在 R 中工作，试图创建一个图面板，每个图都包含两个重叠的直方图：蓝色直方图下方的红色直方图。红色直方图在每个图中包含相同的数据集，因此应该在整个板上一致地显示。我发现事实并非如此。尽管每个图中的数据完全相同，但红色直方图有所不同。有没有办法来解决这个问题？我的代码中是否遗漏了导致这种不一致的内容？

这是我用来创建绘图的代码：

  test<-rnorm(1000)
  test<-as.data.table(test)
  test[, type:="Sample"]
  setnames(test, old="test", new="value")
  
  test_2<-rnorm(750)
  test_2<-as.data.table(test_2)
  test_2[, type:="Sub Sample"]
  setnames(test_2, old="test_2", new="value")
  test_2_final<-rbind(test, test_2, fill=TRUE)
  
  
  test_3<-rnorm(500)
  test_3<-as.data.table(test_3)
  test_3[, type:="Sub Sample"]
  setnames(test_3, old="test_3", new="value")
  test_3_final<-rbind(test, test_3, fill=TRUE)
  
  test_4<-rnorm(250)
  test_4<-as.data.table(test_4)
  test_4[, type:="Sub Sample"]
  setnames(test_4, old="test_4", new="value")
  test_4_final<-rbind(test, test_4, fill=TRUE)
  
  test_5<-rnorm(100)
  test_5<-as.data.table(test_5)
  test_5[, type:="Sub Sample"]
  setnames(test_5, old="test_5", new="value")
  test_5_final<-rbind(test, test_5, fill=TRUE)
  
  test_6<-rnorm(50)
  test_6<-as.data.table(test_6)
  test_6[, type:="Sub Sample"]
  setnames(test_6, old="test_6", new="value")
  test_6_final<-rbind(test, test_6, fill=TRUE)
  
  draws_750_p<-ggplot(data = test_2_final, aes(x=value, fill=type, color=type)) + geom_histogram(position="identity", alpha = 0.2, bins=30) + theme(plot.title = element_text(hjust = 0.5, size=10, face="plain"))
  draws_500_p<-ggplot(data = test_3_final, aes(x=value, fill=type, color=type)) + geom_histogram(position="identity", alpha = 0.2, bins=30) + theme(plot.title = element_text(hjust = 0.5, size=10, face="plain"))
  draws_250_p<-ggplot(data = test_4_final, aes(x=value, fill=type, color=type)) + geom_histogram(position="identity", alpha = 0.2, bins=30) + theme(plot.title = element_text(hjust = 0.5, size=10, face="plain"))
  draws_100_p<-ggplot(data = test_5_final, aes(x=value, fill=type, color=type)) + geom_histogram(position="identity", alpha = 0.2, bins=30) + theme(plot.title = element_text(hjust = 0.5, size=10, face="plain"))
  draws_50_p<-ggplot(data = test_6_final, aes(x=value, fill=type, color=type)) + geom_histogram(position="identity", alpha = 0.2, bins=30) + theme(plot.title = element_text(hjust = 0.5, size=10, face="plain"))
  
  
  full_plot<-plot_grid(draws_750_p, draws_500_p, draws_250_p, draws_100_p, draws_50_p, ncol = 3, nrow = 2)

这是我描述的奇怪结果的图片：注意红色直方图的分布如何不同，尽管每个集合中的数据集完全相同（在此示例中，您可以在 draws_250_p 图中看到最多在右上角）-

【问题讨论】：

没有仔细看，但我认为问题在于每个地块使用的垃圾箱不同。这意味着相同的值最终可能会出现在不同的 bin 中。默认值是根据指定的 bin 数量和数据范围猜测合理的 bin 边界，但由于每个图中的子样本不同（并且可能比主样本早或晚开始），因此生成的边界将是不同。 @CalumYou 这真的很有帮助。我必须手动指定垃圾箱吗？您可以使用breaks 参数手动指定垃圾箱，或者使用binwidth 和center/boundary 的组合来确保垃圾箱对齐。例如binwidth = 0.05, boundary = 0 将产生 0-0.05, 0.05-0.1 等，只要数据扩展 【参考方案1】：

正如我在评论中提到的那样，问题是每个地块使用的箱子都不同。这意味着相同的值最终可能会出现在不同的 bin 中。默认值是根据指定的 bin 数量和数据范围猜测合理的 bin 边界，但由于每个图中的子样本不同（并且可能比主样本早或晚开始），因此生成的边界将是不同。

解决方案是直接指定 bin 边界，以便它们在每个图中都相同。下面是使用binwidth 和boundary 的组合隐式指定bin 边界的示例。我还冒昧地将所有值组合到一个数据框中，以便可以使用facet_wrap 立即绘制它们，其优点是对齐各个方面的轴并用子样本的大小标记它们.不过，关键点在于对geom_histogram 的调用。您现在可以看到红色分布在每个方面都是相同的。

library(tidyverse)

test <- tibble(type = "Sample", value = rnorm(1000))

add_sub_sample <- function(n, df) 
  sub_sample <- tibble(type = "Sub Sample", value = rnorm(n))
  df %>%
    rbind(sub_sample) %>%
    mutate(sub_sample_n = n)


test_final <- c(750, 500, 250, 100, 50) %>%
  map(add_sub_sample, test) %>%
  bind_rows()

ggplot(test_final, aes(x = value, fill = type, colour = type)) +
  geom_histogram(position = "identity", alpha = 0.2, binwidth = 0.2, boundary = 0) +
  facet_wrap(~sub_sample_n) +
  theme(plot.title = element_text(hjust = 0.5, size=10, face="plain"))

^{由reprex package (v1.0.0) 于 2021-07-14 创建}

【讨论】：

以上是关于R geom_histogram position="identity" 不一致的主要内容，如果未能解决你的问题，请参考以下文章

R中的直方图（ggplot）-binwidth不起作用

R ggplot - 如何将这两个直方图组合成一个整体直方图进行比较？

在 geom_histogram 中使用第三个变量作为填充美学

geom_histogram 移动直方图

如何用geom_histogram按两个变量的比例加权？

使用 scale_x_log10 时如何在 geom_histogram 中设置 ggplot2 binwidth？