在直方图上叠加数据的一致方式（从 geom_histogram 中提取分箱数据？）

Posted 2023-02-16

技术标签:

【中文标题】在直方图上叠加数据的一致方式（从 geom_histogram 中提取分箱数据？）【英文标题】：Consistent way to overlay data on histogram (extracting the binned data from geom_histogram?) 【发布时间】：2021-12-28 04:50:45 【问题描述】：

我的目标是在 ggplot2 中创建这个情节：

经过大量的摆弄，我设法为这个数据集创建了它，根据上面的屏幕截图，使用以下相当脆弱的代码（注意width=63, boundary=410，这需要大量的试验和错误）：

ex = data.frame(year=c(1971,1973,1975,1977,1979,1981,1983,1985,1987,1989,1991,1993,1995,1997,1999,2001,2003,2005,2007,2009,2011,2013,2015,2017,2019), occurances=c(347,773,589,462,280,455,1037,707,663,746,531,735,751,666,642,457,411,286,496,467,582,577,756,557,373))
ex_bin = mutate(ex, range=cut_width(occurances, width=63, boundary=410)) # bin the data
ex_bin$lower = as.numeric(sub("[\\(\\[](.+),.*", "\\1", ex_bin$range)) # extract range lower bound
ex_bin$upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", ex_bin$range)) # extract range upper bound
ex_bin$pos = as.numeric(rbind(by(ex_bin, seq_len(nrow(ex_bin)), function(ey) count(ex_bin[ex_bin$year <= ey$year & ex_bin$upper == ey$upper, ])))[1,]) # extract our visual x position, based on the number of years already in this bin
ggplot(ex_bin, aes(x=occurances, fill=year==2019)) +coord_flip() + geom_histogram(binwidth = 63, boundary=410) + geom_text(color="white", aes(label=year, x=(upper+lower)/2, y=pos-0.5, group=year), ex_bin) # plot!

请注意硬编码的边界和 binwidth。这是非常脆弱的，必须进行调整才能在每个数据集的基础上工作。我怎样才能让它始终如一地工作？我不太关心突出选定的年份（这里是 2019 年，只是为了显示垃圾箱中的错位）而不是正确的标签放置。我之前对stat_bin、cut_number、bins=13 和其他方法的尝试都以看起来像这样的错位图告终（我已经从text 切换到label 以更清楚地显示对齐错误）：

ex_bin = mutate(ex, range=cut_number(occurances, n=13)) # I've also tried cut_interval
ex_bin$lower = as.numeric(sub("[\\(\\[](.+),.*", "\\1", ex_bin$range))
ex_bin$upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", ex_bin$range))
ex_bin$pos = as.numeric(rbind(by(ex_bin, seq_len(nrow(ex_bin)), function(ey) count(ex_bin[ex_bin$year <= ey$year & ex_bin$upper == ey$upper, ])))[1,])
ggplot(ex_bin, aes(x=occurances, fill=year==2019)) +coord_flip() + geom_histogram(bins=13) + geom_label(color="white", aes(label=year, x=(upper+lower)/2, y=pos-0.5, group=year), ex_bin)

为什么？有什么方法可以提取和使用与geom_histogram 相同的数据吗？我试图阅读 ggplot 代码，但无法理解执行流程。为了进一步增加混乱，使用标签放置代码通常还会重新分箱geom_histogram，即使它使用的是原始数据框。这让我感到惊讶，因为对标签的每次调整都会搞砸位置，因为直方图会再次移动（注意突出显示的 bin 下方的 bin 中的三年，而不是上面的两年）：

ex_bin = mutate(ex, range=cut_width(occurances, width=63, boundary=410))
ex_bin$lower = as.numeric(sub("[\\(\\[](.+),.*", "\\1", ex_bin$range))
ex_bin$upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", ex_bin$range))
ex_bin$pos = as.numeric(rbind(by(ex_bin, seq_len(nrow(ex_bin)), function(ey) count(ex_bin[ex_bin$year <= ey$year & ex_bin$upper == ey$upper, ])))[1,])
ggplot(ex_bin, aes(x=occurances, fill=year==2019)) +coord_flip() + geom_histogram(bins=13) + geom_label(color="white", aes(label=year, x=(upper+lower)/2, y=pos-0.5, group=year), ex_bin)

所以我的问题是：

bins=13

geom_histogram

【问题讨论】：

【参考方案1】：

我们可以预先计算固定长度的 bin，然后用图块绘制：

# make fixed length bins, see length.out=10
d <- ex %>% 
  mutate(X = cut(occurances, seq(min(occurances) - 1, max(occurances) + 1, length.out = 10))) %>%
  group_by(X) %>% 
  arrange(year) %>% 
  mutate(Y = row_number())
           
#plot with tiles
ggplot(d, aes(x = X, y = Y, label = year, fill = year == 2019)) +
  geom_tile() + 
  geom_text() +
  scale_x_discrete(drop = FALSE) +
  coord_flip()

编辑：为 x 轴创建漂亮的断点，并调整 vline 以匹配 x 轴：

# set the sequence breaks
seqBy = 100
rr = range(ex$occurances)
cutBreaks <- seq(from = rr[ 1 ] %/% seqBy * seqBy,
                 to = (rr[ 2 ] + seqBy) %/% seqBy * seqBy, 
                 by = seqBy)

# adjust vline to match factors on X axis
vline <- 650
vlineAdjust <- findInterval(vline, cutBreaks) + vline %% seqBy / seqBy
  
# convert X to factor
d <- ex %>% 
  mutate(X = cut(occurances, breaks = cutBreaks, dig.lab = 5)) %>%
  group_by(X) %>% 
  arrange(year) %>% 
  mutate(Y = row_number())

#plot with tiles
ggplot(d, aes(x = X, y = Y, label = year, fill = year == 2019)) +
  geom_tile() + 
  geom_text() +
  geom_vline(xintercept = vlineAdjust, col = "blue") +
  scale_x_discrete(drop = FALSE) +
  coord_flip() +
  theme_minimal()

【讨论】：

非常好，没有神奇的数字！小组中的row_number 只是解决我试图做的尴尬计数的方法。但是，有没有办法让 x 轴再次呈线性，这样我就可以使用geom_vline 等？或者至少有更好的标签，即使它们不对应于垃圾箱？ @byteit101 见编辑，您可以根据需要更改 seqBy 和 vline。【参考方案2】：

实现您想要的结果的一个选项是在geom_text 中也使用stat="bin"。此外，我们必须按年group，以便每年都是一个单独的“块”。棘手的部分是获取我使用after_stat 的年份标签。但是，由于groups 在内部存储为整数序列，因此我们将它们返回到我使用辅助向量的相应年份。

library(ggplot2)
library(dplyr)

ex <- data.frame(year = c(1971, 1973, 1975, 1977, 1979, 1981, 1983, 1985, 1987, 1989, 1991, 1993, 1995, 1997, 1999, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2019), 
                occurances = c(347, 773, 589, 462, 280, 455, 1037, 707, 663, 746, 531, 735, 751, 666, 642, 457, 411, 286, 496, 467, 582, 577, 756, 557, 373))

years <- levels(factor(ex$year))
                
ggplot(ex, aes(y = occurances, fill = year == 2019, group = as.character(year), label = year)) +
  geom_histogram(binwidth = 63, boundary = 410, position = position_stack(reverse = TRUE)) +
  geom_text(color = "white", aes(label = after_stat(if_else(count > 0, as.character(years[group]), ""))), stat = "bin", 
            binwidth = 63, boundary = 410, position = position_stack(vjust = .5, reverse = TRUE))

EDIT 当使用bins 而不是binwidth 和boundary 时，该方法也可以正常工作：

ggplot(ex, aes(y = occurances, fill = year == 2019, group = as.character(year), label = year)) +
  geom_histogram(bins=13, position = position_stack(reverse = TRUE)) +
  geom_text(color = "white", aes(label = after_stat(if_else(count > 0, as.character(years[group]), ""))), stat = "bin", 
            bins=13, position = position_stack(vjust = .5, reverse = TRUE))

【讨论】：

很好，我不知道after_stat 与count 而不是..count.. 有区别吗？我通过尝试使用 if_else 非常接近，但为什么需要after_stat？还有，脆弱的binwidth = 63, boundary = 410可以换成bins=13或者类似的吗？差别不大。 after_stat 只是一种新方法，旨在使用.. 替换旧方法。见ggplot2.tidyverse.org/reference/aes_eval.html。关于你的第二个问题。这肯定适用于bins。查看我的编辑。

以上是关于在直方图上叠加数据的一致方式（从 geom_histogram 中提取分箱数据？）的主要内容，如果未能解决你的问题，请参考以下文章