根据 Prometheus 中的速率了解 histogram_quantile

Posted 2023-02-15

技术标签:

【中文标题】根据 Prometheus 中的速率了解 histogram_quantile【英文标题】：Understanding histogram_quantile based on rate in Prometheus 【发布时间】：2019-08-05 07:44:42 【问题描述】：

根据 Prometheus 文档，为了使用直方图指标获得第 95 个百分位，我可以使用以下查询：

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

来源：https://prometheus.io/docs/practices/histograms/#quantiles

由于直方图的每个桶都是一个计数器，我们可以将每个桶的比率计算为：

范围向量中时间序列的每秒平均增长率。

见：https://prometheus.io/docs/prometheus/latest/querying/functions/#rate

例如，如果桶值[t-5m] = 100，桶值[t] = 200，则桶速率[t] = (200-100)/(10*60) = 0.167

最后，最令人困惑的部分是 histogram_quantile 函数如何在知道所有桶率的情况下找到给定指标的第 95 个百分位数？

有什么代码或算法我可以看一下以更好地理解它吗？

【问题讨论】：

你可以参考我的回复here 【参考方案1】：

一个可靠的例子将很好地解释histogram_quantile。

假设：

只有一个系列为简单起见 10 个桶用于度量 http_request_duration_seconds。

10ms、50ms、100ms、200ms、300ms、500ms、1s、2s、3s、5s

http_request_duration_seconds 是 COUNTER 的度量类型

time	value	delta	rate (quantity of items)
t-10m	50	N/A	N/A
t-5m	100	50	50 / (5*60)
t	200	100	100 / (5*60)
...	...	...	...

我们至少有两个系列的抓取，覆盖 5 分钟，供 rate() 计算每个存储桶的 quantity

rate_xxx(t) = (value_xxx[t]-value_xxx[t-5m]) / (5m*60) 是quantity of items 的[t-5m, t]

我们正在查看 2 个样本（value(t) 和 value(t-5m)）。记录了10000http 请求时长（items），即10000 = rate_10ms(t) + rate_50ms(t) + rate_100ms(t) + ... + rate_5s(t)。

bucket(le)	10ms	50ms	100ms	200ms	300ms	500ms	1s	2s	3s	5s	+Inf
range	~10ms	10~50ms	50~100ms	100~200ms	200~300ms	300~500ms	500ms~1s	1~2s	2s~3s	3~5s	5s~
rate_xxx(t)	3000	3000	1500	1000	800	400	200	40	30	5	5

桶是直方图的本质。我们只需要rate_xxx(t)中的10个数字来进行分位数计算

让我们仔细看看这个表达式（为简单起见，省略了像 sum() 这样的聚合）

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

我们实际上是在从bucket=10ms 到bucket=+Inf 的rate_xxx(t) 中寻找95%th 项目。 95%th 在这里表示9500th，因为我们总共得到了10000 个项目（10000 * 0.95）。从上表可以看出，bucket=500ms之前有9300 = 3000+3000+1500+1000+800项。

所以9500th 项目是bucket=500ms(range=300~500ms) 中的200th 项目(9500-9300)，其中包含400 项目

Prometheus 假设桶中的项目以线性模式均匀分布。bucket=500ms 中 200th 项的度量值是 400ms = 300+(500-300)*(200/400)

也就是说，95% 是 400ms。

有几点需要注意

直方图指标类型的指标本质上应该是COUNTER 用于分位数计算的系列应始终得到标签 le 已定义特定存储桶中的项目（数据）以线性模式均匀分布（例如：300~500ms）

Prometheus 至少做出了这个假设

分位数计算需要按升序/降序对桶进行排序（定义）（例如：1ms histogram_quantile 的结果是一个近似值

附：由于Items (Data) in a specific bucket spread evenly a linear pattern的假设，度量值并不总是accurate

说，bucket=500ms(range=300~500ms) 中的实际最长持续时间（例如：来自 nginx 访问日志）是310ms，但是，我们将通过上述设置从histogram_quantile 获得400ms，这是相当不错的有时令人困惑。

桶距越小，approximation 越准确。因此，请设置适合您需求的铲斗距离。

【讨论】：

【参考方案2】：

我相信this 是它在 prometheus 中的代码一般的想法是您使用存储桶中的数据来推断/近似分位数 Elasticsearch 的汇总功能还包括 does something similar（但不同/简单得多）

【讨论】：

【参考方案3】：

可以参考我的回复here

其实rate()函数只是用来指定时间窗口的，分母对百分位值的计算没有影响。

【讨论】：

【参考方案4】：

您必须使用reset，因为计数器可以重置，rate 会自动考虑重置并为您提供正确的每秒计数。请记住，在使用计数器之前始终使用速率。

【讨论】：

以上是关于根据 Prometheus 中的速率了解 histogram_quantile的主要内容，如果未能解决你的问题，请参考以下文章