Ceph 集群状态监控细化
Posted ygtff
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Ceph 集群状态监控细化相关的知识,希望对你有一定的参考价值。
需求
在做Ceph的监控报警系统时,对于Ceph集群监控状态的监控,最初只是简单的OK、WARN、ERROR,按照Ceph的status输出来判断的,仔细想想,感觉这些还不够,因为WARN、ERROR状态中,是包含多种状态的,如果在大晚上收到一条关于Ceph health的报警信息,只知道了集群有问题,但具体是什么问题呢,不得而知。这个事情发生在工作时间,就还好处理,直接到Ceph环境中查看一下就OK。但是在晚上,有些报警没有那么紧急,可以第二天再处理。所以,就需要细化这些健康状态。
因此,从代码中将HEALTH_OK、HEALTH_WARN、HEALTH_ERR的相关描述输出拉出来,进行判断,分类处理,然后用状态码(status code)的方式来进行Level化。
Ceph本身的健康状态信息:
HEALTH_WARN:
集群健康状态描述信息 | 代表的现象 |
---|---|
Monitor clock skew detected | 时钟偏移 |
mons down, quorum | Ceph Monitor down |
some monitors are running older code | 部署完就可以看到,运行过程中不会出现 |
in osds are down | OSD down后会出现 |
flag(s) set | 标志位设置,可以忽略 |
crush map has legacy tunables | 部署完就可以看到,运行过程中不会出现 |
crush map has straw_calc_version=0 | 部署完就可以看到,运行过程中不会出现 |
cache pools are missing hit_sets | 使用cache tier后会出现 |
no legacy OSD present but 'sortbitwise' flag is not set | 部署完就可以看到,运行过程中不会出现 |
has mon_osd_down_out_interval set to 0 | 将mon_osd_down_out_interval参数设置为0会出现,这个参数设置为0,和noout效力一致 |
'require_jewel_osds' osdmap flag is not set | 部署完就可以看到,运行过程中不会出现 |
is full | pool满后会出现 |
near full osd | OSD快满时警告 |
unscrubbed pgs | 有些pg没有scrub |
pgs stuck | PG处于一些不健康状态的时候,会显示出来 |
requests are blocked | slow requests会警告 |
osds have slow requests | slow requests会警告 |
recovery | 需要recovery的时候会报 |
at/near target max | 使用cache tier的时候会警告 |
too few PGs per OSD | 每个OSD的PG数过少 |
too many PGs per OSD | 每个OSD的PG数过多 |
> pgp_num | pg_num大于pgp_num |
has many more objects per pg than average (too few pgs?) | 每个Pg上的objects数过多 |
HEALTH_ERR:
集群健康状态描述信息 | 代表的现象 |
---|---|
no osds | 部署完就可以看到,运行过程中不会出现 |
full osd | OSD满时出现 |
pgs are stuck inactive for more than | Pg处于inactive状态,该Pg读写都不行 |
scrub errors | scrub 错误出现,是scrub错误?还是scrub出了不一致的pg |
当前监控代码中处理
从上述输出里选出所有关键的几项,作为一些单独的状态码,也就是只关注这些,其他的要么运行过程中不出现,要么目前没有使用,即忽略。
Ceph Health Status Code:
代码 | 10进制数值 |
---|---|
其他警告 | 0 |
HEALTH_OK | 1 |
HEALTH_CLOCK_SKEW = 1 << 1 | 2 |
HEALTH_NEAR_FULL = 1 << 2 | 4 |
HEALTH_FULL = 1 << 3 | 8 |
HEALTH_SLOW_REQUEST = 1 << 4 | 16 |
HEALTH_PG_STALE = 1 << 5 | 32 |
HEALTH_SCRUB_ERROR = 1 << 6 | 64 |
注: 在报警的描述中增加基本的状态码说明:
ceph cluster not health; clock skew:2,nearfull:4,full:8,slow_request:16,pg_stale:32,scrub_error:64,others:0
链接
具体代码
附:
注: 每行最后含detail的,说明是ceph health detail能看到的描述
HEALTH_WARN:
【Monitor.cc:】
Monitor clock skew detected
【MonmapMonitor.cc:】
mons down, quorum
is down (out of quorum) [detail]
some monitors are running older code
only supports the \\"classic\\" command set [detail]
【OSDMonitor.cc:】
osd." << i << " is down since epoch [detail]
in osds are down
flag(s) set
crush map has legacy tunables (require
see http://ceph.com/docs/master/rados/operations/crush-map/#tunables [detail]
crush map has straw_calc_version=0
see http://ceph.com/docs/master/rados/operations/crush-map/#tunables [detail]
with cache_mode needs hit_set_type to be set but it is not [detail]
cache pools are missing hit_sets
no legacy OSD present but 'sortbitwise' flag is not set
has mon_osd_down_out_interval set to 0
this has the same effect as the 'noout' flag [detail]
'require_jewel_osds' osdmap flag is not set
is full
near full osd
【PGMonitor.cc:】
current state/last acting [detail]
ops are blocked > [detail]
deep-scrubbed, last_deep_scrub_stamp [detail]
unscrubbed pgs
pgs stuck
min_size from / may help; search ceph.com/docs for 'incomplete [detail]
requests are blocked >
osds have slow requests
recovery
objects at/near target max [detail]
B at/near target max [detail]
at/near target max
too few PGs per OSD
too many PGs per OSD
> pgp_num
has many more objects per pg than average (too few pgs?)
HEALTH_ERR:
【OSDMonitor.cc:】
no osds
full osd
【PGMonitor.cc:】
pgs are stuck inactive for more than
scrub errors
以上是关于Ceph 集群状态监控细化的主要内容,如果未能解决你的问题,请参考以下文章