监控指标以及prometheus规则-不断完善中

Posted 2021-02-15

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了监控指标以及prometheus规则-不断完善中相关的知识，希望对你有一定的参考价值。

(1)node exporter 标准性能指标

1）监控项
cpu使用率: (100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) 100))
内存使用率：(100 - (((node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes)/node_memory_MemTotal_bytes) 100))
磁盘使用率:(1- (node_filesystem_free_bytes{fstype=~"ext3|ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext3|ext4|xfs"}) ) * 100

2）prometheus规则

groups:
- name: alert-rule
    rules:
    - alert: NodeFilesystemUsage-high
        expr: (1-  (node_filesystem_free_bytes{fstype=~"ext3|ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext3|ext4|xfs"}) ) * 100 > 80
        for: 2m
        labels:
            severity: warning
        annotations:
            summary: "{{$labels.instance}}: High Node Filesystem usage detected"
            description: "{{$labels.instance}}: Node Filesystem usage is above 80% ,(current value is: {{ $value }})"
    - alert: NodeMemoryUsage
        expr: (100 - (((node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes)/node_memory_MemTotal_bytes) * 100))  > 80
        for: 2m
        labels:
            severity: warning
        annotations:
            summary: "{{$labels.instance}}: High Node Memory usage detected"
            description: "{{$labels.instance}}: Node Memory usage is above 80% ,(current value is: {{ $value }})"
    - alert: NodeCPUUsage
        expr: (100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100))  > 80
        for: 2m
        labels:
            severity: warning
        annotations:
            summary: "{{$labels.instance}}: Node High CPU usage detected"
            description: "{{$labels.instance}}: Node CPU usage is above 80% ,(current value is: {{ $value }})"

（2）mysql 监控性能指标

1）mysql性能指标

mysql is down :mysql_up

每秒查询次数指标:rate(mysql_global_status_slow_queries[5m])

连接数指标：rate(mysql_global_status_threads_connected[5m]) > 200
    或可用连接mysql_global_variables_max_connections - mysql_global_status_threads_connected <200

慢查询：rate(mysql_global_status_slow_queries[5m])

mysql主从复制 sql线程： mysql_slave_status_slave_sql_running 
 mysql主从延迟：rate(mysql_slave_status_seconds_behind_master[5m])

2）prometheus规则

groups:
- name: MySQLStatsAlert
    rules:
    - alert: MySQL is down
        expr: mysql_up == 0
        for: 1m
        labels:
            severity: critical
        annotations:
            summary: "Instance {{ $labels.instance }} MySQL is down"
            description: "MySQL database is down. This requires immediate action!"
    - alert: Mysql_High_QPS
        expr: rate(mysql_global_status_questions[5m]) > 500 
        for: 2m
        labels:
            severity: warning
        annotations:
            summary: "{{$labels.instance}}: Mysql_High_QPS detected"
            description: "{{$labels.instance}}: Mysql opreation is more than 500 per second ,(current value is: {{ $value }})"  
    - alert: Mysql_Too_Many_Connections
        expr: rate(mysql_global_status_threads_connected[5m]) > 200
        for: 2m
        labels:
            severity: warning
        annotations:
            summary: "{{$labels.instance}}: Mysql Too Many Connections detected"
            description: "{{$labels.instance}}: Mysql Connections is more than 100 per second ,(current value is: {{ $value }})"  
    - alert: Mysql_Too_Many_slow_queries
        expr: rate(mysql_global_status_slow_queries[5m]) > 3
        for: 2m
        labels:
            severity: warning
        annotations:
            summary: "{{$labels.instance}}: Mysql_Too_Many_slow_queries detected"
            description: "{{$labels.instance}}: Mysql slow_queries is more than 3 per second ,(current value is: {{ $value }})"  
    - alert: SQL thread stopped 
        expr: mysql_slave_status_slave_sql_running == 0
        for: 1m
        labels:
            severity: critical
        annotations:
            summary: "Instance {{ $labels.instance }} SQL thread stopped"
            description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
    - alert: Slave lagging behind Master
        expr: rate(mysql_slave_status_seconds_behind_master[5m]) >30 
        for: 1m
        labels:
            severity: warning 
        annotations:
            summary: "Instance {{ $labels.instance }} Slave lagging behind Master"
            description: "Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!"

（3）pod性能指标
1）容器性能指标

pod的cpu使用率:container_memory_usage_bytes{container_name!=""} / container_spec_memory_limit_bytes{container_name!=""}  *100 != +Inf
pod的内存使用率: sum by (pod_name)( rate(container_cpu_usage_seconds_total{image!=""}[1m] ) ) * 100

2）prometheus规则

groups:
- name: noah_pod.rules
  rules:
  - alert: PodMemUsage
    expr: container_memory_usage_bytes{container_name!=""} / container_spec_memory_limit_bytes{container_name!=""}  *100 != +Inf > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.name}}: Pod High Mem usage detected"
      description: "{{$labels.name}}: Pod Mem is above 80% ,(current value is: {{ $value }})"
  - alert: PodCpuUsage
    expr: sum by (pod_name)( rate(container_cpu_usage_seconds_total{image!=""}[1m] ) ) * 100 > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "{{$labels.name}}: Pod High CPU usage detected"
      description: "{{$labels.name}}: Pod CPU is above 80% ,(current value is: {{ $value }})"

参考文档：

http://ylzheng.com/2018/04/02/use-prometheus-monitor-mysql/
https://www.cnblogs.com/zengkefu/p/5658252.html
https://blog.csdn.net/qq_25934401/article/details/82594478
https://blog.csdn.net/qq_39570637/article/details/81711328
https://blog.csdn.net/ichglauben/article/details/82381438

以上是关于监控指标以及prometheus规则-不断完善中的主要内容，如果未能解决你的问题，请参考以下文章

监控平台设计之Graphite&Prometheus存储

搭建Prometheus平台，你必须考虑的6个因素

Prometheus Grafana监控全方位实践

开箱即用的 Prometheus 告警规则集

基于 prometheus 的微服务指标监控