prometheus通过企业微信机器人报警

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了prometheus通过企业微信机器人报警相关的知识,希望对你有一定的参考价值。

一、prometheus告警逻辑

prometheus通过企业微信机器人报警_企业微信机器人

prometheus主服务通过警报规则(rules)去推送到alertmanager ,这些规则将使用我们收集的指标并在指定的阈值或标准上触发警报,收到警报后, Alertmanager 会处理警报并根据其标签进行路由。一旦路径确定,它们将由Alertmanager调用webhook发送企业微信群组的

二、prometheus的主服务配置

prometheus.yml

# my global config
global: #全局配置
scrape_interval: 5s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting: #配置报警的接口
alertmanagers:
- static_configs:
- targets: [127.0.0.1:9093] #9093是alertmanager的端口

# Load rules once and periodically evaluate them according to the global evaluation_interval.
rule_files:
- /export/prometheus/rules/*.yml #加载报警规则
# - "first_rules.yml"
# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here its Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
# - job_name: prometheus

# metrics_path defaults to /metrics
# scheme defaults to http.

- job_name: qinghotel_report #定义监控项名称
metrics_path: /report/actuator/prometheus #这里是服务接口,可以通过nacos上查看
file_sd_configs:
- files:
- /export/prometheus/conf/report.json #这里是加载json文件的方式匹配被监控主机
refresh_interval: 10s
- job_name: qinghotel-erp-server #每一行的job_name要对齐,不然会报错
metrics_path: /actuator/prometheus
static_configs:
- targets: [10.11.0.10:19008]
- job_name: push-metrics
static_configs:
- targets: [127.0.0.1:9099]
honor_labels: true
- job_name: qinghotel-hotel-member
metrics_path: /actuator/prometheus
static_configs:
- targets: [10.0.0.3:9099,10.11.0.36:9099]

这里贴一个json的配置,一般是被监控主机较多时,会单独写一个文件

/export/prometheus/conf/report.json

[

"targets": ["10.11.0.8:8900","10.11.0.29:8900"]

]


prometheus配置校验方法

执行promtool这个文件校验指定的配置

./promtool check config prometheus.yml

三、alertmanager的配置

global:
resolve_timeout: 5m

templates:
- /export/alertmanager/template/*.tmpl
#这里要加载template的文件


# 定义路由树信息
route:
group_by: [alertname]
group_wait: 10s
group_interval: 1m
repeat_interval: 30m
receiver: prometheus #这里的名称要上下一致
routes:
- receiver: prometheus #同上一致
group_wait: 60s
match:
level: 1


receivers:
- name: prometheus #同上一致
webhook_configs:
- url: http://10.11.0.16:8089/adapter/wx #这里的配置是调用adapter服务的接口
# 匹配adapter的接口,匹配企业微信prometheus机器人
send_resolved: true


inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: [alertname, dev, instance]

template配置文件

wechat.tmpl

 define "wechat.default.message" 
- if gt (len .Alerts.Firing) 0 -
- range $index, $alert := .Alerts -
- if eq $index 0
==========异常告警==========
告警类型: $alert.Labels.alertname
告警级别: $alert.Labels.severity
告警详情: $alert.Annotations.message $alert.Annotations.description;$alert.Annotations.summary
故障时间: ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05"
- if gt (len $alert.Labels.instance) 0
实例信息: $alert.Labels.instance
- end
- if gt (len $alert.Labels.namespace) 0
命名空间: $alert.Labels.namespace
- end
- if gt (len $alert.Labels.node) 0
节点信息: $alert.Labels.node
- end
- if gt (len $alert.Labels.pod) 0
实例名称: $alert.Labels.pod
- end
============END============
- end
- end
- end
- if gt (len .Alerts.Resolved) 0 -
- range $index, $alert := .Alerts -
- if eq $index 0
==========异常恢复==========
告警类型: $alert.Labels.alertname
告警级别: $alert.Labels.severity
告警详情: $alert.Annotations.message $alert.Annotations.description;$alert.Annotations.summary
故障时间: ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05"
恢复时间: ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05"
- if gt (len $alert.Labels.instance) 0
实例信息: $alert.Labels.instance
- end
- if gt (len $alert.Labels.namespace) 0
命名空间: $alert.Labels.namespace
- end
- if gt (len $alert.Labels.node) 0
节点信息: $alert.Labels.node
- end
- if gt (len $alert.Labels.pod) 0
实例名称: $alert.Labels.pod
- end
============END============
- end
- end
- end
- end

adapter服务和alertmanager服务是通过docker一起启动的

docker-compose.yml

version: 3
services:
webhook-adapter:
image: guyongquan/webhook-adapter:latest
version: 3
services:
webhook-adapter:
image: guyongquan/webhook-adapter:latest
container_name: webhook-adapter
hostname: webhook-adapter
ports:
- "8089:80"
restart: always
command:
- "--adapter=/app/prometheusalert/wx.js=/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=ddcebdbc-*******"
#/wx=后面是匹配企业微信机器人的webhook地址
alertmanager:
image: prom/alertmanager
container_name: alertmanager
hostname: alertmanager
restart: always
volumes:
- /export/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
#这里是要挂载alertmanager.yml的绝对路径,这里要修改你的路径
- /etc/localtime:/etc/localtime:ro
ports:
- "9093:9093"

启动方法为:docker-compose up -d  #和docker-compose.yml在同级目录

企业微信机器人的地址复制下来贴到上面配置里即可

prometheus通过企业微信机器人报警_prometheus_02

查看webhook adapter是否启动成功,访问外网地址的8089端口,确认安全组放开。

prometheus通过企业微信机器人报警_企业微信_03

查看alertmanager服务是否正常

prometheus通过企业微信机器人报警_企业微信机器人_04

alertmanager.yml语法校验

./amtool check-config  /export/alertmanager/alertmanager.yml


提供几个报警规则rules的配置。这个在prometheus下层目录

1、hoststats-alert.yml

groups:
- name: hostStatsAlert
rules:
- alert: hostCpuUsageAlert
expr: sum(avg without (cpu)(irate(node_cpumode!=idle[5m]))) by (instance) > 0.85
for: 1m
labels:
severity: page
annotations:
summary: "Instance $labels.instance CPU usgae high"
description: " $labels.instance CPU usage above 85% (current value: $value )"

- alert: hostMemUsageAlert
expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
for: 1m
labels:
severity: page
annotations:
summary: "Instance $labels.instance MEM usgae high"
description: " $labels.instance MEM usage above 85% (current value: $value )"


2、jvm_alert.yml

groups:
- name: jvm-alerting
rules:

# 堆空间使用超过80%
- alert: heap-usage-too-much
for: 60m
labels:
level: 3 #告警级别,告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
name: prometheusalertcenter
annotations:
summary: "JVM Instance $labels.instance memory usage > 80%"
runbook: "详情请参考:http://1.1.1.1:9093/#/alerts" #这里是alert的外网访问地址

# 在5分钟里,Old GC花费时间超过50%
for: 5m
labels:
level: 3 #告警级别,告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
name: prometheusalertcenter
annotations:
summary: "JVM Instance $labels.instance Old GC time > 50% running time"
description: " $labels.instance of application $labels.application has been in status [Old GC time > 50% running time] for more than 5 minutes. current seconds ( $value %)"
runbook: "详情请参考:http://1.1.1.1:9093/#/alerts"

# 在5分钟里,Old GC花费时间超过80%
- alert: old-gc-time-too-much
expr: increase(jvm_gc_collection_seconds_sumgc="PS MarkSweep"[5m]) > 5 * 60 * 0.8
for: 5m
labels:
level: 3 #告警级别,告警级别定义 0 信息,1 警告,2 一般严重,3 严重,4 灾难
name: prometheusalertcenter
annotations:
summary: "JVM Instance $labels.instance Old GC time > 80% running time"
description: " $labels.instance of application $labels.application has been in status [Old GC time > 80% running time] for more than 5 minutes. current seconds ( $value %)"
runbook: "详情请参考:http://1.1.1.1:9093/#/alerts"

3、service_status.yml

groups:
- name: 实例存活告警规则
rules:
- alert: 实例存活告警
expr: up == 0
for: 1m
labels:
user: prometheus
severity: warning
annotations:
summary: "主机宕机 !!!"
description: "该实例已经宕机超过一分钟了"
- name: 内存报警规则
rules:
for: 1m
labels:
severity: warning
annotations:
summary: "服务器可用内存不足"
description: "内存使用率已超过80%(当前值: $value %)"
- name: CPU报警规则
rules:
- alert: CPU使用率告警
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_totalmode="idle"[1m]) )) * 100 > 80
for: 1m
labels:
severity: warning
annotations:
summary: "CPU使用率正在飙升。"
description: "CPU使用率超过80%(当前值: $value %)"
- name: 磁盘使用率报警规则
rules:
- alert: 磁盘使用率告警
expr: 100 - node_filesystem_free_bytesfstype=~"xfs|ext4" / node_filesystem_size_bytesfstype=~"xfs|ext4" * 100 > 80
for: 80m
labels:
severity: warning
annotations:
summary: "硬盘分区使用率过高"
description: "分区使用大于80%(当前值: $value %)"

三、启动服务:

prometheus主服务启动方法

./prometheus --config.file=prometheus.yml --web.enable-lifecycle 2> /dev/null &

prometheus热加载,变更prometheus配置时使用

curl -XPOST http://localhost:9090/-/reload

所有服务正常启动后是有8089,9090,9093这3个端口的

四、shell方法测试企业微信的推送

#!/usr/bin/env bash
alerts_message=[

"labels":
"alertname": "磁盘已满",
"dev": "sda1",
"instance": "实例sda1",
"msgtype": "testing"
,
"annotations":
"info": "程序员小王提示您:这是测试消息",
"summary": "testing"


]

curl -XPOST -d"$alerts_message" http://127.0.0.1:9093/api/v1/alerts
#调用上面alerts_message的这个参数

告警通知显示如下

prometheus通过企业微信机器人报警_企业微信_05


至此本文完。欢迎点赞、收藏、评论

以上是关于prometheus通过企业微信机器人报警的主要内容,如果未能解决你的问题,请参考以下文章

三分钟实现Prometheus电话短信邮件钉钉飞书企业微信报警

prometheus-operator 配置企业微信报警

Fundebug支持企业微信配置机器人报警

prometheus 监控测试服务器集群

Prometheus 通过钉钉告警

Prometheus 通过钉钉告警