5.Prometheus_alert manager报警机制

Posted 都市侠客行

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了5.Prometheus_alert manager报警机制相关的知识,希望对你有一定的参考价值。

首先介绍环境:

1.配置Prometheus.yml,开启alert manager功能


cat prometheus.yml 
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- \'localhost:9093\' #开启alert manager功能

# Load rules once and periodically evaluate them according to the global \'evaluation_interval\'.
rule_files: #配置报警规则,也就是报警触发器;
# - "rules/host_rules.yml"
# - "second_rules.yml"
- "/usr/local/prometheus/rules/node_down.yml"
- "/usr/local/prometheus/rules/memory_over.yml"
- "/usr/local/prometheus/rules/disk_over.yml"
- "/usr/local/prometheus/rules/cpu_over.yml"


# A scrape configuration containing exactly one endpoint to scrape:
# Here it\'s Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"

# metrics_path defaults to \'/metrics\'
# scheme defaults to \'http\'.

static_configs:
- targets: ["localhost:9090","localhost:9100"]

- job_name: "hosts_exporters" #增加文件发现
file_sd_configs:
- files: ["./hosts.json"] #添加被监控target信息文件

2.报警文件详情

\'5.Prometheus_alert

[root@wind-k8s-prom prometheus]# cat rules/
cpu_over.yml disk_over.yml host_rules.yml memory_over.yml node_down.yml

[root@wind-k8s-prom prometheus]# cat rules/cpu_over.yml
groups:
- name: CPU报警规则
rules:
- alert: CPU使用率告警
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 85
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "服务器: CPU使用超过85%!(当前值: {{ $value }}%)"

[root@wind-k8s-prom prometheus]# cat rules/memory_over.yml
groups:
- name: 内存报警规则
rules:
- alert: 内存使用率告警
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "服务器: 内存在1分钟内使用超过80%!(当前值: {{ $value }}%)"

[root@wind-k8s-prom prometheus]# cat rules/disk_over.yml
groups:
- name: 磁盘报警规则
rules:
- alert: 磁盘使用率告警
expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "服务器: 磁盘设备: 使用超过80%!(挂载点: {{ $labels.mountpoint }} 当前值: {{ $value }}%)"

[root@wind-k8s-prom prometheus]# cat rules/node_down.yml
groups:
- name: 监控实例存活告警规则
rules:
- alert: 监控实例存活告警
expr: up == 0
for: 1m
labels:
user: prometheus
severity: warning
annotations:
description: "{{ $labels.instance }} of job {{ $labels.job }} 已经停止超过1分钟"
[root@wind-k8s-prom prometheus]#

\'5.Prometheus_alert\'5.Prometheus_alert\'5.Prometheus_alert

3.配置alertmanager.yml 添加企业微信报警

[root@wind-k8s-prom alertmanager]# cat alertmanager.yml

global:
resolve_timeout: 5m
wechat_api_url: \'https://qyapi.weixin.qq.com/cgi-bin/\'
templates:
- \'/usr/local/alertmanager/template.tmp1\'
route:
group_by: [\'alertname\']
group_wait: 10s
group_interval: 10s
repeat_interval: 10m
receiver: \'wechat\' #配置报警为企业微信
receivers:
- name: \'wechat\'
wechat_configs: #配置企业微信功能
- corp_id: \'ww383025ea********\' #添加企业的ID
to_party: \'1\' #部门ID
agent_id: \'1000002\' #部门agent
api_secret: \'WN*******************************************\' #部门secret。
send_resolved: true
inhibit_rules:
- equal: [\'alertname\', \'cluster\', \'service\']
source_match:
severity: \'high\'
target_match:
severity: \'warning\'


4.配置报警通知信息,template.temp1

[root@wind-k8s-prom alertmanager]# cat template.tmp1 

{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
==========异常告警==========
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
{{- if gt (len $alert.Labels.namespace) 0 }}
命名空间: {{ $alert.Labels.namespace }}
{{- end }}
{{- if gt (len $alert.Labels.node) 0 }}
节点信息: {{ $alert.Labels.node }}
{{- end }}
{{- if gt (len $alert.Labels.pod) 0 }}
实例名称: {{ $alert.Labels.pod }}
{{- end }}
============END============
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
==========异常恢复==========
告警类型: {{ $alert.Labels.alertname }}
告警级别: {{ $alert.Labels.severity }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};{{$alert.Annotations.summary}}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
{{- if gt (len $alert.Labels.namespace) 0 }}
命名空间: {{ $alert.Labels.namespace }}
{{- end }}
{{- if gt (len $alert.Labels.node) 0 }}
节点信息: {{ $alert.Labels.node }}
{{- end }}
{{- if gt (len $alert.Labels.pod) 0 }}
实例名称: {{ $alert.Labels.pod }}
{{- end }}
============END============
{{- end }}
{{- end }}
{{- end }}
{{- end }}

5.然后在企业微信后台里,配置小应用

\'5.Prometheus_alert

6.测试效果


任意关闭一个node_exporter

\'5.Prometheus_alert

配置内存使用率和CPU使用率,设置低一点方便测试

\'5.Prometheus_alert\'5.Prometheus_alert


可以达到目的,实现了通过企业微信报警。


接下来使用QQ邮箱来报警。


以上是关于5.Prometheus_alert manager报警机制的主要内容,如果未能解决你的问题,请参考以下文章

各系统安装virt-manager

django manager

management.server.port 和 management.port 属性有啥区别?

managed_schema

运行python manage.py 出现mportError: No module named django.core.management when using manage.py

Manage major incident -crisis management