K8s系列-Prometheus使用邮件告警

Posted lihanlin

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了K8s系列-Prometheus使用邮件告警相关的知识,希望对你有一定的参考价值。

感谢作者分享-http://bjbsair.com/2020-04-07/tech-info/30650.html

1、指定告警服务和规则文件

告诉Promentheus,将告警信息发送给那个告警管理服务,以及使用那个告警规则文件。这里的告警服务在Kubernetes中部署,对外提供的服务名称为alertmanager,端口为9093。告警规则文件为“/etc/prometheus/rules/”目录下的所有规则文件。

global:  
 scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.  
 evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.  
 # scrape_timeout is set to the global default (10s).  
  
# 指定告警服务器  
alerting:  
 alertmanagers:  
 - static_configs:  
 - targets:  
 - alertmanager:9093  
  
# 指定告警规则文件  
rule_files:  
 - "/etc/prometheus/rules/*.yml"  
 # - "second_rules.yml"  
  
# A scrape configuration containing exactly one endpoint to scrape:  
# Here it‘s Prometheus itself.  
scrape_configs:  
 # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.  
 - job_name: ‘prometheus‘  
  
# metrics_path defaults to ‘/metrics‘  
 # scheme defaults to ‘http‘.  
  
static_configs:  
 - targets: [‘localhost:9090‘]  
 - job_name: ‘redis‘  
 static_configs:  
 - targets: [‘redis-exporter-np:9121‘]  
 - job_name: ‘node‘  
 static_configs:  
 - targets: [‘prometheus-prometheus-node-exporter:9100‘]  
 - job_name: ‘windows-node-001‘  
 static_configs:  
 - targets: [‘10.0.32.148:9182‘]  
 - job_name: ‘windows-node-002‘  
 static_configs:  
 - targets: [‘10.0.34.4:9182‘]  
 - job_name: ‘rabbit‘  
 static_configs:  
 - targets: [‘prom-rabbit-prometheus-rabbitmq-exporter:9419‘]

2、设置告警规则

设置告警的规则,Prometheus基于此告警规则,将告警信息发送给告警服务。这将未启动的实例信息发送给告警服务,告知哪些实例没有正常启动。

#rules  
groups:  
 - name: node-rules  
 rules:  
 - alert: InstanceDown # 告警名称  
   expr: up == 0 # 告警判定条件  
   for: 3s # 持续多久后,才发送  
   labels: # 标签  
    team: k8s  
   annotations: # 警报信息  
    summary: "{{$labels.instance}}: has been down"  
    description: "{{$labels.instance}}: job {{$labels.job}} has been down "

3、设置告警信息路由和接收器

这里设置通过邮件接收告警信息,当告警服务接收到告警信息后,会通过邮件将告警信息发送给被告知者。

global:  
 resolve_timeout: 5m  
 smtp_smarthost: ‘smtp.163.com:25‘ # 发送信息邮箱的smtp服务器代理  
 smtp_from: ‘xxx@163.com‘ # 发送信息的邮箱名称  
 smtp_auth_username: ‘xxx‘ # 邮箱的用户名  
 smtp_auth_password: ‘SYNUNQBZMIWUQXGZ‘ # 邮箱的密码或授权码  
  
route:  
 group_by: [‘alertname‘]  
 group_wait: 10s  
 group_interval: 10s  
 repeat_interval: 1h  
 receiver: ‘email‘  
receivers:  
 - name: ‘email‘  
 email_configs:  
 - to: ‘xxxxxx@aliyun.com‘ # 接收告警的邮箱  
 headers: { Subject: "[WARN] 报警邮件"} # 接收邮件的标题  
  
inhibit_rules:  
 - source_match:  
 severity: ‘critical‘  
 target_match:  
 severity: ‘warning‘  
 equal: [‘alertname‘, ‘dev‘, ‘instance‘]

4、验证

在方案中Prometheus所监控的实例中,redis和windows-node-002没有正常启动,因此根据上述的告警规则,应该会将这些信息发送给被告警者的邮箱。

技术图片

在被告警者的邮箱中,接收的告警信息如下。

技术图片感谢作者分享-http://bjbsair.com/2020-04-07/tech-info/30650.html

1、指定告警服务和规则文件

告诉Promentheus,将告警信息发送给那个告警管理服务,以及使用那个告警规则文件。这里的告警服务在Kubernetes中部署,对外提供的服务名称为alertmanager,端口为9093。告警规则文件为“/etc/prometheus/rules/”目录下的所有规则文件。

global:  
 scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.  
 evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.  
 # scrape_timeout is set to the global default (10s).  
  
# 指定告警服务器  
alerting:  
 alertmanagers:  
 - static_configs:  
 - targets:  
 - alertmanager:9093  
  
# 指定告警规则文件  
rule_files:  
 - "/etc/prometheus/rules/*.yml"  
 # - "second_rules.yml"  
  
# A scrape configuration containing exactly one endpoint to scrape:  
# Here it‘s Prometheus itself.  
scrape_configs:  
 # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.  
 - job_name: ‘prometheus‘  
  
# metrics_path defaults to ‘/metrics‘  
 # scheme defaults to ‘http‘.  
  
static_configs:  
 - targets: [‘localhost:9090‘]  
 - job_name: ‘redis‘  
 static_configs:  
 - targets: [‘redis-exporter-np:9121‘]  
 - job_name: ‘node‘  
 static_configs:  
 - targets: [‘prometheus-prometheus-node-exporter:9100‘]  
 - job_name: ‘windows-node-001‘  
 static_configs:  
 - targets: [‘10.0.32.148:9182‘]  
 - job_name: ‘windows-node-002‘  
 static_configs:  
 - targets: [‘10.0.34.4:9182‘]  
 - job_name: ‘rabbit‘  
 static_configs:  
 - targets: [‘prom-rabbit-prometheus-rabbitmq-exporter:9419‘]

2、设置告警规则

设置告警的规则,Prometheus基于此告警规则,将告警信息发送给告警服务。这将未启动的实例信息发送给告警服务,告知哪些实例没有正常启动。

#rules  
groups:  
 - name: node-rules  
 rules:  
 - alert: InstanceDown # 告警名称  
   expr: up == 0 # 告警判定条件  
   for: 3s # 持续多久后,才发送  
   labels: # 标签  
    team: k8s  
   annotations: # 警报信息  
    summary: "{{$labels.instance}}: has been down"  
    description: "{{$labels.instance}}: job {{$labels.job}} has been down "

3、设置告警信息路由和接收器

这里设置通过邮件接收告警信息,当告警服务接收到告警信息后,会通过邮件将告警信息发送给被告知者。

global:  
 resolve_timeout: 5m  
 smtp_smarthost: ‘smtp.163.com:25‘ # 发送信息邮箱的smtp服务器代理  
 smtp_from: ‘xxx@163.com‘ # 发送信息的邮箱名称  
 smtp_auth_username: ‘xxx‘ # 邮箱的用户名  
 smtp_auth_password: ‘SYNUNQBZMIWUQXGZ‘ # 邮箱的密码或授权码  
  
route:  
 group_by: [‘alertname‘]  
 group_wait: 10s  
 group_interval: 10s  
 repeat_interval: 1h  
 receiver: ‘email‘  
receivers:  
 - name: ‘email‘  
 email_configs:  
 - to: ‘xxxxxx@aliyun.com‘ # 接收告警的邮箱  
 headers: { Subject: "[WARN] 报警邮件"} # 接收邮件的标题  
  
inhibit_rules:  
 - source_match:  
 severity: ‘critical‘  
 target_match:  
 severity: ‘warning‘  
 equal: [‘alertname‘, ‘dev‘, ‘instance‘]

4、验证

在方案中Prometheus所监控的实例中,redis和windows-node-002没有正常启动,因此根据上述的告警规则,应该会将这些信息发送给被告警者的邮箱。

技术图片

在被告警者的邮箱中,接收的告警信息如下。

技术图片感谢作者分享-http://bjbsair.com/2020-04-07/tech-info/30650.html

1、指定告警服务和规则文件

告诉Promentheus,将告警信息发送给那个告警管理服务,以及使用那个告警规则文件。这里的告警服务在Kubernetes中部署,对外提供的服务名称为alertmanager,端口为9093。告警规则文件为“/etc/prometheus/rules/”目录下的所有规则文件。

global:  
 scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.  
 evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.  
 # scrape_timeout is set to the global default (10s).  
  
# 指定告警服务器  
alerting:  
 alertmanagers:  
 - static_configs:  
 - targets:  
 - alertmanager:9093  
  
# 指定告警规则文件  
rule_files:  
 - "/etc/prometheus/rules/*.yml"  
 # - "second_rules.yml"  
  
# A scrape configuration containing exactly one endpoint to scrape:  
# Here it‘s Prometheus itself.  
scrape_configs:  
 # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.  
 - job_name: ‘prometheus‘  
  
# metrics_path defaults to ‘/metrics‘  
 # scheme defaults to ‘http‘.  
  
static_configs:  
 - targets: [‘localhost:9090‘]  
 - job_name: ‘redis‘  
 static_configs:  
 - targets: [‘redis-exporter-np:9121‘]  
 - job_name: ‘node‘  
 static_configs:  
 - targets: [‘prometheus-prometheus-node-exporter:9100‘]  
 - job_name: ‘windows-node-001‘  
 static_configs:  
 - targets: [‘10.0.32.148:9182‘]  
 - job_name: ‘windows-node-002‘  
 static_configs:  
 - targets: [‘10.0.34.4:9182‘]  
 - job_name: ‘rabbit‘  
 static_configs:  
 - targets: [‘prom-rabbit-prometheus-rabbitmq-exporter:9419‘]

2、设置告警规则

设置告警的规则,Prometheus基于此告警规则,将告警信息发送给告警服务。这将未启动的实例信息发送给告警服务,告知哪些实例没有正常启动。

#rules  
groups:  
 - name: node-rules  
 rules:  
 - alert: InstanceDown # 告警名称  
   expr: up == 0 # 告警判定条件  
   for: 3s # 持续多久后,才发送  
   labels: # 标签  
    team: k8s  
   annotations: # 警报信息  
    summary: "{{$labels.instance}}: has been down"  
    description: "{{$labels.instance}}: job {{$labels.job}} has been down "

3、设置告警信息路由和接收器

这里设置通过邮件接收告警信息,当告警服务接收到告警信息后,会通过邮件将告警信息发送给被告知者。

global:  
 resolve_timeout: 5m  
 smtp_smarthost: ‘smtp.163.com:25‘ # 发送信息邮箱的smtp服务器代理  
 smtp_from: ‘xxx@163.com‘ # 发送信息的邮箱名称  
 smtp_auth_username: ‘xxx‘ # 邮箱的用户名  
 smtp_auth_password: ‘SYNUNQBZMIWUQXGZ‘ # 邮箱的密码或授权码  
  
route:  
 group_by: [‘alertname‘]  
 group_wait: 10s  
 group_interval: 10s  
 repeat_interval: 1h  
 receiver: ‘email‘  
receivers:  
 - name: ‘email‘  
 email_configs:  
 - to: ‘xxxxxx@aliyun.com‘ # 接收告警的邮箱  
 headers: { Subject: "[WARN] 报警邮件"} # 接收邮件的标题  
  
inhibit_rules:  
 - source_match:  
 severity: ‘critical‘  
 target_match:  
 severity: ‘warning‘  
 equal: [‘alertname‘, ‘dev‘, ‘instance‘]

4、验证

在方案中Prometheus所监控的实例中,redis和windows-node-002没有正常启动,因此根据上述的告警规则,应该会将这些信息发送给被告警者的邮箱。

技术图片

在被告警者的邮箱中,接收的告警信息如下。

技术图片

以上是关于K8s系列-Prometheus使用邮件告警的主要内容,如果未能解决你的问题,请参考以下文章

k8s微信告警

k8s 结合 Prometheus 构建企业级监控告警系统

使用 Prometheus 实现邮件/企业微信告警

使用 Prometheus 实现邮件/企业微信告警

使用 Prometheus 实现邮件/企业微信告警

alertmanager 告警写入kafka 及 k8s 部署prometheus alertmanager