AlertManager - 根据特定 Jobnames 的路由向不同的接收者发送警报

Posted 2023-02-16

技术标签:

【中文标题】AlertManager - 根据特定 Jobnames 的路由向不同的接收者发送警报【英文标题】：AlertManager - Send alerts to different receivers based on routes for particular Jobnames 【发布时间】：2021-11-04 01:52:17 【问题描述】：

我已经在 Ubuntu 服务器上配置了 prometheus alertmanager 来监控多个 azure vm。目前，所有 vm 实例警报都会通知到默认电子邮件组。我需要触发警报

团队 A(user1,user2,user3) 和默认组，如果服务器 A（使用 Jobname）出现故障。如果服务器 B 出现故障，团队 B(User1,User2) 和默认组。

尝试了一些与下面在 alertmanager.yml 中给出的路由配置的组合，但它没有按预期工作。如果有人可以在 alertmanager 中解释发送组特定警报通知背后的逻辑，我们将不胜感激。谢谢你的时间！

route:
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 2h

  receiver: 'default-receiver'

  routes:
  - match:
      alertname: A_down
    receiver: TeamA
  - match:
      alertname: B_down
    receiver: TeamB

我当前的 Alertmanager.yml 文件：

global:
 resolve_timeout: 1m

route:
 receiver: 'email-notifications'

receivers:
- name: 'email-notifications'
  email_configs:
  - to: alertgroups@example.com
    from: default@example.com
    smarthost: smtp.gmail.com:587
    auth_username: default@example.com
    auth_identity: default@example.com
    auth_password: password
    send_resolved: true

alertrule.yml 文件：

groups:
- name: alert.rules
  rules:
  - alert: InstanceDown
   # Condition for alerting
    expr: up == 0
    for: 1m
   # Annotation - additional informational labels to store more information
    annotations:
      title: 'Instance  $labels.instance  down'
      description: ' $labels.instance  of job  $labels.job  has been down for more than 1 minute.'
   # Labels - additional labels to be attached to the alert
    labels:
        severity: 'critical'

  - alert: HostOutOfMemory
   # Condition for alerting
    expr: node_memory_MemAvailable / node_memory_MemTotal * 100 < 80
    for: 5m
   # Annotation - additional informational labels to store more information
    annotations:
      title: 'Host out of memory (instance  $labels.instance )'
      description: 'Node memory is filling up (< 25% left)\n  VALUE =  $value \n  LABELS:  $labels '
   # Labels - additional labels to be attached to the alert
    labels:
        severity: 'warning'

  - alert: HostHighCpuLoad
   # Condition for alerting
    expr: (sum by (instance) (irate(node_cpujob="node_exporter_metrics",mode="idle"[5m]))) > 80
    for: 5m
   # Annotation - additional informational labels to store more information
    annotations:
      title: 'Host high CPU load (instance  $labels.instance )'
      description: 'CPU load is > 30%\n  VALUE =  $value \n  LABELS:  $labels '
   # Labels - additional labels to be attached to the alert
    labels:
        severity: 'warning'

  - alert: HostOutOfDiskSpace
   # Condition for alerting
    expr: (node_filesystem_availmountpoint="/"  * 100) / node_filesystem_sizemountpoint="/" < 70
    for: 5m
   # Annotation - additional informational labels to store more information
    annotations:
      title: 'Host out of disk space (instance  $labels.instance )'
      description: 'Disk is almost full (< 50% left)\n  VALUE =  $value \n  LABELS:  $labels '

【问题讨论】：

【参考方案1】：

使用此配置：

  routes:
  - match:
      alertname: A_down
    receiver:
    - default-receiver
    - TeamA
  - match:
      alertname: B_down
    receiver: 
    - default-receiver
    - TeamB

不要忘记使用“接收器”块定义默认接收器、TeamA 和 TeamB。

【讨论】：

您好马塞洛，感谢您的回复。我完全理解您的解决方案，但是我有一个小查询，为添加到普罗米修斯的所有目标配置了 alertrule.yml 文件，如何指定特定作业名的规则以仅在特定目标实例发生时向组发送电子邮件警报下来了吗？这是另一个问题吗？你没有提到任何关于使用作业名的路线，是吗？这是可能的，但首先有必要准确了解您想要完成的任务。没有相同的问题，对造成的混乱表示歉意，我已经编辑了问题标题。初始要求是当特定实例（作业名称）满足 alertrule.yml 中定义的全局规则时，Prometheus 警报应将警报发送到默认组（监控团队）+ 服务器特定团队（服务器所有者、选定的潜在客户）。例如，假设服务器 A 从列出的服务器数量下降。因此，与服务器 A 关联的默认监控团队和团队应该获得那些服务器实例特定的警报。

以上是关于AlertManager - 根据特定 Jobnames 的路由向不同的接收者发送警报的主要内容，如果未能解决你的问题，请参考以下文章