Prometheus 警报管理器不发送警报 k8s
Posted
技术标签:
【中文标题】Prometheus 警报管理器不发送警报 k8s【英文标题】:Prometheus alert manager doesnt send alert k8s 【发布时间】:2020-04-15 00:58:35 【问题描述】:我正在使用 prometheus 运算符 0.3.4 和警报管理器 0.20,但它不起作用,即我看到警报已触发(在警报选项卡上的 prometheus UI 上),但我没有收到任何电子邮件警报。通过查看日志,我看到以下内容,有什么想法吗?请以粗体查看警告,也许这就是原因,但不确定如何解决...
这是我使用的 prometheus 运算符的掌舵: https://github.com/helm/charts/tree/master/stable/prometheus-operator
level=info ts=2019-12-23T15:42:28.039Z caller=main.go:231 msg="Starting Alertmanager" version="(version=0.20.0, branch=HEAD, revision=f74be0400a6243d10bb53812d6fa408ad71ff32d)"
level=info ts=2019-12-23T15:42:28.039Z caller=main.go:232 build_context="(go=go1.13.5, user=root@00c3106655f8, date=20191211-14:13:14)"
level=warn ts=2019-12-23T15:42:28.109Z caller=cluster.go:228 component=cluster msg="failed to join cluster" err="1 error occurred:\n\t* Failed to resolve alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc:9094: lookup alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n\n"
level=info ts=2019-12-23T15:42:28.109Z caller=cluster.go:230 component=cluster msg="will retry joining cluster every 10s"
level=warn ts=2019-12-23T15:42:28.109Z caller=main.go:322 msg="unable to join gossip mesh" err="1 error occurred:\n\t* Failed to resolve alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc:9094: lookup alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n\n"
level=info ts=2019-12-23T15:42:28.109Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2019-12-23T15:42:28.131Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2019-12-23T15:42:28.132Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=/etc/alertmanager/config/alertmanager.yaml
level=info ts=2019-12-23T15:42:28.134Z caller=main.go:416 component=configuration msg="skipping creation of receiver not referenced by any route" receiver=AlertMail
level=info ts=2019-12-23T15:42:28.134Z caller=main.go:416 component=configuration msg="skipping creation of receiver not referenced by any route" receiver=AlertMail2
level=info ts=2019-12-23T15:42:28.135Z caller=main.go:497 msg=Listening address=:9093
level=info ts=2019-12-23T15:42:30.110Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.00011151s
level=info ts=2019-12-23T15:42:38.110Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.000659096s
这是我的配置 yaml
global:
imagePullSecrets: []
prometheus-operator:
defaultRules:
grafana:
enabled: true
prometheusOperator:
tolerations:
- key: "WorkGroup"
operator: "Equal"
value: "operator"
effect: "NoSchedule"
- key: "WorkGroup"
operator: "Equal"
value: "operator"
effect: "NoExecute"
tlsProxy:
image:
repository: squareup/ghostunnel
tag: v1.4.1
pullPolicy: IfNotPresent
resources:
limits:
cpu: 8000m
memory: 2000Mi
requests:
cpu: 2000m
memory: 2000Mi
admissionWebhooks:
patch:
priorityClassName: "operator-critical"
image:
repository: jettech/kube-webhook-certgen
tag: v1.0.0
pullPolicy: IfNotPresent
serviceAccount:
name: prometheus-operator
image:
repository: quay.io/coreos/prometheus-operator
tag: v0.34.0
pullPolicy: IfNotPresent
prometheus:
prometheusSpec:
replicas: 1
serviceMonitorSelector:
role: observeable
tolerations:
- key: "WorkGroup"
operator: "Equal"
value: "operator"
effect: "NoSchedule"
- key: "WorkGroup"
operator: "Equal"
value: "operator"
effect: "NoExecute"
ruleSelector:
matchLabels:
role: alert-rules
prometheus: prometheus
image:
repository: quay.io/prometheus/prometheus
tag: v2.13.1
alertmanager:
alertmanagerSpec:
image:
repository: quay.io/prometheus/alertmanager
tag: v0.20.0
resources:
limits:
cpu: 500m
memory: 1000Mi
requests:
cpu: 500m
memory: 1000Mi
serviceAccount:
name: prometheus
config:
global:
resolve_timeout: 1m
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alertmanager@vsx.com'
smtp_auth_username: 'ds.monitoring.grafana@gmail.com'
smtp_auth_password: 'mypass'
smtp_require_tls: false
route:
group_by: ['alertname', 'cluster']
group_wait: 45s
group_interval: 5m
repeat_interval: 1h
receiver: default-receiver
routes:
- receiver: str
match_re:
cluster: "canary|canary2"
receivers:
- name: default-receiver
- name: str
email_configs:
- to: 'rayndoll007@gmail.com'
from: alertmanager@vsx.com
smarthost: smtp.gmail.com:587
auth_identity: ds.monitoring.grafana@gmail.com
auth_username: ds.monitoring.grafana@gmail.com
auth_password: mypass
- name: 'AlertMail'
email_configs:
- to: 'rayndoll007@gmail.com'
https://codebeautify.org/yaml-validator/cb6a2781
错误表明它在解析中失败,名为 alertmanager-monitoring-prometheus-oper-alertmanager-0
的 pod 名称已启动并正在运行,但它尝试解析:查找 alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc
不知道为什么...
这是kubectl get svc -n mon
的输出
更新 这是警告日志
level=warn ts=2019-12-24T12:10:21.293Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc:9094
level=warn ts=2019-12-24T12:10:21.323Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-monitoring-prometheus-oper-alertmanager-1.alertmanager-operated.monitoring.svc:9094
level=warn ts=2019-12-24T12:10:21.326Z caller=cluster.go:438 component=cluster msg=refresh result=failure addr=alertmanager-monitoring-prometheus-oper-alertmanager-2.alertmanager-operated.monitoring.svc:9094
这是kubectl get svc -n mon
alertmanager-operated ClusterIP None <none> 9093/TCP,9094/TCP,9094/UDP 6m4s
monitoring-grafana ClusterIP 100.11.215.226 <none> 80/TCP 6m13s
monitoring-kube-state-metrics ClusterIP 100.22.248.232 <none> 8080/TCP 6m13s
monitoring-prometheus-node-exporter ClusterIP 100.33.130.77 <none> 9100/TCP 6m13s
monitoring-prometheus-oper-alertmanager ClusterIP 100.33.228.217 <none> 9093/TCP 6m13s
monitoring-prometheus-oper-operator ClusterIP 100.21.229.204 <none> 8080/TCP,443/TCP 6m13s
monitoring-prometheus-oper-prometheus ClusterIP 100.22.93.151 <none> 9090/TCP 6m13s
prometheus-operated ClusterIP None <none> 9090/TCP 5m54s
【问题讨论】:
显然你已经为 alertmanager 创建了一个 statefulset。在 statefulset 中,您可以通过域名 'pod-name.service-name.namespace.svc' 解析 pod 的 ip,因此,请确保您创建了一个名为 'alertmanager-operated' 的无头服务并且它已经工作。 @KunLi - 谢谢,我不知道该怎么做,如果你能提供你的建议作为答案,那就太好了..,我使用github.com/helm/charts/tree/master/stable/prometheus-operator,值在问题中,我应该改变什么? 我对alertmanager的配置不是很熟悉,所以我不明白为什么你不能得到任何警报。在我看来,altermanager 的日志是正常的,alertmanager 运行良好。您可以检查警报管理器的 UI 以确保它已收到所有这些警报,然后检查它是否已发出这些警报。如有必要,请使用 tcpdump 帮助您识别这些警报数据的流向。 @KunLi - 所以警告不是我收不到电子邮件的问题吗?level=warn ts=2019-12-23T15:42:28.109Z caller=main.go:322 msg="unable to join gossip mesh" err="1 error occurred:\n\t* Failed to resolve alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc:9094: lookup alertmanager-monitoring-prometheus-oper-alertmanager-0.alertmanager-operated.monitoring.svc on 100.64.0.10:53: no such host\n\n"
这是普通话吗?
请提供以下命令的输出:$ kubectl get svc
,并请描述与您的 Prometheus 部署相关的服务。
【参考方案1】:
适当的调试步骤来帮助处理这些场景:
-
启用 Alertmanager 调试日志:添加参数 --log.level=debug
验证 Alertmanager 集群是否正确形成(检查 /status 端点并验证所有对等方都已列出)
验证 Prometheus 是否正在向所有 Alertmanager 对等方发送警报(检查 /status 端点并验证所有 Alertmanager 对等方均已列出)
端到端测试:生成测试警报,应在 Prometheus UI 中看到警报,然后在 Alertmanager UI 中看到警报,最后应看到警报通知。
【讨论】:
以上是关于Prometheus 警报管理器不发送警报 k8s的主要内容,如果未能解决你的问题,请参考以下文章
警报管理器不工作... android.app.ReceiverRestrictedContext
从 Prometheus helm 图表向 slack 发送警报