Prometheus+Grafana+alertmanager+ 邮件 +钉钉告警

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Prometheus+Grafana+alertmanager+ 邮件 +钉钉告警相关的知识,希望对你有一定的参考价值。

Prometheus+Grafana+alertmanager + 邮件 +钉钉告警

本文模拟生产环境

一 ansible 部署

ansbile 部署

在线安装
yum install ansible -y

Prometheus+Grafana+alertmanager+


离线安装


#离线环境,提前在有网络的服务器上下载好需要的软件包
mkdir -p /home/ansible
yum install ansible -y --downloadonly --downloaddir /home/ansible/

安装

cd /home/ansible

# 安装nfs
rpm -ivh *.rpm --force --nodeps


[root@zcsnode1 ansible]# ansible --version
ansible 2.9.27


#ansible 配置文件更改
[root@zcsmaster1 ~]# cat /etc/ansible/ansible.cfg
[defaults]
host_key_checking=False
[inventory]
[privilege_escalation]
become=True
become_method=su
become_user=root
[paramiko_connection]
[ssh_connection]
[persistent_connection]
[accelerate]
[selinux]
[colors]
[diff]

# ansible 主机清单配置
[root@zcsmaster1 ~]# cd node_exporter/
[root@zcsmaster1 node_exporter]# vim hosts

[root@zcsmaster1 node_exporter]# cat hosts (以下是允许root登录,且免密 )
[node_exporter]
182.168.40.180
182.168.40.181
182.168.40.182
[root@zcsmaster1 node_exporter]#

Prometheus+Grafana+alertmanager+

部署node-exporter

登录ansible 主机

mkdir node_exporter && cd node_exporter

# 下载 anget ,下载不下来,可以手动下载后上传
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz

# 编辑配置文件
[root@zcsmaster1 node_exporter]# cat node_exporter.service
[Unit]
Descriptinotallow=node_exporter
After=network.target

[Service]
Type=simple
User=root
ExecStart=/app/prometheus/node_exporter-1.5.0.linux-amd64/node_exporter --web.listen-address=:19100
Restart=on-failure

[Install]
WantedBy=multi-user.target

#

[root@zcsmaster1 node_exporter]# cat node_exporter.yaml
---
- name: install node_exporter
hosts: all
remote_user: root
become: yes
become_method: su
tasks:
- name: mkdir dir
shell: mkdir /app/prometheus
- name: copy package
copy: src=./node_exporter-1.5.0.linux-amd64.tar.gz dest=/app/prometheus/
- name: copy service
copy: src=./node_exporter.service dest=/etc/systemd/system/
- name: unzip backage
shell: cd /app/prometheus/ && tar -zxvf node_exporter-1.5.0.linux-amd64.tar.gz
- name: install service
shell: systemctl daemon-reload && systemctl start node_exporter && systemctl enable node_exporter && systemctl status node_exporter
[root@zcsmaster1 node_exporter]#



#####
配置hosts

# k8s 主机已经有免密了(生产是进制root 登录的 需要修改hosts)
[root@zcsmaster1 node_exporter]# cat hosts
192.168.40.180
192.168.40.181
192.168.40.182

Prometheus+Grafana+alertmanager+

主机批量执行脚本

ansible-playbook -i hosts  ./node_exporter.yaml

部署Prometheus +alertmanager

创建命名空间ccse
kubectl create ns ccse




创建sa账号,对sa做rbac授权

#创建一个sa账号monitor
[root@xianchaomaster1 ~]# kubectl create serviceaccount monitor -n ccse
#把sa账号monitor通过clusterrolebing绑定到clusterrole上
[root@xianchaomaster1 ~]# kubectl create clusterrolebinding monitor-clusterrolebinding -n ccse --clusterrole=cluster-admin --serviceaccount=ccse:monitor

10.2 创建prometheus数据存储目录

#在k8s集群的node1节点上创建数据存储目录 (生产可以使用远程存储)
mkdir -p /data/app/grafana_data && chmod 777 -R /data/app/grafana_data
mkdir -p /data/app/alertmanager-storage && chmod 777 -R /data/app/alertmanager-storage
mkdir -p /data/app/prometheus_data && chmod 777 -R /data/app/prometheus_data

#节点拉取镜像
docker pull grafana/grafana:8.5.15
docker pull prom/prometheus:v2.40.6
docker pull prom/alertmanager:v0.24.0

安装kube-state-metrics组件

kube-state-metrics是什么?

kube-state-metrics通过监听API Server生成有关资源对象的状态指标,比如Deployment、Node、Pod,需要注意的是kube-state-metrics只是简单的提供一个metrics数据,并不会存储这些指标数据,所以我们可以使用Prometheus来抓取这些数据然后存储,主要关注的是业务相关的一些元数据,比如Deployment、Pod、副本状态等;调度了多少个replicas?现在可用的有几个?多少个Pod是running/stopped/terminated状态?Pod重启了多少次?我有多少job在运行中。

安装kube-state-metrics组件

mkdir -p kube-state-metrics
cd kube-state-metrics

Prometheus+Grafana+alertmanager+

cat kube-state-metrics-deploy.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: kube-system
spec:
replicas: 1
selector:
matchLabels:
app: kube-state-metrics
template:
metadata:
labels:
app: kube-state-metrics
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
image: quay.io/coreos/kube-state-metrics:v1.9.0
ports:
- containerPort: 8080
[root@zcsmaster1 2]# cat kube-state-metrics-rbac.yaml
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups: [""]
resources: ["nodes", "pods", "services", "resourcequotas", "replicationcontrollers", "limitranges", "persistentvolumeclaims", "persistentvolumes", "namespaces", "endpoints"]
verbs: ["list", "watch"]
- apiGroups: ["extensions"]
resources: ["daemonsets", "deployments", "replicasets"]
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources: ["cronjobs", "jobs"]
verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
resources: ["horizontalpodautoscalers"]
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-system
[root@zcsmaster1 2]# cat  kube-state-metrics-svc.yaml
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: true
name: kube-state-metrics
namespace: kube-system
labels:
app: kube-state-metrics
spec:
ports:
- name: kube-state-metrics
port: 8080
protocol: TCP
selector:
app: kube-state-metrics
[root@zcsmaster1 2]#

etcd-certs

生成一个etcd-certs,这个在部署prometheus需要

kubectl -n ccse create secret generic etcd-certs --from-file=/etc/kubernetes/pki/etcd/server.key  --from-file=/etc/kubernetes/pki/etcd/server.crt --from-file=/etc/kubernetes/pki/etcd/ca.crt

部署prometheus-alertmanager-svc.yaml

kubectl apply -f prometheus-alertmanager-svc.yaml

---
apiVersion: v1
kind: Service
metadata:
labels:
name: prometheus
kubernetes.io/cluster-service: true
name: alertmanager
namespace: ccse
spec:
ports:
- name: alertmanager
nodePort: 30066
port: 9093
protocol: TCP
targetPort: 9093
selector:
app: prometheus
sessionAffinity: None
type: NodePort
---
apiVersion: v1
kind: Service
metadata:
name: prometheus
namespace: ccse
labels:
app: prometheus
spec:
type: NodePort
ports:
- port: 9090
targetPort: 9090
protocol: TCP
selector:
app: prometheus
component: server

alertmanager-cm.yaml

kind: ConfigMap
apiVersion: v1
metadata:
name: alertmanager
namespace: ccse
data:
alertmanager.yml: |-
global:
resolve_timeout: 1m
smtp_smarthost: smtp.163.com:25
smtp_from: zhaochengsheng_666@163.com
smtp_auth_username: zhaochengsheng_666
smtp_auth_password: DYTRIRAIWRYANDMC
smtp_require_tls: false
route:
group_by: [alertname]
group_wait: 1s
group_interval: 5s
repeat_interval: 24h
receiver: 5gcmp
receivers:
- name: 5gcmp
email_configs:
- to: 838032955@qq.com
send_resolved: true

prometheus-alertmanager-cfg.yaml

kind: ConfigMap
apiVersion: v1
metadata:
labels:
app: prometheus
name: prometheus-config
namespace: ccse
data:
prometheus.yml: |
rule_files:
- /etc/prometheus/rules.yml
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager.ccse.svc.cluster.local:9093"]
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 1m
scrape_configs:
#- job_name: kubernetes-node
# kubernetes_sd_configs:
# - role: node
# relabel_configs:
# - source_labels: [__address__]
# regex: (.*):10250
# replacement: $1:9100
# target_label: __address__
# action: replace
# - action: labelmap
# regex: __meta_kubernetes_node_label_(.+)
- job_name: kubernetes-node-cadvisor
metric_relabel_configs:
- source_labels: [instance]
separator: ;
regex: (.+)
target_label: node
replacement: $1
action: replace
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
- job_name: kubernetes-apiserver
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
- job_name: kubernetes-service-endpoints
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\\d+)?;(\\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- action: keep
regex: true
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_scrape
- action: replace
regex: (.+)
source_labels:
- __meta_kubernetes_pod_annotation_prometheus_io_path
target_label: __metrics_path__
- action: replace
regex: ([^:]+)(?::\\d+)?;(\\d+)
replacement: $1:$2
source_labels:
- __address__
- __meta_kubernetes_pod_annotation_prometheus_io_port
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- action: replace
source_labels:
- __meta_kubernetes_namespace
target_label: kubernetes_namespace
- action: replace
source_labels:
- __meta_kubernetes_pod_name
target_label: kubernetes_pod_name
- job_name: kubernetes-schedule
scrape_interval: 5s
static_configs:
- targets: [192.168.40.180:10251]
- job_name: kubernetes-controller-manager
scrape_interval: 5s
static_configs:
- targets: [192.168.40.180:10252]
- job_name: ccse-node-exporter
static_configs:
- targets: ["192.168.40.180:19100","192.168.40.181:19100","192.168.40.182:19100"]
#- job_name: kubernetes-kube-proxy
# scrape_interval: 5s
# static_configs:
# - targets: [192.168.40.180:10249,192.168.40.181:10249]
- job_name: kubernetes-etcd
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/ca.crt
cert_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.crt
key_file: /var/run/secrets/kubernetes.io/k8s-certs/etcd/server.key
scrape_interval: 5s
static_configs:
- targets: [192.168.40.180:2379]
rules.yml: |
groups:
- name: example
rules:
- alert: kube-proxy的cpu使用率大于80%
expr: rate(process_cpu_seconds_totaljob=~"kubernetes-kube-proxy"[1m]) * 100 > 80
for: 5m
labels:
severity: warnning
annotations:
description: "$labels.instance的$labels.job组件的cpu使用率超过80%"
- alert: kube-proxy的cpu使用率大于90%
expr: rate(process_cpu_seconds_totaljob=~"kubernetes-kube-proxy"[1m]) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
description: "$labels.instance的$labels.job组件的cpu使用率超过90%"
- alert: scheduler的cpu使用率大于80%
expr: rate(process_cpu_seconds_totaljob=~"kubernetes-schedule"[1m]) * 100 > 80
for: 5m
labels:
severity: warnning
annotations:
description: "$labels.instance的$labels.job组件的cpu使用率超过80%"
- alert: scheduler的cpu使用率大于90%
expr: rate(process_cpu_seconds_totaljob=~"kubernetes-schedule"[1m]) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
description: "$labels.instance的$labels.job组件的cpu使用率超过90%"
- alert: controller-manager的cpu使用率大于90%
expr: rate(process_cpu_seconds_totaljob=~"kubernetes-controller-manager"[1m]) * 100 > 0
for: 5m
labels:
severity: critical
annotations:
description: "$labels.instance的$labels.job组件的cpu使用率超过90%"
- alert: apiserver的cpu使用率大于80%
expr: rate(process_cpu_seconds_totaljob=~"kubernetes-apiserver"[1m]) * 100 > 80
for: 5m
labels:
severity: warnning
annotations:
description: "$labels.instance的$labels.job组件的cpu使用率超过80%"
- alert: apiserver的cpu使用率大于90%
expr: rate(process_cpu_seconds_totaljob=~"kubernetes-apiserver"[1m]) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
description: "$labels.instance的$labels.job组件的cpu使用率超过90%"
- alert: etcd的cpu使用率大于80%
expr: rate(process_cpu_seconds_totaljob=~"kubernetes-etcd"[1m]) * 100 > 80
for: 5m
labels:
severity: warnning
annotations:
description: "$labels.instance的$labels.job组件的cpu使用率超过80%"
- alert: etcd的cpu使用率大于90%
expr: rate(process_cpu_seconds_totaljob=~"kubernetes-etcd"[1m]) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
description: "$labels.instance的$labels.job组件的cpu使用率超过90%"
- alert: kube-state-metrics的cpu使用率大于80%
expr: rate(process_cpu_seconds_totalk8s_app=~"kube-state-metrics"[1m]) * 100 > 80
for: 5m
labels:
severity: warnning
annotations:
description: "$labels.instance的$labels.k8s_app组件的cpu使用率超过80%"
value: " $value %"
threshold: "80%"
- alert: kube-state-metrics的cpu使用率大于90%
expr: rate(process_cpu_seconds_totalk8s_app=~"kube-state-metrics"[1m]) * 100 > 0
for: 5m
labels:
severity: critical
annotations:
description: "$labels.instance的$labels.k8s_app组件的cpu使用率超过90%"
value: " $value %"
threshold: "90%"
- alert: coredns的cpu使用率大于80%
expr: rate(process_cpu_seconds_totalk8s_app=~"kube-dns"[1m]) * 100 > 80
for: 5m
labels:
severity: warnning
annotations:
description: "$labels.instance的$labels.k8s_app组件的cpu使用率超过80%"
value: " $value %"
threshold: "80%"
- alert: coredns的cpu使用率大于90%
expr: rate(process_cpu_seconds_totalk8s_app=~"kube-dns"[1m]) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
description: "$labels.instance的$labels.k8s_app组件的cpu使用率超过90%"
value: " $value %"
threshold: "90%"
- alert: kube-proxy打开句柄数>600
expr: process_open_fdsjob=~"kubernetes-kube-proxy" > 600
for: 5m
labels:
severity: warnning
annotations:
description: "$labels.instance的$labels.job打开句柄数>600"
value: " $value "
- alert: kube-proxy打开句柄数>1000
expr: process_open_fdsjob=~"kubernetes-kube-proxy" > 1000
for: 5m
labels:
severity: critical
annotations:
description: "$labels.instance的$labels.job打开句柄数>1000"
value: " $value "
- alert: kubernetes-schedule打开句柄数>600
expr: process_open_fdsjob=~"kubernetes-schedule" > 600
for: 5m
labels:
severity: warnning
annotations:
description: "$labels.instance的$labels.job打开句柄数>600"
value: " $value "
- alert: kubernetes-schedule打开句柄数>1000
expr: process_open_fdsjob=~"kubernetes-schedule" > 1000
for: 5m
labels:
severity: critical
annotations:
description: "$labels.instance的$labels.job打开句柄数>1000"
value: " $value "
- alert: kubernetes-controller-manager打开句柄数>600
expr: process_open_fdsjob=~"kubernetes-controller-manager" > 600
for: 5m
labels:
severity: warnning
annotations:
description: "$labels.instance的$labels.job打开句柄数>600"
value: " $value "
- alert: kubernetes-controller-manager打开句柄数>1000
expr: process_open_fdsjob=~"kubernetes-controller-manager" > 1000
for: 5m
labels:
severity: critical
annotations:
description: "$labels.instance的$labels.job打开句柄数>1000"
value: " $value "
- alert: kubernetes-apiserver打开句柄数>600
expr: process_open_fdsjob=~"kubernetes-apiserver" > 600
for: 5m
labels:
severity: warnning
annotations:
description: "$labels.instance的$labels.job打开句柄数>600"
value: " $value "
- alert: kubernetes-apiserver打开句柄数>1000
expr: process_open_fdsjob=~"kubernetes-apiserver" > 1000
for: 5m
labels:
severity: critical
annotations:
description: "$labels.instance的$labels.job打开句柄数>1000"
value: " $value "
- alert: kubernetes-etcd打开句柄数>600
expr: process_open_fdsjob=~"kubernetes-etcd" > 600
for: 5m
labels:
severity: warnning
annotations:
description: "$labels.instance的$labels.job打开句柄数>600"
value: " $value "
- alert: kubernetes-etcd打开句柄数>1000
expr: process_open_fdsjob=~"kubernetes-etcd" > 1000
for: 5m
labels:
severity: critical
annotations:
description: "$labels.instance的$labels.job打开句柄数>1000"
value: " $value "
- alert: coredns
expr: process_open_fdsk8s_app=~"kube-dns" > 600
for: 2s
labels:
severity: warnning
annotations:
description: "插件$labels.k8s_app($labels.instance): 打开句柄数超过600"
value: " $value "
- alert: coredns
expr: process_open_fdsk8s_app=~"kube-dns" > 1000
for: 2s
labels:
severity: critical
annotations:
description: "插件$labels.k8s_app($labels.instance): 打开句柄数超过1000"
value: " $value "
- alert: kube-proxy
expr: process_virtual_memory_bytesjob=~"kubernetes-kube-proxy" > 2000000000
for: 2s
labels:
severity: warnning
annotations:
description: "组件$labels.job($labels.instance): 使用虚拟内存超过2G"
value: " $value "
- alert: scheduler
expr: process_virtual_memory_bytesjob=~"kubernetes-schedule" > 2000000000
for: 2s
labels:
severity: warnning
annotations:
description: "组件$labels.job($labels.instance): 使用虚拟内存超过2G"
value: " $value "
- alert: kubernetes-controller-manager
expr: process_virtual_memory_bytesjob=~"kubernetes-controller-manager" > 2000000000
for: 2s
labels:
severity: warnning
annotations:
description: "组件$labels.job($labels.instance): 使用虚拟内存超过2G"
value: " $value "
- alert: kubernetes-apiserver
expr: process_virtual_memory_bytesjob=~"kubernetes-apiserver" > 2000000000
for: 2s
labels:
severity: warnning
annotations:
description: "组件$labels.job($labels.instance): 使用虚拟内存超过2G"
value: " $value "
- alert: kubernetes-etcd
expr: process_virtual_memory_bytesjob=~"kubernetes-etcd" > 2000000000
for: 2s
labels:
severity: warnning
annotations:
description: "组件$labels.job($labels.instance): 使用虚拟内存超过2G"
value: " $value "
- alert: kube-dns
expr: process_virtual_memory_bytesk8s_app=~"kube-dns" > 2000000000
for: 2s
labels:
severity: warnning
annotations:
description: "插件$labels.k8s_app($labels.instance): 使用虚拟内存超过2G"
value: " $value "
- alert: HttpRequestsAvg
expr: sum(rate(rest_client_requests_totaljob=~"kubernetes-kube-proxy|kubernetes-kubelet|kubernetes-schedule|kubernetes-control-manager|kubernetes-apiservers"[1m])) > 1000
for: 2s
labels:
team: admin
annotations:
description: "组件$labels.job($labels.instance): TPS超过1000"
value: " $value "
threshold: "1000"
- alert: Pod_restarts
expr: kube_pod_container_status_restarts_totalnamespace=~"ccse" > 0
for: 2s
labels:
severity: warnning
annotations:
description: "在$labels.namespace名称空间下发现$labels.pod这个pod下的容器$labels.container被重启,这个监控指标是由$labels.instance采集的"
value: " $value "
threshold: "0"
- alert: Pod_waiting
expr: kube_pod_container_status_waiting_reasonnamespace=~"kube-system|default" == 1
for: 10s
labels:
team: admin
annotations:
description: "空间$labels.namespace($labels.instance): 发现$labels.pod下的$labels.container启动异常等待中"
value: " $value "
threshold: "1"
- alert: Pod_terminated
expr: kube_pod_container_status_terminated_reasonnamespace=~"ccse" == 1
for: 2s
labels:
team: admin
annotations:
description: "空间$labels.namespace($labels.instance): 发现$labels.pod下的$labels.container被删除"
value: " $value "
threshold: "1"
- alert: Etcd_leader
expr: etcd_server_has_leaderjob="kubernetes-etcd" == 0
for: 2s
labels:
team: admin
annotations:
description: "组件$labels.job($labels.instance): 当前没有leader"
value: " $value "
threshold: "0"
- alert: Etcd_leader_changes
expr: rate(etcd_server_leader_changes_seen_totaljob="kubernetes-etcd"[1m]) > 0
for: 2s
labels:
team: admin
annotations:
description: "组件$labels.job($labels.instance): 当前leader已发生改变"
value: " $value "
threshold: "0"
- alert: Etcd_failed
expr: rate(etcd_server_proposals_failed_totaljob="kubernetes-etcd"[1m]) > 0
for: 2s
labels:
team: admin
annotations:
description: "组件$labels.job($labels.instance): 服务失败"
value: " $value "
threshold: "0"
- alert: Etcd_db_total_size
expr: etcd_debugging_mvcc_db_total_size_in_bytesjob="kubernetes-etcd" > 10000000000
for: 2s
labels:
team: admin
annotations:
description: "组件$labels.job($labels.instance):db空间超过10G"
value: " $value "
threshold: "10G"
- alert: Endpoint_ready
expr: kube_endpoint_address_not_readynamespace=~"kube-system|default" == 1
for: 2s
labels:
team: admin
annotations:
description: "空间$labels.namespace($labels.instance): 发现$labels.endpoint不可用"
value: " $value "
threshold: "1"
- name: 物理节点状态-监控告警
rules:
- alert: 物理节点cpu使用率
expr: 100-avg(irate(node_cpu_seconds_totalmode="idle"[5m])) by(instance)*100 > 90
for: 5m
labels:
severity: ccritical
annotations:
summary: " $labels.instance cpu使用率过高"
description: " $labels.instance 的cpu使用率超过90%,当前使用率[ $value ],需要排查处理"
- alert: 物理节点内存使用率
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: " $labels.instance 内存使用率过高"
description: " $labels.instance 的内存使用率超过90%,当前使用率[ $value ],需要排查处理"
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: " $labels.instance : 服务器宕机"
description: " $labels.instance : 服务器延时超过5分钟"
- alert: 物理节点磁盘的IO性能
expr: 100-(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100) < 10
for: 5m
labels:
severity: critical
annotations:
summary: "$labels.mountpoint 流入磁盘IO使用率过高!"
description: "$labels.mountpoint 流入磁盘IO大于90%(目前使用:$value)"
- alert: 入网流量带宽
expr: ((sum(rate (node_network_receive_bytes_totaldevice!~tap.*|veth.*|br.*|docker.*|virbr*|lo*[5m])) by (instance)) / 100) > 1024000
for: 5m
labels:
severity: critical
annotations:
summary: "$labels.mountpoint 流入网络带宽过高!"
description: "$labels.mountpoint 流入网络带宽持续5分钟高于1000M. RX带宽使用率$value"
- alert: 出网流量带宽
expr: ((sum(rate (node_network_transmit_bytes_totaldevice!~tap.*|veth.*|br.*|docker.*|virbr*|lo*[5m])) by (instance)) / 100) > 1024000
for: 5m
labels:
severity: critical
annotations:
summary: "$labels.mountpoint 流出网络带宽过高!"
description: "$labels.mountpoint 流出网络带宽持续5分钟高于1000M. RX带宽使用率$value"
# - alert: TCP会话
# expr: node_netstat_Tcp_CurrEstab > 1000
# for: 2s
# labels:
# severity: critical
# annotations:
# summary: "$labels.mountpoint TCP_ESTABLISHED过高!"
# description: "$labels.mountpoint TCP_ESTABLISHED大于1000%(目前使用:$value%)"
- alert: 磁盘容量
expr: 100-(node_filesystem_free_bytesfstype=~"ext4|xfs"/node_filesystem_size_bytes fstype=~"ext4|xfs"*100) > 80
for: 5m
labels:
severity: critical
annotations:
summary: "$labels.mountpoint 磁盘分区使用率过高!"
description: "$labels.mountpoint 磁盘分区使用大于80%(目前使用:$value%)"

prometheus-alertmanager-deploy.yaml

---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-server
namespace: ccse
labels:
app: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
component: server
template:
metadata:
labels:
app: prometheus
component: server
annotations:
prometheus.io/scrape: false
spec:
nodeName: zcsnode1
serviceAccountName: monitor
containers:
- name: prometheus
image: prom/prometheus:v2.40.6
imagePullPolicy: IfNotPresent
command:
- "/bin/prometheus"
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retentinotallow=360h"
- "--web.enable-lifecycle"
ports:
- containerPort: 9090
protocol: TCP
resources:
requests:
memory: "500Mi"
cpu: "0.5"
limits:
memory: "8Gi"
cpu: "8"
volumeMounts:
- mountPath: /etc/prometheus
name: prometheus-config
- mountPath: /prometheus/
name: prometheus-storage-volume
- name: k8s-certs
mountPath: /var/run/secrets/kubernetes.io/k8s-certs/etcd/
- name: alertmanager
image: prom/alertmanager:v0.24.0
imagePullPolicy: IfNotPresent
args:
- "--config.file=/etc/alertmanager/alertmanager.yml"
- "--log.level=debug"
ports:
- containerPort: 9093
protocol: TCP
name: alertmanager
resources:
requests:
memory: "500Mi"
cpu: "0.5"
limits:
memory: "4Gi"
cpu: "4"
volumeMounts:
- name: alertmanager-config
mountPath: /etc/alertmanager
- name: alertmanager-storage
mountPath: /alertmanager
- name: localtime
mountPath: /etc/localtime
volumes:
- name: prometheus-config
configMap:
name: prometheus-config
- name: prometheus-storage-volume
hostPath:
path: /data/app/prometheus_data
type: Directory
- name: k8s-certs
secret:
secretName: etcd-certs
- name: alertmanager-config
configMap:
name: alertmanager
- name: alertmanager-storage
hostPath:
path: /data/app/alertmanager-storage
type: DirectoryOrCreate
- name: localtime
hostPath:
path: /usr/share/zoneinfo/Asia/Shanghai

部署 grafana

内存限制 根据实际情况设置值

[root@zcsmaster1 ~]# cat grafana.yaml 
---
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: ccse
labels:
app: grafana
name: grafana
spec:
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
securityContext:
fsGroup: 472
supplementalGroups:
- 0
nodeName: zcsnode1
containers:
- name: grafana
image: grafana/grafana:8.5.15
imagePullPolicy: IfNotPresent
ports:
- containerPort: 3000
name: http-grafana
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /robots.txt
port: 3000
scheme: HTTP
initialDelaySeconds: 10
periodSeconds: 30
successThreshold: 1
timeoutSeconds: 2
livenessProbe:
failureThreshold: 3
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
tcpSocket:
port: 3000
timeoutSeconds: 1
resources:
requests:
cpu: 250m
memory: 750Mi
limits:
memory: "4Gi"
cpu: "4"
volumeMounts:
- mountPath: /var/lib/grafana
name: grafana-storage-volume
volumes:
- name: grafana-storage-volume
hostPath:
path: /data/app/grafana_data
type: Directory
---
apiVersion: v1
kind: Service
metadata:
namespace: ccse
name: grafana
spec:
ports:
- nodePort: 32673
port: 3000
protocol: TCP
targetPort: http-grafana
selector:
app: grafana
sessionAffinity: None
type: LoadBalancer

访问验证

[root@zcsmaster1 ok]# kubectl  get svc -n ccse | grep -E "prometheus|grafana|alertmanager"
alertmanager NodePort 10.103.186.163 <none> 9093:30066/TCP 119s
grafana LoadBalancer 10.106.90.16 <pending> 3000:32673/TCP 11h
prometheus NodePort 10.99.102.44 <none> 9090:30263/TCP 119s

Prometheus+Grafana+alertmanager+

http://192.168.40.180:30263/
http://192.168.40.180:32673/login
http://192.168.40.180:30066/#/alerts

Prometheus+Grafana+alertmanager+

问题1

从上面可以发现kubernetes-controller-manager和kubernetes-schedule都显示连接不上对应的端口

可按如下方法处理: 
vim /etc/kubernetes/manifests/kube-scheduler.yaml
修改如下内容:
把--bind-address=127.0.0.1变成--bind-address=192.168.40.180
把httpGet:字段下的hosts由127.0.0.1变成192.168.40.180
把—port=0删除
#注意:192.168.40.180是k8s的控制节点xianchaomaster1的ip
vim /etc/kubernetes/manifests/kube-controller-manager.yaml
把--bind-address=127.0.0.1变成--bind-address=192.168.40.180
把httpGet:字段下的hosts由127.0.0.1变成192.168.40.180
把—port=0删除

修改之后在k8s各个节点执行
systemctl restart kubelet

kubectl get cs
显示如下:
NAME STATUS MESSAGE ERROR
controller-manager Healthy ok
scheduler Healthy ok
etcd-0 Healthy "health":"true"

ss -antulp | grep :10251
ss -antulp | grep :10252
可以看到相应的端口已经被物理机监听了
点击status->targets,可看到如下

Prometheus+Grafana+alertmanager+

grafana 配置

数据源配置

​http://prometheus.ccse.svc:9090​

Prometheus+Grafana+alertmanager+

导入 prometheus-dashboard

Prometheus+Grafana+alertmanager+

Prometheus+Grafana+alertmanager+


alertmanager 配置

邮箱告警

Prometheus+Grafana+alertmanager+

Prometheus+Grafana+alertmanager+

下面一篇文张 ,会详细讲解钉钉告警


以上是关于Prometheus+Grafana+alertmanager+ 邮件 +钉钉告警的主要内容,如果未能解决你的问题,请参考以下文章

部署Prometheus+Grafana监控Docker

istio+prometheus+grafana 流量监控

Prometheus / Grafana 反单调性

Prometheus+Grafana监控MySQL浅析

Prometheus + Grafana 部署说明之安装

监控利器Prometheus——Prometheus+Grafana监控SpringBoot项目业务指标监控