k8s-prometheus
Posted 芒果牛奶
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了k8s-prometheus相关的知识,希望对你有一定的参考价值。
promethus
基于k8s
收集数据
node-exporter
vi node-exporter-ds.yml
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: node-exporter
labels:
app: node-exporter
spec:
template:
metadata:
labels:
app: node-exporter
spec:
hostNetwork: true
containers:
- image: prom/node-exporter
name: node-exporter
ports:
- containerPort: 9100
volumeMounts:
- mountPath: "/etc/localtime"
name: timezone
volumes:
- name: timezone
hostPath:
path: /etc/localtime
存储,持久卷,创建一个10G的pv,基于nfs
vi prometheus-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: gwj-pv-prometheus
labels:
app: gwj-pv
spec:
capacity:
storage: 10Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Recycle
storageClassName: slow
mountOptions:
- hard
- nfsvers=4.1
nfs:
path: /storage/gwj-prometheus
server: 10.1.99.1
持久卷申领,基于刚刚创建的pv,申领一个5G的pvc
vi prometheus-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: gwj-prometheus-pvc
namespace: gwj
spec:
accessModes:
- ReadWriteMany
volumeMode: Filesystem
resources:
requests:
storage: 5Gi
selector:
matchLabels:
app: gwj-pv
storageClassName: slow
设置prometheus rbac权限
clusterrole.rbac.authorization.k8s.io/gwj-prometheus-clusterrole created
serviceaccount/gwj-prometheus created
clusterrolebinding.rbac.authorization.k8s.io/gwj-prometheus-rolebinding created
vi prometheus-rbac.yml
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: gwj-prometheus-clusterrole
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups:
- extensions
resources:
- ingresses
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
apiVersion: v1
kind: ServiceAccount
metadata:
namespace: gwj
name: gwj-prometheus
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: gwj-prometheus-rolebinding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: gwj-prometheus-clusterrole
subjects:
- kind: ServiceAccount
name: gwj-prometheus
namespace: gwj
创建prometheus 配置文件,使用configmap
vi prometheus-cm.yml
apiVersion: v1
kind: ConfigMap
metadata:
name: gwj-prometheus-cm
namespace: gwj
data:
prometheus.yml: |
rule_files:
- /etc/prometheus/rules.yml
alerting:
alertmanagers:
- static_configs:
- targets: ["gwj-alertmanger-svc:80"]
global:
scrape_interval: 10s
scrape_timeout: 10s
evaluation_interval: 10s
scrape_configs:
- job_name: \'kubernetes-nodes\'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: metrics_path
replacement: /api/v1/nodes/${1}/proxy/metrics
- target_label: address
replacement: kubernetes.default.svc:443
- job_name: \'kubernetes-node-exporter\'
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- source_labels: [__meta_kubernetes_role]
action: replace
target_label: kubernetes_role
- source_labels: [__address__]
regex: \'(.*):10250\'
replacement: \'${1}:9100\'
target_label: address
- job_name: \'kubernetes-pods\'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: address
regex: (1+)(?::d+)?;(d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
- job_name: \'kubernetes-cadvisor\'
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: address
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: metrics_path
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- job_name: \'kubernetes-service-endpoints\'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: scheme
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: metrics_path
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: address
regex: (1+)(?::d+)?;(d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
rules.yml: |
groups:
- name: kebernetes_rules
rules:
- alert: InstanceDown
expr: up{job="kubernetes-node-exporter"} == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
- alert: StatefulSetReplicasMismatch
annotations:
summary: "Replicas miss match"
description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 3 minutes.
expr: label_join(kube_statefulset_status_replicas_ready != kube_statefulset_replicas, "instance", "/", "namespace", "statefulset")
for: 3m
labels:
severity: critical
- alert: PodFrequentlyRestarting
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m
labels:
severity: warning
annotations:
description: Pod {{ $labels.namespaces }}/{{ $labels.pod }} is was restarted {{ $value }} times within the last hour
summary: Pod is restarting frequently
- alert: DeploymentReplicasNotUpdated
expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)
or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))
unless (kube_deployment_spec_paused == 1)
for: 5m
labels:
severity: critical
annotations:
description: Replicas are not updated and available for deployment {{ $labels.namespace }}/{{ $labels.deployment }}
summary: Deployment replicas are outdated
- alert: DaemonSetRolloutStuck
expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100
for: 5m
labels:
severity: critical
annotations:
description: Only {{ $value }}% of desired pods scheduled and ready for daemonset {{ $labels.namespace }}/{{ $labels.daemonset }}
summary: DaemonSet is missing pods
- alert: DaemonSetsNotScheduled
expr: kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0
for: 10m
labels:
severity: warning
annotations:
description: \'{{{{ $value }}
}} Pods of DaemonSet {{{{ $labels.namespace }}
}}/{{{{ $labels.daemonset }}
}} are not scheduled.\'
summary: Daemonsets are not scheduled correctly
- alert: DaemonSetsMissScheduled
expr: kube_daemonset_status_number_misscheduled > 0
for: 10m
labels:
severity: warning
annotations:
description: \'{{{{ $value }}
}} Pods of DaemonSet {{{{ $labels.namespace }}
}}/{{{{ $labels.daemonset }}
}} are running where they are not supposed to run.\'
summary: Daemonsets are not scheduled correctly
- alert: Node_Boot_Time
expr: (node_time_seconds - node_boot_time_seconds) <= 150
for: 15s
annotations:
summary: "机器{{ $labels.instacnce }} 刚刚重启,时间少于 150s"
- alert: Available_Percent
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes <= 0.2
for: 15s
annotations:
summary: "机器{{ $labels.instacnce }} available less than 20%"
- alert: FD_Used_Percent
expr: (node_filefd_allocated / node_filefd_maximum) >= 0.8
for: 15s
annotations:
summary: "机器{{ $labels.instacnce }} FD used more than 80%"
根据刚刚创建的cm的要求,创建alertmanger 用于告警
vi alertmanger.yml
kind: Service
apiVersion: v1
metadata:
name: gwj-alertmanger-svc
namespace: gwj
spec:
selector:
app: gwj-alert-pod
ports:
- protocol: TCP
port: 80
targetPort: 9093
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: gwj-alert-sts
namespace: gwj
labels:
app: gwj-alert-sts
spec:
replicas: 1
serviceName: gwj-alertmanger-svc
selector:
matchLabels:
app: gwj-alert-pod
template:
metadata:
labels:
app: gwj-alert-pod
spec:
containers:
- image: prom/alertmanager:v0.14.0
name: gwj-alert-pod
ports:
- containerPort: 9093
protocol: TCP
volumeMounts:
- mountPath: "/etc/localtime"
name: timezone
volumes:
- name: timezone
hostPath:
path: /etc/localtime
kubectl apply -f alertmanger.yml
service/gwj-alertmanger-svc created
statefulset.apps/gwj-alert-sts created
创建prometheus statefulset来创建prometheus
service/gwj-prometheus-svc created
statefulset.apps/gwj-prometheus-sts created
/prometheus
pvc: gwj-prometheus-pvc
/etc/prometheus/
configMap:
name: gwj-prometheus-cm
vi prometheus-sts.yml
kind: Service
apiVersion: v1
metadata:
name: gwj-prometheus-svc
namespace: gwj
labels:
app: gwj-prometheus-svc
spec:
ports:
- port: 80
targetPort: 9090
selector:
app: gwj-prometheus-pod
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: gwj-prometheus-sts
namespace: gwj
labels:
app: gwj-prometheus-sts
spec:
replicas: 1
serviceName: gwj-prometheus-svc
selector:
matchLabels:
app: gwj-prometheus-pod
template:
metadata:
labels:
app: gwj-prometheus-pod
spec:
containers:
- image: prom/prometheus:v2.9.2
name: gwj-prometheus-pod
ports:
- containerPort: 9090
protocol: TCP
volumeMounts:
- mountPath: "/prometheus"
name: data
- mountPath: "/etc/prometheus/"
name: config-volume
- mountPath: "/etc/localtime"
name: timezone
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 500m
memory: 2000Mi
serviceAccountName: gwj-prometheus
volumes:
- name: data
persistentVolumeClaim:
claimName: gwj-prometheus-pvc
- name: config-volume
configMap:
name: gwj-prometheus-cm
- name: gwj-prometheus-rule-cm
configMap:
name: gwj-prometheus-rule-cm
- name: timezone
hostPath:
path: /etc/localtime
kubectl apply -f prometheus-sts.yml
service/gwj-prometheus-svc created
statefulset.apps/gwj-prometheus-sts created
创建ingress,根据域名分发到不同的service
vi prometheus-ingress.yml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
namespace: gwj
annotations:
name: gwj-ingress-prometheus
spec:
rules:
- host: gwj.syncbug.com
http:
paths:
- path: /
backend:
serviceName: gwj-prometheus-svc
servicePort: 80
- host: gwj-alert.syncbug.com
http:
paths:
- path: /
backend:
serviceName: gwj-alertmanger-svc
servicePort: 80
kubectl apply -f prometheus-ingress.yml
ingress.extensions/gwj-ingress-prometheus created
访问对应的域名
gwj.syncbug.com
查看目标对象是否正确
http://gwj.syncbug.com/targets
查看配置文件是否正确
http://gwj.syncbug.com/config
gwj-alert.syncbug.com
===grafana
vi grafana-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: gwj-pv-grafana
labels:
app: gwj-pv-gra
spec:
capacity:
storage: 2Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Recycle
storageClassName: slow
mountOptions:
- hard
- nfsvers=4.1
nfs:
path: /storage/gwj-grafana
server: 10.1.99.1
vi grafana-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: gwj-grafana-pvc
namespace: gwj
spec:
accessModes:
- ReadWriteMany
volumeMode: Filesystem
resources:
requests:
storage: 1Gi
selector:
matchLabels:
app: gwj-pv-gra
storageClassName: slow
vi grafana-deployment.yaml
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
name: grafana
name: grafana
namespace: gwj
spec:
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
name: grafana
spec:
containers:
- env:
- name: GF_PATHS_DATA
value: /var/lib/grafana/
- name: GF_PATHS_PLUGINS
value: /var/lib/grafana/plugins
image: grafana/grafana:6.2.4
imagePullPolicy: IfNotPresent
name: grafana
ports:
- containerPort: 3000
name: grafana
protocol: TCP
volumeMounts:
- mountPath: /var/lib/grafana/
name: data
- mountPath: /etc/localtime
name: localtime
dnsPolicy: ClusterFirst
restartPolicy: Always
volumes:
- name: data
persistentVolumeClaim:
claimName: gwj-grafana-pvc
- name: localtime
hostPath:
path: /etc/localtime
vi grafana-ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
namespace: gwj
annotations:
name: gwj-ingress-grafana
spec:
rules:
- host: gwj-grafana.syncbug.com
http:
paths:
- path: /
backend:
serviceName: gwj-grafana-svc
servicePort: 80
kind: Service
apiVersion: v1
metadata:
name: gwj-grafana-svc
namespace: gwj
spec:
selector:
app: grafana
ports:
- protocol: TCP
port: 80
targetPort: 3000
进入grafana,gwj-grafana.syncbug.com
默认: admin admin
输入datasource: http://gwj-prometheus-svc:80
import模版
- : ↩
以上是关于k8s-prometheus的主要内容,如果未能解决你的问题,请参考以下文章