Kubernetes相关组件监控指标采集
Posted kevingrace
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Kubernetes相关组件监控指标采集相关的知识,希望对你有一定的参考价值。
线上部署了kuberneter集群环境,需要在zabbix上对相关组件运行情况进行监控。kuberneter组件监控指标分为固定指标数据采集和动态指标数据采集。其中,固定指标数据在终端命令行可以通过metrics接口获取, 在zabbix里"自动发现";动态指标数据通过python脚本获获取,并返回JSON 字符串格式,在zabbix里添加模板或配置主机的自动发现策略。
一、固定指标数据采集(zabbix自动发现,采集间隔建议5min)
1. Master指标【采集范围:Master集群的3个节点,测试环境为192.168.10.93/94/95】
1、指标标识:kube_apiserver_process_cpu_seconds_total 采集指令示例:curl -s --cacert kubernetes-ca/ca.pem --cert kubernetes-ca/admin.pem --key kubernetes-ca/admin-key.pem https://192.168.10.93:6443/metrics | grep process_cpu_seconds_total | grep -v ‘#‘ | awk ‘{print $2}‘ 2、指标标识:kube_apiserver_process_open_fds 采集指令示例:curl -s --cacert kubernetes-ca/ca.pem --cert kubernetes-ca/admin.pem --key kubernetes-ca/admin-key.pem https://192.168.10.93:6443/metrics | grep process_open_fds | grep -v ‘#‘ | awk ‘{print $2}‘ 3、指标标识:kube_apiserver_process_virtual_memory_bytes 采集指令示例:curl -s --cacert kubernetes-ca/ca.pem --cert kubernetes-ca/admin.pem --key kubernetes-ca/admin-key.pem https://192.168.10.93:6443/metrics | grep process_virtual_memory_bytes | grep -v ‘#‘ | awk ‘{print $2}‘ 4、指标标识:kube_apiserver_rest_client_requests_total_200_put 采集指令示例:curl -s --cacert kubernetes-ca/ca.pem --cert kubernetes-ca/admin.pem --key kubernetes-ca/admin-key.pem https://192.168.10.93:6443/metrics | grep rest_client_requests_total | grep -v ‘#‘ | grep PUT | grep 200 | awk ‘{print $2}‘ 5、指标标识:kube_apiserver_rest_client_requests_total_200_get 采集指令示例:curl -s --cacert kubernetes-ca/ca.pem --cert kubernetes-ca/admin.pem --key kubernetes-ca/admin-key.pem https://192.168.10.93:6443/metrics | grep rest_client_requests_total | grep -v ‘#‘ | grep GET | grep 200 | awk ‘{print $2}‘ 6、指标标识:etcd_debugging_mvcc_db_total_size_in_bytes 采集指令示例:curl -s --cacert etcd/ca.pem --cert etcd/healthcheck-client.pem --key etcd/healthcheck-client-key.pem https://192.168.10.93:2379/metrics | grep etcd_debugging_mvcc_db_total_size_in_bytes | grep -v ‘#‘ | awk ‘{print $2}‘ 7、指标标识:etcd_server_has_leader 采集指令示例:curl -s --cacert etcd/ca.pem --cert etcd/healthcheck-client.pem --key etcd/healthcheck-client-key.pem https://192.168.10.93:2379/metrics | grep etcd_server_has_leader | grep -v ‘#‘ | awk ‘{print $2}‘ 8、指标标识:etcd_server_leader_changes_seen_total 采集指令示例:curl -s --cacert etcd/ca.pem --cert etcd/healthcheck-client.pem --key etcd/healthcheck-client-key.pem https://192.168.10.93:2379/metrics | grep etcd_server_leader_changes_seen_total | grep -v ‘#‘ | awk ‘{print $2}‘ 9、指标标识:etcd_server_proposals_failed_total 采集指令示例:curl -s --cacert etcd/ca.pem --cert etcd/healthcheck-client.pem --key etcd/healthcheck-client-key.pem https://192.168.10.93:2379/metrics | grep etcd_server_proposals_failed_total | grep -v ‘#‘ | awk ‘{print $2}‘ 10、指标标识:etcd_process_cpu_seconds_total 采集指令示例:curl -s --cacert etcd/ca.pem --cert etcd/healthcheck-client.pem --key etcd/healthcheck-client-key.pem https://192.168.10.93:2379/metrics | grep process_cpu_seconds_total | grep -v ‘#‘ | awk ‘{print $2}‘ 11、指标标识:etcd_process_open_fds 采集指令示例:curl -s --cacert etcd/ca.pem --cert etcd/healthcheck-client.pem --key etcd/healthcheck-client-key.pem https://192.168.10.93:2379/metrics | grep process_open_fds | grep -v ‘#‘ | awk ‘{print $2}‘ 12、指标标识:etcd_process_virtual_memory_bytes 采集指令示例:curl -s --cacert etcd/ca.pem --cert etcd/healthcheck-client.pem --key etcd/healthcheck-client-key.pem https://192.168.10.93:2379/metrics | grep process_virtual_memory_bytes | grep -v ‘#‘ | awk ‘{print $2}‘ 13、指标标识:kube_controller_manager_process_cpu_seconds_total 采集指令示例:curl -s 192.168.10.93:10252/metrics | grep process_cpu_seconds_total | grep -v ‘#‘ | awk ‘{print $2}‘ 14、指标标识:kube_controller_manager_process_open_fds 采集指令示例:curl -s 192.168.10.93:10252/metrics | grep process_open_fds | grep -v ‘#‘ | awk ‘{print $2}‘ 15、指标标识:kube_controller_manager_process_virtual_memory_bytes 采集指令示例:curl -s 192.168.10.93:10252/metrics | grep process_virtual_memory_bytes | grep -v ‘#‘ | awk ‘{print $2}‘ 16、指标标识:kube_controller_manager_rest_client_requests_total_200_put 采集指令示例:curl -s 192.168.10.93:10252/metrics | grep rest_client_requests_total | grep -v ‘#‘ | grep PUT | grep 200 | awk ‘{print $2}‘ 17、指标标识:kube_controller_manager_rest_client_requests_total_200_get 采集指令示例:curl -s 192.168.10.93:10252/metrics | grep rest_client_requests_total | grep -v ‘#‘ | grep GET | grep 200 | awk ‘{print $2}‘ 18、指标标识:kube_scheduler_process_cpu_seconds_total 采集指令示例:curl -s 192.168.10.93:10251/metrics | grep process_cpu_seconds_total | grep -v ‘#‘ | awk ‘{print $2}‘ 19、指标标识:kube_scheduler_process_open_fds 采集指令示例:curl -s 192.168.10.93:10251/metrics | grep process_open_fds | grep -v ‘#‘ | awk ‘{print $2}‘ 20、指标标识:kube_scheduler_process_virtual_memory_bytes 采集指令示例:curl -s 192.168.10.93:10251/metrics | grep process_virtual_memory_bytes | grep -v ‘#‘ | awk ‘{print $2}‘ 21、指标标识:kube_scheduler_rest_client_requests_total_200_put 采集指令示例:curl -s 192.168.10.93:10251/metrics | grep rest_client_requests_total | grep -v ‘#‘ | grep PUT | grep 200 | awk ‘{print $2}‘ 22、指标标识:kube_scheduler_rest_client_requests_total_200_get 采集指令示例:curl -s 192.168.10.93:10251/metrics | grep rest_client_requests_total | grep -v ‘#‘ | grep GET | grep 200 | awk ‘{print $2}‘
2. Node指标【采集范围:Node的5个节点,测试环境为192.168.10.230/231/232/233/234】
1、指标标识:kubelet_docker_operations_errors_inspect_container 采集指令示例:curl -s 192.168.10.230:10255/metrics | grep kubelet_docker_operations_errors | grep -v ‘#‘ | grep inspect_container | awk ‘{print $2}‘ 2、指标标识:kubelet_docker_operations_errors_inspect_image 采集指令示例:curl -s 192.168.10.230:10255/metrics | grep kubelet_docker_operations_errors | grep -v ‘#‘ | grep inspect_image | awk ‘{print $2}‘ 3、指标标识:kubelet_docker_operations_errors_start_container 采集指令示例:curl -s 192.168.10.230:10255/metrics | grep kubelet_docker_operations_errors | grep -v ‘#‘ | grep start_container | awk ‘{print $2}‘ 4、指标标识:kubelet_docker_operations_errors_stop_container 采集指令示例:curl -s 192.168.10.230:10255/metrics | grep kubelet_docker_operations_errors | grep -v ‘#‘ | grep stop_container | awk ‘{print $2}‘ 5、指标标识:kubelet_node_config_error 采集指令示例:curl -s 192.168.10.230:10255/metrics | grep kubelet_node_config_error | grep -v ‘#‘ | awk ‘{print $2}‘ 6、指标标识:kubelet_process_cpu_seconds_total 采集指令示例:curl -s 192.168.10.230:10255/metrics | grep process_cpu_seconds_total | grep -v ‘#‘ | awk ‘{print $2}‘ 7、指标标识:kubelet_process_open_fds 采集指令示例:curl -s 192.168.10.230:10255/metrics | grep process_open_fds | grep -v ‘#‘ | awk ‘{print $2}‘ 8、指标标识:kubelet_process_virtual_memory_bytes 采集指令示例:curl -s 192.168.10.230:10255/metrics | grep process_virtual_memory_bytes | grep -v ‘#‘ | awk ‘{print $2}‘ 9、指标标识:kubelet_rest_client_requests_total_200_put 采集指令示例:curl -s 192.168.10.230:10255/metrics | grep rest_client_requests_total | grep -v ‘#‘ | grep PUT | grep 200 | awk ‘{print $2}‘ 10、指标标识:kubelet_rest_client_requests_total_200_get 采集指令示例:curl -s 192.168.10.230:10255/metrics | grep rest_client_requests_total | grep -v ‘#‘ | grep GET | grep 200 | awk ‘{print $2}‘ 11、指标标识:kube_proxy_process_cpu_seconds_total 采集指令示例:curl -s 192.168.10.230:10249/metrics | grep process_cpu_seconds_total | grep -v ‘#‘ | awk ‘{print $2}‘ 12、指标标识:kube_proxy_process_open_fds 采集指令示例:curl -s 192.168.10.230:10249/metrics | grep process_open_fds | grep -v ‘#‘ | awk ‘{print $2}‘ 13、指标标识:kube_proxy_process_virtual_memory_bytes 采集指令示例:curl -s 192.168.10.230:10249/metrics | grep process_virtual_memory_bytes | grep -v ‘#‘ | awk ‘{print $2}‘ 14、指标标识:kube_proxy_rest_client_requests_total_200_put 采集指令示例:curl -s 192.168.10.230:10249/metrics | grep rest_client_requests_total | grep -v ‘#‘ | grep PUT | grep 200 | awk ‘{print $2}‘ 15、指标标识:kube_proxy_rest_client_requests_total_200_get 采集指令示例:curl -s 192.168.10.230:10249/metrics | grep rest_client_requests_total | grep -v ‘#‘ | grep GET | grep 200 | awk ‘{print $2}‘
3. 整体指标【采集Node集群中任一节点即可,测试环境可采集其中一台192.168.10.230即可。 在采集对应node节点的指标数据中,如果node节点宕机,则监控指标数据就会失败。为了防止这种情况,采集的IP可以建议修改为nginx-Ingress IP或内部Service IP】
1、指标标识:coredns_process_cpu_seconds_total 采集指令示例:curl -s 192.168.10.230:9153/metrics | grep process_cpu_seconds_total | grep -v ‘#‘ | awk ‘{print $2}‘ 2、指标标识:coredns_process_open_fds 采集指令示例:curl -s 192.168.10.230:9153/metrics | grep process_open_fds | grep -v ‘#‘ | awk ‘{print $2}‘ 3、指标标识:coredns_process_virtual_memory_bytes 采集指令示例:curl -s 192.168.10.230:9153/metrics | grep process_virtual_memory_bytes | grep -v ‘#‘ | awk ‘{print $2}‘ 4、指标标识:kube_state_metrics_metrics_process_cpu_seconds_total 采集指令示例:curl -s 192.168.10.230:8081/metrics | grep process_cpu_seconds_total | grep -v ‘#‘ | awk ‘{print $2}‘ 5、指标标识:kube_state_metrics_metrics_process_open_fds 采集指令示例:curl -s 192.168.10.230:8081/metrics | grep process_open_fds | grep -v ‘#‘ | awk ‘{print $2}‘ 6、指标标识:kube_state_metrics_metrics_process_virtual_memory_bytes 采集指令示例:curl -s 192.168.10.230:8081/metrics | grep process_virtual_memory_bytes | grep -v ‘#‘ | awk ‘{print $2}‘
二、固定指标数据采集
动态指标采集的python脚本(将各个动态指标数据采集脚本整合到了一个脚本里)
[[email protected] ~]# cat zabbix-metrics-find.py #!/usr/bin/env python # coding:utf-8 import json import os import re import sys #kube-state-metrics自动发现for zabbix #python传参value/values(不区分大小写)时显示监控值,其他参数或无参数显示监控KEY #采集范围:任一Node节点,测试可在192.168.10.230,此IP后续建议改为Nginx-Ingress的负载IP,或内部service IP #采集间隔建议5min #Author: GaoKan #Created: 2019-5-22 #Updated: def main(): ip = ‘192.168.10.230‘ flag = ‘key‘ if len(sys.argv) > 1: if sys.argv[1].lower() in (‘value‘, ‘values‘): flag = ‘value‘ keys = [] values = [] metrics_dict = { #DaemonSet-Metrics ‘kube_daemonset_status_number_misscheduled‘ : { ‘forshort‘ : ‘ds_misscheduled‘, ‘tags‘ : [‘namespace‘, ‘daemonset‘,], }, ‘kube_daemonset_status_number_unavailable‘ : { ‘forshort‘ : ‘ds_unavailable‘, ‘tags‘ : [‘namespace‘, ‘daemonset‘,], }, #Deployment-Metrics ‘kube_deployment_status_replicas_unavailable‘ : { ‘forshort‘ : ‘deploy_unavailable‘, ‘tags‘ : [‘namespace‘, ‘deployment‘,], }, #Pod-Metrics ‘kube_pod_container_status_waiting_reason‘ : { ‘forshort‘ : ‘po_cntr_waiting_reason‘, ‘tags‘ : [‘namespace‘, ‘pod‘, ‘container‘, ‘reason‘,], }, ‘kube_pod_container_status_terminated_reason‘ : { ‘forshort‘ : ‘po_cntr_terminated_reason‘, ‘tags‘ : [‘namespace‘, ‘pod‘, ‘container‘, ‘reason‘,], }, ‘kube_pod_container_status_restarts_total‘ : { ‘forshort‘ : ‘po_cntr_restarts_total‘, ‘tags‘ : [‘namespace‘, ‘pod‘, ‘container‘,], }, #ReplicaSet-Metrics ‘kube_replicaset_status_ready_replicas‘ : { ‘forshort‘ : ‘rs_ready_replicas‘, ‘tags‘ : [‘namespace‘, ‘replicaset‘,], }, ‘kube_replicaset_status_replicas‘ : { ‘forshort‘ : ‘rs_replicas‘, ‘tags‘ : [‘namespace‘, ‘replicaset‘,], }, #Endpoint-Metrics ‘kube_endpoint_address_not_ready‘ : { ‘forshort‘ : ‘ep_not_ready‘, ‘tags‘ : [‘namespace‘, ‘endpoint‘,], }, } metrics = os.popen(‘curl -s ‘ + ip + ‘:8080/metrics‘) for row in metrics: if row.startswith(‘#‘): continue pos1 = row.find(‘{‘) pos2 = row.find(‘}‘) if row[: pos1] in metrics_dict.keys(): key = metrics_dict[row[: pos1]][‘forshort‘] for tag in metrics_dict[row[: pos1]][‘tags‘]: key += ‘_‘ + re.search(r‘‘ + tag + ‘=\"(.*?)\"‘, row[pos1 + 1 : pos2]).group(1) keys.append({"{#METRICSNAME}" : key}) values.append({"{#METRICSVALUE}" : row[pos2 + 2 : -1]}) if flag == ‘value‘: print(json.dumps({"data":values},indent = 4)) else: print(json.dumps({"data":keys},indent = 4)) if __name__ == "__main__": main()
执行脚本,返回json字符串格式(执行结果显示的是kubernetes所有的对象资源,如pod,deploy,service等的运行状态,根据跑的业务量,可能会有成百上千个)
[[email protected] ~]# python zabbix-metrics-find.py |head -30 { "data": [ { "{#METRICSNAME}": "ds_misscheduled_test-rg_test-rg-005" }, { "{#METRICSNAME}": "ds_misscheduled_cattle-system_cattle-node-agent" }, { "{#METRICSNAME}": "ds_misscheduled_test-rg_test-rg-001" }, { "{#METRICSNAME}": "ds_misscheduled_test-rg_test-rg-002" }, { "{#METRICSNAME}": "ds_misscheduled_test-rg_test-rg-003" }, { "{#METRICSNAME}": "ds_misscheduled_test-rg_test-rg-004" }, { "{#METRICSNAME}": "ds_unavailable_test-rg_test-rg-003" }, { "{#METRICSNAME}": "ds_unavailable_test-rg_test-rg-004" }, { "{#METRICSNAME}": "ds_unavailable_test-rg_test-rg-005" }, ................... ................... { "{#METRICSNAME}": "po_cntr_restarts_total_test-rg_test-rg-005-jvkm6_test-rg-005" }, { "{#METRICSNAME}": "po_cntr_restarts_total_cattle-system_cattle-node-agent-mdl9x_agent" }, { "{#METRICSNAME}": "po_cntr_restarts_total_test-rg_test-rg-005-wpsbq_test-rg-005" }, { "{#METRICSNAME}": "po_cntr_restarts_total_test-rg_test-rg-004-9s57x_test-rg-004" }, { "{#METRICSNAME}": "po_cntr_restarts_total_test-rg_test-rg-005-wxk54_test-rg-005" }, { "{#METRICSNAME}": "po_cntr_restarts_total_cattle-system_cattle-node-agent-r46bz_agent" }, { "{#METRICSNAME}": "po_cntr_restarts_total_default_mysql-ceph-test-76697d98d6-4gj9v_mysql-ceph-test" }, { "{#METRICSNAME}": "po_cntr_restarts_total_kube-system_coredns-5cbf6655f-6wxqz_coredns" }, { "{#METRICSNAME}": "po_cntr_restarts_total_kube-system_kube-state-metrics-576fbb446d-ctl4p_addon-resizer" }, { "{#METRICSNAME}": "po_cntr_restarts_total_kube-system_kube-state-metrics-576fbb446d-ctl4p_kube-state-metrics" }, ................... ................... { "{#METRICSNAME}": "rs_ready_replicas_test_nginx-5c689d88bb" }, { "{#METRICSNAME}": "rs_ready_replicas_two-test_aicase-docker-5784b5749b" }, { "{#METRICSNAME}": "rs_ready_replicas_cattle-system_cattle-cluster-agent-d59dbdb55" }, { "{#METRICSNAME}": "rs_ready_replicas_test_nginx-589dcbcbd6" }, { "{#METRICSNAME}": "rs_ready_replicas_test_nginx-5b677cdf4f" }, { "{#METRICSNAME}": "rs_ready_replicas_default_mysql-ceph-test-76697d98d6" }, { "{#METRICSNAME}": "rs_ready_replicas_kube-system_kube-state-metrics-75bbc44548" }, { "{#METRICSNAME}": "rs_ready_replicas_kube-system_traefik-ingress-controller-6db4877748" }, { "{#METRICSNAME}": "rs_ready_replicas_two-test_aicase-docker-57d445cbf" } ] }
查询values
[[email protected] ~]# python zabbix-metrics-find.py values { "data": [ { "{#METRICSVALUE}": "0" }, { "{#METRICSVALUE}": "0" }, { "{#METRICSVALUE}": "0" }, { "{#METRICSVALUE}": "0" }, { "{#METRICSVALUE}": "0" }, { "{#METRICSVALUE}": "0" }, { "{#METRICSVALUE}": "0" }, { "{#METRICSVALUE}": "0" }, { "{#METRICSVALUE}": "0" }, ................. ................. { "{#METRICSVALUE}": "1" }, { "{#METRICSVALUE}": "27" }, { "{#METRICSVALUE}": "0" }, { "{#METRICSVALUE}": "3" }, { "{#METRICSVALUE}": "0" }, ................. ................. { "{#METRICSVALUE}": "1" }, { "{#METRICSVALUE}": "0" }, { "{#METRICSVALUE}": "2" }, { "{#METRICSVALUE}": "1" }, { "{#METRICSVALUE}": "0" }, { "{#METRICSVALUE}": "0" }, { "{#METRICSVALUE}": "0" }, { "{#METRICSVALUE}": "2" }, { "{#METRICSVALUE}": "0" } ] }
以上是关于Kubernetes相关组件监控指标采集的主要内容,如果未能解决你的问题,请参考以下文章
云原生在京东丨云原生时代下的监控:如何基于云原生进行指标采集?
运维工程师监控工作之Elasticsearch关键指标采集方法