无法将 K8s 服务添加为 prometheus 目标

Posted

技术标签:

【中文标题】无法将 K8s 服务添加为 prometheus 目标【英文标题】:Unable to add a K8s service as prometheus target 【发布时间】:2021-09-07 21:14:46 【问题描述】:

我希望我的 prometheus 服务器从 pod 中抓取指标。

我按照以下步骤操作:

    使用部署创建了一个 pod - kubectl apply -f sample-app.deploy.yaml 使用kubectl apply -f sample-app.service.yaml 暴露相同 使用helm upgrade -i prometheus prometheus-community/prometheus -f prometheus-values.yaml 部署的 Prometheus 服务器 使用kubectl apply -f service-monitor.yaml 创建了一个serviceMonitor 来为prometheus 添加一个目标。

所有 pod 都在运行,但是当我打开 prometheus 仪表板时,我没有看到 sample-app service 作为 prometheus 目标,位于仪表板 UI 中的 status>targets 下。

我已验证以下内容:

    当我执行kubectl get servicemonitors 时,我可以看到sample-app 我可以看到 sample-app 在/metrics 下以 prometheus 格式公开指标

此时我进一步调试,使用进入prometheus pod kubectl exec -it pod/prometheus-server-65b759cb95-dxmkm -c prometheus-server sh ,并看到 proemetheus 配置 (/etc/config/prometheus.yml) 没有将 sample-app 作为作业之一,所以我编辑了 configmap 使用

kubectl edit cm prometheus-server -o yaml 已添加

    - job_name: sample-app
        static_configs:
        - targets:
          - sample-app:8080

假设所有其他字段,例如 scraping 间隔,scrape_timeout 保持默认值。

我可以看到 /etc/config/prometheus.yml 中反映了同样的情况,但 prometheus 仪表板仍然没有将 sample-app 显示为 status>targets 下的目标。

以下是 prometheus-server 和服务监视器的 yaml。

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    autopilot.gke.io/resource-adjustment: '"input":"containers":["name":"prometheus-server-configmap-reload","name":"prometheus-server"],"output":"containers":["limits":"cpu":"500m","ephemeral-storage":"1Gi","memory":"2Gi","requests":"cpu":"500m","ephemeral-storage":"1Gi","memory":"2Gi","name":"prometheus-server-configmap-reload","limits":"cpu":"500m","ephemeral-storage":"1Gi","memory":"2Gi","requests":"cpu":"500m","ephemeral-storage":"1Gi","memory":"2Gi","name":"prometheus-server"],"modified":true'
    deployment.kubernetes.io/revision: "1"
    meta.helm.sh/release-name: prometheus
    meta.helm.sh/release-namespace: prom
  creationTimestamp: "2021-06-24T10:42:31Z"
  generation: 1
  labels:
    app: prometheus
    app.kubernetes.io/managed-by: Helm
    chart: prometheus-14.2.1
    component: server
    heritage: Helm
    release: prometheus
  name: prometheus-server
  namespace: prom
  resourceVersion: "6983855"
  selfLink: /apis/apps/v1/namespaces/prom/deployments/prometheus-server
  uid: <some-uid>
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: prometheus
      component: server
      release: prometheus
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: prometheus
        chart: prometheus-14.2.1
        component: server
        heritage: Helm
        release: prometheus
    spec:
      containers:
      - args:
        - --volume-dir=/etc/config
        - --webhook-url=http://127.0.0.1:9090/-/reload
        image: jimmidyson/configmap-reload:v0.5.0
        imagePullPolicy: IfNotPresent
        name: prometheus-server-configmap-reload
        resources:
          limits:
            cpu: 500m
            ephemeral-storage: 1Gi
            memory: 2Gi
          requests:
            cpu: 500m
            ephemeral-storage: 1Gi
            memory: 2Gi
        securityContext:
          capabilities:
            drop:
            - NET_RAW
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/config
          name: config-volume
          readOnly: true
      - args:
        - --storage.tsdb.retention.time=15d
        - --config.file=/etc/config/prometheus.yml
        - --storage.tsdb.path=/data
        - --web.console.libraries=/etc/prometheus/console_libraries
        - --web.console.templates=/etc/prometheus/consoles
        - --web.enable-lifecycle
        image: quay.io/prometheus/prometheus:v2.26.0
        imagePullPolicy: IfNotPresent
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /-/healthy
            port: 9090
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 15
          successThreshold: 1
          timeoutSeconds: 10
        name: prometheus-server
        ports:
        - containerPort: 9090
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /-/ready
            port: 9090
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 5
          successThreshold: 1
          timeoutSeconds: 4
        resources:
          limits:
            cpu: 500m
            ephemeral-storage: 1Gi
            memory: 2Gi
          requests:
            cpu: 500m
            ephemeral-storage: 1Gi
            memory: 2Gi
        securityContext:
          capabilities:
            drop:
            - NET_RAW
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/config
          name: config-volume
        - mountPath: /data
          name: storage-volume
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext:
        fsGroup: 65534
        runAsGroup: 65534
        runAsNonRoot: true
        runAsUser: 65534
        seccompProfile:
          type: RuntimeDefault
      serviceAccount: prometheus-server
      serviceAccountName: prometheus-server
      terminationGracePeriodSeconds: 300
      volumes:
      - configMap:
          defaultMode: 420
          name: prometheus-server
        name: config-volume
      - name: storage-volume
        persistentVolumeClaim:
          claimName: prometheus-server
status:
  availableReplicas: 1
  conditions:
  - lastTransitionTime: "2021-06-24T10:43:25Z"
    lastUpdateTime: "2021-06-24T10:43:25Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2021-06-24T10:42:31Z"
    lastUpdateTime: "2021-06-24T10:43:25Z"
    message: ReplicaSet "prometheus-server-65b759cb95" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1
  updatedReplicas: 1

服务监视器的 yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      "apiVersion":"monitoring.coreos.com/v1","kind":"ServiceMonitor","metadata":"annotations":,"creationTimestamp":"2021-06-24T07:55:58Z","generation":1,"labels":"app":"sample-app","release":"prometheus","name":"sample-app","namespace":"prom","resourceVersion":"6884573","selfLink":"/apis/monitoring.coreos.com/v1/namespaces/prom/servicemonitors/sample-app","uid":"34644b62-eb4f-4ab1-b9df-b22811e40b4c","spec":"endpoints":["port":"http"],"selector":"matchLabels":"app":"sample-app","release":"prometheus"
  creationTimestamp: "2021-06-24T07:55:58Z"
  generation: 2
  labels:
    app: sample-app
    release: prometheus
  name: sample-app
  namespace: prom
  resourceVersion: "6904642"
  selfLink: /apis/monitoring.coreos.com/v1/namespaces/prom/servicemonitors/sample-app
  uid: <some-uid>
spec:
  endpoints:
  - port: http
  selector:
    matchLabels:
      app: sample-app
      release: prometheus 

【问题讨论】:

您是否尝试过端口转发您的示例应用程序并获取需要被 prometheus 抓取的 /metrics 端点?您的 /metrics 端点是否可用且正常工作? 是的。 pod 正在以 prometheus 格式向 /metrics 端点发送指标。使用端口转发验证 您的服务有端点吗?尝试 kubectl 获取端点并检查输出 @meaningqo 是的服务有端点。我可以curl --request GET --url 'http://my_endpoint_ip:8080/metrics' 如果你运行的是prometheus operator service monitor,你不需要手动编辑config map 【参考方案1】:

您需要使用包含 Prometheus 运算符的 prometheus-community/kube-prometheus-stack 图表,以便根据 ServiceMonitor 资源自动更新 Prometheus 的配置。

您使用的prometheus-community/prometheus 图表不包括在 Kubernetes API 中监视 ServiceMonitor 资源并相应更新 Prometheus 服务器的 ConfigMap 的 Prometheus 运算符。

您的集群中似乎安装了必要的 CustomResourceDefinitions (CRD),否则您将无法创建 ServiceMonitor 资源。这些未包含在 prometheus-community/prometheus 图表中,因此它们之前可能已添加到您的集群中。

【讨论】:

我在 GKE 自动驾驶仪集群上运行这些工作负载,在部署 prometheus-community/kube-prometheus-stack 时出现“mutatingwebhookconfigurations access denied”错误。看起来那是 GKE 自动驾驶的limitation。让我试试标准集群。 我在标准集群上尝试了你提供的建议,它有效。

以上是关于无法将 K8s 服务添加为 prometheus 目标的主要内容,如果未能解决你的问题,请参考以下文章

k8s 上的 Prometheus 自定义指标服务发现

k8s上搭建loki日志服务并通过prometheus进行错误日志告警

一段时间内 K8s 服务的正常运行时间 - Prometheus?

使用 Prometheus 在 K8s 服务端点上测量 40 倍和 50 倍的错误?

Prometheus for k8s - 添加工作节点后的“上下文截止日期”

搭建Prometheus监控k8s服务