将 EKS 版本更新到 1.16 后，Prometheus & Alert Manager 不断崩溃

Posted 2023-02-16

技术标签:

【中文标题】将 EKS 版本更新到 1.16 后，Prometheus & Alert Manager 不断崩溃【英文标题】：Prometheus & Alert Manager keeps crashing after updating the EKS version to 1.16 【发布时间】：2021-07-24 00:21:31 【问题描述】：

prometheus-prometheus-kube-prometheus-prometheus-0 0/2 终止 0 4s alertmanager-prometheus-kube-prometheus-alertmanager-0 0/2 终止 0 10s

在将 EKS 集群从 1.15 更新到 1.16 后，除了这两个 pod 之外，一切正常，它们继续终止并且无法初始化。因此，prometheus 监控不起作用。在描述 pod 时，我遇到了以下错误。

Error: failed to start container "prometheus": Error response from daemon: OCI runtime create failed: container_linux.go:362: creating new parent process caused: container_linux.go:1941: running lstat on namespace path "/proc/29271/ns/ipc" caused: lstat /proc/29271/ns/ipc: no such file or directory: unknown
Error: failed to start container "config-reloader": Error response from daemon: cannot join network of a non running container: 7e139521980afd13dad0162d6859352b0b2c855773d6d4062ee3e2f7f822a0b3
Error: cannot find volume "config" to mount into container "config-reloader"
Error: cannot find volume "config" to mount into container "prometheus"

这是我用于部署的 yaml 文件：

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  creationTimestamp: "2021-04-30T16:39:14Z"
  deletionGracePeriodSeconds: 600
  deletionTimestamp: "2021-04-30T16:49:14Z"
  generateName: prometheus-prometheus-kube-prometheus-prometheus-
  labels:
    app: prometheus
    app.kubernetes.io/instance: prometheus-kube-prometheus-prometheus
    app.kubernetes.io/managed-by: prometheus-operator
    app.kubernetes.io/name: prometheus
    app.kubernetes.io/version: 2.26.0
    controller-revision-hash: prometheus-prometheus-kube-prometheus-prometheus-56d9fcf57
    operator.prometheus.io/name: prometheus-kube-prometheus-prometheus
    operator.prometheus.io/shard: "0"
    prometheus: prometheus-kube-prometheus-prometheus
    statefulset.kubernetes.io/pod-name: prometheus-prometheus-kube-prometheus-prometheus-0
  name: prometheus-prometheus-kube-prometheus-prometheus-0
  namespace: mo
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: prometheus-prometheus-kube-prometheus-prometheus
    uid: 326a09f2-319c-449d-904a-1dd0019c6d80
  resourceVersion: "9337443"
  selfLink: /api/v1/namespaces/monitoring/pods/prometheus-prometheus-kube-prometheus-prometheus-0
  uid: e2be062f-749d-488e-a6cc-42ef1396851b
spec:
  containers:
  - args:
    - --web.console.templates=/etc/prometheus/consoles
    - --web.console.libraries=/etc/prometheus/console_libraries
    - --config.file=/etc/prometheus/config_out/prometheus.env.yaml
    - --storage.tsdb.path=/prometheus
    - --storage.tsdb.retention.time=10d
    - --web.enable-lifecycle
    - --storage.tsdb.no-lockfile
    - --web.external-url=http://prometheus-kube-prometheus-prometheus.monitoring:9090
    - --web.route-prefix=/
    image: quay.io/prometheus/prometheus:v2.26.0
    imagePullPolicy: IfNotPresent
    name: prometheus
    ports:
    - containerPort: 9090
      name: web
      protocol: TCP
    readinessProbe:
      failureThreshold: 120
      httpGet:
        path: /-/ready
        port: web
        scheme: HTTP
      periodSeconds: 5
      successThreshold: 1
      timeoutSeconds: 3
    resources: 
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/prometheus/config_out
      name: config-out
      readOnly: true
    - mountPath: /etc/prometheus/certs
      name: tls-assets
      readOnly: true
    - mountPath: /prometheus
      name: prometheus-prometheus-kube-prometheus-prometheus-db
    - mountPath: /etc/prometheus/rules/prometheus-prometheus-kube-prometheus-prometheus-rulefiles-0
      name: prometheus-prometheus-kube-prometheus-prometheus-rulefiles-0
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: prometheus-kube-prometheus-prometheus-token-mh66q
      readOnly: true
  - args:
    - --listen-address=:8080
    - --reload-url=http://localhost:9090/-/reload
    - --config-file=/etc/prometheus/config/prometheus.yaml.gz
    - --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml
    - --watched-dir=/etc/prometheus/rules/prometheus-prometheus-kube-prometheus-prometheus-rulefiles-0
    command:
    - /bin/prometheus-config-reloader
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: metadata.name
    - name: SHARD
      value: "0"
    image: quay.io/prometheus-operator/prometheus-config-reloader:v0.47.0
    imagePullPolicy: IfNotPresent
    name: config-reloader
    ports:
    - containerPort: 8080
      name: reloader-web
      protocol: TCP
    resources:
      limits:
        cpu: 100m
        memory: 50Mi
      requests:
        cpu: 100m
        memory: 50Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: FallbackToLogsOnError
    volumeMounts:
    - mountPath: /etc/prometheus/config
      name: config
    - mountPath: /etc/prometheus/config_out
      name: config-out
    - mountPath: /etc/prometheus/rules/prometheus-prometheus-kube-prometheus-prometheus-rulefiles-0
      name: prometheus-prometheus-kube-prometheus-prometheus-rulefiles-0
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: prometheus-kube-prometheus-prometheus-token-mh66q
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: prometheus-prometheus-kube-prometheus-prometheus-0
  nodeName: ip-10-1-49-45.ec2.internal
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext:
    fsGroup: 2000
    runAsGroup: 2000
    runAsNonRoot: true
    runAsUser: 1000
  serviceAccount: prometheus-kube-prometheus-prometheus
  serviceAccountName: prometheus-kube-prometheus-prometheus
  subdomain: prometheus-operated
  terminationGracePeriodSeconds: 600
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: config
    secret:
      defaultMode: 420
      secretName: prometheus-prometheus-kube-prometheus-prometheus
  - name: tls-assets
    secret:
      defaultMode: 420
      secretName: prometheus-prometheus-kube-prometheus-prometheus-tls-assets
  - emptyDir: 
    name: config-out
  - configMap:
      defaultMode: 420
      name: prometheus-prometheus-kube-prometheus-prometheus-rulefiles-0
    name: prometheus-prometheus-kube-prometheus-prometheus-rulefiles-0
  - emptyDir: 
    name: prometheus-prometheus-kube-prometheus-prometheus-db
  - name: prometheus-kube-prometheus-prometheus-token-mh66q
    secret:
      defaultMode: 420
      secretName: prometheus-kube-prometheus-prometheus-token-mh66q
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-04-30T16:39:14Z"
    status: "True"
    type: PodScheduled
  phase: Pending
  qosClass: Burstable

【问题讨论】：

嗨，错误说，它在你的 ns 中没有找到秘密 prometheus-prometheus-kube-prometheus-prometheus 来挂载到这些容器中，你能检查一下秘密 kubectl get secrets 我尝试了这些秘密，但它仍然给我同样的错误。 【参考方案1】：

如果有人需要知道答案，在我的情况下（上述情况），有 2 个 Prometheus 运算符在不同的不同命名空间中运行，1 个在默认命名空间中，另一个在监控命名空间中。所以我从默认命名空间中删除了一个，它解决了我的 pod 崩溃问题。

【讨论】：

以上是关于将 EKS 版本更新到 1.16 后，Prometheus & Alert Manager 不断崩溃的主要内容，如果未能解决你的问题，请参考以下文章