根据每个 pod 的活动连接数扩展 GKE pod

Posted

技术标签:

【中文标题】根据每个 pod 的活动连接数扩展 GKE pod【英文标题】:Scaling GKE pods based on number of active connections per pod 【发布时间】:2020-04-19 06:49:35 【问题描述】:

我有一个正在运行的 GKE 集群,其 HPA 使用目标 CPU 利用率指标。这没关系,但 CPU 利用率并不是我们的最佳扩展指标。分析表明,活动连接数是一般平台负载的一个很好的指标,因此,我们希望将其作为我们的主要扩展指标。

为此,我为我们使用的 nginx 入口启用了自定义指标。从这里我们可以看到活动连接数、请求率等。

这是使用 NGINX 自定义指标的 HPA 规范:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-uat-active-connections
  namespace: default
spec:
  minReplicas: 3
  maxReplicas: 6
  metrics:
    - type: Pods
      pods:
        metricName: custom.googleapis.com|nginx-ingress-controller|nginx_ingress_controller_nginx_process_connections
        selector: 
          matchLabels:
            metric.labels.state: active
            resource.labels.cluster_name: "[redacted]"
        targetAverageValue: 5
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: "[redacted]"

然而,虽然这个规范确实部署得很好,但我总是从 HPA 得到这个输出:

NAME                         REFERENCE                                 TARGETS       MINPODS   MAXPODS   REPLICAS   AGE
hpa-uat-active-connections   Deployment/[redacted]                     <unknown>/5   3         6         3          31s

简而言之,目标值是“未知”,我至今无法理解/解决原因。自定义指标确实存在:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1/namespaces/default/custom.googleapis.com|nginx-ingress-controller|ng​​inx_ingress_controller_nginx_process_connections?labelSelector=metric.labels.state%3Dactive, resource.labels.cluster_name%3D[已编辑]" | jq

这给出了:


  "kind": "ExternalMetricValueList",
  "apiVersion": "external.metrics.k8s.io/v1beta1",
  "metadata": 
    "selfLink": "/apis/external.metrics.k8s.io/v1beta1/namespaces/default/custom.googleapis.com%7Cnginx-ingress-controller%7Cnginx_ingress_controller_nginx_process_connections"
  ,
  "items": [
    
      "metricName": "custom.googleapis.com|nginx-ingress-controller|nginx_ingress_controller_nginx_process_connections",
      "metricLabels": 
        "metric.labels.controller_class": "nginx",
        "metric.labels.controller_namespace": "ingress-nginx",
        "metric.labels.controller_pod": "nginx-ingress-controller-54f84b8dff-sml6l",
        "metric.labels.state": "active",
        "resource.labels.cluster_name": "[redacted]",
        "resource.labels.container_name": "",
        "resource.labels.instance_id": "[redacted]-eac4b327-stqn",
        "resource.labels.namespace_id": "ingress-nginx",
        "resource.labels.pod_id": "nginx-ingress-controller-54f84b8dff-sml6l",
        "resource.labels.project_id": "[redacted],
        "resource.labels.zone": "[redacted]",
        "resource.type": "gke_container"
      ,
      "timestamp": "2019-12-30T14:11:01Z",
      "value": "1"
    
  ]

所以我有两个问题,真的:

    (主要):我在这里做错了什么导致 HPA 无法读取指标? 这是尝试在多个 pod 上扩展到平均活动连接负载的正确方法吗?

提前非常感谢, 本

编辑 1

kubectl 全部获取

NAME                                                READY   STATUS    RESTARTS   AGE
pod/[redacted]-deployment-7f5fbc9ddf-l9tqk          1/1     Running   0          34h
pod/[redacted]-uat-deployment-7f5fbc9ddf-pbcns      1/1     Running   0          34h
pod/[redacted]-uat-deployment-7f5fbc9ddf-tjfrm      1/1     Running   0          34h

NAME                                TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
service/[redacted]-webapp-service   NodePort    [redacted]     <none>        [redacted]                   57d
service/kubernetes                  ClusterIP   [redacted]     <none>        [redacted]                   57d

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/[redacted]-uat-deployment      3/3     3            3           57d

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/[redacted]-uat-deployment-54b6bd5f9c      0         0         0       12d
replicaset.apps/[redacted]-uat-deployment-574c778cc9      0         0         0       35h
replicaset.apps/[redacted]-uat-deployment-66546bf76b      0         0         0       11d
replicaset.apps/[redacted]-uat-deployment-698dfbb6c4      0         0         0       4d
replicaset.apps/[redacted]-uat-deployment-69b5c79d54      0         0         0       6d17h
replicaset.apps/[redacted]-uat-deployment-6f67ff6599      0         0         0       10d
replicaset.apps/[redacted]-uat-deployment-777bfdbb9d      0         0         0       3d23h
replicaset.apps/[redacted]-uat-deployment-7f5fbc9ddf      3         3         3       34h
replicaset.apps/[redacted]-uat-deployment-9585454ff       0         0         0       6d21h
replicaset.apps/[redacted]-uat-deployment-97cbcfc6        0         0         0       17d
replicaset.apps/[redacted]-uat-deployment-c776f648d       0         0         0       10d

NAME                                                               REFERENCE                                 TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/[redacted]-uat-deployment      Deployment/[redacted]-uat-deployment      4%/80%    3         6         3          9h

【问题讨论】:

$ kubectl get all 是否列出任何列为“已完成”的 pod?只是好奇这个issue 是否适用于此。 谢谢尼克 - 我已将该命令的输出添加到我的问题中。没有处于 Completed 状态的 pod。但我确实有一些似乎是旧的部署。我怀疑我的标签匹配器也可能不正确...... 【参考方案1】:

好的,我设法通过查找 HPA (https://docs.okd.io/latest/rest_api/apis-autoscaling/v2beta1.HorizontalPodAutoscaler.html) 的架构来解决这个问题。

简而言之,我使用了错误的指标类型(如上您可以看到我使用的是“Pods”,但我应该使用“External”)。

正确的 HPA 规范是:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-uat-active-connections
  namespace: default
spec:
  minReplicas: 3
  maxReplicas: 6
  metrics:
    - type: External
      external:
        metricName: custom.googleapis.com|nginx-ingress-controller|nginx_ingress_controller_nginx_process_connections
        metricSelector: 
          matchLabels:
            metric.labels.state: active
            resource.labels.cluster_name: [redacted]
        targetAverageValue: 5
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: [redacted]

我这样做后,事情马上就奏效了:

NAME                         REFERENCE                                 TARGETS        MINPODS   MAXPODS   REPLICAS   AGE
hpa-uat-active-connections   Deployment/bustle-webapp-uat-deployment   334m/5 (avg)   3         6         3          30s

【讨论】:

很高兴您找到并发布答案。

以上是关于根据每个 pod 的活动连接数扩展 GKE pod的主要内容,如果未能解决你的问题,请参考以下文章

难以使用外部指标配置 Horizo​​ntal Pod Autoscaler

如何在 GKE 上调试节点健康错误?

使用 Google Container Engine (GKE) 和 Stackdriver 监控和提醒 Pod 状态或重启

GKE:具有 3 个副本的 Pubsub 和 Pod 部署

Pod错误:CrashLoopBackOff(GKE)

无法从 GKE 中的 pod 内部连接到 Postgres SQL 实例