如何基于 Prometheus alert 运行 pod

Posted 2023-02-15

技术标签:

【中文标题】如何基于 Prometheus alert 运行 pod【英文标题】：How to run pod based on Prometheus alert 【发布时间】：2021-12-26 17:19:53 【问题描述】：

有什么方法可以根据 Prometheus 发出的警报运行 pod？我们有一个场景，我们需要根据磁盘压力阈值执行一个 pod。我可以创建警报，但我需要执行一个 pod。我怎样才能做到这一点？

groups:
  - name: node_memory_MemAvailable_percent
    rules:
    - alert: node_memory_MemAvailable_percent_alert
      annotations:
        description: Memory on node  $labels.instance  currently at  $value % 
          is under pressure
        summary: Memory usage is under pressure, system may become unstable.
      expr: |
        100 - ((node_memory_MemAvailable_bytesjob="node-exporter" * 100) / node_memory_MemTotal_bytesjob="node-exporter") > 80
      for: 2m
      labels:
        severity: warning

【问题讨论】：

首先，不要重复问题：***.com/questions/69975448/…。您写到要根据磁盘压力添加耗材，并且您将警报设置为内存压力。请澄清这一点。 【参考方案1】：

我认为 Alertmanager 可以帮助您，使用 webhook 接收器 (documentation)。

这样，当警报被触发时，Prometheus 将其发送给 Alertmanager，然后 Alertmanager 对自定义 webhook 进行 POST。

当然，您需要实现一个服务来处理警报并运行您的操作。

【讨论】：

【参考方案2】：

通常，您的问题显示磁盘压力，并且在代码中我可以看到可用内存量。如果你想根据你的内存来扩大和缩小你的副本，你可以实现Horizontal Pod Autoscaler：

Horizontal Pod Autoscaler 实现为一个控制循环，其周期由控制器管理器的 --horizontal-pod-autoscaler-sync-period 标志控制（默认值为 15 秒）。

在每个期间，控制器管理器根据每个 HorizontalPodAutoscaler 定义中指定的指标查询资源利用率。控制器管理器从资源指标 API（针对每个 pod 的资源指标）或自定义指标 API（针对所有其他指标）获取指标。

您可以基于memory utilization 创建自己的 HPA。示例如下：

apiVersion: autoscaling/v2beta2 
kind: HorizontalPodAutoscaler
metadata:
  name: php-memory-scale 
spec:
  scaleTargetRef:
    apiVersion: apps/v1 
    kind: Deployment 
    name: php-apache 
  minReplicas: 1 
  maxReplicas: 10 
  metrics: 
  - type: Resource
    resource:
      name: memory 
      target:
        type: Utilization 
        averageValue: 10Mi

您也可以创建您的自定义Kubernetes HPA with custom metrics from Prometheus：

自动扩展是一种根据资源使用情况自动扩展或缩减工作负载的方法。 K8s Horizontal Pod Autoscaler：
被实现为一个控制循环，它通过metrics.k8s.io API（如CPU/内存和应用程序的自定义指标API）定期查询资源指标API以获取核心指标特定指标（external.metrics.k8s.io 或 custom.metrics.k8s.io API。它们由指标解决方案供应商提供的“适配器”API 服务器提供。有一些 known solutions，但没有一个实现是 Kubernetes 的正式一部分）根据观察到的指标自动扩展部署或副本集中的 pod 数量。
在下文中，我们将重点关注自定义指标，因为自定义指标 API 使 Prometheus 等监控系统能够向 HPA 控制器公开特定于应用程序的指标。

另一个解决方案可能是使用KEDA。看看这个guide。这是用于监控来自 nginx 的 500 错误的示例 yaml：

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
 name: nginx-scale
 namespace: keda-hpa
spec:
 scaleTargetRef:
   kind: Deployment
   name: nginx-server
 minReplicaCount: 1
 maxReplicaCount: 5
 cooldownPeriod: 30
 pollingInterval: 1
 triggers:
 - type: prometheus
   metadata:
     serverAddress: https://prometheus_server/prometheus
     metricName: nginx_connections_waiting_keda
     query: |
       sum(nginx_connections_waitingjob="nginx")
     threshold: "500"

【讨论】：

@WytrzymałyWiktor 是的，但使用 am executor 我们实现了这一目标【参考方案3】：

是的，我们有 webhook，但是我们通过使用 am executor 作为来自 am executor 自定义脚本的自定义服务来实现服务，我们已经从 ado 管道运行了所需的作业

【讨论】：

【参考方案4】：

您可以使用名为 Robusta 的开源项目来完成此操作。（免责声明：我是维护者。）

首先，定义要触发的 Prometheus 警报：

customPlaybooks:
- triggers:
  - on_prometheus_alert:
      alert_name: DiskSpaceAlertName
  actions:
  - disk_watcher:

其次，我们需要编写触发时运行的实际操作。（上面称为disk_watcher。）如果有人已经为您的需要编写了一个动作，您可以跳过这一步，因为已经有50多个内置actions。

在这种情况下，没有内置动作，所以我们需要用 Python 编写一个。（不过，我很乐意添加一个内置的 :)

@action
def disk_watcher(event: DeploymentEvent):
    deployment = event.get_deployment()

    # read / modify the resources here
    print(deployment.spec.template.spec.containers[0].resources)
    # here you would do the actual update to the resources you like
    ...
    # afterwards, save the change
    deployment.update()

    # fetch the relevant pod
    pod = RobustaPod.find_pod(deployment.metadata.name, deployment.metadata.namespace)

    # see what is using up disk space
    output = pod.exec("df -h")

    # create another pod
    other_output = RobustaPod.exec_in_debugger_pod("my-new-pod", pod.spec.nodeName, "cmd to run", "my-image")

    # send details to slack or any other destination
    event.add_enrichment([
        MarkdownBlock(f"the output from df is attached"),
        FileBlock("df.txt", output.encode()),
        FileBlock("other.txt", other_output.encode())
    ])

【讨论】：

以上是关于如何基于 Prometheus alert 运行 pod的主要内容，如果未能解决你的问题，请参考以下文章