ECS Fargate 自动扩缩得更快？

Posted 2023-03-04

技术标签:

【中文标题】ECS Fargate 自动扩缩得更快？【英文标题】：ECS Fargate autoscaling more rapidly? 【发布时间】：2021-07-23 23:13:16 【问题描述】：

我正在对我的 Auto Scaling AWS ECS Fargate 堆栈进行负载测试，其中包括：

Application Load Balancer (ALB) 的目标组指向 ECS， ECS 集群、服务、任务、ApplicationAutoScaling::ScalableTarget 和 ApplicationAutoScaling::ScalingPolicy，应用程序自动扩展策略定义了一个目标跟踪策略：类型：TargetTrackingScaling， PredefinedMetricType：ALBRequestCountPerTarget，阈值 = 1000 个请求在过去 1 分钟评估期内 1 个数据点超出阈值时触发警报。

这一切都很好。警报确实被触发了，我看到正在发生扩展操作。但是检测“阈值突破”感觉很慢。这是我的负载测试和 AWS 事件的时间安排（从 JMeter 日志和 AWS 控制台的不同位置整理而来）：

10:44:32 start load test (this is the first request timestamp entry in JMeter logs)
10:44:36 4 seconds later (in the the JMeter logs), we see that the load test reaches it's 1000th request to the ALB. At this point in time, we're above the threshold and waiting for AWS to detect that...
10:46:10 1m34s later, I can finally see the spike show up in alarm graph on the cloudwatch alarm detail page BUT the alarm is still in OK state!
    NOTE: notice the 1m34s delay in detecting the spike, if it gets a datapoint every 60 seconds, it should be MAX 60 seconds before it detects it: my load test blasts out 1000 request every 4 seconds!!
10:46:50 the alarm finally goes from OK to ALARM state
    NOTE: at this point, we're 2m14s past the moment when requests started pounding the server at a rate of 1000 requests every 6 seconds!
    NOTE: 3 seconds later, after the alarm finally went off, the "scale out" action gets called (awesome, that part is quick):
14:46:53 Action Successfully executed action    arn:aws:autoscaling:us-east-1:MYACCOUNTID:scalingPolicy:51f0a780-28d5-4005-9681-84244912954d:resource/ecs/service/my-ecs-cluster/my-service:policyName/alb-requests-per-target-per-minute:createdBy/ffacb0ac-2456-4751-b9c0-b909c66e9868
After that, I follow the actions in the ECS "events tab":
10:46:53 Message: Successfully set desired count to 6. Waiting for change to be fulfilled by ecs. Cause: monitor alarm TargetTracking-service/my-ecs-cluster-cce/my-service-AlarmHigh-fae560cc-e2ee-4c6b-8551-9129d3b5a6d3 in state ALARM triggered policy alb-requests-per-target-per-minute
10:47:08 service my-service has started 5 tasks: task 7e9612fa981c4936bd0f33c52cbded72 task e6cd126f265842c1b35c0186c8f9b9a6 task ba4ffa97ceeb49e29780f25fe6c87640 task 36f9689711254f0e9d933890a06a9f45 task f5dd3dad76924f9f8f68e0d725a770c0.
10:47:41 service my-service registered 3 targets in target-group my-tg
10:47:52 service my-service registered 2 targets in target-group my-tg
10:49:05 service my-service has reached a steady state.
    NOTE: starting the tasks took 33 seconds, this is very acceptable because I set the HealthCheckGracePeriodSeconds to 30 seconds and health check interval is 30 seconds as well)
    NOTE: 3m09s between the time the load starting pounding the server and the time the first new ECS tasks are up
    NOTE: most of this time (3m09s) is due to the waiting for the alarm to go off (2m20s)!! The rest is normal: waiting for the new tasks to start.

Q1：有没有办法让警报触发更快和/或一旦超过阈值？对我来说，这需要 1 平方米的时间太多了。它应该在大约 1 分 30 秒内真正扩大（最大 1 分检测 ALARM HIGH 状态 + 30 秒开始任务）...

注意：我在今天打开的另一个问题中记录了我的 CloudFormation 堆栈： Cloudformation ECS Fargate autoscaling target tracking: 1 custom alarm in 1 minute: Failed to execute action

【问题讨论】：

【参考方案1】：

您对此无能为力。 ALB 将指标发送到 1 minute intervals 中的 CloudWatch。此外，这些指标也不是实时的，因此预计会出现延迟，甚至长达几分钟，正如 AWS 支持部门所解释并在 cmets here 中报告的那样：

预计指标会出现一些延迟，这是任何监控系统所固有的 - 因为它们取决于多个变量，例如服务发布指标的延迟、CloudWatch 中的传播延迟和摄取延迟等等。我知道 ALB 指标的一致 3 或 4 分钟延迟偏高。

您要么必须过度配置 ECS 以在警报触发和扩展执行时承受增加的负载，要么降低阈值。

或者，您可以创建自己的custom metrics，例如从您的应用程序。这些指标甚至可以以 1 秒为间隔。您的应用也可以“手动”触发警报。这可以让您减少观察到的延迟。

【讨论】：

关键部分：链接的 SO 问题中的评论：“ALB 指标延迟是由于 3 分钟的摄取延迟时间造成的，现阶段无法减少此延迟”+“AWS 正在处理它”。这绝对是我正在经历的：几乎延迟了 3 分钟。这很糟糕，因为许多人会自己编写代码：轮询 ALB 以尝试获取“过去一分钟内的请求数”。或者在他们的应用程序中编码（计数请求），这应该是 AWS 在 ALB 级别（访问日志/指标）提供的东西，而不是在我们的应用程序中编码......很棒的链接！谢谢！作为记录，在 Azure 文档中（与 AWS 进行比较），他们提到：“对于 Web 应用程序，平均周期要短得多，允许新实例在大约五分钟后可用更改为平均触发措施”。感觉 AWS 可以赢得这场比赛。感觉我们大多数人都可以对其进行编码以更快地扩展（如果我们可以解析“新”访问日志）。来吧 AWS！ :) docs.microsoft.com/en-us/azure/architecture/best-practices/… @Pierre 没问题。如果答案有帮助，我们将不胜感激。是的，当然，我只是想让它再开放几天以激发更多的反应。也许来自 AWS，他们可能会确认他们正在处理它并且可能会获得 ETA，这样我们就不必自己编写代码（他们已经提供了 ALB 指标，已经将其推送到 CloudWatch，并且已经有触发 ECS 的 Cloudwatch 警报行动：他们只需要使其更高分辨率以避免等待 ALB 指标达到 CW 1 分钟 + 等待 CW 触发警报的另外 1 分钟。 @Pierre 是的，没问题。谢谢。

以上是关于ECS Fargate 自动扩缩得更快？的主要内容，如果未能解决你的问题，请参考以下文章