AWS 批处理 cloudformation - “CannotPullContainerError”
Posted
技术标签:
【中文标题】AWS 批处理 cloudformation - “CannotPullContainerError”【英文标题】:AWS batch cloudformation - “CannotPullContainerError” 【发布时间】:2021-04-16 16:49:51 【问题描述】:我有一个带有 6 个资源的 AWS Batch POC 的 Cloud Formation 模板。
3 AWS::IAM::角色 1 AWS::Batch::ComputeEnvironment 1 AWS::Batch::JobQueue 1 AWS::Batch::JobDefinitionAWS::IAM::Role 具有策略“arn:aws:iam::aws:policy/AdministratorAccess”(为了避免问题。)
使用的角色:
1 进入 AWS::Batch::ComputeEnvironment 2 进入 AWS::Batch::JobDefinition但即使使用策略“arn:aws:iam::aws:policy/AdministratorAccess”,我也会收到“CannotPullContainerError:来自守护进程的错误响应:获取 https://********.dkr.ecr。 eu-west-1.amazonaws.com/v2/:net/http:在等待连接时请求被取消(等待标头时超出了 Client.Timeout)“当我在工作时。
免责声明:一切都是 FARGATE(计算环境和作业),而不是 EC2
AWSTemplateFormatVersion: '2010-09-09'
Description: Creates a POC AWS Batch environment.
Parameters:
Environment:
Type: String
Description: 'Environment Name'
Default: TEST
Subnets:
Type: List<AWS::EC2::Subnet::Id>
Description: 'List of Subnets to boot into'
ImageName:
Type: String
Description: 'Name and tag of Process Container Image'
Default: 'upload:6.0.0'
Resources:
BatchServiceRole:
Type: 'AWS::IAM::Role'
Properties:
RoleName: !Join ['', ['Demo', BatchServiceRole]]
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: 'Allow'
Principal:
Service: 'batch.amazonaws.com'
Action: 'sts:AssumeRole'
ManagedPolicyArns:
- 'arn:aws:iam::aws:policy/AdministratorAccess'
BatchContainerRole:
Type: 'AWS::IAM::Role'
Properties:
RoleName: !Join ['', ['Demo', BatchContainerRole]]
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
-
Effect: 'Allow'
Principal:
Service:
- 'ecs-tasks.amazonaws.com'
Action:
- 'sts:AssumeRole'
ManagedPolicyArns:
- 'arn:aws:iam::aws:policy/AdministratorAccess'
BatchJobRole:
Type: 'AWS::IAM::Role'
Properties:
RoleName: !Join ['', ['Demo', BatchJobRole]]
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: 'Allow'
Principal:
Service: 'ecs-tasks.amazonaws.com'
Action: 'sts:AssumeRole'
ManagedPolicyArns:
- 'arn:aws:iam::aws:policy/AdministratorAccess'
BatchCompute:
Type: "AWS::Batch::ComputeEnvironment"
Properties:
ComputeEnvironmentName: DemoContentInput
ComputeResources:
MaxvCpus: 256
SecurityGroupIds:
- sg-0b33333333333333
Subnets: !Ref Subnets
Type: FARGATE
ServiceRole: !Ref BatchServiceRole
State: ENABLED
Type: Managed
Queue:
Type: "AWS::Batch::JobQueue"
DependsOn: BatchCompute
Properties:
ComputeEnvironmentOrder:
- ComputeEnvironment: DemoContentInput
Order: 1
Priority: 1
State: "ENABLED"
JobQueueName: DemoContentInput
ContentInputJob:
Type: "AWS::Batch::JobDefinition"
Properties:
Type: Container
ContainerProperties:
Command:
- -v
- process
- new-file
- -o
- s3://contents/content_id/content_id.mp4
Environment:
- Name: SECRETS
Value: !Join [ ':', [ 'resolve:secretsmanager:common.secrets:SecretString:aws_access_key_id', 'resolve:secretsmanager:common.secrets:SecretString:aws_secret_access_key' ] ]
- Name: APPLICATION
Value: upload
- Name: API_KEY
Value: 'resolve:secretsmanager:common.secrets:SecretString:fluzo.api_key'
- Name: CLIENT
Value: upload-container
- Name: ENVIRONMENT
Value: !Ref Environment
- Name: SETTINGS
Value: !Join [ ':', [ 'resolve:secretsmanager:common.secrets:SecretString:aws_access_key_id', 'resolve:secretsmanager:common.secrets:SecretString:aws_secret_access_key', 'upload-container' ] ]
ExecutionRoleArn: 'arn:aws:iam::**********:role/DemoBatchJobRole'
Image: !Join ['', [!Ref 'AWS::AccountId','.dkr.ecr.', !Ref 'AWS::Region', '.amazonaws.com/', !Ref ImageName ] ]
JobRoleArn: !Ref BatchContainerRole
ResourceRequirements:
- Type: VCPU
Value: 1
- Type: MEMORY
Value: 2048
JobDefinitionName: DemoContentInput
PlatformCapabilities:
- FARGATE
RetryStrategy:
Attempts: 1
Timeout:
AttemptDurationSeconds: 600
进入 AWS::Batch::JobQueue:ContainerProperties:ExecutionRoleArn 我对 arn 进行了编码,因为如果写入 !Ref BatchJobRole 我会收到错误消息。但这个问题不是我的目标。
问题是如何避免“CannotPullContainerError:来自守护进程的错误响应:获取 https://********.dkr.ecr.eu-west-1.amazonaws.com/v2/:net/ http:在等待连接时请求被取消(等待标头时超出了 Client.Timeout)”,当我运行作业时。
【问题讨论】:
我认为连接超时与网络问题有关,如何检查路由、NAT GW、安全组? 您是!Ref Subnets
公有子网还是私有子网?您的 VPC 是如何配置的?
【参考方案1】:
听起来您无法从子网内访问互联网。
确保:
有一个与您的 VPC 关联的 Internet 网关设备(如果没有,请创建一个 - 即使您只是使用 nat-gateway 进行出口) 与您的子网关联的路由表有一条默认路由 (0.0.0./0) 到 Internet 网关或带有附加弹性 IP 的 nat-gateway。 附加的安全组具有允许端口和协议的出站 Internet 流量 (0.0.0.0/0) 的规则。 (例如 80/http、443/https) 与子网关联的网络访问控制列表(网络 ACL)具有允许到 Internet 的出站和入站流量的规则。参考资料:
https://aws.amazon.com/premiumsupport/knowledge-center/ec2-connect-internet-gateway/
【讨论】:
以上是关于AWS 批处理 cloudformation - “CannotPullContainerError”的主要内容,如果未能解决你的问题,请参考以下文章
AWS Cloudformation-如何在 json/yaml 模板中处理字符串大写或小写
无法为执行AWS :: CloudFormation :: CustomResource的aws lambda函数设置环境变量
AWS > CloudFormation 模板 - 您可以在上传之前对其进行测试吗?