aws imagebuilder 理解并使用imagebuilder构建pcluster自定义ami

Posted zhojiew

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了aws imagebuilder 理解并使用imagebuilder构建pcluster自定义ami相关的知识,希望对你有一定的参考价值。

参考资料

理解imagebuilder

imagebuilder 使用 cinc-client 进行客户端统一配置CINC is not Chef,而是chef的免费分发版本。

https://cinc.sh/about/

imagebuilder管道的整体逻辑如下

核心概念的关系如下图

  • recipe,包含一个parent image和一个或多个components

  • component,是recipe的构建块,描述了如何构建、验证和测试映像

  • Infrastructure,定义了构建和测试映像的环境

  • distribution,配置指定分发到选定的 AWS 区域、帐户或组织

运行命令和日志的细节可以参考,Under the Hood

构建pcluster自定义ami

官方pcluster作为源

之前的pcluster文章介绍了通过pcluster工具创建ami,实际上就是使用了imagebuilder

Image Builder 使用 SSM 自动化以协调映像构建操作。要查看其他详细信息以帮助排除生成故障,需要在控制台中搜索Image Builder 提供的执行 ID,然后检查 Automation 执行

Resource handler returned message: "Error occurred during operation 'SSM execution 'a13bc224-150b-47ae-8e9d-47f3bdc4dc48' failed for image arn: 'arn:aws-cn:imagebuilder:cn-north-1:xxxxxxx:image/parallelclusterimage-myubuntu1804/3.1.4/1' with status = 'Failed' in state = 'BUILDING' and failure message = 'Document arn:aws-cn:imagebuilder:cn-north-1:xxxxxxx:component/parallelclusterimage-de178710-9674-11ed-b264-0e2b2c28fce2/3.1.4/1 failed!''." (RequestToken: 273970de-d749-1216-1215-06466707ae47, HandlerErrorCode: GeneralServiceException)

查看具体的错误细节,和cfn的报错一致,具体需要查看对应document的错误日志

在document的cwlogs中查看构建自定义ami的报错(日志来自image builder)

可见是由于pcluser命令行版本3.1.4,ami对应pcluster版本为3.2.1,版本不一致导致报错

================================================================================
Stdout: Recipe Compile Error in /etc/chef/local-mode-cache/cache/cookbooks/aws-parallelcluster/attributes/conditions.rb
Stdout: ================================================================================
Stdout: 
Stdout: RuntimeError
Stdout: ------------
Stdout: This AMI was created with aws-parallelcluster-cookbook-3.2.1, but is trying to be used with aws-parallelcluster-cookbook-3.1.4. Please either use an AMI created with aws-parallelcluster-cookbook-3.1.4 or change your ParallelCluster to aws-parallelcluster-cookbook-3.2.1

修改版本一致后构建成功,之后使用自定义ami创建集群即可

Region: cn-north-1
Image:
  Os: ubuntu1804
  CustomAmi: ami-003819348308f4f4f
HeadNode:
  InstanceType: m5.large
...

公开ami作为源

之前选择的是pcluster的官方ami版本, aws-parallelcluster-3.2.1-ubuntu-1804-lts-hvm-x86_64-202209270835,尝试使用普通的ubuntu ami能否顺利构建

Build:
  InstanceType: c5.4xlarge
  ParentImage: ami-07356f2da3fd22521
  SubnetId: subnet-xxxxxxxxx
  SecurityGroupIds:
    - sg-xxxxxxxx
  UpdateOsPackages:
    Enabled: true

cfn堆栈报错如下

Resource handler returned message: "Error occurred during operation 'SSM execution 'cb055f7d-7c07-471a-9d3a-06a900926f8e' failed for image arn: 'arn:aws-cn:imagebuilder:cn-north-1:xxxxxxx:image/parallelclusterimage-myubuntu1804raw/3.2.1/1' with status = 'Failed' in state = 'BUILDING' and failure message = 'Document arn:aws-cn:imagebuilder:cn-north-1:xxxxxxx:component/parallelclusterimage-f78ad100-9685-11ed-89e5-06b4c2e890aa/3.2.1/1 failed!''." (RequestToken: ea6df8f2-d076-43b7-8893-44c567a70a34, HandlerErrorCode: GeneralServiceException)

还是一样的套路寻找错误原因

Command 9647e5df-dfe4-49f5-aab2-f6843bf55c16 returns unexpected invocation result: 
Status=[Failed], ResponseCode=[1], Output=[
    "executionId": "c0466b39-9686-11ed-8042-0651be0b5200",
    "status": "failed",
    "failedStepCount": 1,
    "executedStepCount": 24,
    "ignoredFailedStepCount": 0,
    "failureMessage": "Document arn:aws-cn:imagebuilder:cn-north-1:xxxxxxx:component/parallelclusterimage-f78ad100-9685-11ed-89e5-06b4c2e890aa/3.2.1/1 failed!",
    "logUrl": "/var/lib/amazon/toe/TOE_2023-01-17_16-48-21_UTC-0_c0466b39-9686-11ed-8042-0651be0b5200"

查看cwlogs日志,这就有点尴尬了

STDERR: fatal: unable to access 'https://github.com/pyenv/pyenv-virtualenv/': gnutls_handshake() failed: The TLS connection was non-properly terminated.
Ran git ls-remote "https://github.com/pyenv/pyenv-virtualenv" "master*" returned 128

没有找到配置代理的地方,暂时无奈放弃

通过userdata分析报错

构建成功后启动pcluster头节点的userdata,只保留主要逻辑如下

  • 检查cookbook和pcluster版本是否一致
  • 检查ami是否被pcluster支持
  • 运行chef配置节点
#!/bin/bash -x
...
function vendor_cookbook

  mkdir /tmp/cookbooks
  cd /tmp/cookbooks
  tar -xzf /etc/chef/aws-parallelcluster-cookbook.tgz
  HOME_BAK="$HOME"
  export HOME="/tmp"
  for d in `ls /tmp/cookbooks`; do
    cd /tmp/cookbooks/$d
    LANG=en_US.UTF-8 /opt/cinc/embedded/bin/berks vendor /etc/chef/cookbooks --delete || error_exit 'Vendoring cookbook failed.'
  done;
  export HOME="$HOME_BAK"

...
custom_cookbook=NONE
export _region=cn-north-1
s3_url=amazonaws.com.cn
if [ "$custom_cookbook" != "NONE" ]; then
  if [[ "$custom_cookbook" =~ ^s3://([^/]*)(.*) ]]; then
    bucket_region=$(aws s3api get-bucket-location --bucket $BASH_REMATCH[1] | jq -r '.LocationConstraint')
    if [[ "$bucket_region" == null ]]; then
      bucket_region="us-east-1"
    fi
    cookbook_url=$(aws s3 presign "$custom_cookbook" --region "$bucket_region")
  else
    cookbook_url=$custom_cookbook
  fi
fi
export parallelcluster_version=aws-parallelcluster-3.2.1
export cookbook_version=aws-parallelcluster-cookbook-3.2.1
export chef_version=17.2.29
export berkshelf_version=7.2.0
if [ -f /opt/parallelcluster/.bootstrapped ]; then
  installed_version=$(cat /opt/parallelcluster/.bootstrapped)
  if [ "$cookbook_version" != "$installed_version" ]; then
    error_exit "This AMI was created with $installed_version, but is trying to be used with $cookbook_version. Please either use an AMI created with $cookbook_version or change your ParallelCluster to $installed_version"
  fi
else
  error_exit "This AMI was not baked by ParallelCluster. Please use pcluster build-image command to create an AMI by providing your AMI as parent image."
fi
if [ "$custom_cookbook" != "NONE" ]; then
  curl --retry 3 -v -L -o /etc/chef/aws-parallelcluster-cookbook.tgz $cookbook_url
  vendor_cookbook
fi

由此可见,构建自定义ami出现的错误实际上是在测试镜像阶段检测版本不一致导致的。

查看/etc/chef/cookbooks目录,是recipe菜单目录

$ tree -L 1
/etc/chef/cookbooks
├── apt
├── aws-parallelcluster
├── aws-parallelcluster-awsbatch
├── aws-parallelcluster-config
├── aws-parallelcluster-install
├── aws-parallelcluster-scheduler-plugin
├── aws-parallelcluster-slurm
├── aws-parallelcluster-test
├── iptables
├── line
├── nfs
├── openssh
├── pyenv
├── selinux
├── yum
└── yum-epel

具体报错需要结合内部的ruby代码进行分析了

以上是关于aws imagebuilder 理解并使用imagebuilder构建pcluster自定义ami的主要内容,如果未能解决你的问题,请参考以下文章

Service Fabric FABRIC_E_IMAGEBUILDER_VALIDATION_ERROR:下载路径已清理错误

用于 EC2 Image Builder 的 yaml 中的多行 bash 脚本

在 AWS cloudformation 上理解 Apigateway 和嵌套堆栈时出错

AWS企业实战之CloudFront的配置

使用delphi将图像上传到ima​​geshack

AWS ECS 如何在私有桥接网络中启动容器