使用 AugmentedManifest 在 SageMaker 中进行语义分割失败
Posted
技术标签:
【中文标题】使用 AugmentedManifest 在 SageMaker 中进行语义分割失败【英文标题】:Semantic Segmentation failing in SageMaker using AugmentedManifest 【发布时间】:2021-08-17 03:52:23 【问题描述】:我正在使用增强清单,所有标签都使用 mTurk 完成,并且我正在尝试使用这些文件训练模型。
我有一个 Jupyter Notebook、Python 3.7 和 TensorFlow 2。
首先,我进行一些基本的初始化并配置清单文件位置。
import boto3
import re
import sagemaker
from sagemaker import get_execution_role
import time
from time import gmtime, strftime
import json
role = get_execution_role()
sess = sagemaker.Session()
s3 = boto3.resource("s3")
training_image = sagemaker.amazon.amazon_estimator.image_uris.retrieve(
"semantic-segmentation", boto3.Session().region_name
)
augmented_manifest_filename_train = (
"output.manifest"
)
bucket_name = "<private>"
s3_output_path = "s3:///output".format(bucket_name)
s3_train_data_path = "s3:///output/trees-and-houses/manifests/output/".format(
bucket_name, augmented_manifest_filename_train)
augmented_manifest_s3_key = s3_train_data_path.split(bucket_name)[1][1:]
s3_obj = s3.Object(bucket_name, augmented_manifest_s3_key)
augmented_manifest = s3_obj.get()["Body"].read().decode("utf-8")
augmented_manifest_lines = augmented_manifest.split("\n")
num_training_samples = len(augmented_manifest_lines)
这一切都很好。 我可以打印我的清单文件并查看其属性。 然后,我配置作业:
# Create unique job name
job_name_prefix = "groundtruth-augmented-manifest-demo"
timestamp = time.strftime("-%Y-%m-%d-%H-%M-%S", time.gmtime())
job_name = job_name_prefix + timestamp
s3_output_location = "s3:///training_outputs/".format(bucket_name)
并创建估计器和超参数
# Create a model object set to using "Pipe" mode.
model = sagemaker.estimator.Estimator(training_image,
role,
instance_count=1,
instance_type='ml.p3.8xlarge',
volume_size = 50,
max_run = 360000,
input_mode = 'Pipe',
output_path=s3_output_location,
job_name=job_name,
sagemaker_session=sess)
model.set_hyperparameters(
backbone="resnet-101",
algorithm="psp",
use_pretrained_model="False",
crop_size=240,
num_classes=3,
epochs=10,
base_size=540,
learning_rate=0.0001,
optimizer="rmsprop",
lr_scheduler="poly",
mini_batch_size=4,
early_stopping=True,
early_stopping_patience=2,
early_stopping_min_epochs=10,
num_training_samples=num_training_samples
)
最后,由于我的文件很大,我使用“管道”训练输入。
# Create a train data channel with S3_data_type as 'AugmentedManifestFile' and attribute names.
train_data = sagemaker.inputs.TrainingInput(s3_data= s3_train_data_path,
distribution='FullyReplicated',
content_type='application/x-recordio',
s3_data_type='AugmentedManifestFile',
compression='Gzip',
attribute_names=attribute_names,
input_mode='Pipe',
record_wrapping='RecordIO')
data_channels = 'train': train_data
最后,我尝试训练我的模型,就像 AWS 的示例一样。由于我使用的是增强清单,因此我不需要验证通道。
# Train a model.
model.fit(inputs=data_channels, logs=True, wait=True)
但是,开始训练时出现以下错误:
UnexpectedStatusException: Error for Training job semantic-segmentation-2021-05-28-23-53-46-966: Failed. Reason: ClientError: Unable to initialize the algorithm. Failed to validate input data configuration. (caused by ValidationError)
Caused by: 'validation' is a required property
Failed validating 'required' in schema['allOf'][2]:
'required': ['validation']
On instance:
'train': 'ContentType': 'application/x-recordio',
'RecordWrapperType': 'RecordIO',
'S3DistributionType': 'FullyReplicated',
'TrainingInputMode': 'Pipe'
【问题讨论】:
【参考方案1】:您可以在增强清单上使用较低级别的 API 来运行您的训练作业:
# Create unique job name
import time
role = sagemaker.get_execution_role()
nn_job_name_prefix = "labeljob-86-chain-augmented-manifest"
timestamp = time.strftime("-%Y-%m-%d-%H-%M", time.gmtime())
nn_job_name = nn_job_name_prefix + timestamp
training_params =
"AlgorithmSpecification": "TrainingImage": training_image, "TrainingInputMode": "Pipe",
"RoleArn": role,
"OutputDataConfig": "S3OutputPath": "s3:////output/".format(bucket, s3_prefix),
"ResourceConfig": "InstanceCount": 1, "InstanceType": "ml.p3.2xlarge", "VolumeSizeInGB": 50,
"TrainingJobName": nn_job_name,
"HyperParameters":
"backbone" : "resnet-50", # Other option is resnet-101
"algorithm" : "deeplab", #fcn, psp, deeplab
"use_pretrained_model" : "True", # Use the pre-trained model.
"crop_size" : "240", # Size of image random crop.
"num_classes" : "2", # ------------------ IMPORTANT! number of classes (starting at 0)
"epochs" : "45", # Number of epochs to run. (small for testing)
"learning_rate" : "0.001",
"optimizer" : "adam", # 'sgd', 'adam', 'rmsprop', 'nag', 'adagrad'.
"lr_scheduler" : "poly", # Other options include 'cosine' and 'step'.
"mini_batch_size" : "16", # small mini batch size for this data set size.
"validation_mini_batch_size" : "4",
"early_stopping" : "True", # Turn on early stopping. If OFF, other early stopping parameters are ignored.
"early_stopping_patience" : "2", # Tolerate these many epochs if the mIoU doens't increase.
"early_stopping_min_epochs" : "5", # No matter what, run these many number of epochs.
"num_training_samples": str(num_training_samples), # --------------------------------------IMPORTANT!
,
"StoppingCondition": "MaxRuntimeInSeconds": 86400,
"InputDataConfig": [
"ChannelName": "train",
"DataSource":
"S3DataSource":
"S3DataType": "AugmentedManifestFile",
"S3Uri": "s3:////".format(bucket, s3_prefix, "train.manifest"),
"S3DataDistributionType": "FullyReplicated",
"AttributeNames": attribute_names,
,
"ContentType": "application/x-recordio",
"RecordWrapperType": "RecordIO",
"CompressionType": "None",
,
"ChannelName": "validation",
"DataSource":
"S3DataSource":
"S3DataType": "AugmentedManifestFile",
"S3Uri": "s3:////".format(bucket, s3_prefix, "validation.manifest"),
"S3DataDistributionType": "FullyReplicated",
"AttributeNames": attribute_names,
,
"ContentType": "application/x-recordio",
"RecordWrapperType": "RecordIO",
"CompressionType": "None",
,
],
print("Training job name: ".format(nn_job_name))
print(
"\nInput Data Location: ".format(
training_params["InputDataConfig"][0]["DataSource"]["S3DataSource"]
)
)
【讨论】:
以上是关于使用 AugmentedManifest 在 SageMaker 中进行语义分割失败的主要内容,如果未能解决你的问题,请参考以下文章
[转帖]SQLSERVER 使用触发器实现 禁用sa用户 在非本机登录
SpringBoot 使用 Sa-Token 完成路由拦截鉴权