查看 Hugging Face Sagemaker 模型的训练错误指标

Posted 2023-03-29

技术标签:

【中文标题】查看 Hugging Face Sagemaker 模型的训练错误指标【英文标题】：View train error metrics for Hugging Face Sagemaker model 【发布时间】：2022-01-15 06:18:42 【问题描述】：

我已经使用 Hugging Face 与 Amazon Sagemaker and their Hello World example 的集成训练了一个模型。

通过在训练模型上调用training_job_analyticshuggingface_estimator.training_job_analytics.dataframe()，我可以轻松地计算和查看评估测试集上生成的指标：准确度、f-score、精度、召回率等：huggingface_estimator.training_job_analytics.dataframe()

我如何才能在训练集上看到相同的指标（甚至每个 epoch 的训练误差）？

培训代码与添加文档的额外部分的链接基本相同：

from sagemaker.huggingface import HuggingFace

# optionally parse logs for key metrics
# from the docs: https://huggingface.co/docs/sagemaker/train#sagemaker-metrics
metric_definitions = [
    'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?",
    'Name': 'learning_rate', 'Regex': "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?",
    'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?",
    'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?",
    'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?",
    'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?",
    'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?",
    'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?",
    'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?",
    'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"
]

# hyperparameters, which are passed into the training job
hyperparameters=
    'epochs': 5,
    'train_batch_size': batch_size,
    'model_name': model_checkpoint,
    'task': task,


# init the model (but not yet trained)
huggingface_estimator = HuggingFace(
    entry_point='train.py',
    source_dir='./scripts',
    instance_type='ml.p3.2xlarge',
    instance_count=1,
    role=role,
    transformers_version='4.6',
    pytorch_version='1.7',
    py_version='py36',
    hyperparameters = hyperparameters,
    metric_definitions=metric_definitions
)
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit('train': training_input_path, 'test': test_input_path)

# does not return metrics on training - only on eval!
huggingface_estimator.training_job_analytics.dataframe()

【问题讨论】：

【参考方案1】：

这可以通过将训练中的 epoch 数增加到更现实的值来解决。

目前，模型的训练时间不到 300 秒（这是记录以下时间戳的时间），并且可能是损失函数。

要进行的更改：

hyperparameters=
    'epochs': 100, # increase the number of epochs to realistic value!
    'train_batch_size': batch_size,
    'model_name': model_checkpoint,
    'task': task,

【讨论】：

以上是关于查看 Hugging Face Sagemaker 模型的训练错误指标的主要内容，如果未能解决你的问题，请参考以下文章