根据自定义指标失败保存最佳指标(警告:tensorflow:只能在自定义指标可用的情况下保存最佳模型,跳过)

Posted

技术标签:

【中文标题】根据自定义指标失败保存最佳指标(警告:tensorflow:只能在自定义指标可用的情况下保存最佳模型,跳过)【英文标题】:Saving best metrics based on Custom metrics failing (WARNING:tensorflow:Can save best model only with CUSTOM METRICS available, skipping) 【发布时间】:2021-12-29 04:31:16 【问题描述】:

我已经定义了一个在纪元结束时运行并计算指标的回调。它在计算所需指标方面运行良好。下面是函数供参考

回调以在 epoch 结束时查找指标

class Metrics(tf.keras.callbacks.Callback):
    def __init__(self, train_tf_data, val_tf_data, model, CLASSES, logs=, **kwargs):
        super().__init__(**kwargs)
        self.train_tf_data = train_tf_data
        self.val_tf_data = val_tf_data
        self.model = model
        self.CLASSES = CLASSES
        # for train data
        self.train_f1_after_epoch = 0
        self.train_prec_after_epoch = 0
        self.train_recall_after_epoch = 0
        # for val data
        self.val_f1_after_epoch = 0
        self.val_prec_after_epoch = 0
        self.val_recall_after_epoch = 0

    def on_train_begin(self, logs=):
        self.train_reports = None
        self.val_reports = None
        self.val_f1_after_epoch = 0

    def on_epoch_end(self, epoch, logs=):
        # for train data
        self.train_reports = test_model(model=self.model, data=self.train_tf_data, 
                                        CLASSES=self.CLASSES)
        self.train_f1_after_epoch = self.train_reports['f1_score']
        self.train_recall_after_epoch = self.train_reports['recall']
        self.train_prec_after_epoch = self.train_reports['precision']

        # for val data
        self.val_reports = test_model(model=self.model, data=self.val_tf_data, 
                                      CLASSES=self.CLASSES)
        self.val_f1_after_epoch = self.val_reports['f1_score']
        self.val_recall_after_epoch = self.val_reports['recall']
        self.val_prec_after_epoch = self.val_reports['precision']

        # saving train results to log dir
        logs["train_f1_after_epoch"]=self.train_f1_after_epoch
        logs['train_precision_after_epoch'] = self.train_prec_after_epoch
        logs['train_recall_after_epoch'] = self.train_recall_after_epoch
        
        # saving val results to log dir
        logs['val_f1_after_epoch'] = self.val_f1_after_epoch
        logs['val_precision_after_epoch'] = self.val_prec_after_epoch
        logs['val_recall_after_epoch'] = self.val_recall_after_epoch


        print('train_reports_after_epoch', self.train_reports)
        print('val_reports_after_epoch', self.val_reports)

test_model 的代码

def test_model(model, data, CLASSES, label_one_hot=True, average="micro"):
    images_ds = data.map(lambda image, label: image)
    labels_ds = data.map(lambda image, label: label).unbatch()
    NUM_VALIDATION_IMAGES = count_data_items(tf_records_filenames=data)
    cm_correct_labels = next(iter(labels_ds.batch(NUM_VALIDATION_IMAGES))).numpy() # get everything as one batch
    if label_one_hot is True:
        cm_correct_labels = np.argmax(cm_correct_labels, axis=-1)
    cm_probabilities = model.predict(images_ds)
    cm_predictions = np.argmax(cm_probabilities, axis=-1)
    
    # cmat = confusion_matrix(cm_correct_labels, cm_predictions, labels=range(len(CLASSES)))

    warnings.filterwarnings('ignore')
    score = f1_score(cm_correct_labels, cm_predictions, labels=range(len(CLASSES)), average=average)
    precision = precision_score(cm_correct_labels, cm_predictions, labels=range(len(CLASSES)), average=average)
    recall = recall_score(cm_correct_labels, cm_predictions, labels=range(len(CLASSES)), average=average)
    # cmat = (cmat.T / cmat.sum(axis=1)).T # normalized
    # print('f1 score: :.3f, precision: :.3f, recall: :.3f'.format(score, precision, recall))
    test_results = 'f1_score': score, 'precision':precision, 'recall':recall
    warnings.filterwarnings('always')
    return test_results

一些型号代码.....

型号代码

m1 = tf.keras.metrics.CategoricalAccuracy()
m2 = tf.keras.metrics.Recall()
m3 = tf.keras.metrics.Precision()
m4 = Metrics(train_tf_data=train_data, 
             val_tf_data=test_data, model=model, 
             CLASSES=CLASS_NAMES)
optimizers = [
        tfa.optimizers.AdamW(learning_rate=lr * .001 , weight_decay=wd),
        tfa.optimizers.AdamW(learning_rate=lr, weight_decay=wd)

           ]
optimizers_and_layers = [(optimizers[0], model.layers[0]), (optimizers[1], model.layers[1:])]
    
optimizer = tfa.optimizers.MultiOptimizer(optimizers_and_layers)


model.compile(
    optimizer= optimizer,
    loss = 'categorical_crossentropy',
    metrics=[m1, m2, m3],
    )

在回调中使用这个

checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, 
                                                    monitor = 'val_f1_after_epoch',
                                                    save_best_only=True,
                                                    save_weights_only=True,
                                                    mode='max',
                                                    save_freq='epoch',
                                                    verbose=1)
                                                    
checkpoint_cb._supports_tf_logs = False

我面临的问题是它给了我一个警告,说

WARNING:tensorflow:Can save best model only with val_f1_after_epoch available, skipping

在调查历史后,我发现指标在历史记录中可用

print(list(history.history.keys()))
['loss',
'categorical_accuracy',
'recall',
'precision',
'val_loss',
'val_categorical_accuracy',
'val_recall',
'val_precision',
'train_f1_after_epoch',
'train_precision_after_epoch',
'train_recall_after_epoch',
'val_f1_after_epoch', #this is the metrics
'val_precision_after_epoch',
'val_recall_after_epoch']

请让我知道我在这里缺少什么,我想根据我的自定义指标保存最佳模型?

【问题讨论】:

【参考方案1】:

确保指标回调列在模型检查点回调之前。

history = model.fit(train_data, validation_data=test_data, epochs=N_EPOCHS, callbacks=[m4, checkpoint_cb])

当我们将回调列表传递给模型时,将在训练的每个阶段调用回调。万一 Metric 和 ModelCheckpoint 使用 on_epoch_end,那么你必须确保回调的顺序是 [Metric,ModelCheckpoint]

注意 : Metric 继承自回调,因此只会在 epoch 结束时执行。

【讨论】:

是的,我在模型检查点之前列出了指标回调。尝试使用 logs['val_f1_after_epoch'] = 0 但它对我不起作用即使替代方法也不起作用。此外,您建议的替代方案是考虑通过不断更新指标来取平均值。我想更新这些指标,但不想计算平均值。在实现上述代码(代码块中的一个)后,当聚合设置为 tf.VariableAggregation.SUM 时,我收到 ValueError: SyncOnReadVariable does not support assign_add in cross-replica context。我正在使用 TensorFlow 2.5.0 我尝试升级到 Tensorflow 2.7 并使用关闭镜像策略我能够运行该程序,但我得到了同样的警告。问题就是这样。 WARNING:tensorflow:Can save best model only with val_f1_after_epoch available, skipping 您能否提供有关我无法重现错误的代码的更多详细信息。 我收到此警告的唯一时间是当我将`logs['val_f1_after_epoch'] = None ` 放入回调中时 我在这里打开了一个 git 问题 github.com/keras-team/keras/issues/15684。如果您需要更多信息,请告诉我

以上是关于根据自定义指标失败保存最佳指标(警告:tensorflow:只能在自定义指标可用的情况下保存最佳模型,跳过)的主要内容,如果未能解决你的问题,请参考以下文章

在 Kubernetes 中使用多个自定义指标适配器

tf.keras 如何保存 ModelCheckPoint 对象

根据自定义指标扩展部署

查询通过 prometheus 节点导出器文本文件收集器公开的自定义指标失败

如何根据自定义指标扩展 knative 服务?

MaskRCNN 的 segm IoU 指标从何而来 = 0?