完成 GeneratorDataset 迭代器时发生错误

Posted

技术标签:

【中文标题】完成 GeneratorDataset 迭代器时发生错误【英文标题】:Error occurred when finalizing GeneratorDataset iterator 【发布时间】:2020-08-11 01:25:18 【问题描述】:

我正在使用 tensorflow 训练一个模型,并且在 epoch 结束之前一切正常,然后我收到以下错误...这是什么意思??

系统信息(Google colab):

GPU 0:Tesla K80(UUID:GPU-26ddd2bb-3c0f-4772-1bc9-077417190d42)

张量流版本:2.1

库达:10.1

平台:Ubuntu 18

这是我做的步骤:

training_size = sum(1 for _ in tf.data.TFRecordDataset(self.train_tf_record))

history = self.training_model.fit(training_gen,
                                      epochs=epochs,
                                      callbacks=callbacks,
                                      validation_data=valid_gen,
                                      steps_per_epoch=training_size / batch_size)

train_genvalid_gen 都是通过以下方式获得的生成器:

dataset = dataset.prefetch(
    buffer_size=tf.data.experimental.AUTOTUNE)
image, label = next(iter(dataset.take(1)))
while True:
    yield image, label

我正在使用生成器,因为如果我将两个数据集都直接传递给model.fit(),一旦内存不足,训练就会被 oom 杀手杀死

请注意,如果我不指定 steps_per_epoch,则该纪元永远不会结束,它只会一直步进到 800 步以上,然后我会中断执行,因为数据集不是那么大(大约 2200 张图像)

那么这发生在第一个 epoch 之后:

Traceback (most recent call last):
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 2039, in execution_mode
    yield
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 668, in _next_internal
    output_shapes=self._flat_output_shapes)
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2552, in iterator_get_next
    _ops.raise_from_not_ok_status(e, name)
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 6810, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[154] = [6, -2, 18, 0] does not index into shape [8,52,52,3,6]
     [[node PartitionedCall_2/TensorScatterUpdate]] [Op:IteratorGetNext]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "trainer.py", line 265, in <module>
    tr.train(100, 8, 1e-4, dataset_name='beverly_hills')
  File "../Helpers/utils.py", line 33, in wrapper
    result = func(*args, **kwargs)
  File "trainer.py", line 248, in train
    steps_per_epoch=training_size / batch_size)
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 71, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 940, in fit
    return_dict=True)
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 71, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1150, in evaluate
    steps_per_execution=self._steps_per_execution)
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 1138, in __init__
    model=model)
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 787, in __init__
    peek, x = self._peek_and_restore(x)
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 845, in _peek_and_restore
    peek = next(x)
  File "trainer.py", line 150, in initialize_dataset
    image, label = next(iter(dataset.take(1)))
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 644, in __next__
    return self.next()
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 683, in next
    return self._next_internal()
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 674, in _next_internal
    return structure.from_compatible_tensor_list(self._element_spec, ret)
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 2042, in execution_mode
    executor_new.wait()
  File "/root/.local/lib/python3.6/site-packages/tensorflow/python/eager/executor.py", line 67, in wait
    pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[154] = [6, -2, 18, 0] does not index into shape [8,52,52,3,6]
     [[node PartitionedCall_2/TensorScatterUpdate]]
2020-04-27 13:07:24.909888: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
     [[node PyFunc]]

【问题讨论】:

如果您的火车在 epoch 结束之前运行良好,那么您的验证数据似乎是错误的。错误消息是关于形状问题的。所以,检查你的 train_gen 和 validation_gen,看看它们是否总是产生相同的形状 两者都是同一个方法的输出,我在较小的数据集上测试了整个代码,它运行得非常好,使用大数据集时开始出现各种问题 如果没有您的模型,很难说哪里出了问题。阅读错误消息,这似乎是您的验证数据中的形状问题。如果您的模型输入具有静态输入形状,请检查您的生成器是否始终产生与模型输入相同的形状 @Augusto Maillo 我刚刚发现我在提供给培训师的 pandas 数据框中忽略了一些错误,我修复了它,并将再次测试以查看问题是否消失 @prb_cm 我的数据集有问题,尝试使用样本进行试验,看看何时发生。 【参考方案1】:

我遇到了类似的错误。 然后我使用了tf-nightly 命令:

!pip install tf-nightly

将我的损失更改为分类交叉熵并开始训练。

【讨论】:

我不久前解决了这个问题,我不记得确切的细节,但据我所知,这是一个数据集问题

以上是关于完成 GeneratorDataset 迭代器时发生错误的主要内容,如果未能解决你的问题,请参考以下文章

Scrapy拓展——完成时发通知邮件

尝试实现 STL 迭代器时出错

检查数组中是不是存在迭代器时,array.includes 不起作用[重复]

在 Java 中使用迭代器时向 ArrayList 添加元素

使用向量迭代器时的 EXC_BAD_ACCESS?

使用 ARM GCC 编译列表迭代器时的模板编译时错误