完成 GeneratorDataset 迭代器时发生错误
Posted
技术标签:
【中文标题】完成 GeneratorDataset 迭代器时发生错误【英文标题】:Error occurred when finalizing GeneratorDataset iterator 【发布时间】:2020-08-11 01:25:18 【问题描述】:我正在使用 tensorflow 训练一个模型,并且在 epoch 结束之前一切正常,然后我收到以下错误...这是什么意思??
系统信息(Google colab):
GPU 0:Tesla K80(UUID:GPU-26ddd2bb-3c0f-4772-1bc9-077417190d42)
张量流版本:2.1
库达:10.1
平台:Ubuntu 18
这是我做的步骤:
training_size = sum(1 for _ in tf.data.TFRecordDataset(self.train_tf_record))
history = self.training_model.fit(training_gen,
epochs=epochs,
callbacks=callbacks,
validation_data=valid_gen,
steps_per_epoch=training_size / batch_size)
train_gen
和valid_gen
都是通过以下方式获得的生成器:
dataset = dataset.prefetch(
buffer_size=tf.data.experimental.AUTOTUNE)
image, label = next(iter(dataset.take(1)))
while True:
yield image, label
我正在使用生成器,因为如果我将两个数据集都直接传递给model.fit()
,一旦内存不足,训练就会被 oom 杀手杀死
请注意,如果我不指定 steps_per_epoch
,则该纪元永远不会结束,它只会一直步进到 800 步以上,然后我会中断执行,因为数据集不是那么大(大约 2200 张图像)
那么这发生在第一个 epoch 之后:
Traceback (most recent call last):
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 2039, in execution_mode
yield
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 668, in _next_internal
output_shapes=self._flat_output_shapes)
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 2552, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 6810, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[154] = [6, -2, 18, 0] does not index into shape [8,52,52,3,6]
[[node PartitionedCall_2/TensorScatterUpdate]] [Op:IteratorGetNext]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "trainer.py", line 265, in <module>
tr.train(100, 8, 1e-4, dataset_name='beverly_hills')
File "../Helpers/utils.py", line 33, in wrapper
result = func(*args, **kwargs)
File "trainer.py", line 248, in train
steps_per_epoch=training_size / batch_size)
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 71, in _method_wrapper
return method(self, *args, **kwargs)
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 940, in fit
return_dict=True)
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 71, in _method_wrapper
return method(self, *args, **kwargs)
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1150, in evaluate
steps_per_execution=self._steps_per_execution)
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 1138, in __init__
model=model)
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 787, in __init__
peek, x = self._peek_and_restore(x)
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/data_adapter.py", line 845, in _peek_and_restore
peek = next(x)
File "trainer.py", line 150, in initialize_dataset
image, label = next(iter(dataset.take(1)))
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 644, in __next__
return self.next()
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 683, in next
return self._next_internal()
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 674, in _next_internal
return structure.from_compatible_tensor_list(self._element_spec, ret)
File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/eager/context.py", line 2042, in execution_mode
executor_new.wait()
File "/root/.local/lib/python3.6/site-packages/tensorflow/python/eager/executor.py", line 67, in wait
pywrap_tfe.TFE_ExecutorWaitForAllPendingNodes(self._handle)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[154] = [6, -2, 18, 0] does not index into shape [8,52,52,3,6]
[[node PartitionedCall_2/TensorScatterUpdate]]
2020-04-27 13:07:24.909888: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[node PyFunc]]
【问题讨论】:
如果您的火车在 epoch 结束之前运行良好,那么您的验证数据似乎是错误的。错误消息是关于形状问题的。所以,检查你的 train_gen 和 validation_gen,看看它们是否总是产生相同的形状 两者都是同一个方法的输出,我在较小的数据集上测试了整个代码,它运行得非常好,使用大数据集时开始出现各种问题 如果没有您的模型,很难说哪里出了问题。阅读错误消息,这似乎是您的验证数据中的形状问题。如果您的模型输入具有静态输入形状,请检查您的生成器是否始终产生与模型输入相同的形状 @Augusto Maillo 我刚刚发现我在提供给培训师的 pandas 数据框中忽略了一些错误,我修复了它,并将再次测试以查看问题是否消失 @prb_cm 我的数据集有问题,尝试使用样本进行试验,看看何时发生。 【参考方案1】:我遇到了类似的错误。
然后我使用了tf-nightly
命令:
!pip install tf-nightly
将我的损失更改为分类交叉熵并开始训练。
【讨论】:
我不久前解决了这个问题,我不记得确切的细节,但据我所知,这是一个数据集问题以上是关于完成 GeneratorDataset 迭代器时发生错误的主要内容,如果未能解决你的问题,请参考以下文章
检查数组中是不是存在迭代器时,array.includes 不起作用[重复]