无效参数:通过编辑标签数量在摘要直方图中显示 Nan

Posted

技术标签:

【中文标题】无效参数:通过编辑标签数量在摘要直方图中显示 Nan【英文标题】:Invalid argument: Nan in summary histogram by editing the number of labels 【发布时间】:2020-03-07 08:26:32 【问题描述】:

我已减少数据集城市景观的默认标签数量从 19 个减少到 10 个。我的目标是更改数据集,以便解码器需要重新学习权重,作为增加解码器输出类的准备练习。

我使用的网络是deeplab,起初训练过程很好。在出现错误之前运行了大约 500 个步骤。

(下面的代码没有从训练开始后的第一行开始)

I1111 16:19:23.461441 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.82067
Total loss is :[6.42209053]
INFO:tensorflow:global_step/sec: 1.84064
I1111 16:19:28.894436 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84064
Total loss is :[6.23576546]
INFO:tensorflow:global_step/sec: 1.84368
I1111 16:19:34.318257 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84368
Total loss is :[6.09628582]
INFO:tensorflow:global_step/sec: 1.83645
I1111 16:19:39.763585 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.83645
Total loss is :[6.20008707]
INFO:tensorflow:global_step/sec: 1.84192
I1111 16:19:45.192930 140502638323520 basic_session_run_hooks.py:692] global_step/sec: 1.84192
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
     [[node image_pooling/BatchNorm/moving_variance_1]]
     [[Mean_225/_10177]]
  (1) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
     [[node image_pooling/BatchNorm/moving_variance_1]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/zwang/workspace//models-master/research/deeplab/train.py", line 521, in <module>
    tf.app.run()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/home/zwang/workspace//models-master/research/deeplab/train.py", line 515, in main
    sess.run([train_tensor])
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 754, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1252, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1353, in run
    raise six.reraise(*original_exc_info)
  File "/home/zwang/.local/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1338, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1411, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/monitored_session.py", line 1169, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
     [[node image_pooling/BatchNorm/moving_variance_1 (defined at home/zwang/workspace//models-master/research/deeplab/train.py:328) ]]
     [[Mean_225/_10177]]
  (1) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1
     [[node image_pooling/BatchNorm/moving_variance_1 (defined at home/zwang/workspace//models-master/research/deeplab/train.py:328) ]]
0 successful operations.
0 derived errors ignored.

Errors may have originated from an input operation.
Input Source operations connected to node image_pooling/BatchNorm/moving_variance_1:
 image_pooling/BatchNorm/moving_variance/read (defined at home/zwang/workspace/models-master/research/deeplab/model.py:478)

Input Source operations connected to node image_pooling/BatchNorm/moving_variance_1:
 image_pooling/BatchNorm/moving_variance/read (defined at home/zwang/workspace/models-master/research/deeplab/model.py:478)

Original stack trace for 'image_pooling/BatchNorm/moving_variance_1':
  File "home/zwang/workspace//models-master/research/deeplab/train.py", line 521, in <module>
    tf.app.run()
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "home/zwang/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "home/zwang/workspace//models-master/research/deeplab/train.py", line 472, in main
    dataset.ignore_label)
  File "home/zwang/workspace//models-master/research/deeplab/train.py", line 379, in _train_deeplab_model
    reuse_variable=(i != 0))
  File "home/zwang/workspace//models-master/research/deeplab/train.py", line 275, in _tower_loss
    _build_deeplab(iterator, common.OUTPUT_TYPE: num_of_classes, ignore_label)
  File "home/zwang/workspace//models-master/research/deeplab/train.py", line 257, in _build_deeplab
    output_type_dict[model.MERGED_LOGITS_SCOPE])
  File "home/zwang/workspace//models-master/research/deeplab/train.py", line 328, in _log_summaries
    tf.summary.histogram(model_var.op.name, model_var)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/summary/summary.py", line 179, in histogram
    tag=tag, values=values, name=scope)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_logging_ops.py", line 329, in histogram_summary
    "HistogramSummary", tag=tag, values=values, name=name)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
    op_def=op_def)
  File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
    self._traceback = tf_stack.extract_stack()

我认为是错误

  (0) Invalid argument: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_1

好像是tensorboard的错误,有什么办法可以避免吗?

由于我的训练已经运行了 30000 步中的 500 步,没有任何问题。我希望没有函数的某些部分(如张量板的直方图),或者通过在其他地方编辑 num_of_labels _(也许 the_num_of_classes 的另一个参数可能需要编辑)_,训练过程将运行正常。

您能否针对此错误或我的一般方法提出一些建议?谢谢

最好的问候

【问题讨论】:

【参考方案1】:

问题已通过调整训练的超参数得到解决,例如降低学习率以稳定训练过程。

【讨论】:

你能解释一下为了解决这个问题你必须做出哪些调整吗? 我记得,我降低了启动学习率。您也可以尝试使用预训练模型以获得稳定的训练过程

以上是关于无效参数:通过编辑标签数量在摘要直方图中显示 Nan的主要内容,如果未能解决你的问题,请参考以下文章

Gatsby 查询在 graphql 编辑器中有效,但在反应代码中无效

Confluence 6 在编辑器中控制参数的显示

Confluence 6 在编辑器中控制参数的显示

PDO 更新错误无效参数号:绑定变量的数量与令牌的数量不匹配

如何通过位置箱绘制所有出现的NA

Tableau销售数据看板制作