TensorFlow Debugger ValueError:在设备的分区图中找不到节点名称“Add/x”

Posted

技术标签:

【中文标题】TensorFlow Debugger ValueError:在设备的分区图中找不到节点名称“Add/x”【英文标题】:TensorFlow Debugger ValueError: Node name 'Add/x' is not found in partition graphs of device 【发布时间】:2018-09-20 15:58:55 【问题描述】:

我正在开发 TensorFlow 1.6,并尝试在我的程序中设置 TensorFlow 调试器 tfdbg。当我在 tfdbg 终端中输入命令运行时,我收到以下错误:

Traceback (most recent call last):
  File "/Users/Documents/imputation/main.py", line 346, in <module>
    args_ = _Parser(description='Train/evaluate the network for incidents '
  File "/Users/Documents/imputation/main.py", line 312, in parse_args
    command(args, parser)
  File "/Users/Documents/imputation/main.py", line 222, in _call
    args_dict = _Train._call(namespace, parser)
  File "/Users/Documents/imputation/main.py", line 151, in _call
    train(**args_dict)
  File "/Users/Documents/imputation/tf_impute.py", line 185, in train
    mon_sess.run([train_op,
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 546, in run
    run_metadata=run_metadata)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1022, in run
    run_metadata=run_metadata)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1113, in run
    raise six.reraise(*original_exc_info)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1098, in run
    return self._sess.run(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/training/monitored_session.py", line 1178, in run
    run_metadata=run_metadata))
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/debug/wrappers/hooks.py", line 150, in after_run
    self._session_wrapper.on_run_end(on_run_end_request)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/debug/wrappers/local_cli_wrapper.py", line 323, in on_run_end
    self._dump_root, partition_graphs=partition_graphs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/debug/lib/debug_data.py", line 495, in __init__
    self._load_all_device_dumps(partition_graphs, validate)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/debug/lib/debug_data.py", line 517, in _load_all_device_dumps
    self._load_partition_graphs(partition_graphs, validate)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/debug/lib/debug_data.py", line 797, in _load_partition_graphs
    self._validate_dump_with_graphs(debug_graph.device_name)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/tensorflow/python/debug/lib/debug_data.py", line 842, in _validate_dump_with_graphs
    "device %s." % (datum.node_name, device_name))
ValueError: Node name 'Add/x' is not found in partition graphs of device /job:localhost/replica:0/task:0/device:CPU:0.

我也在https://github.com/tensorflow/tensorflow/issues/8753 中查看该问题,其中讨论了类似的问题,但提供的解决方案对我不起作用。我已经尝试将 tfdbg 实现为会话的包装器以及挂钩内。我实现 tfdbg 的部分代码如下所示:

class _LoggerHook(tf.train.SessionRunHook):
    cumulative_loss = 0

    def begin(self):
        self._step = -1
        self._start_time = time.time()

    def before_run(self, run_context):
        self._step += 1
        return tf.train.SessionRunArgs(loss)

    def after_run(self, run_context, run_values):
        loss_value = run_values.results
        self.cumulative_loss += loss_value

        if self._step == 0:
            print('Starting training at %s' % datetime.now())
        elif self._step % print_step == 0:
            current_time = time.time()
            duration = current_time - self._start_time
            self._start_time = current_time

            rms_error = math.sqrt(2 * self.cumulative_loss / print_step)
            self.cumulative_loss = 0
            examples_per_sec = print_step * batch_size / duration
            sec_per_batch = float(duration / print_step)

            format_str = (
                '%s: %d examples, rms_error = %.6f (%.1f examples/sec; '
                '%.3f sec/batch)')
            print(format_str % (
                datetime.now(), self._step * batch_size, rms_error,
                examples_per_sec, sec_per_batch))

max_steps = epochs * (examples // batch_size)
model_saver = tf.train.Saver(var_list=tf.model_variables())

class _CheckpointSaverHook(CheckpointSaverHook):
    def __init__(self, *args, **kwargs):
        super(_CheckpointSaverHook, self).__init__(*args, **kwargs)
        assert self._listeners == [], 'CheckpointSaverListener not ' \
                                      'allowed'

    def end(self, session):


class _FinalStepHook(FinalOpsHook):

    def end(self, session):
        super(_FinalStepHook, self).end(session)
        print('Saving last checkpoint at step %d' % session.run(
            global_step))
        model_saver.save(session,
                         os.path.join(train_dir, "model.ckpt"),
                         global_step)

final_hook = _FinalStepHook([train_op, preds_update_op])
scaffold = tf.train.Scaffold(saver=model_saver)
logger_hook = _LoggerHook()
hooks = [_CheckpointSaverHook(checkpoint_dir=train_dir, save_secs=1000,
                              scaffold=scaffold),
         tf.train.StopAtStepHook(last_step=max_steps - 1),
         tf.train.NanTensorHook(loss), logger_hook, final_hook,
         tf_debug.LocalCLIDebugHook()]
config = tf.ConfigProto(log_device_placement=log_device_placement)
config.gpu_options.allow_growth = True
start_train = time.time()
with tf.train.MonitoredTrainingSession(checkpoint_dir=train_dir,
    hooks=hooks, config=config, save_checkpoint_secs=0,
    scaffold=scaffold) as mon_sess:
    try:
      while not mon_sess.should_stop():
          mon_sess.run([train_op,
                        # globals_preds
                        ])
    except OutOfRangeError as e:
      print(e)
      print('global step %s' % logger_hook._step)
    except KeyboardInterrupt:
      print('Train interrupted at global step %s' % logger_hook._step)
print('Training %d examples in %d epochs took %s' % (
    examples, epochs, secs_to_time(time.time() - start_train)))
upload_timestamped_tar(s3_url, train_dir, keep_dir, keep_tar, wait)
return final_hook.final_ops_values[1]

您知道如何解决此问题吗?

【问题讨论】:

您从github.com/tensorflow/tensorflow/issues/8753 尝试了什么? 我尝试在LocalCLIDebugWrapperSesion 中使用关键字参数thread_name_filter,它仅在选定线程上激活调试器CLI。我什至不确定这是否导致了 ValueError。 【参考方案1】:

我现在解决了这个问题。问题是我在代码中的某处使用了加号+ 而不是tf.add。当我检查 Tensorboard 中的图表时,我意识到“add/x”节点已经存在,但带有一个小写字母,例如 here.

将代码中的部分更改为tf.add 后,Tensorboard 中的节点也更改为“Add/x”,大写字母如here。最后,TensorFlow Debugger 能够正确识别节点并且现在可以正常工作了。

【讨论】:

我遇到了类似的问题:调试器正在寻找节点"RNN/basic_rnn_cell/.../Adagrad",它实际上位于"rnn/basic_rnn_cell/.../Adagrad" 的另一个命名空间中。命名空间“RNN”是我创建的,“rnn”是 tensorflow 创建的。我将我的命名空间重命名为“rnn”并且它起作用了。所以大写字母的另一个问题。感谢您指出正确的方向!

以上是关于TensorFlow Debugger ValueError:在设备的分区图中找不到节点名称“Add/x”的主要内容,如果未能解决你的问题,请参考以下文章

Fiddler Web Debugger简单调试头部参数

tensorflow-clip_by_value

深度学习之tensorflow框架(中)

tensorflow.python.framework.errors_impl.InvalidArgumentError: You must feed a value for placeholder

TensorFlow 变量Variable

如何解决loss NAN的问题