Tensorflow：恢复图形和模型，然后在单个图像上运行评估

Posted 2023-02-23

技术标签:

【中文标题】Tensorflow：恢复图形和模型，然后在单个图像上运行评估【英文标题】：Tensorflow: restoring a graph and model then running evaluation on a single image 【发布时间】：2016-10-05 13:36:54 【问题描述】：

我认为，如果针对 convnet in the CIFAR-10 tutorial 创建的模型测试单个新图像这一关键任务有一个有据可查的解决方案，这将对 Tensorflow 社区大有帮助。

我可能错了，但似乎缺少使训练模型在实践中可用的关键步骤。该教程中有一个“缺失的环节”——一个脚本可以直接加载单个图像（作为数组或二进制），将其与经过训练的模型进行比较，然后返回一个分类。

先前的答案给出了解释整体方法的部分解决方案，但我都无法成功实施。可以在这里和那里找到其他零碎的东西，但不幸的是还没有添加到一个有效的解决方案中。在将其标记为重复或已回答之前，请考虑我所做的研究。

Tensorflow: how to save/restore a model?

Restoring TensorFlow model

Unable to restore models in tensorflow v0.8

https://gist.github.com/nikitakit/6ef3b72be67b86cb7868

最流行的答案是第一个，其中@RyanSepassi 和@YaroslavBulatov 描述了问题和方法：需要“手动构建具有相同节点名称的图，并使用 Saver 将权重加载到其中”。尽管这两个答案都有帮助，但如何将其插入 CIFAR-10 项目尚不清楚。

非常需要一个功能齐全的解决方案，因此我们可以将其移植到其他单一图像分类问题。在这方面有几个关于 SO 的问题要求这个，但仍然没有完整的答案（例如Load checkpoint and evaluate single image with tensorflow DNN）。

我希望我们能集中在一个每个人都可以使用的工作脚本上。

以下脚本尚不可用，我很高兴收到您关于如何改进此脚本以使用 CIFAR-10 TF 教程训练模型提供单图像分类解决方案的意见。

假设所有变量、文件名等都未触及原始教程。

新文件：cifar10_eval_single.py

import cv2
import tensorflow as tf

FLAGS = tf.app.flags.FLAGS

tf.app.flags.DEFINE_string('eval_dir', './input/eval',
                           """Directory where to write event logs.""")
tf.app.flags.DEFINE_string('checkpoint_dir', './input/train',
                           """Directory where to read model checkpoints.""")

def get_single_img():
    file_path = './input/data/single/test_image.tif'
    pixels = cv2.imread(file_path, 0)
    return pixels

def eval_single_img():

    # below code adapted from @RyanSepassi, however not functional
    # among other errors, saver throws an error that there are no
    # variables to save
    with tf.Graph().as_default():

        # Get image.
        image = get_single_img()

        # Build a Graph.
        # TODO

        # Create dummy variables.
        x = tf.placeholder(tf.float32)
        w = tf.Variable(tf.zeros([1, 1], dtype=tf.float32))
        b = tf.Variable(tf.ones([1, 1], dtype=tf.float32))
        y_hat = tf.add(b, tf.matmul(x, w))

        saver = tf.train.Saver()

        with tf.Session() as sess:
            sess.run(tf.initialize_all_variables())
            ckpt = tf.train.get_checkpoint_state(FLAGS.checkpoint_dir)

            if ckpt and ckpt.model_checkpoint_path:
                saver.restore(sess, ckpt.model_checkpoint_path)
                print('Checkpoint found')
            else:
                print('No checkpoint found')

            # Run the model to get predictions
            predictions = sess.run(y_hat, feed_dict=x: image)
            print(predictions)

def main(argv=None):
    if tf.gfile.Exists(FLAGS.eval_dir):
        tf.gfile.DeleteRecursively(FLAGS.eval_dir)
    tf.gfile.MakeDirs(FLAGS.eval_dir)
    eval_single_img()

if __name__ == '__main__':
    tf.app.run()

【问题讨论】：

【参考方案1】：

这是我一次运行单个图像的方式。我承认重用获取范围似乎有点笨拙。

这是一个辅助函数

def restore_vars(saver, sess, chkpt_dir):
    """ Restore saved net, global score and step, and epsilons OR
    create checkpoint directory for later storage. """
    sess.run(tf.initialize_all_variables())

    checkpoint_dir = chkpt_dir

    if not os.path.exists(checkpoint_dir):
        try:
            os.makedirs(checkpoint_dir)
        except OSError:
            pass

    path = tf.train.get_checkpoint_state(checkpoint_dir)
    #print("path1 = ",path)
    #path = tf.train.latest_checkpoint(checkpoint_dir)
    print(checkpoint_dir,"path = ",path)
    if path is None:
        return False
    else:
        saver.restore(sess, path.model_checkpoint_path)
        return True

这是在 for 循环中一次运行单个图像的代码的主要部分。

to_restore = True
with tf.Session() as sess:

    for i in test_img_idx_set:

            # Gets the image
            images = get_image(i)
            images = np.asarray(images,dtype=np.float32)
            images = tf.convert_to_tensor(images/255.0)
            # resize image to whatever you're model takes in
            images = tf.image.resize_images(images,256,256)
            images = tf.reshape(images,(1,256,256,3))
            images = tf.cast(images, tf.float32)

            saver = tf.train.Saver(max_to_keep=5, keep_checkpoint_every_n_hours=1)

            #print("infer")
            with tf.variable_scope(tf.get_variable_scope()) as scope:
                if to_restore:
                    logits = inference(images)
                else:
                    scope.reuse_variables()
                    logits = inference(images)


            if to_restore:
                restored = restore_vars(saver, sess,FLAGS.train_dir)
                print("restored ",restored)
                to_restore = False

            logit_val = sess.run(logits)
            print(logit_val)

这是上面使用占位符的替代实现，我认为它更干净一些。但出于历史原因，我将保留上面的示例。

imgs_place = tf.placeholder(tf.float32, shape=[my_img_shape_put_here])
images = tf.reshape(imgs_place,(1,256,256,3))

saver = tf.train.Saver(max_to_keep=5, keep_checkpoint_every_n_hours=1)

#print("infer")
logits = inference(images)

restored = restore_vars(saver, sess,FLAGS.train_dir)
print("restored ",restored)

with tf.Session() as sess:
    for i in test_img_idx_set:
        logit_val = sess.run(logits,feed_dict=imgs_place=i)
        print(logit_val)

【讨论】：

【参考方案2】：

有两种方法可以将单个新图像提供给 cifar10 模型。第一种方法是一种更简洁的方法，但需要在主文件中进行修改，因此需要重新训练。当用户不想修改模型文件而是想使用现有的检查点/元图文件时，第二种方法适用。

第一种方法的代码如下：

import tensorflow as tf
import numpy as np
import cv2

sess = tf.Session('', tf.Graph())
with sess.graph.as_default():
    # Read meta graph and checkpoint to restore tf session
    saver = tf.train.import_meta_graph("/tmp/cifar10_train/model.ckpt-200.meta")
    saver.restore(sess, "/tmp/cifar10_train/model.ckpt-200")

    # Read a single image from a file.
    img = cv2.imread('tmp.png')
    img = np.expand_dims(img, axis=0)

    # Start the queue runners. If they are not started the program will hang
    # see e.g. https://www.tensorflow.org/programmers_guide/reading_data
    coord = tf.train.Coordinator()
    threads = []
    for qr in sess.graph.get_collection(tf.GraphKeys.QUEUE_RUNNERS):
        threads.extend(qr.create_threads(sess, coord=coord, daemon=True,
                                         start=True))

    # In the graph created above, feed "is_training" and "imgs" placeholders.
    # Feeding them will disconnect the path from queue runners to the graph 
    # and enable a path from the placeholder instead. The "img" placeholder will be 
    # fed with the image that was read above.
    logits = sess.run('softmax_linear/softmax_linear:0', 
                     feed_dict='is_training:0': False, 'imgs:0': img)

    #Print classifiction results.
    print(logits)

该脚本需要用户创建两个占位符和一个条件执行语句才能使其工作。

在cifar10_train.py中添加占位符和条件执行语句如下图：

def train():   
"""Train CIFAR-10 for a number of steps."""   
    with tf.Graph().as_default():
        global_step = tf.contrib.framework.get_or_create_global_step()

    with tf.device('/cpu:0'):
        images, labels = cifar10.distorted_inputs()

    is_training = tf.placeholder(dtype=bool,shape=(),name='is_training')
    imgs = tf.placeholder(tf.float32, (1, 32, 32, 3), name='imgs')
    images = tf.cond(is_training, lambda:images, lambda:imgs)
    logits = cifar10.inference(images)

cifar10 模型中的输入连接到 queue runner 对象，这是一个多级队列，可以并行从文件中预取数据。看一个漂亮的队列跑步者动画here

虽然队列运行器在预取大型数据集进行训练方面效率很高，但对于只需要对单个文件进行分类的推理/测试来说，它们是一种过度杀伤力，而且它们在修改/维护方面也需要更多的参与。为此，我添加了一个占位符“is_training”，在训练时将其设置为 False，如下所示：

 import numpy as np
 tmp_img = np.ndarray(shape=(1,32,32,3), dtype=float)
 with tf.train.MonitoredTrainingSession(
     checkpoint_dir=FLAGS.train_dir,
     hooks=[tf.train.StopAtStepHook(last_step=FLAGS.max_steps),
            tf.train.NanTensorHook(loss),
            _LoggerHook()],
     config=tf.ConfigProto(
         log_device_placement=FLAGS.log_device_placement)) as mon_sess:
   while not mon_sess.should_stop():
     mon_sess.run(train_op, feed_dict=is_training: True, imgs: tmp_img)

另一个占位符“imgs”为将在推理过程中输入的图像保存形状为 (1,32,32,3) 的张量——第一个维度是批量大小，在本例中为一个。我已修改 cifar 模型以接受 32x32 图像而不是 24x24，因为原始 cifar10 图像是 32x32。

最后，条件语句将占位符或队列运行器输出提供给图形。 “is_training”占位符在推理过程中设置为 False，“img”占位符被输入一个 numpy 数组——numpy 数组从 3 维向量重新整形到 4 维向量，以符合模型中推理函数的输入张量。

这就是它的全部内容。任何模型都可以通过单个/用户定义的测试数据来推断，如上面的脚本所示。本质上是读取图表，将数据提供给图表节点并运行图表以获得最终输出。

现在是第二种方法。另一种方法是破解 cifar10.py 和 cifar10_eval.py 以将批量大小更改为 1，并将来自队列运行器的数据替换为从文件中读取的数据。

将批量大小设置为 1：

tf.app.flags.DEFINE_integer('batch_size', 1,
                             """Number of images to process in a batch.""")

通过读取的图像文件调用推理。

def evaluate():   with tf.Graph().as_default() as g:
    # Get images and labels for CIFAR-10.
    eval_data = FLAGS.eval_data == 'test'
    images, labels = cifar10.inputs(eval_data=eval_data)
    import cv2
    img = cv2.imread('tmp.png')
    img = np.expand_dims(img, axis=0)
    img = tf.cast(img, tf.float32)

    logits = cifar10.inference(img)

然后将 logits 传递给 eval_once 并修改 eval 一次以评估 logits：

def eval_once(saver, summary_writer, top_k_op, logits, summary_op): 
    ...
    while step < num_iter and not coord.should_stop():
        predictions = sess.run([top_k_op])
        print(sess.run(logits))

没有单独的脚本来运行这种推理方法，只需运行 cifar10_eval.py，它现在将从用户定义的位置读取一个批量大小为 1 的文件。

【讨论】：

我已将代码的第一部分添加到 cifar10_train 示例中，但条件语句存在问题，会产生以下错误：ValueError: Shape of a new variable (local3/weights) must be fully defined, but instead was (?, 384) 并回响到 cifar10。问题似乎在weights = _variable_with_weight_decay('weights', shape=[dim, 384], stddev=0.04, wd=0.004) 下的# local3 中表现出来。不添加该代码cifar10_train 输出正确。对于方法2，我有一个问题。为什么我们需要代码images, labels = cifar10.inputs(eval_data=eval_data)，因为既没有使用images，也没有使用labels。你没有，那行是从原始评估函数中留下的，并且没有在我修改的函数中删除。【参考方案3】：

得到它的工作与此

softmax = gn.inference(image)
saver = tf.train.Saver()
ckpt = tf.train.get_checkpoint_state(FLAGS.checkpoint_dir)
with tf.Session() as sess:
  saver.restore(sess, ckpt.model_checkpoint_path)
  softmaxval = sess.run(softmax)
  print(softmaxval)

输出

[[  6.73550041e-03   4.44930716e-04   9.92570221e-01   1.00681427e-06
    3.05406687e-08   2.38927707e-04   1.89839399e-12   9.36238484e-06
    1.51646684e-09   3.38977535e-09]]

【讨论】：

gn.inference 是正在使用的特定模型的推理功能。我认为 gn 只是意味着图表。当检查点写入磁盘时，它是训练期间使用的推理函数。【参考方案4】：

恐怕我没有适合你的工作代码，但这是我们在生产中经常解决这个问题的方法：

使用 write_graph 之类的方式将 GraphDef 保存到磁盘。

使用freeze_graph 加载GraphDef 和检查点，并保存一个将变量转换为常量的GraphDef。

将 GraphDef 加载到 label_image 或 classify_image 之类的文件中。

对于您的示例，这有点矫枉过正，但我至少建议将原始示例中的图表序列化为 GraphDef，然后将其加载到您的脚本中（这样您就不必复制生成图表的代码）。创建相同的图表后，您应该能够从 SaverDef 填充它，并且 freeze_graph 脚本作为示例可能会有所帮助。

【讨论】：

以上是关于Tensorflow：恢复图形和模型，然后在单个图像上运行评估的主要内容，如果未能解决你的问题，请参考以下文章