tf.data.Dataset 不要和random包混用

Posted 2021-10-29 MrCharles

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了tf.data.Dataset 不要和random包混用相关的知识，希望对你有一定的参考价值。

自从Tensorflow1.4发布之后，Datasets就成为了新的给Tensorflow模型创建input pipelines的方法。

Dataset 有如下模式：

从你的数据创建数据集
在数据上进行一些预处理
迭代每一个数据

迭代是按照流这种模式进行的，所以整个数据集是不需要一次性加载到内存。

一个典型的dataset构建过程：

def ListFiles(basedir,ext):
    list_ds = tf.data.Dataset.list_files(basedir+"/*."+ext).repeat()
    return list_ds
 
 allsubdir = [os.path.join(dbdir, o) for o in os.listdir(dbdir)
                 if os.path.isdir(os.path.join(dbdir, o))]
    path_ds = tf.data.Dataset.from_tensor_slices(allsubdir).repeat()
    ds = path_ds.shuffle(buffer_size).interleave(lambda x: ListFiles(x, ext), cycle_length=3000,# every time, cycle_length classes are taken out to do the maping 
                                           block_length=samples_per_class,
                                           num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(classes_per_batch * samples_per_class, True).map(
        lambda x: pair_parser(x, classes_per_batch * samples_per_class, dataset=dataset_ext), -1).map(
        lambda path, labels: load_and_preprocess_image(path, labels, classes_per_batch * samples_per_class,is_ccrop=is_ccrop),
        num_parallel_calls=tf.data.experimental.AUTOTUNE).prefetch(tf.data.experimental.AUTOTUNE)

上述代码首先读取所有的文件夹路径，得到了allsubdir，每一个文件夹就是一个人名。然后建立了一个path_ds：path_ds = tf.data.Dataset.from_tensor_slices(allsubdir).repeat()。这里使用了repeat()表示无限制重复。

然后基于path_ds, 首先shuffle所有的路径，在此基础之上，调用ListFiles列出文件夹下面所有的文件，注意这里返回了另外一个dataset对象。然后把所有的数据分batch。

分batch之后，需要去分配组队，这里其实就是生成triplet loss的三个文件，也就是anchor，positive，negative。pair_parse如下：

def pair_parser(imgs,totalsamples, dataset='VGG2'):
    images = imgs
    labels = []
    if dataset == 'VGG2':
        for i in range(totalsamples):
            labels.append(tf.strings.to_number(
                    tf.strings.substr(tf.strings.split(imgs[None, i], os.path.sep)[0, -2], pos=1, len=6),
                    out_type=tf.dtypes.int32))

    else:
        for i in range(totalsamples):
            labels.append(
                    tf.strings.to_number(tf.strings.split(imgs[None, i], os.path.sep)[0, -2], out_type=tf.dtypes.int32))

    return images,labels

pair_parser 就是把图像的对应的人名读取出来，作为label的值。这里返回的是 images,labels。下一步就是执行load_and_preprocess_image：


def _transform_images(is_ccrop=False):
    def transform_images(x_train):
        x_train = tf.image.resize(x_train, (128, 128))
        x_train = tf.image.random_crop(x_train, (112, 112, 3))
        x_train = tf.image.random_flip_left_right(x_train)
        x_train = tf.image.random_saturation(x_train, 0.6, 1.4)
        x_train = tf.image.random_brightness(x_train, 0.4)
        x_train = x_train / 255
        return x_train
    return transform_images

def preprocess_image(image,totalsamples,is_ccrop=False):
    images = []
    for i in range(totalsamples):
        img = tf.image.decode_jpeg(image[i], channels=3)
        img = _transform_images(is_ccrop=is_ccrop)(img)
        images.append(img)
    return images
def load_and_preprocess_image(path,labels,totalsamples,is_ccrop=False):
    image = []
    for i in range(totalsamples):
        image.append(tf.io.read_file(path[i]))
    return preprocess_image(image,totalsamples,is_ccrop=False),labels

这个load_and_preprocess_image其实也很简单，就是根据路径读取加载文件，然后解码文件，返回读取后的文件和label。

最后执行prefetch(tf.data.experimental.AUTOTUNE)，意思是在执行当前任务是，也会同时执行下一次的数据的预处理，提前加载到内存做准备，可以加快处理速度，让GPU快速运行无需等待IO加载。

注意，这里有一个随机裁剪，事实上这里随机的一些过程最好使用TF自带的。如果使用random这个包则会出现严重问题。

例如：

def _transform_images1(is_ccrop=False):
    def transform_images(x_train):
    	xi,yi,wi,hi = random.randint(50,200),random.randint(50,200),img_size_w,img_size_h 
    	x_train = x_train [:, xi:xi + wi, yi:yi + hi]
        x_train = x_train / 255
        return x_train
    return transform_images

def _transform_images2(is_ccrop=False):
    def transform_images(x_train):
    	xi,yi,wi,hi = tf.random.uniform(shape=(), minval=50, maxval=200, dtype=tf.int32),tf.random.uniform(shape=(), minval=50, maxval=200, dtype=tf.int32),img_size_w,img_size_h 
    	x_train = x_train [:, xi:xi + wi, yi:yi + hi]
        x_train = x_train / 255
        return x_train
    return transform_images

_transform_images2是会正常运行的，而_transform_images1也会正常运行。可是TF模型在运行的最初，使用_transform_images1，xi,yi,wi,hi 这四个值执行一次就不会再重新执行，所以所有后续的图像都是按照同样的xi,yi,wi,hi 裁剪，导致数据集生成不对，影响后续模型的训练。

因此，切记谨慎在dataset的pipeline里面使用非TF的函数。

Charles@深圳，寒露

以上是关于tf.data.Dataset 不要和random包混用的主要内容，如果未能解决你的问题，请参考以下文章