如何使用 Keras（增强、拆分）预处理我的 ImageDataset

Posted 2023-02-16

技术标签:

【中文标题】如何使用 Keras（增强、拆分）预处理我的 ImageDataset【英文标题】：How to preprocess my ImageDataset using Keras (Augmentation, Split) 【发布时间】：2021-10-21 13:48:17 【问题描述】：

我有一个关于物体检测的项目。我的数据很少，想使用 Keras 应用数据增强方法，但是当我尝试将数据拆分并保存到训练和测试时出错。

我该怎么做？

我想做什么？

首先，我想调整图像数据集的大小，然后将数据随机拆分为训练和测试。保存到“培训”“测试”目录之后，我想为培训文件夹实施数据扩充。

from tensorflow.keras.applications.xception import preprocess_input
from tensorflow.keras.preprocessing.image import ImageDataGenerator
 data_dir=/..path/
ds_gen = ImageDataGenerator(
preprocessing_function=preprocess_input,
validation_split=0.2 
)

train_ds = ds_gen.flow_from_directory(
"data_dir", 
seed=1,
target_size=(150, 150), #adjust to your needs
batch_size=32,#adjust to your needs
save_to_dir= data_dir/training
subset='training' 
 )

val_ds = ds_gen.flow_from_directory(
"data_dir",
seed=1,
target_size=(150, 150),
batch_size=32,
save_to_dir= data_dir/validation
subset='validation'
)

【问题讨论】：

【参考方案1】：

我建议使用 ImageDataGenerator.flow_from_dataframe 来做你想做的事。由于您使用的是目录中的流，因此您的数据被组织起来，以便下面的代码将读取图像信息并创建一个 train_df、一个 test_df 和一个 valid_df 数据帧集：

def preprocess (sdir, trsplit, vsplit, random_seed):
    filepaths=[]
    labels=[]    
    classlist=os.listdir(sdir)
    for klass in classlist:
        classpath=os.path.join(sdir,klass)
        flist=os.listdir(classpath)
        for f in flist:
            fpath=os.path.join(classpath,f)
            filepaths.append(fpath)
            labels.append(klass)
    Fseries=pd.Series(filepaths, name='filepaths')
    Lseries=pd.Series(labels, name='labels')
    df=pd.concat([Fseries, Lseries], axis=1)       
    # split df into train_df and test_df 
    dsplit=vsplit/(1-trsplit)
    strat=df['labels']    
    train_df, dummy_df=train_test_split(df, train_size=trsplit, shuffle=True, random_state=random_seed, stratify=strat)
    strat=dummy_df['labels']
    valid_df, test_df=train_test_split(dummy_df, train_size=dsplit, shuffle=True, random_state=random_seed, stratify=strat)
    print('train_df length: ', len(train_df), '  test_df length: ',len(test_df), '  valid_df length: ', len(valid_df))
    print(train_df['labels'].value_counts())
    return train_df, test_df, valid_df
    
sdir=/..path/
train_split=.8 # set this to the % of data you want for the train set
valid_split=.1 # set this to the % of the data you want for a validation set
# note % used for test is 1-train_split-valid_split
train_df, test_df, valid_df= preprocess(sdir,train_split, valid_split)

该函数将根据每个类的训练数据框中有多少样本来显示类之间的平衡。检查这些数据并决定如何在每个班级中使用您想要的样本数量。例如，class0 有 3000 个样本，class1 有 1200 个样本，class2 有 800 个样本，您可以决定对于训练数据框，您希望每个类都有 1000 个样本（max_samples=1000）。这意味着对于第 2 类，您必须创建 200 张增强图像，而对于第 0 类和第 1 类，您需要减少图像数量。下面的功能将为您做到这一点。 trim 函数修剪一个类中的最大样本数。 balance 函数使用 trim 函数，然后创建目录来存储增强图像，创建 aug_df 数据帧并将其与 train_df 数据帧合并。结果是一个复合数据帧 ndf，用作复合训练集，并与每个类中的样本的 max_samples 完全平衡。

def trim (df, max_size, min_size, column):
    df=df.copy()
    sample_list=[] 
    groups=df.groupby(column)
    for label in df[column].unique():        
        group=groups.get_group(label)
        sample_count=len(group)         
        if sample_count> max_size :
            samples=group.sample(max_size, replace=False, weights=None, random_state=123, axis=0).reset_index(drop=True)
            sample_list.append(samples)
        elif sample_count>= min_size:
            sample_list.append(group)
    df=pd.concat(sample_list, axis=0).reset_index(drop=True)
    balance=list(df[column].value_counts())
    print (balance)
    return df
def balance(train_df,max_samples, min_samples, column, working_dir, image_size):
    train_df=train_df.copy()
    train_df=trim (train_df, max_samples, min_samples, column)    
    # make directories to store augmented images
    aug_dir=os.path.join(working_dir, 'aug')
    if os.path.isdir(aug_dir):
        shutil.rmtree(aug_dir)
    os.mkdir(aug_dir)
    for label in train_df['labels'].unique():    
        dir_path=os.path.join(aug_dir,label)    
        os.mkdir(dir_path)
    # create and store the augmented images  
    total=0
    gen=ImageDataGenerator(horizontal_flip=True,  rotation_range=20, width_shift_range=.2,
                                  height_shift_range=.2, zoom_range=.2)
    groups=train_df.groupby('labels') # group by class
    for label in train_df['labels'].unique():  # for every class               
        group=groups.get_group(label)  # a dataframe holding only rows with the specified label 
        sample_count=len(group)   # determine how many samples there are in this class  
        if sample_count< max_samples: # if the class has less than target number of images
            aug_img_count=0
            delta=max_samples-sample_count  # number of augmented images to create
            target_dir=os.path.join(aug_dir, label)  # define where to write the images    
            aug_gen=gen.flow_from_dataframe( group,  x_col='filepaths', y_col=None, target_size=image_size,
                                            class_mode=None, batch_size=1, shuffle=False, 
                                            save_to_dir=target_dir, save_prefix='aug-', color_mode='rgb',
                                            save_format='jpg')
            while aug_img_count<delta:
                images=next(aug_gen)            
                aug_img_count += len(images)
            total +=aug_img_count
    print('Total Augmented images created= ', total)
    # create aug_df and merge with train_df to create composite training set ndf
    if total>0:
        aug_fpaths=[]
        aug_labels=[]
        classlist=os.listdir(aug_dir)
        for klass in classlist:
            classpath=os.path.join(aug_dir, klass)     
            flist=os.listdir(classpath)    
            for f in flist:        
                fpath=os.path.join(classpath,f)         
                aug_fpaths.append(fpath)
                aug_labels.append(klass)
        Fseries=pd.Series(aug_fpaths, name='filepaths')
        Lseries=pd.Series(aug_labels, name='labels')
        aug_df=pd.concat([Fseries, Lseries], axis=1)
        ndf=pd.concat([train_df,aug_df], axis=0).reset_index(drop=True)
    else:
        ndf=train_df
    print (list(ndf['labels'].value_counts()) )
    return ndf 

    
max_samples= 1000 # set this to how many samples you want in each class
min_samples=0
column='labels'
working_dir = r'./' # this is the directory where the augmented images will be stored
img_size=(224,224) # set this to the image size you want for the images
ndf=balance(train_df,max_samples, min_samples, column, working_dir, img_size)

现在创建训练、测试和有效生成器

channels=3
batch_size=30
img_shape=(img_size[0], img_size[1], channels)
length=len(test_df)
test_batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=80],reverse=True)[0]  
test_steps=int(length/test_batch_size)
print ( 'test batch size: ' ,test_batch_size, '  test steps: ', test_steps)
def scalar(img):    
    return img  # EfficientNet expects pixelsin range 0 to 255 so no scaling is required
trgen=ImageDataGenerator(preprocessing_function=scalar, horizontal_flip=True)
tvgen=ImageDataGenerator(preprocessing_function=scalar)
train_gen=trgen.flow_from_dataframe( ndf, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
                                    color_mode='rgb', shuffle=True, batch_size=batch_size)
test_gen=tvgen.flow_from_dataframe( test_df, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
                                    color_mode='rgb', shuffle=False, batch_size=test_batch_size)

valid_gen=tvgen.flow_from_dataframe( valid_df, x_col='filepaths', y_col='labels', target_size=img_size, class_mode='categorical',
                                    color_mode='rgb', shuffle=True, batch_size=batch_size)
classes=list(train_gen.class_indices.keys())
class_count=len(classes)

现在使用 model.fit 中的 train_gen 和 valid_gen。在 model.evaluate 或 model.predict 中使用 test_gen

【讨论】：

感谢您的回答，这对我帮助很大。

以上是关于如何使用 Keras（增强、拆分）预处理我的 ImageDataset的主要内容，如果未能解决你的问题，请参考以下文章

如何使用批处理为大型数据集拟合 Keras ImageDataGenerator

如何通过 tf.data API 使用 Keras 生成器

如何使用 AdaBoost 增强基于 Keras 的神经网络？

Keras ImageDataGenerator 不处理符号链接文件

具有增强图像和其他功能的 Keras 迭代器

使用 ImageDataGenerator 进行 Keras 数据增强（您的输入没有数据）