使用CNN和LSTM构建图像字幕标题生成器

Posted

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用CNN和LSTM构建图像字幕标题生成器相关的知识,希望对你有一定的参考价值。

感谢参考原文-http://bjbsair.com/2020-04-01/tech-info/18508.html

当您看到一个图像,您的大脑可以轻松分辨出图像的含义,但是计算机可以分辨出图像的含义吗?计算机视觉研究人员为此做了很多工作,他们认为直到现在都不可能!随着深度学习技术的进步,海量数据集的可用性和计算机功能的增强,我们可以构建可以为图像生成字幕的模型。

这就是我们将在这个项目中实现的目标,在该项目中,我们将一起使用卷积神经网络和一种循环神经网络(LSTM)的深度学习技术。

什么是图像字幕生成器?

图像标题生成器是一项任务,涉及计算机视觉和自然语言处理概念,以识别图像的上下文并以自然语言描述它们。

我们项目的目的是学习CNN和LSTM模型的概念,并通过使用LSTM实现CNN来构建图像字幕生成器的工作模型。

在这个项目中我们将使用CNN(卷积神经网络) 和LSTM(长短期记忆)实现字幕生成器。图像特征将从Xception中提取,Xception是在imagenet数据集上训练的CNN模型,然后我们将特征输入到LSTM模型中,该模型将负责生成图像标题。

整理数据集

对于图像标题生成器,我们将使用Flickr_8K数据集。还有其他一些大数据集,例如Flickr_30K和MSCOCO数据集,但是训练网络可能需要数周的时间,因此我们将使用一个小的Flickr8k数据集。庞大的数据集的优势在于我们可以构建更好的模型。

准备条件

我们将需要以下的几种库

  • tensorflow
  • keras
  • pillow
  • numpy
  • tqdm
  • jupyterlab

1.首先,我们导入所有必需的库

import string  
import numpy as np  
from PIL import Image  
import os  
from pickle import dump, load  
import numpy as np  
from keras.applications.xception import Xception, preprocess_input  
from keras.preprocessing.image import load_img, img_to_array  
from keras.preprocessing.text import Tokenizer  
from keras.preprocessing.sequence import pad_sequences  
from keras.utils import to_categorical  
from keras.layers.merge import add  
from keras.models import Model, load_model  
from keras.layers import Input, Dense, LSTM, Embedding, Dropout  
# small library for seeing the progress of loops.  
from tqdm import tqdm_notebook as tqdm  
tqdm().pandas()

技术图片

2、获取并执行数据清理

我们文件的格式是图像和标题,用新行(“ n”)分隔。

每个图像有5个字幕,我们可以看到为每个字幕分配了#(0到5)数字。

我们将定义5个函数:

  • load_doc(filename)–用于加载文档文件并将文件内部的内容读取为字符串。
  • all_img_captions(filename)–此函数将创建一个描述字典,该字典映射具有5个字幕列表的图像。
  • cleaning_text(descriptions)–此函数获取所有描述并执行数据清理。当使用文本数据时,这是重要的一步,根据目标,我们决定要对文本执行哪种类型的清理。在我们的例子中,我们将删除标点符号,将所有文本转换为小写并删除包含数字的单词。
  • text_vocabulary(descriptions)–这是一个简单的函数,它将分隔所有唯一的单词并从所有描述中创建词汇表。
  • save_descriptions(descriptions,filename)–该函数将创建一个已被预处理的所有描述的列表,并将它们存储到文件中。我们将创建一个descriptions.txt文件来存储所有标题。
# Loading a text file into memory  
def load_doc(filename):  
    # Opening the file as read only  
    file = open(filename, ‘r‘)  
    text = file.read()  
    file.close()  
    return text  
# get all imgs with their captions  
def all_img_captions(filename):  
    file = load_doc(filename)  
    captions = file.split(‘
‘)  
    descriptions ={}  
    for caption in captions[:-1]:  
        img, caption = caption.split(‘	‘)  
        if img[:-2] not in descriptions:  
            descriptions[img[:-2]] =   
        else:  
            descriptions[img[:-2]].append(caption)  
    return descriptions  
#Data cleaning- lower casing, removing puntuations and words containing numbers  
def cleaning_text(captions):  
    table = str.maketrans(‘‘,‘‘,string.punctuation)  
    for img,caps in captions.items():  
        for i,img_caption in enumerate(caps):  
            img_caption.replace("-"," ")  
            desc = img_caption.split()  
            #converts to lowercase  
            desc = [word.lower() for word in desc]  
            #remove punctuation from each token  
            desc = [word.translate(table) for word in desc]  
            #remove hanging ‘s and a   
            desc = [word for word in desc if(len(word)>1)]  
            #remove tokens with numbers in them  
            desc = [word for word in desc if(word.isalpha())]  
            #convert back to string  
            img_caption = ‘ ‘.join(desc)  
            captions[img][i]= img_caption  
    return captions  
def text_vocabulary(descriptions):  
    # build vocabulary of all unique words  
    vocab = set()  
    for key in descriptions.keys():  
        [vocab.update(d.split()) for d in descriptions[key]]  
    return vocab  
#All descriptions in one file   
def save_descriptions(descriptions, filename):  
    lines = list()  
    for key, desc_list in descriptions.items():  
        for desc in desc_list:  
            lines.append(key + ‘	‘ + desc )  
    data = "
".join(lines)  
    file = open(filename,"w")  
    file.write(data)  
    file.close()  
# Set these path according to project folder in you system  
dataset_text = "D:dataflair projectsProject - Image Caption GeneratorFlickr_8k_text"  
dataset_images = "D:dataflair projectsProject - Image Caption GeneratorFlicker8k_Dataset"  
#we prepare our text data  
filename = dataset_text + "/" + "Flickr8k.token.txt"  
#loading the file that contains all data  
#mapping them into descriptions dictionary img to 5 captions  
descriptions = all_img_captions(filename)  
print("Length of descriptions =" ,len(descriptions))  
#cleaning the descriptions  
clean_descriptions = cleaning_text(descriptions)  
#building vocabulary   
vocabulary = text_vocabulary(clean_descriptions)  
print("Length of vocabulary = ", len(vocabulary))  
#saving each description to file   
save_descriptions(clean_descriptions, "descriptions.txt")

技术图片

3、从所有图像中提取特征向量

这项技术也称为转移学习,我们不必自己做任何事情,我们使用已经在大型数据集上进行训练的预训练模型,并从这些模型中提取特征并将其用于我们的任务。我们正在使用Xception模型,该模型已经在imagenet数据集中进行了训练,该数据集具有1000个不同的类别进行分类。我们可以直接从keras.applications导入此模型。由于Xception模型最初是为imagenet构建的,因此与模型集成时,我们所做的改动很少。需要注意的一件事是,Xception模型采用299 299 3的图像尺寸作为输入。我们将删除最后一个分类层,并获得2048个特征向量。

模型= Xception(include_top = False,pooling =‘avg‘)

函数extract_features()将提取所有图像的特征,然后将图像名称与它们各自的特征数组映射。然后,我们将特征字典转储到“ features.p”pickle文件中。

def extract_features(directory):  
        model = Xception( include_top=False, pooling=‘avg‘ )  
        features = {}  
        for img in tqdm(os.listdir(directory)):  
            filename = directory + "/" + img  
            image = Image.open(filename)  
            image = image.resize((299,299))  
            image = np.expand_dims(image, axis=0)  
            #image = preprocess_input(image)  
            image = image/127.5  
            image = image - 1.0  
            feature = model.predict(image)  
            features[img] = feature  
        return features  
#2048 feature vector  
features = extract_features(dataset_images)  
dump(features, open("features.p","wb"))

技术图片

根据您的系统,此过程可能会花费很多时间。

features = load(open("features.p","rb"))

4、加载数据集以训练模型

Flickr_8k_test文件夹中,我们有Flickr_8k.trainImages.txt文件,其中包含用于训练的6000个图像名称的列表。

为了加载训练数据集,我们需要更多函数:

  • load_photos(filename)–这将以字符串形式加载文本文件,并返回图像名称列表。
  • load_clean_descriptions(文件名,照片)–此函数将创建一个字典,其中包含照片列表中每张照片的标题。我们还为每个字幕附加了<start>和<end>标识符。我们需要这样做,以便我们的LSTM模型可以识别字幕的开始和结束。
  • load_features(photos)–此函数将为我们提供先前从Xception模型提取的图像名称及其特征向量的字典。
#load the data   
def load_photos(filename):  
    file = load_doc(filename)  
    photos = file.split("
")[:-1]  
    return photos  
def load_clean_descriptions(filename, photos):   
    #loading clean_descriptions  
    file = load_doc(filename)  
    descriptions = {}  
    for line in file.split("
"):  
        words = line.split()  
        if len(words)<1 :  
            continue  
        image, image_caption = words[0], words[1:]  
        if image in photos:  
            if image not in descriptions:  
                descriptions[image] = []  
            desc = ‘<start> ‘ + " ".join(image_caption) + ‘ <end>‘  
            descriptions[image].append(desc)  
    return descriptions  
def load_features(photos):  
    #loading all features  
    all_features = load(open("features.p","rb"))  
    #selecting only needed features  
    features = {k:all_features[k] for k in photos}  
    return features  
filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"  
#train = loading_data(filename)  
train_imgs = load_photos(filename)  
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)  
train_features = load_features(train_imgs)

技术图片

5、词汇化

我们将用唯一的索引值映射词汇表中的每个单词。Keras库为我们提供了tokenizer函数,我们将使用该函数从词汇表创建令牌并将其保存到“ tokenizer.p”pickle文件中。

#calculate maximum length of descriptions  
def max_length(descriptions):  
    desc_list = dict_to_list(descriptions)  
    return max(len(d.split()) for d in desc_list)  

max_length = max_length(descriptions)  
max_length

技术图片

我们的词汇表包含7577个单词。

我们计算描述的最大长度。这对于确定模型结构参数很重要。说明的最大长度为32。

#create input-output sequence pairs from the image description.  
#data generator, used by model.fit_generator()  
def data_generator(descriptions, features, tokenizer, max_length):  
    while 1:  
        for key, description_list in descriptions.items():  
            #retrieve photo features  
            feature = features[key][0]  
            input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)  
            yield [[input_image, input_sequence], output_word]  
def create_sequences(tokenizer, max_length, desc_list, feature):  
    X1, X2, y = list(), list(), list()  
    # walk through each description for the image  
    for desc in desc_list:  
        # encode the sequence  
        seq = tokenizer.texts_to_sequences([desc])[0]  
        # split one sequence into multiple X,y pairs  
        for i in range(1, len(seq)):  
            # split into input and output pair  
            in_seq, out_seq = seq[:i], seq[i]  
            # pad input sequence  
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]  
            # encode output sequence  
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]  
            # store  
            X1.append(feature)  
            X2.append(in_seq)  
            y.append(out_seq)  
    return np.array(X1), np.array(X2), np.array(y)  
#You can check the shape of the input and output for your model  
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))  
a.shape, b.shape, c.shape  
#((47, 2048), (47, 32), (47, 7577))

技术图片

6、创建数据生成器

首先让我们看一下模型输入和输出的样子。为了使此任务成为监督学习任务,我们必须为模型提供输入和输出以进行训练。我们必须在6000张图像上训练模型,每张图像将包含2048个长度的特征向量,并且标题也以数字表示。不能将这6000个图像的数据量保存到内存中,因此我们将使用生成器方法来生成批处理。

生成器将产生输入和输出序列。

#create input-output sequence pairs from the image description.  
#data generator, used by model.fit_generator()  
def data_generator(descriptions, features, tokenizer, max_length):  
    while 1:  
        for key, description_list in descriptions.items():  
            #retrieve photo features  
            feature = features[key][0]  
            input_image, input_sequence, output_word = create_sequences(tokenizer, max_length, description_list, feature)  
            yield [[input_image, input_sequence], output_word]  
def create_sequences(tokenizer, max_length, desc_list, feature):  
    X1, X2, y = list(), list(), list()  
    # walk through each description for the image  
    for desc in desc_list:  
        # encode the sequence  
        seq = tokenizer.texts_to_sequences([desc])[0]  
        # split one sequence into multiple X,y pairs  
        for i in range(1, len(seq)):  
            # split into input and output pair  
            in_seq, out_seq = seq[:i], seq[i]  
            # pad input sequence  
            in_seq = pad_sequences([in_seq], maxlen=max_length)[0]  
            # encode output sequence  
            out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]  
            # store  
            X1.append(feature)  
            X2.append(in_seq)  
            y.append(out_seq)  
    return np.array(X1), np.array(X2), np.array(y)  
#You can check the shape of the input and output for your model  
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))  
a.shape, b.shape, c.shape  
#((47, 2048), (47, 32), (47, 7577))

技术图片

7.定义CNN-RNN模型

为了定义模型的结构,我们将使用Functional API中的Keras模型。它将包括三个主要部分:

  • Feature Extractor–从图像中提取的特征大小为2048,带有密集层,我们会将尺寸减小到256个节点。
  • Sequence Processor–嵌入层将处理文本输入,然后是LSTM层。
  • Decoder –通过合并以上两层的输出,我们将按密集层进行处理以做出最终预测。最后一层将包含等于我们词汇量的节点数。

最终模型的视觉表示如下:

技术图片

from keras.utils import plot_model  
# define the captioning model  
def define_model(vocab_size, max_length):  
    # features from the CNN model squeezed from 2048 to 256 nodes  
    inputs1 = Input(shape=(2048,))  
    fe1 = Dropout(0.5)(inputs1)  
    fe2 = Dense(256, activation=‘relu‘)(fe1)  
    # LSTM sequence model  
    inputs2 = Input(shape=(max_length,))  
    se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)  
    se2 = Dropout(0.5)(se1)  
    se3 = LSTM(256)(se2)  
    # Merging both models  
    decoder1 = add([fe2, se3])  
    decoder2 = Dense(256, activation=‘relu‘)(decoder1)  
    outputs = Dense(vocab_size, activation=‘softmax‘)(decoder2)  
    # tie it together [image, seq] [word]  
    model = Model(inputs=[inputs1, inputs2], outputs=outputs)  
    model.compile(loss=‘categorical_crossentropy‘, optimizer=‘adam‘)  
    # summarize model  
    print(model.summary())  
    plot_model(model, to_file=‘model.png‘, show_shapes=True)  
    return model

技术图片

8、训练模型

为了训练模型,我们将使用6000个训练图像,方法是分批生成输入和输出序列,并使用model.fit_generator()方法将它们拟合到模型中。我们还将模型保存到我们的模型文件夹中。

# train our model  
print(‘Dataset: ‘, len(train_imgs))  
print(‘Descriptions: train=‘, len(train_descriptions))  
print(‘Photos: train=‘, len(train_features))  
print(‘Vocabulary Size:‘, vocab_size)  
print(‘Description Length: ‘, max_length)  
model = define_model(vocab_size, max_length)  
epochs = 10  
steps = len(train_descriptions)  
# making a directory models to save our models  
os.mkdir("models")  
for i in range(epochs):  
    generator = data_generator(train_descriptions, train_features, tokenizer, max_length)  
    model.fit_generator(generator, epochs=1, steps_per_epoch= steps, verbose=1)  
    model.save("models/model_" + str(i) + ".h5")

技术图片

9、测试模型

该模型已经过训练,现在,我们将制作一个单独的文件testing_caption_generator.py,它将加载模型并生成预测。预测包含索引值的最大长度,因此我们将使用相同的tokenizer.p pickle文件从其索引值中获取单词。

import numpy as np  
from PIL import Image  
import matplotlib.pyplot as plt  
import argparse  
ap = argparse.ArgumentParser()  
ap.add_argument(‘-i‘, ‘--image‘, required=True, help="Image Path")  
args = vars(ap.parse_args())  
img_path = args[‘image‘]  
def extract_features(filename, model):  
        try:  
            image = Image.open(filename)  
        except:  
            print("ERROR: Couldn‘t open image! Make sure the image path and extension is correct")  
        image = image.resize((299,299))  
        image = np.array(image)  
        # for images that has 4 channels, we convert them into 3 channels  
        if image.shape[2] == 4:   
            image = image[..., :3]  
        image = np.expand_dims(image, axis=0)  
        image = image/127.5  
        image = image - 1.0  
        feature = model.predict(image)  
        return feature  
def word_for_id(integer, tokenizer):  
for word, index in tokenizer.word_index.items():  
     if index == integer:  
         return word  
return None  
def generate_desc(model, tokenizer, photo, max_length):  
    in_text = ‘start‘  
    for i in range(max_length):  
        sequence = tokenizer.texts_to_sequences([in_text])[0]  
        sequence = pad_sequences([sequence], maxlen=max_length)  
        pred = model.predict([photo,sequence], verbose=0)  
        pred = np.argmax(pred)  
        word = word_for_id(pred, tokenizer)  
        if word is None:  
            break  
        in_text += ‘ ‘ + word  
        if word == ‘end‘:  
            break  
    return in_text  
#path = ‘Flicker8k_Dataset/111537222_07e56d5a30.jpg‘  
max_length = 32  
tokenizer = load(open("tokenizer.p","rb"))  
model = load_model(‘models/model_9.h5‘)  
xception_model = Xception(include_top=False, pooling="avg")  
photo = extract_features(img_path, xception_model)  
img = Image.open(img_path)  
description = generate_desc(model, tokenizer, photo, max_length)  
print("

")  
print(description)  
plt.imshow(img)

技术图片

技术图片

two girls are playing in the grass(两个女孩在草地上玩)

结论

在这个项目中,我们通过构建图像标题生成器实现了CNN-RNN模型。需要注意的一些关键点是,我们的模型取决于数据,因此,它无法预测词汇量之外的单词。我们使用了一个包含8000张图像的小型数据集。对于生产级别的模型,我们需要对大于100,000张图像的数据集进行训练,以产生更好的精度模型。

以上是关于使用CNN和LSTM构建图像字幕标题生成器的主要内容,如果未能解决你的问题,请参考以下文章

MLP初始化Keras中的LSTM细胞状态

如何为二维数据构建LSTM网络?

解码器 LSTM Pytorch 的图像字幕示例输入大小

用LSTM分类 MNIST

使用 Keras 训练 CNN-LSTM 时卡在第一个 epoch

多对多 LSTM PyTorch