dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

dataset 包含所有评论都具有正面或负面情绪，它是一个二元情绪分类的数据集，包含 50,000 条影评文本。

info 包括文本编码器 (tfds.features.text.SubwordTextEncoder)，此文本编码器将对任何字符串进行编码，并在必要时退回到字节编码。

查看一下自带的文本编码器的大小：

encoder = info.features['text'].encoder
print('Vocabulary size: {}'.format(encoder.vocab_size))

返回信息：Vocabulary size: 8185

使用文本编码器：

# 原始的字符串数据
sample_string = 'Hello TensorFlow.'

# 使用文本编码器 进行编码，并打印出来
encoded_string = encoder.encode(sample_string)
print('Encoded string is {}'.format(encoded_string))

# 使用文件编码器 进行解码，回复原来的字符串数据，并打印出来
original_string = encoder.decode(encoded_string)
print('The original string: "{}"'.format(original_string))

返回信息：

Encoded string is [4025, 222, 6307, 2327, 4043, 2120, 7975]

The original string: "Hello TensorFlow."

二、数据预处理

数据预处理包括：创建这些编码字符串的批次；使用 padded_batch 方法将序列零填充，至批次中最长字符串的长度。

为什么要将样本填充呢？

由于每条输入数据的长度，可能是不一样的，有些数据长一些，有些数据短一些；在训练神经网络时，我们又希望输入数据是标准一些的；减少模型对输入数据长度的不同，而产生一些不必要的影响。

# 创建这些编码字符串的批次
BUFFER_SIZE = 10000
BATCH_SIZE = 64

# 使用 padded_batch 方法将序列零填充，至批次中最长字符串的长度
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE)
test_dataset = test_dataset.padded_batch(BATCH_SIZE)

三、创建模型

选择顺序模型（Sequential），因为模型中的所有层都只有单个输入并产生单个输出。

输入层：IMDB数据集的评论数据 — train_dataset

第一层：嵌入向量层，嵌入向量层每个单词存储一个向量。（64个结点）

调用时，它会将单词索引序列转换为向量序列。比如，'Hello TensorFlow.' 对应的单词索引序列是[4025, 222, 6307, 2327, 4043, 2120, 7975]。

这些向量是可训练的。在足够的数据上训练后，具有相似含义的单词通常具有相似的向量。

第二层：包装器+RNN层，其中包装器是 tf.keras.layers.Bidirectional，RNN使用LSTM网络

包装器也可以与 RNN 层（64个结点的LSTM网络）一起使用；这将通过 RNN 层向前和向后传播输入，然后连接输出。这有助于 RNN 学习长程依赖关系。

第三层：全连接层，64个结点的Dense 层，发现映射关系，计算相关规律；使用ReLu激活函数。

输出层：全连接层，1个结点的Dense 层，输出一个值，如果预测 >= 0.5，则为正，否则为负。

搭建模型代码：

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

查看模型结构：tf.keras.utils.plot_model(model)

或者用这样方式看看：model.summary()

编译模型

主要是为模型选择损失函数loss、优化器 optimizer、衡量指标metrics（通常用准确度accuracy 来衡量的）

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

四、训练模型

这里我们输入准备好的训练集数据，和测试集的数据，模型一共训练5次。

history = model.fit(train_dataset, epochs=5,
                    validation_data=test_dataset, 
                    validation_steps=30)

test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

PS：训练这个模型有点慢，能使用GPU最好了。

判断输入数据是正面或负面情绪：如果预测 >= 0.5，则为正，否则为负。

五、评价模型

导入 matplotlib 并创建一个辅助函数来绘制计算图：

import matplotlib.pyplot as plt

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])
  plt.show()

查看模型准确度的变化：

plot_graphs(history, 'accuracy')

查看模型损失的变化：

plot_graphs(history, 'loss')

六、使用模型

由于训练模型时，对每个样本都使用了padded序列零填充，在预先新数据时也需要先填充一下，实现代码如下：

def pad_to_size(vec, size):
  zeros = [0] * (size - len(vec))
  vec.extend(zeros)
  return vec

def sample_predict(sample_pred_text, pad):
  encoded_sample_pred_text = encoder.encode(sample_pred_text)

  if pad:
    encoded_sample_pred_text = pad_to_size(encoded_sample_pred_text, 64)
  encoded_sample_pred_text = tf.cast(encoded_sample_pred_text, tf.float32)
  predictions = model.predict(tf.expand_dims(encoded_sample_pred_text, 0))

  return (predictions)

下面使用模型

首先对样本进行padded序列零填充，再进行预测，先看一下正面评论，模型的预测结果：

sample_pred_text = ('The movie was not good. The animation and the graphics '
                    'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)

输出：[[0.53476834]]，这个预测结果 >= 0.5，则为正，预测正确了。

如果不对样本进行填充，直接进行预测

# 对数据不进行填充，直接预测
sample_pred_text = ('The movie was cool. The animation and the graphics '
                    'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print(predictions)

输出：[[0.44686183]]，这个预测结果 < 0.5，则为负，预测错误了。所以进行数据填充，保持和训练模型时的输入数据格式一致，也是比较关键的。

再看一下负面评论，模型的预测结果：

sample_pred_text = ('The movie was not good. The animation and the graphics '
                    'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)

输出信息：[[-1.15085]]，它小于0.5了，模型预测正确。

完整代码

import tensorflow_datasets as tfds
import tensorflow as tf
import matplotlib.pyplot as plt

# 下载数据集
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

# 查看info的编码器 (tfds.features.text.SubwordTextEncoder)
encoder = info.features['text'].encoder
print('Vocabulary size: {}'.format(encoder.vocab_size))
sample_string = 'Hello TensorFlow.'

# 此文本编码器将以对任何字符串进行编码，并在必要时退回到字节编码。
encoded_string = encoder.encode(sample_string)
print('Encoded string is {}'.format(encoded_string))

original_string = encoder.decode(encoded_string)
print('The original string: "{}"'.format(original_string))

assert original_string == sample_string
for index in encoded_string:
  print('{} ----&gt; {}'.format(index, encoder.decode([index])))

# 准备用于训练的数据
BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE)
test_dataset = test_dataset.padded_batch(BATCH_SIZE)


# 创建模型
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

# 编译模型
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

# 训练模型
history = model.fit(train_dataset, epochs=5,
                    validation_data=test_dataset, 
                    validation_steps=30)

test_loss, test_acc = model.evaluate(test_dataset)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

# 创建一个辅助函数来绘制计算图
def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])
  plt.show()

# 查看模型准确度、损失 的变化
plot_graphs(history, 'accuracy')
plot_graphs(history, 'loss')

# 定制填充函数
def pad_to_size(vec, size):
  zeros = [0] * (size - len(vec))
  vec.extend(zeros)
  return vec

def sample_predict(sample_pred_text, pad):
  encoded_sample_pred_text = encoder.encode(sample_pred_text)

  if pad:
    encoded_sample_pred_text = pad_to_size(encoded_sample_pred_text, 64)
  encoded_sample_pred_text = tf.cast(encoded_sample_pred_text, tf.float32)
  predictions = model.predict(tf.expand_dims(encoded_sample_pred_text, 0))

  return (predictions)

# 对数据不进行填充，直接预测
sample_pred_text = ('The movie was not good. The animation and the graphics '
                    'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print(predictions)

# 先对数据进行填充，再预测
sample_pred_text = ('The movie was not good. The animation and the graphics '
                    'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)

七、优化模型

首先回顾一下上面模型的结构：tf.keras.utils.plot_model(model, show_shapes=True)

该模型只使用了一层的RNN（LSTM网络），其实我们可以堆叠两个或更多 LSTM 层。

比如，堆叠两个 LSTM 层，实现起来效果好一些。

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,  return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1)
])

下面是编译模型、训练、测试、可视化准确度和损失的代码：

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)

test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))


plot_graphs(history, 'accuracy')
plot_graphs(history, 'loss')

# predict on a sample text with padding
sample_pred_text = ('The movie was not good. The animation and the graphics '
                    'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)

完整代码：

import tensorflow_datasets as tfds
import tensorflow as tf
import matplotlib.pyplot as plt

# 下载数据集
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']

# 查看info的编码器 (tfds.features.text.SubwordTextEncoder)
encoder = info.features['text'].encoder
print('Vocabulary size: {}'.format(encoder.vocab_size))
sample_string = 'Hello TensorFlow.'

# 此文本编码器将以对任何字符串进行编码，并在必要时退回到字节编码。
encoded_string = encoder.encode(sample_string)
print('Encoded string is {}'.format(encoded_string))

original_string = encoder.decode(encoded_string)
print('The original string: "{}"'.format(original_string))

assert original_string == sample_string
for index in encoded_string:
    print('{} ----&gt; {}'.format(index, encoder.decode([index])))

# 准备用于训练的数据
BUFFER_SIZE = 10000
BATCH_SIZE = 32  # 32 or 64
train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE)
test_dataset = test_dataset.padded_batch(BATCH_SIZE)

# 创建模型
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64,  return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1)
])

# 编译模型
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

# 训练模型
history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset,
                    validation_steps=30)

test_loss, test_acc = model.evaluate(test_dataset)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))

# 创建一个辅助函数来绘制计算图
def plot_graphs(history, metric):
    plt.plot(history.history[metric])
    plt.plot(history.history['val_' + metric], '')
    plt.xlabel("Epochs")
    plt.ylabel(metric)
    plt.legend([metric, 'val_' + metric])
    plt.show()

# 查看模型准确度、损失 的变化
plot_graphs(history, 'accuracy')
plot_graphs(history, 'loss')


# 定制填充函数
def pad_to_size(vec, size):
    zeros = [0] * (size - len(vec))
    vec.extend(zeros)
    return vec

def sample_predict(sample_pred_text, pad):
    encoded_sample_pred_text = encoder.encode(sample_pred_text)

    if pad:
        encoded_sample_pred_text = pad_to_size(encoded_sample_pred_text, 64)
    encoded_sample_pred_text = tf.cast(encoded_sample_pred_text, tf.float32)
    predictions = model.predict(tf.expand_dims(encoded_sample_pred_text, 0))

    return (predictions)

# demo1 先对数据进行填充，再预测
sample_pred_text = ('The movie was not good. The animation and the graphics '
                    'were terrible. I would not recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print(predictions)

# demo2
sample_pred_text = ('I like this movie better, it is very humanistic.  ')
predictions = sample_predict(sample_pred_text, pad=True)
print(predictions)

本文参考：https://www.tensorflow.org/text/tutorials/text_classification_rnn

关于 tf.keras.layers.Bidirectional 的使用，可以参考官网：https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional?hl=zh_cn

以上是关于循环神经网络实践—文本分类的主要内容，如果未能解决你的问题，请参考以下文章