用于分类的 TensorFlow 标签未在模型中正确加载

Posted

技术标签:

【中文标题】用于分类的 TensorFlow 标签未在模型中正确加载【英文标题】:Tensorflow labels for classification aren't loaded properly in the model 【发布时间】:2021-12-15 19:45:59 【问题描述】:

我的数据中的类别有问题,我无法将 Dense softmax 层设置为“3”,而不是 3 个类别的“1”。

我认为我的问题在于 vectorize_text,但我并不完全确定。我也可以假设我没有正确设置标签张量。

# Start of data generation

dummy_data = 'text': ['Love', 'Money', 'War'],
              'labels': [1,2,3]
              
dummy_data['text'] = dummy_data['text']*500
dummy_data['labels'] = dummy_data['labels']*500

df_train_bogus = pd.DataFrame(dummy_data)  


def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  ds = tf.data.Dataset.from_tensor_slices(dict(dataframe)).batch(batch_size)
  return ds

batch_size = 32
train_ds = df_to_dataset(df_train_bogus, batch_size=batch_size)
val_ds = df_to_dataset(df_train_bogus, batch_size=batch_size)

# Model constants (can be lower but that doesn't matter for this example)
sequence_length = 128
max_features = 20000  # vocab size
embedding_dim = 128
# End of data generation
#  Start of vectorization
vectorize_layer = TextVectorization(
    standardize = 'lower_and_strip_punctuation',
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

def vectorize_text(text, labels):
  print(text)
  print(labels)

  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), labels

vectorize_layer.adapt(df_train_bogus['text'])

train_ds_vectorized = train_ds.map(lambda x: (vectorize_text(x['text'], x['labels'])))
val_ds_vectorized = val_ds.map(lambda x: (vectorize_text(x['text'], x['labels'])))

"""
Output:
Tensor("args_1:0", shape=(None,), dtype=string)
Tensor("args_0:0", shape=(None,), dtype=int64)
Tensor("args_1:0", shape=(None,), dtype=string)
Tensor("args_0:0", shape=(None,), dtype=int64)

"""
#  The model

model = Sequential()
model.add(Embedding(max_features, embedding_dim, input_length=sequence_length))
model.add(LSTM(embedding_dim, input_shape=(None, sequence_length)))

model.add(Dense(3, activation='softmax'))
#  Fails with this error:
#      ValueError: Shapes (None, 1) and (None, 3) are incompatible

model.summary()

model.compile(loss="categorical_crossentropy",
              optimizer="adam",
              metrics=["accuracy"])  # model 4

epochs = 10

# Fit the model using the train and test datasets.
history = model.fit(train_ds_vectorized, validation_data=val_ds_vectorized, epochs=epochs)

【问题讨论】:

【参考方案1】:

您的虚拟数据中的标签导致了问题。如果它们不是单热编码的,那么我建议使用sparse_categorical_crossentropy 损失函数,它适用于整数目标(你已经拥有)。查看docs 了解更多信息。这是一个完整的工作示例:

import tensorflow as tf
import pandas as pd

dummy_data = 'text': ['Love', 'Money', 'War'],
              'labels': [0, 1, 2]
              
dummy_data['text'] = dummy_data['text']*500
dummy_data['labels'] = dummy_data['labels']*500

df_train_bogus = pd.DataFrame(dummy_data)  


def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  ds = tf.data.Dataset.from_tensor_slices(dict(dataframe)).batch(batch_size)
  return ds

batch_size = 32
train_ds = df_to_dataset(df_train_bogus, batch_size=batch_size)
val_ds = df_to_dataset(df_train_bogus, batch_size=batch_size)

# Model constants (can be lower but that doesn't matter for this example)
sequence_length = 128
max_features = 20000  # vocab size
embedding_dim = 128

#  Start of vectorization
vectorize_layer = tf.keras.layers.TextVectorization(
    standardize = 'lower_and_strip_punctuation',
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

def vectorize_text(text, labels):
  print(text)
  print(labels)

  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), labels

vectorize_layer.adapt(df_train_bogus['text'])

train_ds_vectorized = train_ds.map(lambda x: (vectorize_text(x['text'], x['labels'])))
val_ds_vectorized = val_ds.map(lambda x: (vectorize_text(x['text'], x['labels'])))

"""
Output:
Tensor("args_1:0", shape=(None,), dtype=string)
Tensor("args_0:0", shape=(None,), dtype=int64)
Tensor("args_1:0", shape=(None,), dtype=string)
Tensor("args_0:0", shape=(None,), dtype=int64)

"""

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(max_features, embedding_dim, input_length=sequence_length))
model.add(tf.keras.layers.LSTM(embedding_dim, input_shape=(None, sequence_length)))

model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.summary()

model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["sparse_categorical_accuracy"])  # model 4

epochs = 10

history = model.fit(train_ds_vectorized, validation_data=val_ds_vectorized, epochs=epochs)
"""
Output:
Tensor("args_1:0", shape=(None,), dtype=string)
Tensor("args_0:0", shape=(None,), dtype=int64)
Tensor("args_1:0", shape=(None,), dtype=string)
Tensor("args_0:0", shape=(None,), dtype=int64)

"""

model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(max_features, embedding_dim, input_length=sequence_length))
model.add(tf.keras.layers.LSTM(embedding_dim, input_shape=(None, sequence_length)))

model.add(tf.keras.layers.Dense(3, activation='softmax'))

model.summary()

model.compile(loss="sparse_categorical_crossentropy",
              optimizer="adam",
              metrics=["accuracy"])  # model 4

epochs = 10

history = model.fit(train_ds_vectorized, validation_data=val_ds_vectorized, epochs=epochs)

请注意,您的标签需要从zero 开始到n,因为sparse_categorical_crossentropy 会生成最可能类别的类别索引,可以是0

更新:准确度 0.333 是正确的,因为您有 3 个类,每个类的样本数量相同。您需要使用更大的数据集才能看到任何合理的结果。

【讨论】:

我确实尝试了 sparse_categorical_crossentropy,但是我得到了奇怪的结果。按原样运行您附加的代码会带来 0.333 的准确度,就好像它总是发送与输出相同的数字.. 谢谢,我将开始获取我的数据集,看看这是否有意义:-)【参考方案2】:

您的问题在于您的损失函数。 Keras 中的分类交叉熵要求类不是 idx 形式,而是作为它们的目标 logits/激活输出。因此,您的训练损失应该是以下形式:

from tensorflow.keras.utils import to_categorical
n_classes = 3
y = [0,1,2] #IMPORTANT TO INDEX FROM 0 
cat_y = to_categorical(y,n_classes)


array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)

要实现这一点,您需要对处理数据的方式进行一些更改,如下所示:

# Start of data generation

dummy_data = 'text': ['Love', 'Money', 'War'],
              'labels': [1,2,0]
              
dummy_data['text'] = dummy_data['text']*500
dummy_data['labels'] = dummy_data['labels']*500

dummy_data['labels'] = to_categorical(dummy_data['labels'],3)
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
    ds = tf.data.Dataset.from_tensor_slices((dummy_data['text'],dummy_data['labels']))
    return ds

batch_size = 32
train_ds = df_to_dataset(dummy_data, batch_size=batch_size)
val_ds = df_to_dataset(dummy_data, batch_size=batch_size)

# Model constants (can be lower but that doesn't matter for this example)
sequence_length = 128
max_features = 20000  # vocab size
embedding_dim = 128
# End of data generation
#  Start of vectorization
vectorize_layer = TextVectorization(
    standardize = 'lower_and_strip_punctuation',
    max_tokens=max_features,
    output_mode="int",
    output_sequence_length=sequence_length,
)

def vectorize_text(text, labels):
  print(text)
  print(labels)

  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), tf.expand_dims(labels, 0)

vectorize_layer.adapt(dummy_data['text'])

train_ds_vectorized = train_ds.map(lambda x,y: vectorize_text(x,y))
val_ds_vectorized = val_ds.map(lambda x,y: vectorize_text(x,y))

    

【讨论】:

以上是关于用于分类的 TensorFlow 标签未在模型中正确加载的主要内容,如果未能解决你的问题,请参考以下文章

Tensorflow Precision, Recall, F1 - 多标签分类

如何使用 Tensorflow 创建预测标签和真实标签的混淆矩阵?

如何使用 Tensorflow 创建预测标签和真实标签的混淆矩阵?

“运行时检查失败 #0 - ESP 的值未在函数调用中正确保存”从 C++ 代码成功 C# 回调后

在评估 TensorFlow-Slim 时显示标签

TensorFlow 模型适用于 Python,但不适用于 C++