为啥我会得到不同形状的张量错误?

Posted

技术标签:

【中文标题】为啥我会得到不同形状的张量错误?【英文标题】:Why am I getting a tensors with different shapes error?为什么我会得到不同形状的张量错误? 【发布时间】:2021-12-24 15:02:53 【问题描述】:

我正在尝试构建用于文本生成的 LSTM 模型,但在尝试拟合模型时出现错误。

追溯:

> InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: Cannot batch tensors with different shapes in component 0. First element had shape [21] and element 1 had shape [17]. [[node IteratorGetNext (defined at tmp/ipykernel_7804/4234150290.py:1) ]] (1) Invalid argument: Cannot batch tensors with different shapes in component 0. First element had shape [21] and element 1 had shape [17]. [[node IteratorGetNext (defined at tmp/ipykernel_7804/4234150290.py:1) ]] [[IteratorGetNext/_4]] 0 successful operations. 0 derived errors ignored. [Op:__inference_train_function_35783]

代码:

    batch_size = 64
    AUTOTUNE = tf.data.experimental.AUTOTUNE
    buffer_size= train_ds.cardinality().numpy()
    
    train_ds = train_ds.shuffle(buffer_size=buffer_size)\
                       .batch(batch_size=batch_size,drop_remainder=True)\
                       .cache()\
                       .prefetch(AUTOTUNE)
    
    test_ds = test_ds.shuffle(buffer_size=buffer_size)\
                       .batch(batch_size=batch_size,drop_remainder=True)\
                       .cache()\
                       .prefetch(AUTOTUNE)

    def create_model():
        n_units = 256
        max_len = 64
        vocab_size = 10000
        
        inputs_tokens = Input(shape=(max_len,), dtype=tf.int32)
        # inputs_tokens = Input(shape = (None,), dtype=tf.int32)
        
        embedding_layer = Embedding(vocab_size, 256)
        x = embedding_layer(inputs_tokens)
        x = LSTM(n_units)(x)
        x = Dropout(0.2)(x)
        outputs = Dense(vocab_size, activation = 'softmax')(x)
        model = Model(inputs=inputs_tokens, outputs=outputs)
        
        loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
        metric_fn  = tf.keras.metrics.SparseCategoricalAccuracy()
        model.compile(optimizer="adam", loss=loss_fn, metrics=metric_fn)  
        
        return model

当我查看类型规范 train_ds.element_spec 时,我得到:

    (TensorSpec(shape=<unknown>, dtype=tf.int64, name=None),
     TensorSpec(shape=(64,), dtype=tf.int64, name=None))

有什么想法我在这里做错了吗?我应该使用 padded_batch 吗?我应该重塑我的数据集吗?

编辑:

我是如何创建train_ds

我有一个 ~100k 歌词数组作为列表中的字符串,如下所示: `

['麦克风检查,我可以平滑到任何凹槽', '放松舌头,让我的麦克风巡航', “环游地球,像珍妮特一样把它们打包”,]`

我使用train_test_split 为特征和标签创建测试和训练集,其中标签是每条中倒数第二个单词。

    train_text_ds_raw = tf.data.Dataset.from_tensor_slices(
                tf.cast(train_data.values, tf.string)
    ) 
    
    train_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
                tf.cast(train_targets.values, tf.int64),
    
    ) 

然后我创建了这个函数:

    vectorize_layer = tf.keras.layers.TextVectorization(
        max_tokens=max_features,
        # standardize=lyrics_corpus,
        split="whitespace",
        ngrams=2,
        output_mode="int",
        # output_sequence_length=max_len,
        # vocabulary=words,
    )

    def convert_text_input(sample):
        text = sample
        text = tf.expand_dims(text, -1)  
        return tf.squeeze(vectorize_layer(text))

应用功能

    train_text_ds = train_text_ds_raw.map(convert_text_input, 
                                      num_parallel_calls=tf.data.experimental.AUTOTUNE)

将标签和文本重新组合在一起

    train_ds = tf.data.Dataset.zip(
        (
                train_text_ds,
                train_cat_ds_raw
         )
    )

示例表 | |预测器 |标签 |标签 ID | |-----------|-------------------------- --------------|----------|--------| | 0 |麦克风检查,我可以顺利进入任何 groov... |凹槽 | 8167 | | 1 |放松舌头,让我的麦克风好好听听... |邮轮| 4692 | | 2 |环游地球,像简一样把它们收起来... |珍妮特 | 9683 | | 3 |杰克逊,她在问我能不能猛击它,... |我—— | 9191 | | 4 |哟,哟,红人,男人,他妈的,男人?... |人? | 11174|

【问题讨论】:

您能展示一下您是如何创建数据集的train_ds吗? 添加为编辑 谢谢,train_targets.values 到底是什么?整数? 标签的整数编码(标签是每行倒数第二个单词) 我添加了一个表格示例,降价在编辑中正确显示,但在此处看起来不正确。我在数据框中有歌词、标签和 label_id。 【参考方案1】:

您可能忘记将vectorize_layer 层的状态与vectorize_layer.adapt 匹配到数据集。您可能还需要填充序列。也许尝试这样的事情:

import tensorflow as tf

train_text = [' mic check, i can get smooth to any groove ', " around the planet, pack 'em in like janet ", ' relax the tongue, let my mic take a cruise ', " around the planet, pack 'em in like janet ",]
train_text_ds_raw = tf.data.Dataset.from_tensor_slices(
            tf.cast(train_text, tf.string)
) 

train_cat_ds_raw = tf.data.Dataset.from_tensor_slices(
                tf.cast([200, 300, 400, 500], tf.int64)) 

vectorize_layer = tf.keras.layers.TextVectorization(
  max_tokens=50,
  split="whitespace",
  ngrams=2,
  output_mode="int",
)

vectorize_layer.adapt(train_text)

max_length = 20
def convert_text_input(sample):
  text = sample
  text = tf.expand_dims(text, -1)
  vectorized_text = tf.squeeze(vectorize_layer(text)) 
   
  if tf.shape(vectorized_text)[0] < max_length:
    difference = max_length-tf.shape(vectorized_text)[0] 
    return tf.pad(vectorized_text, [[0, difference]], "CONSTANT")
  
  return vectorized_text

train_text_ds = train_text_ds_raw.map(convert_text_input, 
                                      num_parallel_calls=tf.data.experimental.AUTOTUNE)

train_ds = tf.data.Dataset.zip(
        ( train_text_ds,  train_cat_ds_raw)
    ).batch(2)

for x, y in train_ds:
  print(x, y)
tf.Tensor(
tf.Tensor(
[[ 8 42 36 44 39 26 21 46 37 32 41 35 43 38 25 20 45  0  0  0]
 [17  2  5  7 15 13 10 11 16  3  4  6 14 12  9  0  0  0  0  0]], shape=(2, 20), dtype=int64) tf.Tensor([200 300], shape=(2,), dtype=int64)
tf.Tensor(
[[28  2 19 34 30  8 24 48 40 27 22 18 33 29 31 23 47  0  0  0]
 [17  2  5  7 15 13 10 11 16  3  4  6 14 12  9  0  0  0  0  0]], shape=(2, 20), dtype=int64) tf.Tensor([400 500], shape=(2,), dtype=int64)

请注意,您不能使用TextVectorization 层的pad_to_max_tokens 参数,因为它仅适用于“multi_hot”、“count”和“tf_idf”模式并且您正在使用output_mode="int"。因此,您必须自己应用填充。

如果您想使用填充,您至少必须确保每个批次包含相同长度的序列并且您的输入形状是灵活的 => (None, )

【讨论】:

以上是关于为啥我会得到不同形状的张量错误?的主要内容,如果未能解决你的问题,请参考以下文章

tf.shape() 在张量流中得到错误的形状

在pytorch中连接两个不同形状的火炬张量

无法在组件 0 中批量处理不同形状的张量

如何使用 TensorFlow 连接两个具有不同形状的张量?

为啥我得到 Keras 形状不匹配?

为啥我会为同一个数据集得到不同的输出?