BERT - 从多个输出与单个输出中提取 CLS 嵌入

Posted 2023-03-29

技术标签:

【中文标题】BERT - 从多个输出与单个输出中提取 CLS 嵌入【英文标题】：BERT - Extracting CLS embedding from multiple outputs vs single 【发布时间】：2021-04-14 01:58:09 【问题描述】：

我正在使用转换器 TFBertModel 对一堆输入字符串进行分类，但是我想访问 CLS 嵌入以便能够重新平衡我的数据。

当我将数据的单个元素传递给简化的 bert 模型的 predict 方法（以获取 CLS 数据）时，我采用 last_hidden_state 的第一个数组，瞧。但是，当我传入多行数据时，输出的形状会按预期发生变化，但实际 CLS 嵌入（我第一次传入的第一行）似乎也发生了变化。

我的数据集包含输入 ID 和掩码，以及模型：

from transformers import TFBertModel

model = TFBertModel.from_pretrained('bert-base-multilingual-cased', trainable=False, num_labels=len(le.classes_))

input_ids_layer = Input(shape=(256,), dtype=np.int32)
input_mask_layer = Input(shape=(256,), dtype=np.int32)

bert_layer = model([input_ids_layer, input_mask_layer])

model = Model(inputs=[input_ids_layer, input_mask_layer], outputs=bert_layer)

然后，为了获得 CLS 嵌入，我只需调用 predict 方法并深入研究结果。所以对于第一行数据（data_x[0] 是输入 id，data_x[1] 是掩码）

output1 = model.predict([data_x[0][0], data_x[1][0]])

TFBaseModelOutputWithPooling([('last_hidden_state',
                               array([[[ 0.35013607, -0.5340336 ,  0.28577858, ..., -0.03405955,
                                        -0.0165604 , -0.36481357]],
                               
                                      [[ 0.34572566, -0.5361709 ,  0.281771  , ..., -0.03687727,
                                        -0.01690093, -0.35451806]],
                               
                                      [[ 0.34878412, -0.5399749 ,  0.28948805, ..., -0.03613809,
                                        -0.01503076, -0.35425758]],
                               
                                      ...,

我的理解是句子的CLS表示是last_hidden_state的第一个数组，即：

lhs1 = output1[0]

lhs1.shape
>> (256, 1, 768)

cls1 = lhs1[0][0]

cls1
>>[0.35013607 ... -0.36481357]` (as above)

到目前为止一切顺利。当我现在想从我的数据集中获取前 2 个 CLS 嵌入时，我的困惑就出现了：

output_both = model.predict([data_x[0][:2], data_x[1][:2]])
lhs_both = output_both[0] # last hidden states

lhs_both.shape
>> (2, 256, 768)

cls_both = lhs_both[0][0] # I thought this would give me two CLS arrays including the first one above

检查cls_both:

array([[[ 0.11075249, -0.02257648, -0.40831113, ...,  0.18384863,
          0.17032738, -0.05989586],
        [-0.22926208, -0.5627498 ,  0.2617012 , ...,  0.20701236,
          0.3141808 , -0.8650396 ],
        [-0.22352833, -0.49676323, -0.5286081 , ...,  0.23819353,
          0.3742358 , -0.69018203],
        ...,
        [ 0.5120927 , -0.09863365,  0.7378716 , ..., -0.19551781,
          0.45915398,  0.22804889],
        [-0.13397002,  0.1617202 ,  0.15663634, ..., -0.511597  ,
          0.3959382 ,  0.30565232],
        [-0.14100523,  0.22792323, -0.15898004, ..., -0.2690729 ,
          0.4730471 ,  0.18431285]],

       [[-0.20033133, -0.08412935, -0.0411438 , ...,  0.34706163,
          0.1919156 , -0.08740871],
        [-0.12536147, -0.44519228,  1.2984221 , ...,  0.07149828,
          0.7915938 ,  0.08048639],
        [ 0.4596323 , -0.3316555 ,  1.2545322 , ..., -0.02128018,
          0.5344383 ,  0.32054782],
        ...,
        [-0.54777217,  0.23129587,  0.5007771 , ...,  0.70299244,
          0.27277255, -0.2848366 ],
        [-0.49410668,  0.37352908,  0.8732239 , ...,  0.6065303 ,
          0.152081  , -0.9312557 ],
        [-0.33172935, -0.35368383,  0.5942321 , ...,  0.7171531 ,
          0.24436645,  0.08909844]]], dtype=float32)

我不确定如何解释这一点 - 我的期望是看到第一行 CLS cls1 包含在 cls_both 中，但正如您所见，第一个子数组中的第一行是不同的。谁能解释一下？

此外，如果我只运行第二行，我会得到与第一行完全相同的 CLS 令牌，尽管它们包含完全不同的 input_ids/masks：

output2 = model.predict([data_x[0][1], data_x[1][1]])
lhs2 = output2[0]
cls2 = lhs2[0][0]


cls2
>>
[ 0.35013607, -0.5340336 ,  0.28577858, ..., -0.03405955,
         -0.0165604 , -0.36481357]]

cls1 == cl2 
>> True

编辑

BERT sentence embeddings: how to obtain sentence embeddings vector

上面的帖子解释了output[0][:,0,:] 是准确获取 CLS 令牌的正确方法，这使事情变得更容易。

当我运行三行时，我得到了一致的结果，但是每当我运行单行时，我都会得到cls1 中显示的结果 - 为什么每次都没有不同？

【问题讨论】：

【参考方案1】：

我认为您传入的切片 data_x 的形状存在问题。

由于您没有指定data_x 的形状，我首先尝试在下面复制它：

text = ['a sample text','another text', 'the third text']

bert_tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer_output = bert_tokenizer(text, return_tensors='np', max_length=256, padding='max_length')

data_x = np.array([
    tokenizer_output['input_ids'], 
    tokenizer_output['attention_mask']
])

print(data_x.shape)

data_x 的形状是(2, 3, 256)。

data_x[0][0] 不是正确的切片方式

对于您的第一行数据，您通过使用data_x[0][0] 和data_x[1][0] 对其进行切片准备了 input_ids 和 attention_mask，您的 input_ids 和 attention_mask 的形状变为 (256,)

print(data_x[0][0].shape) # (256,)
print(data_x[1][0].shape) # (256,)

而 TF 模型期望 (batch_size, 256) 的输入形状为 input_ids_layer 和 input_mask_layer。请注意，提供给Input 的shape 参数不包括batch_size，引用自其文档here：shape: A shape tuple (integers), not including the batch size.。

事实上，当我尝试传递输入 [data_x[0][0], data_x[1][0]] # Both with shape (256,) 时，我收到了来自 Tensorflow 的以下警告：

WARNING:tensorflow:Model was constructed with shape (None, 256) for input KerasTensor(type_spec=TensorSpec(shape=(None, 256), dtype=tf.int32, name='input_3'), name='input_3', description="created by layer 'input_3'"), but it was called on an input with incompatible shape (32, 1).
...

对输入数据进行切片的正确方法

您应该在不更改张量维度的情况下对它们进行切片，以便您的 input_ids 和 attention_mask 保持(batch_size, 256) 的形式

# For first sentence only
input_1 = [data[0][0:1], data[1][0:1]] # Shape : (1,256) (1,256)

# For second sentence only
input_2 = [data[0][1:2], data[1][1:2]] # Shape : (1,256) (1,256)

# For first and second sentence
input_12 = [data[0][:2], data[1][:2]]  # Shape : (2,256) (2,256)

在将上述数据传递给您的模型后，您将使用output[0][:,0,:] 获得 CLF 嵌入，如您共享的链接中所述。

无论你传入input_1、input_2还是input_12，都可以确认第一句和第二句的embeddings是一样的：

output1 = model.predict(input_1)
output2 = model.predict(input_2)
output12 = model.predict(input_12)

ihs1 = output1[0] # Shape : (1, 256, 768)
ihs2 = output2[0] # Shape : (1, 256, 768)
ihs12 = output12[0] # Shape : (2, 256, 768)

cls1 = ihs1[:,0,:] # Shape : (1, 768)
cls2 = ihs2[:,0,:] # Shape : (1, 768)
cls12 = ihs12[:,0,:] # Shape : (2, 768)

# Check that cls1 is exactly the same as cls12[0]
print((cls1 == cls12[0]).all()) # True

# Likewise, cls2 is exactly the same as cls12[1]
print((cls2 == cls12[1]).all()) # True

希望这可以为您解决问题。当您有疑问时，请务必检查模型的输入和输出形状。

【讨论】：

以上是关于BERT - 从多个输出与单个输出中提取 CLS 嵌入的主要内容，如果未能解决你的问题，请参考以下文章

从文件夹中的所有文本文件中提取与模式匹配的行到单个输出文件

图示详解BERT模型的输入与输出

BERT模型内部结构解析

EMNLP 2019Sentence-BERT

多个空格用一个空格代替

TPL Dataflow 模块可从单个输入生成多个输出