BERT 获取句子嵌入

Posted

技术标签:

【中文标题】BERT 获取句子嵌入【英文标题】:BERT get sentence embedding 【发布时间】:2021-11-29 16:30:54 【问题描述】:

我正在从this page 复制代码。我已将 BERT 模型下载到本地系统并获得句子嵌入。

我有大约 500,000 个句子需要句子嵌入,这需要很长时间。

    有没有办法加快这个过程? 发送一批句子而不是一次发送一个句子会有帮助吗?

.

#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

corpa=["i am a boy","i live in a city"]



storage=[]#list to store all embeddings

for text in corpa:
    # Add the special tokens.
    marked_text = "[CLS] " + text + " [SEP]"

    # Split the sentence into tokens.
    tokenized_text = tokenizer.tokenize(marked_text)

    # Map the token strings to their vocabulary indeces.
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

    segments_ids = [1] * len(tokenized_text)

    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers. 
    with torch.no_grad():

        outputs = model(tokens_tensor, segments_tensors)

        # Evaluating the model will return a different number of objects based on 
        # how it's  configured in the `from_pretrained` call earlier. In this case, 
        # becase we set `output_hidden_states = True`, the third item will be the 
        # hidden states from all layers. See the documentation for more details:
        # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
        hidden_states = outputs[2]


    # `hidden_states` has shape [13 x 1 x 22 x 768]

    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = hidden_states[-2][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)

    storage.append((text,sentence_embedding))

######更新1

我根据提供的答案修改了我的代码。它没有进行完整的批处理

#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

batch_sentences = ["Hello I'm a single sentence",
                    "And another sentence",
                    "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)


storage=[]#list to store all embeddings
for i,text in enumerate(encoded_inputs['input_ids']):
    
    tokens_tensor = torch.tensor([encoded_inputs['input_ids'][i]])
    segments_tensors = torch.tensor([encoded_inputs['attention_mask'][i]])
    print (tokens_tensor)
    print (segments_tensors)

    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers. 
    with torch.no_grad():

        outputs = model(tokens_tensor, segments_tensors)

        # Evaluating the model will return a different number of objects based on 
        # how it's  configured in the `from_pretrained` call earlier. In this case, 
        # becase we set `output_hidden_states = True`, the third item will be the 
        # hidden states from all layers. See the documentation for more details:
        # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
        hidden_states = outputs[2]


    # `hidden_states` has shape [13 x 1 x 22 x 768]

    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = hidden_states[-2][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)
    print (sentence_embedding[:10])
    storage.append((text,sentence_embedding))

我可以将 for 循环中的前 2 行更新到下面。但它们只有在标记化后所有句子的长度相同时才有效

tokens_tensor = torch.tensor([encoded_inputs['input_ids']])
segments_tensors = torch.tensor([encoded_inputs['attention_mask']])

而且在这种情况下outputs = model(tokens_tensor, segments_tensors) 失败。

在这种情况下我如何才能完全执行批处理?

【问题讨论】:

【参考方案1】:

可以加快您的工作流程的最简单方法之一是批量数据处理。在当前的实现中,每次迭代只输入一个句子,但可以使用批处理数据!

现在,如果您愿意自己实现这部分,我强烈建议您以这种方式使用tokenizer 来准备您的数据。

batch_sentences = ["Hello I'm a single sentence",
                    "And another sentence",
                    "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)
'input_ids': [[101, 8667, 146, 112, 182, 170, 1423, 5650, 102],
               [101, 1262, 1330, 5650, 102],
               [101, 1262, 1103, 1304, 1304, 1314, 1141, 102]],
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0],
                    [0, 0, 0, 0, 0, 0, 0, 0]],
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1],
                    [1, 1, 1, 1, 1, 1, 1, 1]]

但是有一个更简单的方法,使用FeatureExtractionPipeline 和综合documentation!这看起来像这样:

from transformers import pipeline

feature_extraction = pipeline('feature-extraction', model="distilroberta-base", tokenizer="distilroberta-base")
features = feature_extraction(["Hello I'm a single sentence",
                               "And another sentence",
                               "And the very very last one"])

更新1 事实上,您稍微更改了代码,但您一次只传递一个样本,而不是批处理形式。如果我们想坚持您的实现,批处理将是这样的:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )
model.eval()
sentences = [ 
              "Hello I'm a single sentence",
              "And another sentence",
              "And the very very last one",
              "Hello I'm a single sentence",
              "And another sentence",
              "And the very very last one",
              "Hello I'm a single sentence",
              "And another sentence",
              "And the very very last one",
            ]
batch_size = 4  
for idx in range(0, len(sentences), batch_size):
    batch = sentences[idx : min(len(sentences), idx+batch_size)]
    
    # encoded = tokenizer(batch)
    encoded = tokenizer.batch_encode_plus(batch,max_length=50, padding='max_length', truncation=True)
  
    encoded = key:torch.LongTensor(value) for key, value in encoded.items()
    with torch.no_grad():
        
        outputs = model(**encoded)
        
    
    print(outputs.last_hidden_state.size())

输出:

torch.Size([4, 50, 768]) # batch_size * max_length * hidden dim
torch.Size([4, 50, 768])
torch.Size([1, 50, 768]) 

更新2

关于将批处理数据填充到最大长度的内容有两个问题。一,它是否能够用无关信息来扰乱变压器模型? NO,因为在训练阶段模型已经以批处理形式呈现了可变长度的输入句子,并且设计者已经引入了一个特定的参数来引导模型在 WHERE 上应注意!其次,如何摆脱这些垃圾数据?使用attention mask 参数,您可以只对相关数据执行均值运算!

所以代码会变成这样:

for idx in range(0, len(sentences), batch_size):
    batch = sentences[idx : min(len(sentences), idx+batch_size)]
    
    # encoded = tokenizer(batch)
    encoded = tokenizer.batch_encode_plus(batch,max_length=50, padding='max_length', truncation=True)
  
    encoded = key:torch.LongTensor(value) for key, value in encoded.items()
    with torch.no_grad():
        
        outputs = model(**encoded)
    lhs = outputs.last_hidden_state
    attention = encoded['attention_mask'].reshape((lhs.size()[0], lhs.size()[1], -1)).expand(-1, -1, 768)
    embeddings = torch.mul(lhs, attention)
    denominator = torch.count_nonzero(embeddings, dim=1)
    summation = torch.sum(embeddings, dim=1)
    mean_embeddings = torch.div(summation, denominator)

【讨论】:

有一个重要的警告需要考虑:当使用管道(或通常成批的输入)时,输出将具有最长输入序列的长度。尤其是在平均时,这意味着您还要对理想情况下应该忽略的“不相关”标记进行平均!我发现这在管道中没有很好的记录,所以这种填充根本不明显...... 这些我都不知道,谢谢! @dennlinger 很抱歉再次更新,但我挖得更深了(主存储库中的代码没有给我明确指示它将在哪里填充),并且似乎 这种行为改变了与版本 4.11。事实上,现在模型将单独处理每个样本,这将导致保留原始长度(您的答案中的预期行为)。 @meti:您的更新正面临@dennlinger 提到的问题。 @cronoik 你试过最后一个吗?【参考方案2】:

关于您最初的问题:您无能为力。 BERT 是一种计算要求很高的算法。最好的办法是使用BertTokenizerFast 而不是常规的BertTokenizer。 “快速”版本效率更高,您会看到大量文本的差异。

话虽如此,我必须警告您,平均 BERT 词嵌入并不能为句子创建良好的嵌入。请参阅this 帖子。根据您的问题,我假设您想做某种semantic similarity 搜索。尝试使用其中一个open-sourced models。

【讨论】:

有没有办法在我的数据上微调句子-bert 模型? 是的,你可以微调,你只需要相关的句子对,看看句子转换器库中的训练工具,最高效的方法是在 NLI 数据上使用multiple negatives ranking loss,我会试一试

以上是关于BERT 获取句子嵌入的主要内容,如果未能解决你的问题,请参考以下文章

论文泛读142Sentence-BERT:使用 Siamese BERT-Networks 的句子嵌入

句子编码和语境化词嵌入有啥区别?

如何使用注意掩码计算 HuggingFace Transformers BERT 令牌嵌入的均值/最大值?

bert不同句子中的词向量会变化吗

如何使用 BERT 对相似的句子进行聚类

我在哪里可以获得 BERT 的预训练词嵌入?