向 BERT/RoBERTa 添加新令牌，同时保留相邻令牌的令牌化

Posted 2023-03-29

技术标签:

【中文标题】向 BERT/RoBERTa 添加新令牌，同时保留相邻令牌的令牌化【英文标题】：Adding new tokens to BERT/RoBERTa while retaining tokenization of adjacent tokens 【发布时间】：2022-01-12 05:23:55 【问题描述】：

我正在尝试向 BERT 和 RoBERTa 标记器添加一些新标记，以便我可以根据新词微调模型。这个想法是用新词在有限的一组句子上微调模型，然后看看它在其他不同的上下文中对这个词的预测，以检查模型对某些语言属性的知识状态。

为了做到这一点，我想添加新的标记，并且基本上将它们视为新的普通单词（模型还没有碰巧遇到）。它们的行为应该与添加后的普通单词完全一样，除了它们的嵌入矩阵将被随机初始化，然后在微调期间被学习。

但是，我在执行此操作时遇到了一些问题。特别是，在 BERT 的情况下，使用 do_basic_tokenize=False 初始化标记器时，新添加的标记周围的标记没有按预期表现（在 RoBERTa 的情况下，更改此设置似乎不会影响示例中的输出here ）。在以下示例中可以观察到该问题；在 BERT 的情况下，新添加的标记后面的句点不被标记为子词（即，它被标记为 . 而不是预期的##.），而在 RoBERTa 的情况下，后面的单词新添加的子词被视为没有前面的空格（即，它被标记为a 而不是Ġa。

from transformers import BertTokenizer, RobertaTokenizer

new_word = 'mynewword'
bert = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize = False)
bert.tokenize('mynewword') # does not exist yet
# ['my', '##ne', '##w', '##word']
bert.tokenize('testing.')
# ['testing', '##.']

bert.add_tokens(new_word)
bert.tokenize('mynewword') # now it does
# ['mynewword']
bert.tokenize('mynewword.')
# ['mynewword', '.']

roberta = RobertaTokenizer.from_pretrained('roberta-base', do_basic_tokenize = False)
roberta.tokenize('mynewword') # does not exist yet
# ['my', 'new', 'word']
roberta.tokenize('A testing a')
# ['A', 'Ġtesting', 'Ġa']

roberta.add_tokens(new_word)
roberta.tokenize('mynewword') # now it does
# ['mynewword']
roberta.tokenize('A mynewword a')
# ['A', 'mynewword', 'a']

有没有办法让我添加新标记，同时让周围标记的行为与没有添加标记的情况相匹配？我觉得这很重要，因为模型最终可能会学习到（例如），新令牌可以出现在 . 之前，而大多数其他令牌只能出现在 ##. 之前，这似乎会影响它的泛化方式。此外，我可以在这里打开基本标记化来解决 BERT 问题，但这并不能真正反映模型知识的完整状态，因为它破坏了不同标记之间的区别。这对解决 RoBERTa 问题没有任何帮助，无论如何，该问题仍然存在。

此外，理想情况下，我可以将 RoBERTa 标记添加为 Ġmynewword，但我假设只要它从不作为句子中的第一个单词出现，这无关紧要。

【问题讨论】：

我也有同样的问题 :) discuss.huggingface.co/t/… 对答案没有直接帮助，但仍然相关：您在 do_basic_tokenize=False 的罗伯塔标记器中没有看到任何差异的原因是它不支持此选项开始与。另外，为了澄清：您是否有任何特殊原因要更改 do_basic_tokenize 的默认值？不知道——我正在加入一个已经开始的项目。但既然你已经提出了这一点，那么问是有道理的。 【参考方案1】：

在继续尝试解决这个问题后，我似乎发现了一些可能可行的方法。它不一定是通用的，但可以从词汇文件（+ RoBERTa 的合并文件）中加载分词器。如果您手动编辑这些文件以以正确的方式添加新令牌，那么一切似乎都按预期工作。以下是 BERT 的示例：

from transformers import BertTokenizer

bert = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize=False)
bert.tokenize('testing.') # ['testing', '##.']
bert.tokenize('mynewword') # ['my', '##ne', '##w', '##word']

bert_vocab = bert.get_vocab() # get the pretrained tokenizer's vocabulary
bert_vocab.update('mynewword' : len(bert_vocab)) # add the new word to the end

with open('vocab.tmp', 'w', encoding = 'utf-8') as tmp_vocab_file:
    tmp_vocab_file.write('\n'.join(bert_vocab))
    
new_bert = BertTokenizer(name_or_path = 'bert-base-uncased', vocab_file = 'vocab.tmp', do_basic_tokenize=False)
new_bert.max_model_length = 512 # for identity to this setting on the pretrained one

new_bert.tokenize('mynewword') # ['mynewword']
new_bert.tokenize('mynewword.') # ['mynewword', '##.']

import os
os.remove('vocab.tmp') # cleanup

RoBERTa 更难，因为我们还必须将这些对添加到 merges.txt。我有一种适用于新标记的方法，但不幸的是，它会影响作为新标记子部分的单词的标记化，因此它并不完美——如果有人使用它来添加组成的单词（如我使用case），你可以只选择不太可能导致问题的字符串（不像这里的'mynewword'示例），但在其他情况下它可能会导致问题。（虽然这不是一个完美的解决方案，但希望它可以让其他人看到更好的解决方案。）

import re
import json
import requests
from transformers import RobertaTokenizer

roberta = RobertaTokenizer.from_pretrained('roberta-base')
roberta.tokenize('testing a') # ['testing', 'Ġa']
roberta.tokenize('mynewword') # ['my', 'new', 'word']

# update the vocabulary with the new token and the 'Ġ'' version
roberta_vocab = roberta.get_vocab()
roberta_vocab.update('mynewword' : len(roberta_vocab)) 
roberta_vocab.update(chr(288) + 'mynewword' : len(roberta_vocab)) # chr(288) = 'Ġ'
with open('vocab.tmp', 'w', encoding = 'utf-8') as tmp_vocab_file:
    json.dump(roberta_vocab, tmp_vocab_file, ensure_ascii=False)

# get and modify the merges file so that the new token will always be tokenized as a single word
url = 'https://huggingface.co/roberta-base/resolve/main/merges.txt'
roberta_merges = requests.get(url).content.decode().split('\n')

# this is a helper function to loop through a list of new tokens and get the byte-pair encodings
# such that the new token will be treated as a single unit always
def get_roberta_merges_for_new_tokens(new_tokens):
    merges = [gen_roberta_pairs(new_token) for new_token in new_tokens]
    merges = [pair for token in merges for pair in token]
    return merges

def gen_roberta_pairs(new_token, highest = True):
    # highest is used to determine whether we are dealing with the Ġ version or not. 
    # we add those pairs at the end, which is only if highest = True
    
    # this is the hard part...
    chrs = [c for c in new_token] # list of characters in the new token, which we will recursively iterate through to find the BPEs
    
    # the simplest case: add one pair
    if len(chrs) == 2:
        if not highest: 
            return tuple([chrs[0], chrs[1]])
        else:
            return [' '.join([chrs[0], chrs[1]])]
    
    # add the tokenization of the first letter plus the other two letters as an already merged pair
    if len(chrs) == 3:
        if not highest:
            return tuple([chrs[0], ''.join(chrs[1:])])
        else:
            return gen_roberta_pairs(chrs[1:]) + [' '.join([chrs[0], ''.join(chrs[1:])])]
    
    if len(chrs) % 2 == 0:
        pairs = gen_roberta_pairs(''.join(chrs[:-2]), highest = False)
        pairs += gen_roberta_pairs(''.join(chrs[-2:]), highest = False)
        pairs += tuple([''.join(chrs[:-2]), ''.join(chrs[-2:])])
        if not highest:
            return pairs
    else:
        # for new tokens with odd numbers of characters, we need to add the final two tokens before the
        # third-to-last token
        pairs = gen_roberta_pairs(''.join(chrs[:-3]), highest = False)
        pairs += gen_roberta_pairs(''.join(chrs[-2:]), highest = False)
        pairs += gen_roberta_pairs(''.join(chrs[-3:]), highest = False)
        pairs += tuple([''.join(chrs[:-3]), ''.join(chrs[-3:])])
        if not highest:
            return pairs
    
    pairs = tuple(zip(pairs[::2], pairs[1::2]))
    pairs = [' '.join(pair) for pair in pairs]
    
    # pairs with the preceding special token
    g_pairs = []
    for pair in pairs:
        if re.search(r'^' + ''.join(pair.split(' ')), new_token):
            g_pairs.append(chr(288) + pair)
    
    pairs = g_pairs + pairs
    pairs = [chr(288) + ' ' + new_token[0]] + pairs
    
    pairs = list(dict.fromkeys(pairs)) # remove any duplicates
    
    return pairs

# first line of this file is a comment; add the new pairs after it
roberta_merges = roberta_merges[:1] + get_roberta_merges_for_new_tokens(['mynewword']) + roberta_merges[1:]
roberta_merges = list(dict.fromkeys(roberta_merges))
with open('merges.tmp', 'w', encoding = 'utf-8') as tmp_merges_file:
    tmp_merges_file.write('\n'.join(roberta_merges))

new_roberta = RobertaTokenizer(name_or_path='roberta-base', vocab_file='vocab.tmp', merges_file='merges.tmp')

# for some reason, we have to re-add the <mask> token to roberta if we are using it, since
# loading the tokenizer from a file will cause it to be tokenized as separate parts
# the weight matrix is identical, and once re-added, a fill-mask pipeline still identifies
# the mask token correctly (not shown here)
new_roberta.add_tokens(new_roberta.mask_token, special_tokens=True)
new_roberta.model_max_length = 512

new_roberta.tokenize('mynewword') # ['mynewword']
new_roberta.tokenize('mynewword a') # ['mynewword', 'Ġa']
new_roberta.tokenize(' mynewword') # ['Ġmynewword']

# however, this does not guarantee that tokenization of other words will not be affected
roberta.tokenize('mynew') # ['my', 'new']
new_roberta.tokenize('mynew') # ['myne', 'w']

import os
os.remove('vocab.tmp')
os.remove('merges.tmp') # cleanup

【讨论】：

以上是关于向 BERT/RoBERTa 添加新令牌，同时保留相邻令牌的令牌化的主要内容，如果未能解决你的问题，请参考以下文章