我们是不是应该小写输入数据以（预）使用拥抱脸训练 BERT 无大小写模型？

Posted 2023-03-12

技术标签:

【中文标题】我们是不是应该小写输入数据以（预）使用拥抱脸训练 BERT 无大小写模型？【英文标题】：Shall we lower case input data for (pre) training a BERT uncased model using huggingface?我们是否应该小写输入数据以（预）使用拥抱脸训练 BERT 无大小写模型？ 【发布时间】：2020-10-09 11:40:29 【问题描述】：

我们是否应该小写输入数据以（预）使用拥抱脸训练 BERT 无大小写模型？我查看了 Thomas Wolf (https://github.com/huggingface/transformers/issues/92#issuecomment-444677920) 的回复，但不完全确定他是否是这个意思。

如果我们将文本小写会发生什么？

【问题讨论】：

分词器应该为你做这件事。 【参考方案1】：

Tokenizer 会处理这个问题。

一个简单的例子：

import torch
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', max_length = 10, padding_side = 'right')

input_ids = torch.tensor(tokenizer.encode('this is a cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

input_ids = torch.tensor(tokenizer.encode('This is a Cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

输出：

tensor([[ 101, 2023, 2003, 1037, 4937,  102,    0,    0,    0,    0]])
tensor([[ 101, 2023, 2003, 1037, 4937,  102,    0,    0,    0,    0]])

但如果是装箱的，

tokenizer = BertTokenizer.from_pretrained('bert-base-cased', max_length = 10, padding_side = 'right')

input_ids = torch.tensor(tokenizer.encode('this is a cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

input_ids = torch.tensor(tokenizer.encode('This is a Cat', add_special_tokens=True, max_length = 10, pad_to_max_length = True)).unsqueeze(0)
print(input_ids)

tensor([[ 101, 1142, 1110,  170, 5855,  102,    0,    0,    0,    0]])

tensor([[ 101, 1188, 1110,  170, 8572,  102,    0,    0,    0,    0]])

【讨论】：

谢谢，我也是这么想的！一般来说，在模型的 tokenizer_config.json 属性 do_lower_case 中指定了对大小写处理的分词器行为。【参考方案2】：

我认为 bert-base-uncased 模型会将文本小写，而不管您传递给模型的内容如何。您还可以尝试使用玩具数据集并使用 BERT 标记器打印标记以便确认。

【讨论】：

以上是关于我们是不是应该小写输入数据以（预）使用拥抱脸训练 BERT 无大小写模型？的主要内容，如果未能解决你的问题，请参考以下文章

训练后如何查看拥抱脸模型的变化？

如何使用拥抱脸变压器批量制作训练器垫输入？

训练使用 AutoConfig 定义的拥抱脸 AutoModel

Huggingface 微调 - 如何在预训练的基础上构建自定义模型

Jupyter 笔记本中的 ModuleNotFoundError 拥抱脸数据集

将拥抱脸标记映射到原始输入文本