微调 BERT 情绪分析时的过度拟合
Posted
技术标签:
【中文标题】微调 BERT 情绪分析时的过度拟合【英文标题】:Overfitting when fine-tuning BERT sentiment analysis 【发布时间】:2021-09-06 22:45:54 【问题描述】:总的来说,我是机器学习的新手。我目前正在尝试使用 BERT 和 Transformers https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/ 进行情绪分析的教程@
但是,当我训练模型时,模型似乎过度拟合
我不知道如何解决这个问题。我尝试过减少时期的数量,增加批量大小,打乱我的数据(这是有序的)并增加验证拆分。到目前为止,没有任何效果。我什至尝试过改变不同的学习率,但我现在使用的是最小的。
下面是我的代码:
PRE_TRAINED_MODEL_NAME = 'TurkuNLP/bert-base-finnish-cased-v1'
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)
MAX_LEN = 40
#Make a PyTorch dataset
class FIDataset(Dataset):
def __init__(self, texts, targets, tokenizer, max_len):
self.texts = texts
self.targets = targets
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, item):
text = str(self.texts[item])
target = self.targets[item]
encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=False,
pad_to_max_length=True,
return_attention_mask=True,
return_tensors='pt',
)
return
'text': text,
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'targets': torch.tensor(target, dtype=torch.long)
#split test and train
df_train, df_test = train_test_split(
df,
test_size=0.1,
random_state=RANDOM_SEED
)
df_val, df_test = train_test_split(
df_test,
test_size=0.5,
random_state=RANDOM_SEED
)
#data loader function
def create_data_loader(df, tokenizer, max_len, batch_size):
ds = FIDataset(
texts=df.content.to_numpy(),
targets=df.sentiment.to_numpy(),
tokenizer=tokenizer,
max_len=max_len
)
return DataLoader(
ds,
batch_size=batch_size,
num_workers=4
)
BATCH_SIZE = 32
#Load data into train, test, val
train_data_loader = create_data_loader(df_train, tokenizer, MAX_LEN, BATCH_SIZE)
val_data_loader = create_data_loader(df_val, tokenizer, MAX_LEN, BATCH_SIZE)
test_data_loader = create_data_loader(df_test, tokenizer, MAX_LEN, BATCH_SIZE)
#Bert model loading
bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
# Sentiment Classifier based on Bert model just loaded
class SentimentClassifier(nn.Module):
def __init__(self, n_classes):
super(SentimentClassifier, self).__init__()
self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
self.drop = nn.Dropout(p=0.1)
self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
def forward(self, input_ids, attention_mask):
returned = self.bert(
input_ids=input_ids,
attention_mask=attention_mask
)
pooled_output = returned["pooler_output"]
output = self.drop(pooled_output)
return self.out(output)
#Create a Classifier instance and move to GPU
model = SentimentClassifier(3)
model = model.to(device)
#Optimize with AdamW
EPOCHS = 6
optimizer = AdamW(model.parameters(), lr=2e-5, correct_bias=False)
total_steps = len(train_data_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=0,
num_training_steps=total_steps
)
loss_fn = nn.CrossEntropyLoss().to(device)
#Train each Epoch function
def train_epoch(
model,
data_loader,
loss_fn,
optimizer,
device,
scheduler,
n_examples
):
model = model.train()
losses = []
correct_predictions = 0
for d in data_loader:
input_ids = d["input_ids"].to(device)
attention_mask = d["attention_mask"].to(device)
targets = d["targets"].to(device)
outputs = model(
input_ids=input_ids,
attention_mask=attention_mask
)
_, preds = torch.max(outputs, dim=1)
loss = loss_fn(outputs, targets)
correct_predictions += torch.sum(preds == targets)
losses.append(loss.item())
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
return correct_predictions.double() / n_examples, np.mean(losses)
import torch
history = defaultdict(list)
best_accuracy = 0
if __name__ == '__main__':
for epoch in range(EPOCHS):
print(f'Epoch epoch + 1/EPOCHS')
print('-' * 10)
train_acc, train_loss = train_epoch(
model,
train_data_loader,
loss_fn,
optimizer,
device,
scheduler,
len(df_train)
)
print(f'Train loss train_loss accuracy train_acc')
val_acc, val_loss = eval_model(
model,
val_data_loader,
loss_fn,
device,
len(df_val)
)
print(f'Val loss val_loss accuracy val_acc')
print()
history['train_acc'].append(train_acc)
history['train_loss'].append(train_loss)
history['val_acc'].append(val_acc)
history['val_loss'].append(val_loss)
if val_acc > best_accuracy:
torch.save(model.state_dict(), 'best_model_state.bin')
best_accuracy = val_acc
【问题讨论】:
【参考方案1】:一般来说,要减少过拟合,您可以:
-
增加正则化
降低模型复杂度
执行提前停止
增加训练数据
从您写的内容来看,您已经尝试了 3 和 4。对于神经网络,您可以通过增加 dropout 来增加正则化。你已经有了它的代码。
# NOTE: You don't need bert_model here since you're creating one inside
# of SentimentClassifier.
#bert_model = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
# Sentiment Classifier based on Bert model just loaded
class SentimentClassifier(nn.Module):
def __init__(self, n_classes):
super(SentimentClassifier, self).__init__()
self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
self.drop = nn.Dropout(p=0.1) # <-- INCREASE THIS VALUE
self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
我建议尝试更高的 Dropout
概率值,正如我在上面的代码中所指出的那样(“增加这个值”)。跟踪 Dropout 概率和观察到的过度拟合。试试0.1, 0.2, 0.3, 0.4, 0.5
的概率值。
通常,我发现超过 0.5 的 dropout 并没有多大好处。
【讨论】:
您好,感谢您的评论。我试图用我们商店的客户评论的新数据替换旧数据,现在的情况是我的训练准确率 你还在尝试做情绪分析吗? BERT 不太可能有错。您是否仔细检查了您的新数据集?数据在utf-8
中吗?您是否能够从该数据中解析句子和真实分数?你能看到你Dataset
和Dataloader
中的数据吗,是你所期望的吗?
您好,我无法将我的文本编码为 utf-8,因为它们不是英语而是芬兰语。目前编码为 ISO-8859-1。我已经打印出数据集 df.head() 的一部分,它们看起来是正确的。我还尝试让标记的数据具有相同的确切数量(即负 = 正 = 中性 = 2247 个样本。总共 6741 个样本)
好的。你应该说你一开始就有一个芬兰数据集。 BERT 在大部分英文文本上进行了预训练。我在 Google 上搜索了“BERT Finnish”,并找到了几个已在芬兰语上进行预训练的新 BERT 模型。你也应该这样做。
您好,正如您在我的代码中看到的,我使用了芬兰 Bert 模型以上是关于微调 BERT 情绪分析时的过度拟合的主要内容,如果未能解决你的问题,请参考以下文章