HuggingFace Saving-Loading 模型 (Colab) 进行预测

Posted

技术标签:

【中文标题】HuggingFace Saving-Loading 模型 (Colab) 进行预测【英文标题】:HuggingFace Saving-Loading Model (Colab) to Make Predictions 【发布时间】:2021-08-29 03:53:24 【问题描述】:

使用 HuggingFace 训练 Transformer 模型来预测目标变量(例如电影评分)。我是 Python 新手,这可能是一个简单的问题,但我不知道如何保存经过训练的分类器模型(通过 Colab)然后重新加载以便对新数据进行目标变量预测。例如,我使用 HuggingFace 资源中的示例训练了一个模型来预测 imbd 评级,如下所示。我尝试了多种方法(save_model、save_pretrained),要么根本无法保存它,要么在加载时无法弄清楚要调用什么来获得预测。对于涉及保存、加载、然后基于测试数据模型创建新预测分数的步骤,任何帮助都将不胜感激。

#example mainly from here: https://huggingface.co/transformers/training.html
!pip install transformers
!pip install datasets

from datasets import load_dataset
raw_datasets = load_dataset("imdb")

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], max_length = 128, padding="max_length", truncation=True) 

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

#choosing small datasets for example#
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(500))

### TRAINING classification ###
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

from transformers import TrainingArguments
from transformers import Trainer

training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch", num_train_epochs=2, weight_decay=.0001, learning_rate=0.00001, per_device_train_batch_size=32) 

trainer = Trainer(model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset)
trainer.train()

y_test_predicted_original = model_loaded.predict(small_eval_dataset)

#### Saving ###
from google.colab import drive
drive.mount('/content/gdrive')
%cd /content/gdrive/My\ Drive/FOLDER

trainer.save_pretrained ("Trained model") #assumed this would save but did not
model.save_pretrained ("Trained model") #did save

### Loading Model and Creating Predicted Scores ###

#perhaps this....#
from transformers import BertConfig, BertModel
conf = BertConfig.from_pretrained("Trained model", num_labels=2)
model_loaded = AutoModelForSequenceClassification.from_pretrained("Trained model", config=conf)

#or...#
model_loaded = AutoModelForSequenceClassification.from_pretrained("Trained model", local_files_only=True)
model_loaded 

#with ultimate goal of getting predicted scores (not sure what to call here)...
y_test_predicted_loaded = model_loaded.predict(small_eval_dataset)

【问题讨论】:

【参考方案1】:

保存模型

trainer.save_model("Trained model")

加载模型和分词器

model_loaded = AutoModelForSequenceClassification.from_pretrained("Trained model")
tokenizer = AutoTokenizer.from_pretrained("Trained model")

预测

trainer = Trainer(model = model)
test_results = trainer.predict(test_dataset)

【讨论】:

以上是关于HuggingFace Saving-Loading 模型 (Colab) 进行预测的主要内容,如果未能解决你的问题,请参考以下文章

使用 huggingface 库会报错:KeyError: 'logits'

Huggingface 节省标记器

通过 Huggingface 转换器更新 BERT 模型

Huggingface 转换器模型返回字符串而不是 logits

阿尔伯特没有收敛 - HuggingFace

HuggingFace - config.json 中的 GPT2 标记器配置