如何为拥抱脸重新下载标记器?
Posted
技术标签:
【中文标题】如何为拥抱脸重新下载标记器?【英文标题】:How to re-download tokenizer for huggingface? 【发布时间】:2022-01-04 05:24:18 【问题描述】:我遇到了与https://github.com/huggingface/transformers/issues/11243 完全相同的问题,只是它在 Jupyter 实验室中不起作用。它确实在我的 shell 中的 python 中工作。编辑:在我关闭并重新打开 shell 后,它现在也不能在 shell 中工作。
我使用以下方式下载了cardiffnlp/twitter-roberta-base-emotion
模型:
model_name = "cardiffnlp/twitter-roberta-base-emotion"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
我用model.save_pretrained(model_name)
保存了模型,现在我无法加载标记器。如果我跑:
tokenizer = AutoTokenizer.from_pretrained(model_name)
它给出了错误:
OSError: Can't load tokenizer for 'cardiffnlp/twitter-roberta-base-emotion'. Make sure that:
- 'cardiffnlp/twitter-roberta-base-emotion' is a correct model identifier listed on 'https://huggingface.co/models'
(make sure 'cardiffnlp/twitter-roberta-base-emotion' is not a path to a local directory with something else, in that case)
- or 'cardiffnlp/twitter-roberta-base-emotion' is the correct path to a directory containing relevant tokenizer files
因为我昨天保存了模型而不是分词器,所以我不能再加载分词器了。我能做些什么来解决这个问题?如果无法加载分词器,我不明白如何保存分词器。
【问题讨论】:
模型和分词器是两个不同的东西,但它们共享相同的下载位置。您需要同时保存分词器和模型 我明白这一点。我在问怎么做,因为我不能再在本地加载标记器了。 您可以从您保存的位置删除它并重新下载检查~/.cache/huggingface/
【参考方案1】:
模型和标记器是两个不同的东西,但它们共享相同的下载位置。您需要保存分词器和模型。我写了一个简单的实用程序来提供帮助。
import typing as t
from loguru import logger
from pathlib import Path
import torch
from transformers import PreTrainedModel
from transformers import PreTrainedTokenizer
class ModelLoader:
"""ModelLoader
Downloading and Loading Hugging FaceModels
Download occurs only when model is not located in the local model directory
If model exists in local directory, load.
"""
def __init__(
self,
model_name: str,
model_directory: str,
tokenizer_loader: PreTrainedTokenizer,
model_loader: PreTrainedModel,
):
self.model_name = Path(model_name)
self.model_directory = Path(model_directory)
self.model_loader = model_loader
self.tokenizer_loader = tokenizer_loader
self.save_path = self.model_directory / self.model_name
if not self.save_path.exists():
logger.debug(f"[+] self.save_path does not exit!")
self.save_path.mkdir(parents=True, exist_ok=True)
self.__download_model()
self.tokenizer, self.model = self.__load_model()
def __repr__(self):
return f"self.__class__.__name__(model=self.save_path)"
# Download model from HuggingFace
def __download_model(self) -> None:
logger.debug(f"[+] Downloading self.model_name")
tokenizer = self.tokenizer_loader.from_pretrained(f"self.model_name")
model = self.model_loader.from_pretrained(f"self.model_name")
logger.debug(f"[+] Saving self.model_name to self.save_path")
tokenizer.save_pretrained(f"self.save_path")
model.save_pretrained(f"self.save_path")
logger.debug("[+] Process completed")
# Load model
def __load_model(self) -> t.Tuple:
logger.debug(f"[+] Loading model from self.save_path")
tokenizer = self.tokenizer_loader.from_pretrained(f"self.save_path")
# Check if GPU is available
device = "cuda" if torch.cuda.is_available() else "cpu"
logger.info(f"[+] Model loaded in device complete")
model = self.model_loader.from_pretrained(f"self.save_path").to(device)
logger.debug("[+] Loading completed")
return tokenizer, model
def retrieve(self) -> t.Tuple:
"""Retriver
Returns:
Tuple: tokenizer, model
"""
return self.tokenizer, self.model
你可以把它当作
…
model_name = "cardiffnlp/twitter-roberta-base-emotion"
model_directory = "/tmp" # or where you want to store models
tokenizer_loader = AutoTokenizer
model_loader = AutoModelForSequenceClassification
get_model = ModelLoader(model_name=model_name, model_directory=model_directory, tokenizer_loader=tokenizer_loader, model_loader=model_loader)
model, tokenizer = get_model.retrieve()
【讨论】:
以上是关于如何为拥抱脸重新下载标记器?的主要内容,如果未能解决你的问题,请参考以下文章
训练使用 AutoConfig 定义的拥抱脸 AutoModel
如何为 sklearn 的 CountVectorizer 编写自定义标记器以将所有 XML 标记以及打开和关闭标记之间的所有文本视为标记