如何为拥抱脸重新下载标记器?

Posted

技术标签:

【中文标题】如何为拥抱脸重新下载标记器?【英文标题】:How to re-download tokenizer for huggingface? 【发布时间】:2022-01-04 05:24:18 【问题描述】:

我遇到了与https://github.com/huggingface/transformers/issues/11243 完全相同的问题,只是它在 Jupyter 实验室中不起作用。它确实在我的 shell 中的 python 中工作。编辑:在我关闭并重新打开 shell 后,它现在也不能在 shell 中工作。

我使用以下方式下载了cardiffnlp/twitter-roberta-base-emotion 模型:

model_name = "cardiffnlp/twitter-roberta-base-emotion"
model = AutoModelForSequenceClassification.from_pretrained(model_name)

我用model.save_pretrained(model_name) 保存了模型,现在我无法加载标记器。如果我跑:

tokenizer = AutoTokenizer.from_pretrained(model_name)

它给出了错误:

OSError: Can't load tokenizer for 'cardiffnlp/twitter-roberta-base-emotion'. Make sure that:

- 'cardiffnlp/twitter-roberta-base-emotion' is a correct model identifier listed on 'https://huggingface.co/models'
(make sure 'cardiffnlp/twitter-roberta-base-emotion' is not a path to a local directory with something else, in that case)

- or 'cardiffnlp/twitter-roberta-base-emotion' is the correct path to a directory containing relevant tokenizer files

因为我昨天保存了模型而不是分词器,所以我不能再加载分词器了。我能做些什么来解决这个问题?如果无法加载分词器,我不明白如何保存分词器。

【问题讨论】:

模型和分词器是两个不同的东西,但它们共享相同的下载位置。您需要同时保存分词器和模型 我明白这一点。我在问怎么做,因为我不能再在本地加载标记器了。 您可以从您保存的位置删除它并重新下载检查~/.cache/huggingface/ 【参考方案1】:

模型和标记器是两个不同的东西,但它们共享相同的下载位置。您需要保存分词器和模型。我写了一个简单的实用程序来提供帮助。

import typing as t
from loguru import logger
from pathlib import Path
import torch
from transformers import PreTrainedModel
from transformers import PreTrainedTokenizer


class ModelLoader:
    """ModelLoader
    Downloading and Loading Hugging FaceModels
       Download occurs only when model is not located in the local model directory
       If model exists in local directory, load.
    """

    def __init__(
        self,
        model_name: str,
        model_directory: str,
        tokenizer_loader: PreTrainedTokenizer,
        model_loader: PreTrainedModel,
    ):

        self.model_name = Path(model_name)
        self.model_directory = Path(model_directory)
        self.model_loader = model_loader
        self.tokenizer_loader = tokenizer_loader

        self.save_path = self.model_directory / self.model_name

        if not self.save_path.exists():
            logger.debug(f"[+] self.save_path does not exit!")
            self.save_path.mkdir(parents=True, exist_ok=True)
            self.__download_model()

        self.tokenizer, self.model = self.__load_model()

    def __repr__(self):
        return f"self.__class__.__name__(model=self.save_path)"

    # Download model from HuggingFace
    def __download_model(self) -> None:

        logger.debug(f"[+] Downloading self.model_name")
        tokenizer = self.tokenizer_loader.from_pretrained(f"self.model_name")
        model = self.model_loader.from_pretrained(f"self.model_name")

        logger.debug(f"[+] Saving self.model_name to self.save_path")
        tokenizer.save_pretrained(f"self.save_path")
        model.save_pretrained(f"self.save_path")

        logger.debug("[+] Process completed")

    # Load model
    def __load_model(self) -> t.Tuple:

        logger.debug(f"[+] Loading model from self.save_path")
        tokenizer = self.tokenizer_loader.from_pretrained(f"self.save_path")
        # Check if GPU is available
        device = "cuda" if torch.cuda.is_available() else "cpu"
        logger.info(f"[+] Model loaded in device complete")
        model = self.model_loader.from_pretrained(f"self.save_path").to(device)

        logger.debug("[+] Loading completed")
        return tokenizer, model

    def retrieve(self) -> t.Tuple:

        """Retriver
        Returns:
            Tuple: tokenizer, model
        """
        return self.tokenizer, self.model

你可以把它当作


…
model_name =  "cardiffnlp/twitter-roberta-base-emotion"
model_directory = "/tmp" # or where you want to store models

tokenizer_loader = AutoTokenizer
model_loader = AutoModelForSequenceClassification


get_model = ModelLoader(model_name=model_name, model_directory=model_directory, tokenizer_loader=tokenizer_loader, model_loader=model_loader)


model, tokenizer = get_model.retrieve()

【讨论】:

以上是关于如何为拥抱脸重新下载标记器?的主要内容,如果未能解决你的问题,请参考以下文章

将拥抱脸标记映射到原始输入文本

如何从拥抱脸下载模型?

拥抱面标记器中的填充如何工作?

训练使用 AutoConfig 定义的拥抱脸 AutoModel

如何为 sklearn 的 CountVectorizer 编写自定义标记器以将所有 XML 标记以及打开和关闭标记之间的所有文本视为标记

如何冻结拥抱脸模型?