使用 Huggingface Transformers 从磁盘加载预训练模型

Posted 2023-03-29

技术标签:

【中文标题】使用 Huggingface Transformers 从磁盘加载预训练模型【英文标题】：Load a pre-trained model from disk with Huggingface Transformers 【发布时间】：2021-01-08 01:42:41 【问题描述】：

从for from_pretrained 的文档中，我了解到我不必每次都下载预训练的向量，我可以使用以下语法将它们保存并从磁盘加载：

  - a path to a `directory` containing vocabulary files required by the tokenizer, for instance saved using the :func:`~transformers.PreTrainedTokenizer.save_pretrained` method, e.g.: ``./my_model_directory/``.
  - (not applicable to all derived classes, deprecated) a path or url to a single saved vocabulary file if and only if the tokenizer only requires a single vocabulary file (e.g. Bert, XLNet), e.g.: ``./my_model_directory/vocab.txt``.

所以，我去了模型中心：

https://huggingface.co/models

我找到了我想要的模型：

https://huggingface.co/bert-base-cased

我从他们提供给这个存储库的链接下载了它：

使用掩码语言建模的英语语言预训练模型（传销）目标。本文介绍并首次发布于这个存储库。此模型区分大小写：它有所作为英语和英语之间。

存放在：

  /my/local/models/cased_L-12_H-768_A-12/

其中包含：

 ./
 ../
 bert_config.json
 bert_model.ckpt.data-00000-of-00001
 bert_model.ckpt.index
 bert_model.ckpt.meta
 vocab.txt

所以，现在我有以下内容：

  PATH = '/my/local/models/cased_L-12_H-768_A-12/'
  tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)

我得到这个错误：

>           raise EnvironmentError(msg)
E           OSError: Can't load config for '/my/local/models/cased_L-12_H-768_A-12/'. Make sure that:
E           
E           - '/my/local/models/cased_L-12_H-768_A-12/' is a correct model identifier listed on 'https://huggingface.co/models'
E           
E           - or '/my/local/models/cased_L-12_H-768_A-12/' is the correct path to a directory containing a config.json file

当我直接链接到 config.json 时也是如此：

  PATH = '/my/local/models/cased_L-12_H-768_A-12/bert_config.json'
  tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)

        if state_dict is None and not from_tf:
            try:
                state_dict = torch.load(resolved_archive_file, map_location="cpu")
            except Exception:
                raise OSError(
>                   "Unable to load weights from pytorch checkpoint file. "
                    "If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. "
                )
E               OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

我应该做些什么不同的事情来让拥抱脸使用我的本地预训练模型？

更新以解决 cmets

YOURPATH = '/somewhere/on/disk/'

name = 'transfo-xl-wt103'
tokenizer = TransfoXLTokenizerFast(name)
model = TransfoXLModel.from_pretrained(name)
tokenizer.save_pretrained(YOURPATH)
model.save_pretrained(YOURPATH)

>>> Please note you will not be able to load the save vocabulary in Rust-based TransfoXLTokenizerFast as they don't share the same structure.
('/somewhere/on/disk/vocab.bin', '/somewhere/on/disk/special_tokens_map.json', '/somewhere/on/disk/added_tokens.json')

所以一切都被保存了，但是随后......

YOURPATH = '/somewhere/on/disk/'
TransfoXLTokenizerFast.from_pretrained('transfo-xl-wt103', cache_dir=YOURPATH, local_files_only=True)

    "Cannot find the requested files in the cached path and outgoing traffic has been"
ValueError: Cannot find the requested files in the cached path and outgoing traffic has been disabled. To enable model look-ups and downloads online, set 'local_files_only' to False.

【问题讨论】：

不确定你从哪里得到这些文件。当我检查链接时，我可以下载以下文件：config.json、flax_model.msgpack、modelcard.json、pytorch_model.bin、tf_model.h5、vocab.txt。另外，最好通过tokenizer.save_pretrained('YOURPATH')和model.save_pretrained('YOURPATH')保存文件，而不是直接下载。谢谢。我已更新问题以反映我尝试过此操作，但似乎没有用。请使用TransfoXLTokenizerFast.from_pretrained(YOURPATH)。 @Mittenchops 你解决过这个问题吗？我在从磁盘加载模型时遇到了类似的困难。我在使用相对路径（即./data/bert-large-uncased/）时遇到了同样的问题，但是当我使用绝对路径（即/opt/workspace/data/bert-large-uncased/）时，它奇迹般地起作用了 【参考方案1】：

相对于您的模型文件夹，该文件位于何处？我相信它必须是相对路径而不是绝对路径。因此，如果您编写代码的文件位于'my/local/'，那么您的代码应该是这样的：

PATH = 'models/cased_L-12_H-768_A-12/'
tokenizer = BertTokenizer.from_pretrained(PATH, local_files_only=True)

您只需要指定所有文件所在的文件夹，而不是直接指定文件。我认为这绝对是PATH 的问题。尝试更改“斜杠”的样式：“/”与“\”，这些在不同的操作系统中是不同的。也可以尝试使用“.”，例如./models/cased_L-12_H-768_A-12/ 等。

【讨论】：

对不起，这实际上是一条绝对路径，只是在我更改它作为示例时损坏了。我更新了问题。【参考方案2】：

我也有同样的需求，刚刚在我的 Linux 机器上使用了 Tensorflow，所以我想分享一下。

我的代码环境的requirements.txt 文件：

tensorflow==2.2.0
Keras==2.4.3
scikit-learn==0.23.1
scipy==1.4.1
numpy==1.18.1
opencv-python==4.5.1.48
seaborn==0.11.1
tensorflow-hub==0.12.0
nltk==3.6.2
tqdm==4.60.0
transformers==4.6.0
ipywidgets==7.6.3

我使用的是 Python 3.6。

我在这里访问了this 站点，该站点显示了我想要的特定拥抱脸模型的目录树。我碰巧想要 uncased 模型，但这些步骤对于您的 cased 版本应该是相似的。另请注意，我的链接指向此模型的一个非常具体的提交，只是为了重现性 - 当有人阅读本文时，很可能会有一个更新的版本。

我手动下载（或不得不复制/粘贴到记事本++，因为在某些情况下，下载按钮将我带到 txt / json 的原始版本......奇怪......）以下文件：

config.json tf_model.h5 tokenizer_config.json tokenizer.json vocab.txt

注意：再一次，我使用的是 Tensorflow，所以我没有下载 Pytorch 权重。如果您使用的是 Pytorch，您可能希望下载这些权重而不是 tf_model.h5 文件。

然后我将这些文件放在我的 Linux 机器上的这个目录中：

/opt/word_embeddings/bert-base-uncased/

使用快速ls -la（我对每个文件的权限为-rw-r--r--）确保至少对所有这些文件具有读取权限可能是个好主意。我对父目录（上面列出的那个）也有执行权限，所以人们可以cd 到这个目录。

从那里，我可以像这样加载模型：

分词器：

# python
from transformers import BertTokenizer
# tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
tokenizer = BertTokenizer.from_pretrained("/opt/word_embeddings/bert-base-uncased/")

层/模型权重：

# python
from transformers import TFAutoModel
# bert = TFAutoModel.from_pretrained("bert-base-uncased")
bert = TFAutoModel.from_pretrained("/opt/word_embeddings/bert-base-uncased/")

【讨论】：

【参考方案3】：

您可以使用simpletransformers 库。查看链接以获得更详细的说明。

    model = ClassificationModel(
    "bert", "dir/your_path"
)

这里我以分类模型为例。您可以将它用于许多其他任务以及问答等。

【讨论】：

【参考方案4】：

除了config文件和vocab文件，还需要添加tf/torch模型（有.h5/@987654325 @extension) 到你的目录。

在您的情况下，torch 和 tf 模型可能位于这些网址中：

手电筒型号：https://cdn.huggingface.co/bert-base-cased-pytorch_model.bin

tf 模型：https://cdn.huggingface.co/bert-base-cased-tf_model.h5

您还可以在模型的files and versions 部分找到所有必需的文件：https://huggingface.co/bert-base-cased/tree/main

【讨论】：

【参考方案5】：

bert 模型文件夹包含以下文件：

config.json
tf_model.h5
tokenizer_config.json
tokenizer.json
vocab.txt

如果我们需要 bert_config.json 则替换这些

 bert_model.ckpt.data-00000-of-00001
 bert_model.ckpt.index
 bert_model.ckpt.meta
 vocab.txt

那怎么办

【讨论】：

以上是关于使用 Huggingface Transformers 从磁盘加载预训练模型的主要内容，如果未能解决你的问题，请参考以下文章