由于代理问题,Huggingface Bert Tokenizer 从源代码构建

Posted

技术标签:

【中文标题】由于代理问题,Huggingface Bert Tokenizer 从源代码构建【英文标题】:Hugginface Bert Tokenizer build from source due to proxy issues 【发布时间】:2021-12-05 12:46:50 【问题描述】:

我遇到过类似的情况:BERT tokenizer & model download

上面的链接是关于下载 Bert 模型本身的,但我只想使用 Bert Tokenizer。

通常我可以这样做:

from transformers import BertTokenizer
bert_tokenizer_en = BertTokenizer.from_pretrained("bert-base-uncased")
bert_tokenizer_de=BertTokenizer.from_pretrained("bert-base-german-cased")

但是我是远程运行的,所以无法通过上面的方法下载。但是我不知道我需要从这里获得哪些文件:https://huggingface.co/bert-base-uncased/tree/main,以便我可以构建标记器?

【问题讨论】:

【参考方案1】:

您需要 1) 下载词汇表和配置文件 (vocab.txt, config.json),2) 将它们放入文件夹中,3) 将文件夹的路径传递给 BertTokenizer.from_pretrained(<path>) 函数。

这里是vocab.txt的下载位置,用于不同的分词器模型

PRETRAINED_VOCAB_FILES_MAP = 
"vocab_file": 
    "bert-base-uncased": "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt",
    "bert-large-uncased": "https://huggingface.co/bert-large-uncased/resolve/main/vocab.txt",
    "bert-base-cased": "https://huggingface.co/bert-base-cased/resolve/main/vocab.txt",
    "bert-large-cased": "https://huggingface.co/bert-large-cased/resolve/main/vocab.txt",
    "bert-base-multilingual-uncased": "https://huggingface.co/bert-base-multilingual-uncased/resolve/main/vocab.txt",
    "bert-base-multilingual-cased": "https://huggingface.co/bert-base-multilingual-cased/resolve/main/vocab.txt",
    "bert-base-chinese": "https://huggingface.co/bert-base-chinese/resolve/main/vocab.txt",
    "bert-base-german-cased": "https://huggingface.co/bert-base-german-cased/resolve/main/vocab.txt",
    "bert-large-uncased-whole-word-masking": "https://huggingface.co/bert-large-uncased-whole-word-masking/resolve/main/vocab.txt",
    "bert-large-cased-whole-word-masking": "https://huggingface.co/bert-large-cased-whole-word-masking/resolve/main/vocab.txt",
    "bert-large-uncased-whole-word-masking-finetuned-squad": "https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt",
    "bert-large-cased-whole-word-masking-finetuned-squad": "https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/vocab.txt",
    "bert-base-cased-finetuned-mrpc": "https://huggingface.co/bert-base-cased-finetuned-mrpc/resolve/main/vocab.txt",
    "bert-base-german-dbmdz-cased": "https://huggingface.co/bert-base-german-dbmdz-cased/resolve/main/vocab.txt",
    "bert-base-german-dbmdz-uncased": "https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/vocab.txt",
    "TurkuNLP/bert-base-finnish-cased-v1": "https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1/resolve/main/vocab.txt",
    "TurkuNLP/bert-base-finnish-uncased-v1": "https://huggingface.co/TurkuNLP/bert-base-finnish-uncased-v1/resolve/main/vocab.txt",
    "wietsedv/bert-base-dutch-cased": "https://huggingface.co/wietsedv/bert-base-dutch-cased/resolve/main/vocab.txt",

config.json的位置:

BERT_PRETRAINED_CONFIG_ARCHIVE_MAP = 
    "bert-base-uncased": "https://huggingface.co/bert-base-uncased/resolve/main/config.json",
    "bert-large-uncased": "https://huggingface.co/bert-large-uncased/resolve/main/config.json",
    "bert-base-cased": "https://huggingface.co/bert-base-cased/resolve/main/config.json",
    "bert-large-cased": "https://huggingface.co/bert-large-cased/resolve/main/config.json",
    "bert-base-multilingual-uncased": "https://huggingface.co/bert-base-multilingual-uncased/resolve/main/config.json",
    "bert-base-multilingual-cased": "https://huggingface.co/bert-base-multilingual-cased/resolve/main/config.json",
    "bert-base-chinese": "https://huggingface.co/bert-base-chinese/resolve/main/config.json",
    "bert-base-german-cased": "https://huggingface.co/bert-base-german-cased/resolve/main/config.json",
    "bert-large-uncased-whole-word-masking": "https://huggingface.co/bert-large-uncased-whole-word-masking/resolve/main/config.json",
    "bert-large-cased-whole-word-masking": "https://huggingface.co/bert-large-cased-whole-word-masking/resolve/main/config.json",
    "bert-large-uncased-whole-word-masking-finetuned-squad": "https://huggingface.co/bert-large-uncased-whole-word-masking-finetuned-squad/resolve/main/config.json",
    "bert-large-cased-whole-word-masking-finetuned-squad": "https://huggingface.co/bert-large-cased-whole-word-masking-finetuned-squad/resolve/main/config.json",
    "bert-base-cased-finetuned-mrpc": "https://huggingface.co/bert-base-cased-finetuned-mrpc/resolve/main/config.json",
    "bert-base-german-dbmdz-cased": "https://huggingface.co/bert-base-german-dbmdz-cased/resolve/main/config.json",
    "bert-base-german-dbmdz-uncased": "https://huggingface.co/bert-base-german-dbmdz-uncased/resolve/main/config.json",
    "cl-tohoku/bert-base-japanese": "https://huggingface.co/cl-tohoku/bert-base-japanese/resolve/main/config.json",
    "cl-tohoku/bert-base-japanese-whole-word-masking": "https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking/resolve/main/config.json",
    "cl-tohoku/bert-base-japanese-char": "https://huggingface.co/cl-tohoku/bert-base-japanese-char/resolve/main/config.json",
    "cl-tohoku/bert-base-japanese-char-whole-word-masking": "https://huggingface.co/cl-tohoku/bert-base-japanese-char-whole-word-masking/resolve/main/config.json",
    "TurkuNLP/bert-base-finnish-cased-v1": "https://huggingface.co/TurkuNLP/bert-base-finnish-cased-v1/resolve/main/config.json",
    "TurkuNLP/bert-base-finnish-uncased-v1": "https://huggingface.co/TurkuNLP/bert-base-finnish-uncased-v1/resolve/main/config.json",
    "wietsedv/bert-base-dutch-cased": "https://huggingface.co/wietsedv/bert-base-dutch-cased/resolve/main/config.json",
    # See all BERT models at https://huggingface.co/models?filter=bert

来源:变形金刚代码库1、2

步骤:

mkdir ~/german-tokenizer
cd german-tokenizer
wget https://huggingface.co/bert-base-german-cased/resolve/main/vocab.txt
wget https://huggingface.co/bert-base-german-cased/resolve/main/config.json

python

# Python Runtime:
>> import transformers
>> from transformers import BertTokenizer
>> BertTokenizer.from_pretrained('~/german-tokenizer')

【讨论】:

以上是关于由于代理问题,Huggingface Bert Tokenizer 从源代码构建的主要内容,如果未能解决你的问题,请参考以下文章

BERT HuggingFace 给出 NaN 损失

如何微调 HuggingFace BERT 模型以进行文本分类 [关闭]

Huggingface Bert:输出打印

huggingface-transformers:训练 BERT 并使用不同的注意力对其进行评估

如何在 HuggingFace Transformers 库中获取中间层的预训练 BERT 模型输出?

在 Huggingface BERT 模型之上添加密集层