学习笔记Transformers库笔记
Posted 囚生CY
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了学习笔记Transformers库笔记相关的知识,希望对你有一定的参考价值。
库API文档: https://huggingface.co/transformers/
版本号:4.3.0
序言
Transformers库应该算是一个比较新的项目,截至2021年3月2日,当中已经收录了不少arxiv上2020年发表的论文的模型代码,通过这个库可以非常轻松的调取最先进的,包括BERT在内的深度学习模型(以自然语言处理领域的模型为主),并且可以使用PyTorch或TensorFlow 2.x进行继续训练或微调。
相对于Tensorhub需要翻墙,目前在网络情况不错的情况下,从huggingface上下载模型的镜像文件还是非常快的,也是目前PyTorch调取BERT模型的主流方案,当然TensorFlow调取BERT模型可以通过BERT官方项目下的README中的方法,笔者之前也写过相关的文章做,但是笔者发现更新到TensorFlow 2.x后,很多之前的方法都不能适用了,因此这个Transformers库还是非常重要的。
跟之前的DGL库类似,笔者主要是做了API文档的翻译工作,大部分有用的内容笔者都已经摘录,省略的基本上都是不太重要的内容,并加了一些笔者注释,可以用于介绍和快速上手(这个库使用起来还是比较简单的)。
目录
第一部分: 入门
快速上手
从管道模型开始上手
Translation from https://huggingface.co/transformers/quicktour.html ;
- 管道(pipeline):
- 任务类型:
- (1) 情感分析(Sentiment analysis): 判断文本是积极的或是消极的;
- (2) 文本生成(Text generation): 根据某种提示生成一段相关文本;
- (3) 命名实体识别(Name entity recognition): 判断语句中的某个分词属于何种类型;
- (4) 问答系统(Question answering): 根据上下文和问题生成答案;
- (5) 缺失文本填充(Filling masked text): 还原被挖去某些单词的语句;
- (6) 文本综述(Summarization): 根据长文本生成总结性的文字;
- (7) 机器翻译(Translation): 将某种语言的文本翻译成另一种语言;
- (8)特征挖掘(Feature extraction): 生成文本的张量表示;
- 以情感分析为例, 给出一个快速上手的示例:
from transformers import pipeline nlp = pipeline("sentiment-analysis") result = nlp("I hate you")[0] print(f"label: result['label'], with score: round(result['score'], 4)") result = nlp("I love you")[0] print(f"label: result['label'], with score: round(result['score'], 4)")
- 该管道模型从distilbert-base-uncased-finetuned-sst-2-english 处下载获得, 如果需要指定使用哪种特定的模型, 可以设置
model
参数, 获取从model hub 上储存的模型, 如下面这个模型除了可以处理英文外, 还可以处理法语, 意大利语, 荷兰语:
from transformers import pipeline classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")
- 关于这些模型的参数可以到huggingface页面上去查阅README文件;
- 通常可以为管道模型添加
tokenizer
参数, 即指定好分词器, transformers库中已经提供了相应的模块AutoModelForSequenceClassification
或TFAutoModelForSequenceClassification
:# PyTorch from transformers import AutoTokenizer, AutoModelForSequenceClassification model_name = "nlptown/bert-base-multilingual-uncased-sentiment" model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer) # TensorFlow from transformers import AutoTokenizer, TFAutoModelForSequenceClassification model_name = "nlptown/bert-base-multilingual-uncased-sentiment" # This model only exists in PyTorch, so we use the `from_pt` flag to import that model in TensorFlow. model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True) tokenizer = AutoTokenizer.from_pretrained(model_name) classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
- 如果需要在特定的数据集上微调这些预训练管道模型, 可以参考Example ;
- 该管道模型从distilbert-base-uncased-finetuned-sst-2-english 处下载获得, 如果需要指定使用哪种特定的模型, 可以设置
- 其他任务的管道模型调用详细方法可以参考task summary , 以下是一个序列分类(sequence classification)的示例代码;
# PyTorch from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc") model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc") classes = ["not paraphrase", "is paraphrase"] sequence_0 = "The company HuggingFace is based in New York City" sequence_1 = "Apples are especially bad for your health" sequence_2 = "HuggingFace's headquarters are situated in Manhattan" paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt") not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt") paraphrase_classification_logits = model(**paraphrase).logits not_paraphrase_classification_logits = model(**not_paraphrase).logits paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0] not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0] # Should be paraphrase for i in range(len(classes)): print(f"classes[i]: int(round(paraphrase_results[i] * 100))%") # Should not be paraphrase for i in range(len(classes)): print(f"classes[i]: int(round(not_paraphrase_results[i] * 100))%") # TensorFlow from transformers import AutoTokenizer, TFAutoModelForSequenceClassification import tensorflow as tf tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc") model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc") classes = ["not paraphrase", "is paraphrase"] sequence_0 = "The company HuggingFace is based in New York City" sequence_1 = "Apples are especially bad for your health" sequence_2 = "HuggingFace's headquarters are situated in Manhattan" paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf") not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf") paraphrase_classification_logits = model(paraphrase)[0] not_paraphrase_classification_logits = model(not_paraphrase)[0] paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0] not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0] # Should be paraphrase for i in range(len(classes)): print(f"classes[i]: int(round(paraphrase_results[i] * 100))%") # Should not be paraphrase for i in range(len(classes)): print(f"classes[i]: int(round(not_paraphrase_results[i] * 100))%")
调用管道模型时在做什么
- 使用分词器(tokenizer): 事实上所有的模型和分词器都是通过
from_pretrained
方法创建得到的, 一般;
- 示例: 注意到调用模型和分词器的
AutoTokenizer
和AutoModelForSequenceClassification
是一个高层的接口类, 也可以根据不同的模型调用不同的类, 如distilbert-base-uncased-finetuned-sst-2-english
模型对应的就是DistilBertTokenizer
和DistilBertForSequenceClassification
;# PyTorch from transformers import AutoTokenizer, AutoModelForSequenceClassification # method 1 model_name = "distilbert-base-uncased-finetuned-sst-2-english" pt_model = AutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name) inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.") # method 2 from transformers import DistilBertTokenizer, DistilBertForSequenceClassification model_name = "distilbert-base-uncased-finetuned-sst-2-english" model = DistilBertForSequenceClassification.from_pretrained(model_name) tokenizer = DistilBertTokenizer.from_pretrained(model_name) # TensorFlow # method 1 from transformers import AutoTokenizer, TFAutoModelForSequenceClassification model_name = "distilbert-base-uncased-finetuned-sst-2-english" tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name) tokenizer = AutoTokenizer.from_pretrained(model_name)\\ inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.") # method 2 from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification model_name = "distilbert-base-uncased-finetuned-sst-2-english" model = TFDistilBertForSequenceClassification.from_pretrained(model_name) tokenizer = DistilBertTokenizer.from_pretrained(model_name) print(inputs)
- 输出结果: 分词的编号与一些其他对于模型训练有用的信息;
'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
- 多语句分词:
batch = tokenizer( ["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."], padding=True, truncation=True, max_length=512, return_tensors="pt" # change to "tf" for got TensorFlow ) for key, value in batch.items(): print(f"key: value.numpy().tolist()")
- 输出结果:
input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]] attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]
- 更多与分词相关的内容可以参考Preprocessing data ;
- 使用预训练模型: 经过分词器预处理后的数据可以直接输入到模型中, 正如上文所述, 分词器的输出包含了模型所需的所有信息:
- 示例: 注意PyTorch版本需要打包字典输入;
# PyTorch # import torch # pt_outputs = pt_model(**pt_batch, labels = torch.tensor([1, 0])) # add labels outputs = pt_model(**pt_batch) print(outputs) # TensorFlow # import tensorflow as tf # tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0])) # add labels outputs = tf_model(tf_batch) print(outputs)
- 输出结果:
(tensor([[-4.0833, 4.3364], [ 0.0818, -0.0418]], grad_fn=<AddmmBackward>),) (<tf.Tensor: shape=(2, 2), dtype=float32, numpy= array([[-4.0832963 , 4.336414 ], [ 0.08181786, -0.04179301]], dtype=float32)>,)
- 重点注意: 输出结果是去除了模型最后一个激活层(如
softmax
等激活函数)的输出, 这在所有transformers
库中的模型都是通用的, 原因是最后一个激活层会和损失函数相融合(fused with loss)
- 将输出结果手动激活:
# PyTorch import torch.nn.functional as F predictions = F.softmax(outputs[0], dim=-1) # TensorFlow import tensorflow as tf predictions = tf.nn.softmax(outputs[0], axis=-1)
- 预训练模型本身都是
torch.nn.Module
或tensorflow.keras.Model
类型的, 因此可以在PyTorch或TensorFlow的框架下进行训练, 其中transformers
库提供了训练模块Trainer
和TFTrainer
, 详细的训练微调方法可以参考training tutorial ;- 训练微调后的分词器或模型可以保存并重新加载使用:
tokenizer.save_pretrained(save_directory) model.save_pretrained(save_directory) tokenizer = AutoTokenizer.from_pretrained(save_directory) model = AutoModel.from_pretrained(save_directory, from_tf=True)
- 返回模型的隐层状态以及所有的注意力权重:
# PyTorch pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True) all_hidden_states, all_attentions = pt_outputs[-2:] # TensorFlow tf_outputs = tf_model(tf_batch, output_hidden_states=True, output_attentions=True) all_hidden_states, all_attentions = tf_outputs[-2:]
- 可以通过设置
config
参数来调整模型的架构, 一些简单的配置参数也可以直接在from_pretrained
方法中设置:# PyTorch from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512) tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') model = DistilBertForSequenceClassification(config) from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification model_name = "distilbert-base-uncased" model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10) tokenizer = DistilBertTokenizer.from_pretrained(model_name) # TensorFlow from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512) tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased') model = TFDistilBertForSequenceClassification(config) from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification model_name = "distilbert-base-uncased" model = TFDistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10) tokenizer = DistilBertTokenizer.from_pretrained(model_name)
安装transformers
- 详细可见https://huggingface.co/transformers/installation.html , 一般直接使用
pip
安装即可, 如果尚未安装TensorFlow或PyTorch可以参考链接中的指令合并安装;- 这里特别地提到一个可以在移动设备上运行的Transformer模型, 具体项目地址在GitHub@swift-coreml-transformers , 是基于ios系统开发的深度学习模型;
哲学
- 这一章节其实是在讲Transformers库建立的思路, 该库主要由三个类构成:
- (1) Model类: 如
BertModel
, 目前收录有超过30个PyTorch模型或Keras模型; - (2) Configuration类: 如
BertConfig
, 用于存储搭建模型的参数; - (3) Tokenizer类: 如
BertTokenizer
, 用于存储分词词汇表以及编码方式; - 使用
from_pretrained()
和save_pretrained()
方法来调用和保存这三种类的实例对象;
- 这里文档中提到一个耐人寻味的东西:
- The code is usually as close to the original code base as possible which means some PyTorch code may be not as pytorchic as it could be as a result of being converted TensorFlow code and vice versa.
- 即便如此, 官方文档中也写了这样的预期目标: Switch easily between PyTorch and TensorFlow 2.0, allowing training using one framework and inference using another.
- 笔者是觉得第一句话说的是从TensorFlow移植到PyTorch的模型可能会失效, 言外之意似乎还是TensorFlow要比PyTorch主流一些的;
术语汇编
- 本节主要是对Transformers模型中的一些术语, 包括位置编码(positional encoding), 编码器(encoder), 解码器(decoder)等做了一些说明, 拿的是BERT模型调用举的例子, 还是比较有借鉴意义的;
- 这里把记一下代码示例:
# Input IDs from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained("bert-base-cased") sequence = "A Titan RTX has 24GB of VRAM" tokenized_sequence = tokenizer.tokenize(sequence) print(tokenized_sequence) # ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M'] inputs = tokenizer(sequence) encoded_sequence = inputs["input_ids"] print(encoded_sequence) # [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102] decoded_sequence = tokenizer.decode(encoded_sequence) print(decoded_sequence) # [CLS] A Titan RTX has 24GB of VRAM [SEP] # Attention mask from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained("bert-base-cased") sequence_a = "This is a short sequence." sequence_b = "This is a rather long sequence. It is at least longer than the sequence A." encoded_sequence_a = tokenizer(sequence_a)["input_ids"] encoded_sequence_b = tokenizer(sequence_b)["input_ids"] print(len(encoded_sequence_a), len(encoded_sequence_b)) # 8, 19 padded_sequences = tokenizer([sequence_a, sequence_b], padding=True) print(padded_sequences["input_ids"]) # [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]] print(padded_sequences["attention_mask"]) # [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]] # Token Type IDs from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained("bert-base-cased") sequence_a = "HuggingFace is based in NYC" sequence_b = "Where is HuggingFace based?" encoded_dict = tokenizer(sequence_a, sequence_b) decoded = tokenizer.decode(encoded_dict["input_ids"]) print(decoded) # [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1] print(encoded_dict['token_type_ids']) # [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
第二部分: 基础使用手册
任务汇总
序列分类
- 代码示例:
# PyTorch from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc") model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc") classes = ["not paraphrase", "is paraphrase"] sequence_0 = "The company HuggingFace is based in New York City" sequence_1 = "Apples are especially bad for your health" sequence_2 = "HuggingFace's headquarters are situated in Manhattan" paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt") not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt") paraphrase_classification_logits = model(**paraphrase).logits not_paraphrase_classification_logits = model(**not_paraphrase).logits paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0] not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0] # Should be paraphrase for i in range(len(classes)): print(f"classes[i]: int(round(paraphrase_results[i] * 100))%") # Should not be paraphrase for i in range(len(classes)): print(f"classes[i]: int(round(not_paraphrase_results[i] * 100))%") # TensorFlow from transformers import AutoTokenizer, TFAutoModelForSequenceClassification import tensorflow as tf tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc") model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc") classes = ["not paraphrase", "is paraphrase"] sequence_0 = "The company HuggingFace is based in New York City" sequence_1 = "Apples are especially bad for your health" sequence_2 = "HuggingFace's headquarters are situated in Manhattan" paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf") not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf") paraphrase_classification_logits = model(paraphrase)[0] not_paraphrase_classification_logits = model(not_paraphrase)[0] paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0] not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0] # Should be paraphrase for i in range(len(classes)): print(f"classes[i]: int(round(paraphrase_results[i] * 100))%") # Should not be paraphrase for i in range(len(classes)): print(f"classes[i]: int(round(not_paraphrase_results[i] * 100))%")
问答挖掘
- 关于SQuAD任务的模型微调可以参考run_squad.py 与run_tf_squad.py , 前者的PyTorch脚本似乎已经挂掉了, 只有后者TensorFlow的脚本仍然是有效的了;
- 简单示例一:
from transformers import pipeline nlp = pipeline("question-answering") con
以上是关于学习笔记Transformers库笔记的主要内容,如果未能解决你的问题,请参考以下文章
Transformers学习笔记2. HuggingFace数据集Datasets
Transformers学习笔记2. HuggingFace数据集Datasets
Transformers学习笔记1. 一些基本概念和编码器字典
Transformers学习笔记1. 一些基本概念和编码器字典