学习笔记Transformers库笔记

Posted 囚生CY

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了学习笔记Transformers库笔记相关的知识,希望对你有一定的参考价值。

库API文档: https://huggingface.co/transformers/
版本号: 4.3.0

序言

Transformers库应该算是一个比较新的项目,截至2021年3月2日,当中已经收录了不少arxiv上2020年发表的论文的模型代码,通过这个库可以非常轻松的调取最先进的,包括BERT在内的深度学习模型(以自然语言处理领域的模型为主),并且可以使用PyTorchTensorFlow 2.x进行继续训练或微调。
相对于Tensorhub需要翻墙,目前在网络情况不错的情况下,从huggingface上下载模型的镜像文件还是非常快的,也是目前PyTorch调取BERT模型的主流方案,当然TensorFlow调取BERT模型可以通过BERT官方项目下的README中的方法,笔者之前也写过相关的文章做,但是笔者发现更新到TensorFlow 2.x后,很多之前的方法都不能适用了,因此这个Transformers库还是非常重要的。

跟之前的DGL库类似,笔者主要是做了API文档的翻译工作,大部分有用的内容笔者都已经摘录,省略的基本上都是不太重要的内容,并加了一些笔者注释,可以用于介绍和快速上手(这个库使用起来还是比较简单的)。


目录


第一部分: 入门

快速上手

从管道模型开始上手

Translation from https://huggingface.co/transformers/quicktour.html ;

  1. 管道(pipeline):
  • 任务类型:
    • (1) 情感分析(Sentiment analysis): 判断文本是积极的或是消极的;
    • (2) 文本生成(Text generation): 根据某种提示生成一段相关文本;
    • (3) 命名实体识别(Name entity recognition): 判断语句中的某个分词属于何种类型;
    • (4) 问答系统(Question answering): 根据上下文和问题生成答案;
    • (5) 缺失文本填充(Filling masked text): 还原被挖去某些单词的语句;
    • (6) 文本综述(Summarization): 根据长文本生成总结性的文字;
    • (7) 机器翻译(Translation): 将某种语言的文本翻译成另一种语言;
    • (8)特征挖掘(Feature extraction): 生成文本的张量表示;
  • 以情感分析为例, 给出一个快速上手的示例:
    from transformers import pipeline
    
    nlp = pipeline("sentiment-analysis")
    result = nlp("I hate you")[0]
    print(f"label: result['label'], with score: round(result['score'], 4)")
    result = nlp("I love you")[0]
    print(f"label: result['label'], with score: round(result['score'], 4)")
    
    • 该管道模型从distilbert-base-uncased-finetuned-sst-2-english 处下载获得, 如果需要指定使用哪种特定的模型, 可以设置model参数, 获取从model hub 上储存的模型, 如下面这个模型除了可以处理英文外, 还可以处理法语, 意大利语, 荷兰语:
    from transformers import pipeline
    
    classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")
    
    • 关于这些模型的参数可以到huggingface页面上去查阅README文件;
    • 通常可以为管道模型添加tokenizer参数, 即指定好分词器, transformers库中已经提供了相应的模块AutoModelForSequenceClassificationTFAutoModelForSequenceClassification:
      # PyTorch
      from transformers import AutoTokenizer, AutoModelForSequenceClassification
      
      model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
      model = AutoModelForSequenceClassification.from_pretrained(model_name)
      tokenizer = AutoTokenizer.from_pretrained(model_name)
      classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
      
      # TensorFlow
      from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
      
      model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
      # This model only exists in PyTorch, so we use the `from_pt` flag to import that model in TensorFlow.
      model = TFAutoModelForSequenceClassification.from_pretrained(model_name, from_pt=True)
      tokenizer = AutoTokenizer.from_pretrained(model_name)
      classifier = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
      
    • 如果需要在特定的数据集上微调这些预训练管道模型, 可以参考Example ;
  • 其他任务的管道模型调用详细方法可以参考task summary , 以下是一个序列分类(sequence classification)的示例代码;
    # PyTorch
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
    classes = ["not paraphrase", "is paraphrase"]
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "Apples are especially bad for your health"
    sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
    paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
    not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
    paraphrase_classification_logits = model(**paraphrase).logits
    not_paraphrase_classification_logits = model(**not_paraphrase).logits
    paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
    not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]
    # Should be paraphrase
    for i in range(len(classes)):
    	print(f"classes[i]: int(round(paraphrase_results[i] * 100))%")
    # Should not be paraphrase
    for i in range(len(classes)):
    	print(f"classes[i]: int(round(not_paraphrase_results[i] * 100))%")
    
    # TensorFlow
    from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
    import tensorflow as tf
    
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
    model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
    classes = ["not paraphrase", "is paraphrase"]
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "Apples are especially bad for your health"
    sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
    paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
    not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")
    paraphrase_classification_logits = model(paraphrase)[0]
    not_paraphrase_classification_logits = model(not_paraphrase)[0]
    paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
    not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]
    # Should be paraphrase
    for i in range(len(classes)):
    	print(f"classes[i]: int(round(paraphrase_results[i] * 100))%")
    # Should not be paraphrase
    for i in range(len(classes)):
    	print(f"classes[i]: int(round(not_paraphrase_results[i] * 100))%")
    

调用管道模型时在做什么

  1. 使用分词器(tokenizer): 事实上所有的模型和分词器都是通过from_pretrained方法创建得到的, 一般;
  • 示例: 注意到调用模型和分词器的AutoTokenizerAutoModelForSequenceClassification是一个高层的接口类, 也可以根据不同的模型调用不同的类, 如distilbert-base-uncased-finetuned-sst-2-english模型对应的就是DistilBertTokenizerDistilBertForSequenceClassification;
    # PyTorch
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    # method 1
    model_name = "distilbert-base-uncased-finetuned-sst-2-english"
    pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
    
    # method 2
    from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
    model_name = "distilbert-base-uncased-finetuned-sst-2-english"
    model = DistilBertForSequenceClassification.from_pretrained(model_name)
    tokenizer = DistilBertTokenizer.from_pretrained(model_name)	
    	
    
    # TensorFlow
    # method 1
    from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
    model_name = "distilbert-base-uncased-finetuned-sst-2-english"
    tf_model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)\\
    inputs = tokenizer("We are very happy to show you the 🤗 Transformers library.")
    
    # method 2
    from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
    model_name = "distilbert-base-uncased-finetuned-sst-2-english"
    model = TFDistilBertForSequenceClassification.from_pretrained(model_name)
    tokenizer = DistilBertTokenizer.from_pretrained(model_name)
    
    print(inputs)
    
    • 输出结果: 分词的编号与一些其他对于模型训练有用的信息;
    'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    
  • 多语句分词:
    batch = tokenizer(
    	["We are very happy to show you the 🤗 Transformers library.", "We hope you don't hate it."],
    	padding=True,
    	truncation=True,
    	max_length=512,
    	return_tensors="pt" # change to "tf" for got TensorFlow
    )
    for key, value in batch.items():
    	print(f"key: value.numpy().tolist()")
    
    • 输出结果:
    input_ids: [[101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 100, 19081, 3075, 1012, 102], [101, 2057, 3246, 2017, 2123, 1005, 1056, 5223, 2009, 1012, 102, 0, 0, 0]]
    attention_mask: [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0]]
    
  • 更多与分词相关的内容可以参考Preprocessing data ;
  1. 使用预训练模型: 经过分词器预处理后的数据可以直接输入到模型中, 正如上文所述, 分词器的输出包含了模型所需的所有信息:
  • 示例: 注意PyTorch版本需要打包字典输入;
    # PyTorch
    # import torch
    # pt_outputs = pt_model(**pt_batch, labels = torch.tensor([1, 0])) # add labels
    outputs = pt_model(**pt_batch)
    print(outputs)
    
    # TensorFlow
    # import tensorflow as tf
    # tf_outputs = tf_model(tf_batch, labels = tf.constant([1, 0])) # add labels
    outputs = tf_model(tf_batch)
    print(outputs)
    
    • 输出结果:
    (tensor([[-4.0833,  4.3364],
    		[ 0.0818, -0.0418]], grad_fn=<AddmmBackward>),)
    		
    (<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
    array([[-4.0832963 ,  4.336414  ],
    	   [ 0.08181786, -0.04179301]], dtype=float32)>,)
    
    • 重点注意: 输出结果是去除了模型最后一个激活层(如softmax等激活函数)的输出, 这在所有transformers库中的模型都是通用的, 原因是最后一个激活层会和损失函数相融合(fused with loss)
  • 将输出结果手动激活:
    # PyTorch
    import torch.nn.functional as F
    predictions = F.softmax(outputs[0], dim=-1)
    
    # TensorFlow
    import tensorflow as tf
    predictions = tf.nn.softmax(outputs[0], axis=-1)
    
  • 预训练模型本身都是torch.nn.Moduletensorflow.keras.Model类型的, 因此可以在PyTorchTensorFlow的框架下进行训练, 其中transformers库提供了训练模块TrainerTFTrainer , 详细的训练微调方法可以参考training tutorial ;
    • 训练微调后的分词器或模型可以保存并重新加载使用:
    tokenizer.save_pretrained(save_directory)
    model.save_pretrained(save_directory)
    
    tokenizer = AutoTokenizer.from_pretrained(save_directory)
    model = AutoModel.from_pretrained(save_directory, from_tf=True)
    
    • 返回模型的隐层状态以及所有的注意力权重:
    # PyTorch
    pt_outputs = pt_model(**pt_batch, output_hidden_states=True, output_attentions=True)
    all_hidden_states, all_attentions = pt_outputs[-2:]
    
    # TensorFlow
    tf_outputs = tf_model(tf_batch, output_hidden_states=True, output_attentions=True)
    all_hidden_states, all_attentions = tf_outputs[-2:]
    
  • 可以通过设置config参数来调整模型的架构, 一些简单的配置参数也可以直接在from_pretrained方法中设置:
    # PyTorch
    from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
    config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    model = DistilBertForSequenceClassification(config)
    
    
    from transformers import DistilBertConfig, DistilBertTokenizer, DistilBertForSequenceClassification
    model_name = "distilbert-base-uncased"
    model = DistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
    tokenizer = DistilBertTokenizer.from_pretrained(model_name)
    
    # TensorFlow
    from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
    config = DistilBertConfig(n_heads=8, dim=512, hidden_dim=4*512)
    tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
    model = TFDistilBertForSequenceClassification(config)
    
    from transformers import DistilBertConfig, DistilBertTokenizer, TFDistilBertForSequenceClassification
    model_name = "distilbert-base-uncased"
    model = TFDistilBertForSequenceClassification.from_pretrained(model_name, num_labels=10)
    tokenizer = DistilBertTokenizer.from_pretrained(model_name)
    

安装transformers

哲学

https://huggingface.co/transformers/philosophy.html

  1. 这一章节其实是在讲Transformers库建立的思路, 该库主要由三个类构成:
  • (1) Model类: 如BertModel , 目前收录有超过30个PyTorch模型或Keras模型;
  • (2) Configuration类: 如BertConfig , 用于存储搭建模型的参数;
  • (3) Tokenizer类: 如BertTokenizer , 用于存储分词词汇表以及编码方式;
  • 使用from_pretrained()save_pretrained()方法来调用和保存这三种类的实例对象;
  1. 这里文档中提到一个耐人寻味的东西:
  • The code is usually as close to the original code base as possible which means some PyTorch code may be not as pytorchic as it could be as a result of being converted TensorFlow code and vice versa.
  • 即便如此, 官方文档中也写了这样的预期目标: Switch easily between PyTorch and TensorFlow 2.0, allowing training using one framework and inference using another.
  • 笔者是觉得第一句话说的是从TensorFlow移植到PyTorch的模型可能会失效, 言外之意似乎还是TensorFlow要比PyTorch主流一些的;

术语汇编

https://huggingface.co/transformers/glossary.html

  1. 本节主要是对Transformers模型中的一些术语, 包括位置编码(positional encoding), 编码器(encoder), 解码器(decoder)等做了一些说明, 拿的是BERT模型调用举的例子, 还是比较有借鉴意义的;
  • 这里把记一下代码示例:
    # Input IDs
    from transformers import BertTokenizer
    tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
    sequence = "A Titan RTX has 24GB of VRAM"
    tokenized_sequence = tokenizer.tokenize(sequence)
    print(tokenized_sequence) # ['A', 'Titan', 'R', '##T', '##X', 'has', '24', '##GB', 'of', 'V', '##RA', '##M']
    inputs = tokenizer(sequence)
    encoded_sequence = inputs["input_ids"]
    print(encoded_sequence) # [101, 138, 18696, 155, 1942, 3190, 1144, 1572, 13745, 1104, 159, 9664, 2107, 102]
    decoded_sequence = tokenizer.decode(encoded_sequence)
    print(decoded_sequence) # [CLS] A Titan RTX has 24GB of VRAM [SEP]
    # Attention mask
    from transformers import BertTokenizer
    tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
    sequence_a = "This is a short sequence."
    sequence_b = "This is a rather long sequence. It is at least longer than the sequence A."
    encoded_sequence_a = tokenizer(sequence_a)["input_ids"]
    encoded_sequence_b = tokenizer(sequence_b)["input_ids"]
    print(len(encoded_sequence_a), len(encoded_sequence_b)) # 8, 19
    padded_sequences = tokenizer([sequence_a, sequence_b], padding=True)
    print(padded_sequences["input_ids"]) # [[101, 1188, 1110, 170, 1603, 4954, 119, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1188, 1110, 170, 1897, 1263, 4954, 119, 1135, 1110, 1120, 1655, 2039, 1190, 1103, 4954, 138, 119, 102]]
    print(padded_sequences["attention_mask"]) # [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]
    # Token Type IDs
    from transformers import BertTokenizer
    tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
    sequence_a = "HuggingFace is based in NYC"
    sequence_b = "Where is HuggingFace based?"
    encoded_dict = tokenizer(sequence_a, sequence_b)
    decoded = tokenizer.decode(encoded_dict["input_ids"])
    print(decoded) # [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    print(encoded_dict['token_type_ids']) # [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1]
    

第二部分: 基础使用手册

任务汇总

序列分类

  • 代码示例:
    # PyTorch
    from transformers import AutoTokenizer, AutoModelForSequenceClassification
    import torch
    
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
    model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
    classes = ["not paraphrase", "is paraphrase"]
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "Apples are especially bad for your health"
    sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
    paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
    not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
    paraphrase_classification_logits = model(**paraphrase).logits
    not_paraphrase_classification_logits = model(**not_paraphrase).logits
    paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
    not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]
    # Should be paraphrase
    for i in range(len(classes)):
    	print(f"classes[i]: int(round(paraphrase_results[i] * 100))%")
    # Should not be paraphrase
    for i in range(len(classes)):
    	print(f"classes[i]: int(round(not_paraphrase_results[i] * 100))%")
    
    # TensorFlow
    from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
    import tensorflow as tf
    
    tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
    model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
    classes = ["not paraphrase", "is paraphrase"]
    sequence_0 = "The company HuggingFace is based in New York City"
    sequence_1 = "Apples are especially bad for your health"
    sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
    paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="tf")
    not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="tf")
    paraphrase_classification_logits = model(paraphrase)[0]
    not_paraphrase_classification_logits = model(not_paraphrase)[0]
    paraphrase_results = tf.nn.softmax(paraphrase_classification_logits, axis=1).numpy()[0]
    not_paraphrase_results = tf.nn.softmax(not_paraphrase_classification_logits, axis=1).numpy()[0]
    # Should be paraphrase
    for i in range(len(classes)):
    	print(f"classes[i]: int(round(paraphrase_results[i] * 100))%")
    # Should not be paraphrase
    for i in range(len(classes)):
    	print(f"classes[i]: int(round(not_paraphrase_results[i] * 100))%")
    

问答挖掘

  1. 关于SQuAD任务的模型微调可以参考run_squad.pyrun_tf_squad.py , 前者的PyTorch脚本似乎已经挂掉了, 只有后者TensorFlow的脚本仍然是有效的了;