huggingface/transformers数据预处理

Posted 2022-08-23 梆子井欢喜坨

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了huggingface/transformers数据预处理相关的知识，希望对你有一定的参考价值。

1. 自然语言

1.1 Tokenize

处理文本数据的主要工具是tokenizer。

tokenizer 首先根据一组规则将文本拆分为 tokens

tokens-to-index（通常称为vocab）将 token 转换为词典中的下标，通过 look-up table 构建张量作为模型的输入。

模型所需的任何其他输入也由tokenizer添加。

确保文本以与预训练语料库相同的方式拆分，并在预训练期间使用相同的vocab。

通过使用AutoTokenizer类加载预训练的标记器来快速开始。这会下载模型预训练时使用的 vocab。

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

将句子送入tokenizer中：

encoded_input = tokenizer("Do not meddle in the affairs of wizards, for they are subtle and quick to anger.")
print(encoded_input)

输出结果


    'input_ids': [101, 2079, 2025, 19960, 10362, 1999, 1996, 3821, 1997, 16657, 1010, 2005, 2027, 2024, 11259, 1998, 4248, 2000, 4963, 1012, 102], 
    'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
    'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

tokenizer 返回一个包含三个重要项目的字典：

input_ids are the indices corresponding to each token in the sentence.
attention_mask indicates whether a token should be attended to or not.
token_type_ids identifies which sequence a token belongs to when there is more than one sequence.

可以解码input_ids以返回原始输入

tokenizer.decode(encoded_input["input_ids"])

输出结果

'[CLS] Do not meddle in the affairs of wizards, for they are subtle and quick to anger. [SEP]'

tokenizer 添加了两个特别的token —— CLS 和 SEP

将包含句子的list送入 tokenizer

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_inputs = tokenizer(batch_sentences)
print(encoded_inputs)

输出结果

'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1]]

1.2 Pad

常见的问题：一批句子的长度并不总是相同的，但作为模型输入的张量需要具有统一的形状。

填充是一种策略，通过向较短的句子添加特殊的 padding token 来确保输入模型的张量长度相同。

将padding参数设置True填充一个batch中较短的序列以匹配最长的序列：

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True)
print(encoded_input)

输出结果

'input_ids': [[101, 1252, 1184, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0], 
               [101, 1790, 112, 189, 1341, 1119, 3520, 1164, 1248, 6462, 117, 21902, 1643, 119, 102], 
               [101, 1327, 1164, 5450, 23434, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
                    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
                    [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], 
                    [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]]

注意到第一个和第三个句子的tokens被填充到了第二个句子的长度

阅读官方文档获得更详细的设置

https://huggingface.co/docs/transformers/main/en/main_classes/tokenizer#tokenizer

max_length (int, optional) — Controls the maximum length to use by one of the truncation/padding parameters.If left unset or set to None, this will use the predefined model maximum length if a maximum length is required by one of the truncation/padding parameters. If the model has no specific maximum input length (like XLNet) truncation/padding to a maximum length will be deactivated.

1.3 Truncation

另一方面，有时序列可能太长，模型无法处理。在这种情况下，您需要将序列截断为更短的长度。

将truncation参数设置True为将序列截断为模型接受的最大长度：

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True)
print(encoded_input)

1.4 Build tensors

整合上述步骤，并设置返回向量的类型（Pytorch: return_tensors = “pt” / Tensorflow: return_tensors=“tf”）

batch_sentences = [
    "But what about second breakfast?",
    "Don't think he knows about second breakfast, Pip.",
    "What about elevensies?",
]
encoded_input = tokenizer(batch_sentences, padding=True, truncation=True, return_tensors="pt")
print(encoded_input)

2. 图像

2.1 特征提取

特征提取器还用于处理视觉任务的图像,目标是将原始图像转换为一批张量作为输入。

让我们为本教程加载food101数据集。

使用 Datasets split参数仅从训练拆分中加载一个小样本，因为数据集非常大：

from datasets import load_dataset

dataset = load_dataset("food101", split="train[:100]")

# 查看一张图像
dataset[1]["image"]

通过 AutoFeatureExtractor.from_pretrained() 加载图像特征提取器

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")

2.2 数据增强

对于视觉任务，通常会在图像中添加某种类型的数据增强作为预处理的一部分。

您可以使用您喜欢的任何库添加增强功能，但在本教程中，您将使用 torchvision 的transforms模块。

Normalize the image and use Compose to chain some transforms - RandomResizedCrop and ColorJitter - together:

from torchvision.transforms import Compose, Normalize, RandomResizedCrop, ColorJitter, ToTensor

normalize = Normalize(mean=feature_extractor.image_mean, std=feature_extractor.image_std)
_transforms = Compose(
    [RandomResizedCrop(feature_extractor.size), 
     ColorJitter(brightness=0.5, hue=0.5), 
     ToTensor(), 
     normalize]
)

模型接受特征提取器产生的像素值作为输入。

Create a function that generates pixel_values from the transforms:

def transforms(examples):
    examples["pixel_values"] = [_transforms(image.convert("RGB")) for image in examples["image"]]
    return examples

use Datasets set_transform to apply the transforms on-the-fly:

dataset.set_transform(transforms)

再次查看图片

dataset[1]

输出结果

'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=512x512 at 0x7F971B093FD0>, 'label': 6, 'pixel_values': tensor([[[ 0.7098,  0.7490,  0.7804,  ..., -0.3333, -0.3412, -0.3255],
         [ 0.6941,  0.7333,  0.7804,  ..., -0.3176, -0.3255, -0.3255],
         [ 0.6941,  0.7098,  0.7647,  ..., -0.3647, -0.3176, -0.3176],
         ...,
         [-0.5686, -0.7176, -0.7725,  ..., -0.5686, -0.6000, -0.6235],
         [-0.5686, -0.6863, -0.7333,  ..., -0.5922, -0.6000, -0.6000],
         [-0.5373, -0.6235, -0.7020,  ..., -0.5922, -0.6078, -0.6000]],

        [[ 0.7098,  0.7490,  0.7725,  ..., -0.1843, -0.1608, -0.1451],
         [ 0.7098,  0.7490,  0.7804,  ..., -0.1686, -0.1608, -0.1608],
         [ 0.7098,  0.7333,  0.7725,  ..., -0.2000, -0.1529, -0.1294],
         ...,
         [-0.3176, -0.4588, -0.5216,  ..., -0.3412, -0.3412, -0.3647],
         [-0.3176, -0.4431, -0.4980,  ..., -0.3569, -0.3412, -0.3412],
         [-0.2941, -0.3882, -0.4588,  ..., -0.3647, -0.3490, -0.3412]],

        [[ 0.5529,  0.6000,  0.6392,  ...,  0.0980,  0.0902,  0.1059],
         [ 0.5451,  0.5843,  0.6235,  ...,  0.1137,  0.1216,  0.1137],
         [ 0.5373,  0.5608,  0.6000,  ...,  0.0980,  0.1294,  0.1451],
         ...,
         [-0.1137, -0.2706, -0.3176,  ..., -0.0039, -0.0275, -0.0431],
         [-0.1294, -0.2549, -0.3176,  ..., -0.0118, -0.0196, -0.0275],
         [-0.1137, -0.2000, -0.2941,  ..., -0.0196, -0.0275, -0.0196]]])

这是图像预处理后的样子。正如您对应用的变换所期望的那样，图像已被随机裁剪，并且其颜色属性不同。

import numpy as np
import matplotlib.pyplot as plt

img = dataset[0]["pixel_values"]
plt.imshow(img.permute(1, 2, 0))

和我脑子里的图像的特征提取不太一样？

以上是关于huggingface/transformers数据预处理的主要内容，如果未能解决你的问题，请参考以下文章

huggingface/transformers数据预处理

目录

1. 自然语言

1.1 Tokenize

1.2 Pad

1.3 Truncation

1.4 Build tensors

2. 图像

2.1 特征提取

2.2 数据增强