使用 Tensorflow 和 Transformers 标记数据帧

Posted 2023-03-29

技术标签:

【中文标题】使用 Tensorflow 和 Transformers 标记数据帧【英文标题】：Tokenizing a dataframe using Tensorflow and Transformers 【发布时间】：2021-07-20 23:19:41 【问题描述】：

我在 pandas 数据框中有一个带标签的数据集。

>>> df.dtypes
title          object
headline       object
byline         object
dateline       object
text           object
copyright    category
country      category
industry     category
topic        category
file           object
dtype: object

我正在构建一个模型来基于 text 预测 topic。 text 是一个大字符串，topic 是一个字符串列表。例如：

>>> df['topic'].head(5)
0    ['ECONOMIC PERFORMANCE', 'ECONOMICS', 'EQUITY ...
1      ['CAPACITY/FACILITIES', 'CORPORATE/INDUSTRIAL']
2    ['PERFORMANCE', 'ACCOUNTS/EARNINGS', 'CORPORAT...
3    ['PERFORMANCE', 'ACCOUNTS/EARNINGS', 'CORPORAT...
4    ['STRATEGY/PLANS', 'NEW PRODUCTS/SERVICES', 'C...

在我将其放入模型之前，我必须对整个数据框进行标记，但是当通过转换器的 Autotokenizer 运行它时，我得到了一个错误。

import pandas as pd
import numpy as np
import tensorflow as tf
from transformers import AutoTokenizer
import tensorflow_hub as hub
import tensorflow_text as text
from sklearn.model_selection import train_test_split

def preprocess_text(df):

    # Remove punctuations and numbers
    df['text'] = df['text'].str.replace('[^a-zA-Z]', ' ', regex=True)

    # Single character removal
    df['text'] = df['text'].str.replace(r"\s+[a-zA-Z]\s+", ' ', regex=True)

    # Removing multiple spaces
    df['text'] = df['text'].str.replace(r'\s+', ' ', regex=True)

    # Remove NaNs
    df['text'] = df['text'].fillna('')
    df['topic'] = df['topic'].cat.add_categories('').fillna('')

    return df

# Load tokenizer and logger
tf.get_logger().setLevel('ERROR')
tokenizer = AutoTokenizer.from_pretrained('roberta-large')

# Load dataframe with just text and topic columns
# Only loading first 100 rows for testing purposes
df = pd.DataFrame()
for chunk in pd.read_csv(r'C:\Users\pfortier\Documents\Reuters\test.csv', sep='|', chunksize=100,
                dtype='topic': 'category', 'country': 'category', 'industry': 'category', 'copyright': 'category'):
    df = chunk
    break
df = preprocess_text(df)

# Split dataset into train, test, val (70, 15, 15)
train, test = train_test_split(df, test_size=0.15)
train, val = train_test_split(train, test_size=0.15)

# Tokenize datasets
train = tokenizer(train, return_tensors='tf', truncation=True, padding=True, max_length=128)
val = tokenizer(val, return_tensors='tf', truncation=True, padding=True, max_length=128)
test = tokenizer(test, return_tensors='tf', truncation=True, padding=True, max_length=128)

我收到此错误：

AssertionError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

上线train = tokenizer(train, return_tensors='tf', truncation=True, padding=True, max_length=128)。

这是否意味着我必须将我的 df 变成一个列表？

【问题讨论】：

【参考方案1】：

简而言之，是的。您也不想标记整个文本列，而只是标记文本列的一个 numpy 数组。缺少的步骤如下所示。

# Create new index
train_idx = [i for i in range(len(train.index))]
test_idx = [i for i in range(len(test.index))]
val_idx = [i for i in range(len(val.index))]

# Convert to numpy
x_train = train['text'].values[train_idx]
x_test = test['text'].values[test_idx]
x_val = val['text'].values[val_idx]

y_train = train['topic_encoded'].values[train_idx]
y_test = test['topic_encoded'].values[test_idx]
y_val = val['topic_encoded'].values[val_idx]

# Tokenize datasets
tr_tok = tokenizer(list(x_train), return_tensors='tf', truncation=True, padding=True, max_length=128)
val_tok = tokenizer(list(x_val), return_tensors='tf', truncation=True, padding=True, max_length=128)
test_tok = tokenizer(list(x_test), return_tensors='tf', truncation=True, padding=True, max_length=128)

【讨论】：

以上是关于使用 Tensorflow 和 Transformers 标记数据帧的主要内容，如果未能解决你的问题，请参考以下文章