ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] -

Posted

技术标签:

【中文标题】ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - 标记 BERT / Distilbert 错误【英文标题】:ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - Tokenizing BERT / Distilbert Error 【发布时间】:2020-12-10 12:16:42 【问题描述】:
def split_data(path):
  df = pd.read_csv(path)
  return train_test_split(df , test_size=0.1, random_state=100)

train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list() 
test_texts, test_labels = test['text'].to_list(), test['sentiment'].to_list() 

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.1, random_state=100)

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

当我尝试使用 BERT 标记器从数据帧中拆分时,我遇到了这样的错误。

【问题讨论】:

【参考方案1】:

我有同样的错误。问题是我的列表中没有,例如:

from transformers import DistilBertTokenizerFast

tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-german-cased')

# create test dataframe
texts = ['Vero Moda Damen Übergangsmantel Kurzmantel Chic Business Coatigan SALE',
         'Neu Herren Damen Sportschuhe Sneaker Turnschuhe Freizeit 1975 Schuhe Gr. 36-46',
         'KOMBI-ANGEBOT Zuckerpaste STRONG / SOFT / ZUBEHÖR -Sugaring Wachs Haarentfernung',
         None]

labels = [1, 2, 3, 1]

d = 'texts': texts, 'labels': labels 
test_df = pd.DataFrame(d)

因此,在将 Dataframe 列转换为列表之前,我删除了所有 None 行。

test_df = test_df.dropna()
texts = test_df["texts"].tolist()
texts_encodings = tokenizer(texts, truncation=True, padding=True)

这对我有用。

【讨论】:

【参考方案2】:

就我而言,我必须设置is_split_into_words=True

https://huggingface.co/transformers/main_classes/tokenizer.html

要编码的序列或序列批次。每个序列可以是字符串或字符串列表(预标记字符串)。如果序列作为字符串列表(预标记)提供,则必须设置 is_split_into_words=True(以消除一批序列的歧义)。

【讨论】:

可以确认这也解决了我的问题。【参考方案3】:
def split_data(path):
  df = pd.read_csv(path)
  return train_test_split(df , test_size=0.2, random_state=100)

train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list() 
test_texts, test_labels = test['text'].to_list(), test['sentiment'].to_list() 

train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2, random_state=100)

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased

train_encodings = tokenizer(train_texts, truncation=True, padding=True)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

尝试更改拆分的大小。它会起作用的。这意味着拆分数据不足以让分词器分词

【讨论】:

train_texts 只需要一个字符串列表?【参考方案4】:

与 MarkusOdenthal 类似,我的列表中有一个非字符串类型。我通过将列转换为字符串,然后将其转换为列表,然后将其拆分为训练和测试段来修复它。所以你会这样做

train_texts = train['text'].astype(str).values.to_list()

【讨论】:

以上是关于ValueError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]] - 的主要内容,如果未能解决你的问题,请参考以下文章

ValueError:不支持多类格式

如何解决 raise ValueError("columns must have matching element counts") ValueError: columns mus

“ValueError:标签 ['timestamp'] 不包含在轴中”错误

ValueError:不支持连续[重复]

django:ValueError - 无法序列化

ValueError:日期超出月份的范围