训练时遇到的BERT模型bug
Posted
技术标签:
【中文标题】训练时遇到的BERT模型bug【英文标题】:BERT model bug encountered during training 【发布时间】:2021-07-25 09:30:42 【问题描述】:因此,我制作了一个自定义数据集,其中包含来自多个电子学习网站的评论。我想要做的是建立一个模型,该模型可以基于文本识别情绪,并且我正在使用我通过抓取制作的数据集进行训练。在做BERT的时候遇到了这个错误
normalize() argument 2 must be str, not float
这是我的代码:-
import numpy as np
import pandas as pd
import numpy as np
import tensorflow as tf
print(tf.__version__)
import ktrain
from ktrain import text
from sklearn.model_selection import train_test_split
import pickle
#class_names = ["Frustration", "Not satisfied", "Satisfied", "Happy", "Excitement"]
data = pd.read_csv("Final_scraped_dataset.csv")
print(data.head())
X = data['Text']
y = data['Emotions']
class_names = np.unique(data['Emotions'])
print(class_names)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
print(X_train.head(10))
encoding =
'Frustration': 0,
'Not satisfied': 1,
'Satisfied': 2,
'Happy': 3,
'Excitement' : 4
y_train = [encoding[x] for x in y_train]
y_test = [encoding[x] for x in y_test]
X_train = X_train.tolist()
X_test = X_test.tolist()
#print(X_train)
(x_train, y_train), (x_test, y_test), preproc = text.texts_from_array(x_train=X_train, y_train=y_train,
x_test=X_test, y_test=y_test,
class_names=class_names,
preprocess_mode='bert',
maxlen=200,
max_features=15000) #I've encountered the error here
'''model = text.text_classifier('bert', train_data=(x_train, y_train), preproc=preproc)
learner = ktrain.get_learner(model, train_data=(x_train, y_train),
val_data=(x_test, y_test),
batch_size=4)
learner.fit_onecycle(2e-5, 3)
learner.validate(val_data=(x_test, y_test))
predictor = ktrain.get_predictor(learner.model, preproc)
predictor.get_classes()
import time
message = 'I hate you a lot'
start_time = time.time()
prediction = predictor.predict(message)
print('predicted: (:.2f)'.format(prediction, (time.time() - start_time)))
# let's save the predictor for later use
predictor.save("new_model/bert_model")
print("SAVED _______")'''
这是完整的错误:-
File "D:\Sentiment analysis\BERT_model_new_dataset.py", line 73, in <module>
max_features=15000)
File "D:\Anaconda3\envs\pythy37\lib\site-packages\ktrain\text\data.py", line 373, in texts_from_array
trn = preproc.preprocess_train(x_train, y_train, verbose=verbose)
File "D:\Anaconda3\envs\pythy37\lib\site-packages\ktrain\text\preprocessor.py", line 796, in preprocess_train
x = bert_tokenize(texts, self.tok, self.maxlen, verbose=verbose)
File "D:\Anaconda3\envs\pythy37\lib\site-packages\ktrain\text\preprocessor.py", line 166, in bert_tokenize
ids, segments = tokenizer.encode(doc, max_len=max_length)
File "D:\Anaconda3\envs\pythy37\lib\site-packages\keras_bert\tokenizer.py", line 73, in encode
first_tokens = self._tokenize(first)
File "D:\Anaconda3\envs\pythy37\lib\site-packages\keras_bert\tokenizer.py", line 103, in _tokenize
text = unicodedata.normalize('NFD', text)
TypeError: normalize() argument 2 must be str, not float
【问题讨论】:
能否分享您的数据样本(Final_scraped_dataset.csv)?尝试解决您的问题会很有帮助。谢谢! 我找到了答案,值不是字符串格式,只是使用了简单的转换。成功了,谢谢:) 【参考方案1】:听起来您的data['Text']
列中可能有一个浮点值。
您可以尝试这样的方法来进一步了解正在发生的事情:
for i, s in enumerate(data['Text']):
if not isinstance(s, str): print('Text in row %s is not a string: %s' % (i, s))
【讨论】:
嘿,我找到了答案。只需使用循环将列中的所有值转换为字符串。工作,谢谢:)以上是关于训练时遇到的BERT模型bug的主要内容,如果未能解决你的问题,请参考以下文章
在使用 bert 模型作为嵌入向量时,我是不是需要对自己的数据进行训练?
无法在 keras 的 BERT 之上为 NER 添加 CRF 层
在训练 Bert 二进制分类模型时,Huggingface 变形金刚返回“ValueError:要解包的值太多(预期为 2)”