文本分类 RNN - LSTM - 错误检查目标

Posted

技术标签:

【中文标题】文本分类 RNN - LSTM - 错误检查目标【英文标题】:Text-Classification RNN - LSTM - Error checking target 【发布时间】:2019-01-11 01:34:13 【问题描述】:

我正在使用 Keras 开发 LSTM - RNN 文本分类 这是我的代码。

import numpy as np
import csv
import keras
import sklearn
import gensim
import random
import scipy
from keras.preprocessing import text
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.models import Sequential
from keras.layers.core import Dense , Dropout , Activation  , Flatten
from keras.layers.convolutional import Convolution1D, MaxPooling1D
from keras.layers import Embedding , LSTM
from sklearn import preprocessing
from sklearn.base import BaseEstimator
from sklearn.svm import LinearSVC , SVC
from sklearn.naive_bayes import MultinomialNB
from gensim.models.word2vec import Word2Vec
from gensim.models.doc2vec import Doc2Vec , TaggedDocument

# size of the word embeddings
embeddings_dim = 300

# maximum number of words to consider in the representations
max_features = 30000

# maximum length of a sentence
max_sent_len = 50

# percentage of the data used for model training
percent = 0.75

# number of classes
num_classes = 2

print ("")
print ("Reading pre-trained word embeddings...")
embeddings = dict( )
embeddings = gensim.models.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz" , binary=True) 

print ("Reading text data for classification and building representations...")
data = [ ( row["sentence"] , row["label"]  ) for row in csv.DictReader(open("test-data.txt"), delimiter='\t', quoting=csv.QUOTE_NONE) ]
random.shuffle( data )
train_size = int(len(data) * percent)
train_texts = [ txt.lower() for ( txt, label ) in data[0:train_size] ]
test_texts = [ txt.lower() for ( txt, label ) in data[train_size:-1] ]
train_labels = [ label for ( txt , label ) in data[0:train_size] ]
test_labels = [ label for ( txt , label ) in data[train_size:-1] ]
num_classes = len( set( train_labels + test_labels ) )
tokenizer = Tokenizer(num_words=max_features, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`|~\t\n', lower=True, split=" ")
tokenizer.fit_on_texts(train_texts)
train_sequences = sequence.pad_sequences( tokenizer.texts_to_sequences( train_texts ) , maxlen=max_sent_len )
test_sequences = sequence.pad_sequences( tokenizer.texts_to_sequences( test_texts ) , maxlen=max_sent_len )
train_matrix = tokenizer.texts_to_matrix( train_texts )
test_matrix = tokenizer.texts_to_matrix( test_texts )
embedding_weights = np.zeros( ( max_features , embeddings_dim ) )
for word,index in tokenizer.word_index.items():
  if index < max_features:
    try: embedding_weights[index,:] = embeddings[word]
    except: embedding_weights[index,:] = np.random.rand( 1 , embeddings_dim )
le = preprocessing.LabelEncoder( )
le.fit( train_labels + test_labels )
train_labels = le.transform( train_labels )
test_labels = le.transform( test_labels )
print("Classes that are considered in the problem : " + repr( le.classes_ ))


print("-----WEIGHTS-----")
print(embedding_weights.shape)

print ("Method = Stack of two LSTMs")
np.random.seed(0)
model = Sequential()

model.add(Embedding(max_features, embeddings_dim, input_length=max_sent_len, mask_zero=True, weights=[embedding_weights] ))
model.add(Dropout(0.25))
model.add(LSTM(output_dim=embeddings_dim , activation='sigmoid', inner_activation='hard_sigmoid', return_sequences=True))
model.add(Dropout(0.25))
model.add(LSTM(activation="sigmoid", units=embeddings_dim, recurrent_activation="hard_sigmoid", return_sequences=True))
model.add(Dropout(0.25))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', class_mode='binary')
else: model.compile(Adam(lr=0.04),'categorical_crossentropy',metrics=['accuracy'])

model.summary()


model.fit(train_sequences, train_labels , epochs=30, batch_size=32)

我的模型是这样的:

Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 50, 300)           9000000   
_________________________________________________________________
dropout_1 (Dropout)          (None, 50, 300)           0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 50, 300)           721200    
_________________________________________________________________
dropout_2 (Dropout)          (None, 50, 300)           0         
_________________________________________________________________
lstm_2 (LSTM)                (None, 50, 300)           721200    
_________________________________________________________________
dropout_3 (Dropout)          (None, 50, 300)           0         
_________________________________________________________________
dense_1 (Dense)              (None, 50, 1)             301       
_________________________________________________________________
activation_1 (Activation)    (None, 50, 1)             0         
=================================================================
Total params: 10,442,701
Trainable params: 10,442,701
Non-trainable params: 0

我的错误是: 检查目标时出错:预期 activation_1 具有 3 个维度,但得到的数组形状为 (750, 1)

我尝试重塑我的所有阵列,但我没有找到解决方案。 有人能帮我吗???谢谢 :D 对不起我的英语不好。

【问题讨论】:

train_sequencestrain_labels的形状是什么? (750,50) e (750,) 【参考方案1】:

您需要对标签进行一次热编码。您可以使用 Keras to_categorical 方法来转换标签编码的整数。

【讨论】:

谢谢,但我已经有一个错误:ValueError:检查目标时出错:预期activation_1 有3 个维度,但得到了形状为(750, 2) 的数组。我使用 train_labels = to_categorical(train_labels, num_classes) test_labels = to_categorical(test_labels, num_classes)..你能帮帮我吗?【参考方案2】:

最后,我的模型是

model = Sequential()

model.add(Embedding(max_features, embeddings_dim, input_length=max_sent_len, mask_zero=True, weights=[embedding_weights] ))
model.add(Dropout(0.25))
model.add(LSTM(output_dim=embeddings_dim , activation='sigmoid', inner_activation='hard_sigmoid', return_sequences=True))
model.add(Dropout(0.25))
model.add(LSTM(activation='sigmoid', units=embeddings_dim, recurrent_activation='hard_sigmoid', return_sequences=False))
model.add(Dropout(0.25))
model.add(Dense(num_classes))
model.add(Activation('sigmoid'))

adam=keras.optimizers.Adam(lr=0.04)
model.compile(optimizer=adam, loss='categorical_crossentropy', metrics=['accuracy'])

但是准确度很差!!! :(

【讨论】:

以上是关于文本分类 RNN - LSTM - 错误检查目标的主要内容,如果未能解决你的问题,请参考以下文章

Keras之RNN和LSTM

『TensotFlow』LSTM古诗生成任务总结

中文文本分类之TextRNN

使用keras构建LSTM分类器

文本分类A C-LSTM Neural Network for Text Classification

文本分类基于BERT预训练模型的灾害推文分类方法基于BERT和RNN的新闻文本分类对比