如何在 scikit-learn 管道中将时代添加到 Keras 网络

Posted

技术标签:

【中文标题】如何在 scikit-learn 管道中将时代添加到 Keras 网络【英文标题】:How to add epochs to Keras network in scikit-learn pipeline 【发布时间】:2020-07-11 14:08:33 【问题描述】:

我正在使用该网站的代码来帮助我分析推文,它使用的是管道: https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/

# Create our list of punctuation marks
punctuations = string.punctuation

# Create our list of stopwords
nlp = spacy.load('en')
stop_words = spacy.lang.en.stop_words.STOP_WORDS

# Load English tokenizer, tagger, parser, NER and word vectors
parser = English()

# Creating our tokenizer function
def spacy_tokenizer(sentence):
    # Creating our token object, which is used to create documents with linguistic annotations.
    mytokens = parser(sentence)

    # Lemmatizing each token and converting each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Removing stop words
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]

    # return preprocessed list of tokens
    return mytokens


# Custom transformer using spaCy
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        # Cleaning Text
        return [clean_text(text) for text in X]

    def fit(self, X, y=None, **fit_params):
        return self

    def get_params(self, deep=True):
        return 

# Basic function to clean the text
def clean_text(text):
    # Removing spaces and converting text into lowercase
    return text.strip().lower()


bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range=(1,1))

x = tweets['text']
Y = tweets['target']
x_train, x_test, Y_train, Y_test = model_selection.train_test_split(x, Y, test_size = 0.2)

#This part I figured out on my own:

from keras import Sequential
from keras.layers import Dense
classifier = Sequential()
#First Hidden Layer
classifier.add(Dense(500, activation='relu', kernel_initializer='random_normal', input_dim=19080))
#Second  Hidden Layer
classifier.add(Dense(500, activation='relu', kernel_initializer='random_normal'))
#Output Layer
classifier.add(Dense(1, activation='sigmoid', kernel_initializer='random_normal'))

classifier.compile(optimizer ='adam',loss='binary_crossentropy', metrics =['accuracy'])
# Create pipeline using Bag of Words
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', bow_vector),
                 ('classifier', classifier)])

# model generation
pipe.fit(x_train, Y_train)

我的问题是,我想这样做:

classifier.fit(X_train,y_train, batch_size=5, epochs=200)

但我似乎无法使其与管道一起使用。我可以在没有它的情况下运行它,而且它只需要一个 epoch 就可以运行得很好。但我很确定我会通过更多的 epoch 获得更好的准确性。

【问题讨论】:

【参考方案1】:

您应该使用 scikit-learn 包装器:

from keras.wrappers.scikit_learn import KerasClassifier

def create_network():
    network = Sequential()
    network.add(Dense(500, activation='relu', kernel_initializer='random_normal', input_dim=19080))
    network.add(Dense(500, activation='relu', kernel_initializer='random_normal'))
    network.add(Dense(1, activation='sigmoid', kernel_initializer='random_normal'))

    network.compile(loss='binary_crossentropy', 
                    optimizer='adam', 
                    metrics=['accuracy']) 

    return network

classifier = KerasClassifier(build_fn=create_network, 
                                 epochs=10, 
                                 batch_size=100, 
                                 verbose=0)

并在您的管道中使用上面显示的classifier,您可以在其中定义epochsbatch_size

【讨论】:

以上是关于如何在 scikit-learn 管道中将时代添加到 Keras 网络的主要内容,如果未能解决你的问题,请参考以下文章

是否可以将 TransformedTargetRegressor 添加到 scikit-learn 管道中?

如何在 scikit-learn 管道中的 CountVectorizer 之前包含 SimpleImputer?

scikit-learn:应用任意函数作为管道的一部分

如何从 scikit-learn 中的 TransformedTargetRegressor 管道中的经过训练的估计器访问属性?

如何在 scikit-learn 的管道中对变换参数进行网格搜索

如何在 scikit-learn 中使用管道调整自定义内核函数的参数