UserWarning: Label not :NUMBER: is present in all training examples

Posted Data+Science+Insight

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了UserWarning: Label not :NUMBER: is present in all training examples相关的知识,希望对你有一定的参考价值。

UserWarning: Label not :NUMBER: is present in all training examples

目录

UserWarning: Label not :NUMBER: is present in all training examples

问题剖析:

完整错误:


问题剖析:

#问题的核心就在于,某些标签在测试集中或者验证集中存在而在训练集中不存在,才会出现这个问题。

#问题可能是一些标记只出现在几个文档中(查看本文了解详细信息)。当您将数据集拆分为train和test以验证模型时,可能会出现训练数据中缺少某些标记的情况。设train_indexes是一个数组,其中包含训练样本的索引。如果训练样本中没有出现(索引k的)特定标记,则指示矩阵y[train_indexes]第k列中的所有元素为零。

import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_predict

Q = {'What does the "yield" keyword do in Python?': ['python'],
     'What is a metaclass in Python?': ['oop'],
     'How do I check whether a file exists using Python?': ['python'],
     'How to make a chain of function decorators?': ['python', 'decorator'],
     'Using i and j as variables in Matlab': ['matlab', 'naming-conventions'],
     'MATLAB: get variable type': ['matlab'],
     'Why is MATLAB so fast in matrix multiplication?': ['performance'],
     'Is MATLAB OOP slow or am I doing something wrong?': ['matlab-oop'],
    }
dataframe = pd.DataFrame({'body': Q.keys(), 'tag': Q.values()})    

mlb = MultiLabelBinarizer()
X = dataframe['body'].values 
y = mlb.fit_transform(dataframe['tag'].values)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=True, 
                                   stop_words='english', 
                                   max_df=0.8, 
                                   min_df=1)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

predicted = cross_val_predict(classifier, X, y)
D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 4 is present in all training examples.
  str(classes[c]))
D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 0 is present in all training examples.
  str(classes[c]))
D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 1 is present in all training examples.
  str(classes[c]))
D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 3 is present in all training examples.
  str(classes[c]))
D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 5 is present in all training examples.
  str(classes[c]))
D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 2 is present in all training examples.
  str(classes[c]))

#预测输出

import numpy as np
np.set_printoptions(precision=2, threshold=1000)
predicted
array([[0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0]])

#手动交叉验证并抑制错误信息的输出,来查看哪些标签不存在于训练集中。

import warnings
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import ShuffleSplit

rs = ShuffleSplit(n_splits=1, test_size=.5, random_state=0)
# train_indices, test_indices = rs.split(X)
# train_indices, test_indices = rs.split(X)
for train_index, test_index in rs.split(X):
    train_indices, test_indices = train_index, test_index

    print("TRAIN:", train_index, "TEST:", test_index)

with warnings.catch_warnings(record=True) as received_warnings:
    warnings.simplefilter("always")
    X_train, y_train = X[train_indices], y[train_indices]
    X_test, y_test = X[test_indices], y[test_indices]
    classifier.fit(X_train, y_train)
    predicted_test = classifier.predict(X_test)
    for w in received_warnings:
        print (w.message)
TRAIN: [3 0 5 4] TEST: [6 2 1 7]
Label not 2 is present in all training examples.
Label not 4 is present in all training examples.
Label not 5 is present in all training examples.

 #也可以从实际的训练数据中得到验证;

#同理,某些预测输出的也全是0;

y_train[:4]
array([[1, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0]])

#编写与问题相关的自定义的函数;

#为了克服这个问题,您可以实现自己的预测函数

def get_best_tags(clf, X, lb, n_tags=3):
    decfun = clf.decision_function(X)
    best_tags = np.argsort(decfun)[:, :-(n_tags+1): -1]
    return lb.classes_[best_tags]

#通过这样做,每个文档总是被分配置信度得分最高的n_tag标记:

mlb.inverse_transform(predicted_test)
get_best_tags(classifier, X_test, mlb)
array([['matlab', 'performance', 'oop'],
       ['python', 'performance', 'oop'],
       ['python', 'performance', 'oop'],
       ['matlab', 'performance', 'oop']], dtype=object)

完整错误:

D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 4 is present in all training examples.
  str(classes[c]))
D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 0 is present in all training examples.
  str(classes[c]))
D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 1 is present in all training examples.
  str(classes[c]))
D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 3 is present in all training examples.
  str(classes[c]))
D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 5 is present in all training examples.
  str(classes[c]))
D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 2 is present in all training examples.
  str(classes[c]))

参考:sklearn

参考:UserWarning: Label not :NUMBER: is present in all training examples

以上是关于UserWarning: Label not :NUMBER: is present in all training examples的主要内容,如果未能解决你的问题,请参考以下文章

运行报警告UserWarning: Unknown extension is not supported and will be removed warn(msg)

全网最详细使用Scrapy时遇到0: UserWarning: You do not have a working installation of the service_identity modul

报错与解决 | ValueError: pos_label=‘pos‘ is not a valid label

微服务读取不到config配置中心配置信息,Spring Boot无法找到PropertySource:找不到标签Could not locate PropertySource: label not

tensorflow报错Can not squeeze dim[1], expected a dimension of 1, got n for解决办法

tensorflow报错Can not squeeze dim[1], expected a dimension of 1, got n for解决办法