UserWarning: Label not :NUMBER: is present in all training examples
Posted Data+Science+Insight
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了UserWarning: Label not :NUMBER: is present in all training examples相关的知识,希望对你有一定的参考价值。
UserWarning: Label not :NUMBER: is present in all training examples
目录
UserWarning: Label not :NUMBER: is present in all training examples
问题剖析:
#问题的核心就在于,某些标签在测试集中或者验证集中存在而在训练集中不存在,才会出现这个问题。
#问题可能是一些标记只出现在几个文档中(查看本文了解详细信息)。当您将数据集拆分为train和test以验证模型时,可能会出现训练数据中缺少某些标记的情况。设train_indexes是一个数组,其中包含训练样本的索引。如果训练样本中没有出现(索引k的)特定标记,则指示矩阵y[train_indexes]第k列中的所有元素为零。
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_predict
Q = {'What does the "yield" keyword do in Python?': ['python'],
'What is a metaclass in Python?': ['oop'],
'How do I check whether a file exists using Python?': ['python'],
'How to make a chain of function decorators?': ['python', 'decorator'],
'Using i and j as variables in Matlab': ['matlab', 'naming-conventions'],
'MATLAB: get variable type': ['matlab'],
'Why is MATLAB so fast in matrix multiplication?': ['performance'],
'Is MATLAB OOP slow or am I doing something wrong?': ['matlab-oop'],
}
dataframe = pd.DataFrame({'body': Q.keys(), 'tag': Q.values()})
mlb = MultiLabelBinarizer()
X = dataframe['body'].values
y = mlb.fit_transform(dataframe['tag'].values)
classifier = Pipeline([
('vectorizer', CountVectorizer(lowercase=True,
stop_words='english',
max_df=0.8,
min_df=1)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
predicted = cross_val_predict(classifier, X, y)
D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 4 is present in all training examples. str(classes[c])) D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 0 is present in all training examples. str(classes[c])) D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 1 is present in all training examples. str(classes[c])) D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 3 is present in all training examples. str(classes[c])) D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 5 is present in all training examples. str(classes[c])) D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 2 is present in all training examples. str(classes[c]))
#预测输出
import numpy as np
np.set_printoptions(precision=2, threshold=1000)
predicted
array([[0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0]])
#手动交叉验证并抑制错误信息的输出,来查看哪些标签不存在于训练集中。
import warnings
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import ShuffleSplit
rs = ShuffleSplit(n_splits=1, test_size=.5, random_state=0)
# train_indices, test_indices = rs.split(X)
# train_indices, test_indices = rs.split(X)
for train_index, test_index in rs.split(X):
train_indices, test_indices = train_index, test_index
print("TRAIN:", train_index, "TEST:", test_index)
with warnings.catch_warnings(record=True) as received_warnings:
warnings.simplefilter("always")
X_train, y_train = X[train_indices], y[train_indices]
X_test, y_test = X[test_indices], y[test_indices]
classifier.fit(X_train, y_train)
predicted_test = classifier.predict(X_test)
for w in received_warnings:
print (w.message)
TRAIN: [3 0 5 4] TEST: [6 2 1 7] Label not 2 is present in all training examples. Label not 4 is present in all training examples. Label not 5 is present in all training examples.
#也可以从实际的训练数据中得到验证;
#同理,某些预测输出的也全是0;
y_train[:4]
array([[1, 0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 0, 1], [0, 1, 0, 0, 0, 0, 0], [0, 1, 0, 1, 0, 0, 0]])
#编写与问题相关的自定义的函数;
#为了克服这个问题,您可以实现自己的预测函数
def get_best_tags(clf, X, lb, n_tags=3):
decfun = clf.decision_function(X)
best_tags = np.argsort(decfun)[:, :-(n_tags+1): -1]
return lb.classes_[best_tags]
#通过这样做,每个文档总是被分配置信度得分最高的n_tag标记:
mlb.inverse_transform(predicted_test)
get_best_tags(classifier, X_test, mlb)
array([['matlab', 'performance', 'oop'], ['python', 'performance', 'oop'], ['python', 'performance', 'oop'], ['matlab', 'performance', 'oop']], dtype=object)
完整错误:
D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 4 is present in all training examples. str(classes[c])) D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 0 is present in all training examples. str(classes[c])) D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 1 is present in all training examples. str(classes[c])) D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 3 is present in all training examples. str(classes[c])) D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 5 is present in all training examples. str(classes[c])) D:\\anaconda\\lib\\site-packages\\sklearn\\multiclass.py:81: UserWarning: Label not 2 is present in all training examples. str(classes[c]))
参考:sklearn
参考:UserWarning: Label not :NUMBER: is present in all training examples
以上是关于UserWarning: Label not :NUMBER: is present in all training examples的主要内容,如果未能解决你的问题,请参考以下文章
运行报警告UserWarning: Unknown extension is not supported and will be removed warn(msg)
全网最详细使用Scrapy时遇到0: UserWarning: You do not have a working installation of the service_identity modul
报错与解决 | ValueError: pos_label=‘pos‘ is not a valid label
微服务读取不到config配置中心配置信息,Spring Boot无法找到PropertySource:找不到标签Could not locate PropertySource: label not
tensorflow报错Can not squeeze dim[1], expected a dimension of 1, got n for解决办法
tensorflow报错Can not squeeze dim[1], expected a dimension of 1, got n for解决办法