使用经过训练的模型进行预测
Posted
技术标签:
【中文标题】使用经过训练的模型进行预测【英文标题】:Predicting with a trained model 【发布时间】:2020-07-04 01:50:31 【问题描述】:我使用逻辑回归创建了一个模型,后来使用 joblib 保存了模型。后来我尝试在我的 test.csv 中加载该模型并预测标签。当我尝试这个时,我得到一个错误,说 “X 每个样本有 1433445 个特征;期望 3797015” 这是我的初始代码:-
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
#reading data
train=pd.read_csv('train_yesindia.csv')
test=pd.read_csv('test_yesindia.csv')
train=train.iloc[:,1:]
test=test.iloc[:,1:]
test.info()
train.info()
test['label']='t'
test=test.fillna(' ')
train=train.fillna(' ')
test['total']=test['title']+' '+test['author']+test['text']
train['total']=train['title']+' '+train['author']+train['text']
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
counts = count_vectorizer.fit_transform(train['total'].values)
tfidf = transformer.fit_transform(counts)
targets = train['label'].values
test_counts = count_vectorizer.transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#split in samples
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf, targets, random_state=0)
logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)
print('Accuracy of Lasso classifier on training set: :.2f'
.format(logreg.score(X_train, y_train)))
print('Accuracy of Lasso classifier on test set: :.2f'
.format(logreg.score(X_test, y_test)))
targets = train['label'].values
logreg = LogisticRegression()
logreg.fit(counts, targets)
example_counts = count_vectorizer.transform(test['total'].values)
predictions = logreg.predict(example_counts)
pred=pd.DataFrame(predictions,columns=['label'])
pred['id']=test['id']
pred.groupby('label').count()
#dumping models
from joblib import dump, load
dump(logreg,'mypredmodel1.joblib')
后来我用不同的代码加载了模型:-
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from joblib import dump, load
test=pd.read_csv('test_yesindia.csv')
test=test.iloc[:,1:]
test['label']='t'
test=test.fillna(' ')
test['total']=test['title']+' '+test['author']+test['text']
#check
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
test_counts = count_vectorizer.fit_transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#check
#load_model
logreg = load('mypredmodel1.joblib')
example_counts = count_vectorizer.fit_transform(test['total'].values)
predictions = logreg.predict(example_counts)
当我运行它时,我得到了错误:
predictions = logreg.predict(example_counts)
Traceback (most recent call last):
File "<ipython-input-58-f28afd294d38>", line 1, in <module>
predictions = logreg.predict(example_counts)
File "C:\Users\adars\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 289, in predict
scores = self.decision_function(X)
File "C:\Users\adars\Anaconda3\lib\site-packages\sklearn\linear_model\base.py", line 270, in decision_function
% (X.shape[1], n_features))
ValueError: X has 1433445 features per sample; expecting 3797015
【问题讨论】:
错误在哪里确切弹出?请更新您的帖子以包含完整的错误跟踪 另外,您似乎没有在第二个代码块中应用 TF-IDF 转换... @desertnaut 更有帮助,非常感谢。 酷。唯一要在下一次做得更好的事情是,删除 错误之后出现的所有代码(它从未使用过,因此与问题无关,只会造成混乱) - 这次为您完成。 @desertnaut 是的,非常感谢。关于错误的任何线索? 【参考方案1】:这很可能是因为您在测试集中重新安装了变压器。绝对不能这样做 - 您还应该将它们保存在您的训练集中,并将测试(或任何其他未来)集仅用于转换数据。
使用管道更容易做到这一点。
所以,从您的第一个块中删除以下代码:
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
counts = count_vectorizer.fit_transform(train['total'].values)
tfidf = transformer.fit_transform(counts)
targets = train['label'].values
test_counts = count_vectorizer.transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
并替换为:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('counts', CountVectorizer(ngram_range=(1, 2)),
('tf-idf', TfidfTransformer(smooth_idf=False))
])
pipeline.fit(train['total'].values)
tfidf = pipeline.transform(train['total'].values)
targets = train['label'].values
test_tfidf = pipeline.transform(test['total'].values)
dump(pipeline, 'transform_predict.joblib')
现在,在您的第二个代码块中,删除这部分:
#check
transformer = TfidfTransformer(smooth_idf=False)
count_vectorizer = CountVectorizer(ngram_range=(1, 2))
test_counts = count_vectorizer.fit_transform(test['total'].values)
test_tfidf = transformer.fit_transform(test_counts)
#check
并替换为:
pipeline = load('transform_predict.joblib')
test_tfidf = pipeline.transform(test['total'].values)
如果你是predict
test_tfidf
变量,而不是example_counts
,那么你应该没问题,不是由 TF-IDF 转换的:
predictions = logreg.predict(test_tfidf)
【讨论】:
非常感谢,您节省了很多时间。这就像一个魅力。另外,我会确保下次我会记住你的建议。以上是关于使用经过训练的模型进行预测的主要内容,如果未能解决你的问题,请参考以下文章
如何使用由经过训练的神经网络创建的权重矩阵在另一个文件中进行预测?