忽略训练数据中不存在的测试特征

Posted 2023-03-12

技术标签:

【中文标题】忽略训练数据中不存在的测试特征【英文标题】：Ignore test features not present in training data 【发布时间】：2020-02-22 17:28:21 【问题描述】：

我的任务是创建三个分类器（两个“开箱即用”，一个“优化”）以使用 sklearn 预测情绪分析。

说明是：

摄取训练集，训练分类器将分类器保存到磁盘在单独的程序中，从磁盘加载分类器使用测试集进行预测

步骤 1-3 没有问题，坦率地说工作得很好，问题是使用 model.predict()。我正在使用 sklearn 的TfidfVectorizer，它从文本中创建一个特征向量。我的问题在于我为训练集创建的特征向量不同于为测试集创建的训练向量，因为提供的文本是不同。

以下是来自train.tsv 文件的示例...

4|z8DDztUxuIoHYHddDL9zQ|So let me set the scene first, My church social group took a trip here last saturday. We are not your mothers church. The churhc is Community Church of Hope, We are the valleys largest GLBT church so when we desended upon Organ stop Pizza, in LDS land you know we look a little out of place. We had about 50 people from our church come and boy did we have fun.  There was a baptist church a couple rows down from us who didn't see it coming. Now we aren't a bunch of flamers frolicking around or anything but we do tend to get a little loud and generally have a great time. I did recognized some of the music  so I was able to sing along with those.  This is a great place to take anyone over 50.  I do think they might be washing dirtymob money or something since the business is cash only.........which I think caught a lot of people off guard including me.  The show starts at 530  so dont be late !!!!!!
:-----:|:-----:|:-----:
2|BIeDBg4MrEd1NwWRlFHLQQ|Decent but terribly inconsistent food. I've had some great dishes and some terrible ones, I love chaat and 3 out of 4 times it was great, but once it was just a fried greasy mess (in a bad way, not in the good way it usually is.) Once the matar paneer was great, once it was oversalted and the peas were just plain bad. I don't know how they do it, but it's a coinflip between good food and an oversalted overcooked bowl.  Either way, portions are generous.
4|NJHPiW30SKhItD5E2jqpHw|Looks aren't everything.......  This little divito looks a little scary looking, but like I've said before "you can't judge a book by it's cover".   Not necessarily the kind of place you will take your date (unless she's blind and hungry), but man oh man is the food ever good!   We have ordered breakfast, lunch, & dinner, and it is all fantastico. They make home-made corn tortillas and several salsas. The breakfast burritos are out of this world and cost about the same as a McDonald's meal.   We are a family that eats out frequently and we are frankly tired of pretty places with below average food. This place is sure to cure your hankerin for a tasty Mexican meal.
2|nnS89FMpIHz7NPjkvYHmug|Being a creature of habit anytime I want good sushi I go to Tokyo Lobby.  Well, my group wanted to branch out and try something new so we decided on Sakana. Not a fan.  And what's shocking to me is this place was packed!  The restaurant opens at 5:30 on Saturday and we arrived at around 5:45 and were lucky to get the last open table.  I don't get it...  Messy rolls that all tasted the same.  We ordered the tootsie roll and the crunch roll, both tasted similar, except of course for the crunchy captain crunch on top.  Just a mushy mess, that was hard to eat.  Bland tempura.  No bueno.  I did, however, have a very good tuna poke salad, but I would not go back just for that.   If you want good sushi on the west side, or the entire valley for that matter, say no to Sakana and yes to Tokyo Lobby.
2|FYxSugh9PGrX1PR0BHBIw|I recently told a friend that I cant figure out why there is no good Mexican restaurants in Tempe. His response was what about MacAyo's? I responded with "why are there no good Mexican food restaurants in Tempe?"  Seriously if anyone out there knows of any legit Mexican in Tempe let me know. And don't say restaurant Mexico!

这是train.py 文件：

import nltk, re, pandas as pd
from nltk.corpus import stopwords
import sklearn, string
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from itertools import islice
import time
from joblib import dump, load

def ID_to_Num(arr):
    le = preprocessing.LabelEncoder()
    new_arr = le.fit_transform(arr)
    return new_arr

def Num_to_ID(arr):
    le = preprocessing.LabelEncoder()
    new_arr = le.inverse_transform(arr)
    return new_arr

def check_performance(preds, acts):
    preds = list(preds)
    acts = pd.Series.tolist(acts)
    right = 0
    total = 0
    for i in range(len(preds)):
        if preds[i] == acts[i]:
            right += 1
        total += 1

    return (right / total) * 100

# This function removes numbers from an array
def remove_nums(arr): 
    # Declare a regular expression
    pattern = '[0-9]'  
    # Remove the pattern, which is a number
    arr = [re.sub(pattern, '', i) for i in arr]    
    # Return the array with numbers removed
    return arr

# This function cleans the passed in paragraph and parses it
def get_words(para):   
    # Create a set of stop words
    stop_words = set(stopwords.words('english'))
    # Split it into lower case    
    lower = para.lower().split()
    # Remove punctuation
    no_punctuation = (nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in lower)
    # Remove integers
    no_integers = remove_nums(no_punctuation)
    # Remove stop words
    dirty_tokens = (data for data in no_integers if data not in stop_words)
    # Ensure it is not empty
    tokens = [data for data in dirty_tokens if data.strip()]
    # Ensure there is more than 1 character to make up the word
    tokens = [data for data in tokens if len(data) > 1]

    # Return the tokens
    return tokens 

def minmaxscale(data):
    scaler = MinMaxScaler()
    df_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
    return df_scaled

# This function takes the first n items of a dictionary
def take(n, iterable):
    #https://***.com/questions/7971618/python-return-first-n-keyvalue-pairs-from-dict
    #Return first n items of the iterable as a dict
    return dict(islice(iterable, n))

def main():

    tsv_file = "filepath"
    csv_table=pd.read_csv(tsv_file, sep='\t', header=None)
    csv_table.columns = ['class', 'ID', 'text']

    s = pd.Series(csv_table['text'])
    new = s.str.cat(sep=' ')
    vocab = get_words(new)

    s = pd.Series(csv_table['text'])
    corpus = s.apply(lambda s: ' '.join(get_words(s)))

    csv_table['dirty'] = csv_table['text'].str.split().apply(len)
    csv_table['clean'] = csv_table['text'].apply(lambda s: len(get_words(s)))

    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(corpus)

    df = pd.DataFrame(data=X.todense(), columns=vectorizer.get_feature_names())

    result = pd.concat([csv_table, df], axis=1, sort=False)

    Y = result['class']

    result = result.drop('text', axis=1)
    result = result.drop('ID', axis=1)
    result = result.drop('class', axis=1)

    X = result

    mlp = MLPClassifier()
    rf = RandomForestClassifier()    
    mlp_opt = MLPClassifier(
        activation = 'tanh',
        hidden_layer_sizes = (1000,),
        alpha = 0.009,
        learning_rate = 'adaptive',
        learning_rate_init = 0.01,
        max_iter = 250,
        momentum = 0.9,
        solver = 'lbfgs',
        warm_start = False
    )    

    print("Training Classifiers")
    mlp_opt.fit(X, Y)
    mlp.fit(X, Y)
    rf.fit(X, Y)

    dump(mlp_opt, "C:\\filepath\Models\\mlp_opt.joblib")
    dump(mlp, "C:\\filepath\\Models\\mlp.joblib")
    dump(rf, "C:\\filepath\\Models\\rf.joblib")

    print("Trained Classifiers")

main()

这是Tester.py 文件：

from nltk.corpus import stopwords
import sklearn, string, nltk, re, pandas as pd, numpy, time
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from joblib import dump, load

def ID_to_Num(arr):
    le = preprocessing.LabelEncoder()
    new_arr = le.fit_transform(arr)
    return new_arr

def Num_to_ID(arr):
    le = preprocessing.LabelEncoder()
    new_arr = le.inverse_transform(arr)
    return new_arr

def check_performance(preds, acts):
    preds = list(preds)
    acts = pd.Series.tolist(acts)
    right = 0
    total = 0
    for i in range(len(preds)):
        if preds[i] == acts[i]:
            right += 1
        total += 1

    return (right / total) * 100

# This function removes numbers from an array
def remove_nums(arr): 
    # Declare a regular expression
    pattern = '[0-9]'  
    # Remove the pattern, which is a number
    arr = [re.sub(pattern, '', i) for i in arr]    
    # Return the array with numbers removed
    return arr

# This function cleans the passed in paragraph and parses it
def get_words(para):   
    # Create a set of stop words
    stop_words = set(stopwords.words('english'))
    # Split it into lower case    
    lower = para.lower().split()
    # Remove punctuation
    no_punctuation = (nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in lower)
    # Remove integers
    no_integers = remove_nums(no_punctuation)
    # Remove stop words
    dirty_tokens = (data for data in no_integers if data not in stop_words)
    # Ensure it is not empty
    tokens = [data for data in dirty_tokens if data.strip()]
    # Ensure there is more than 1 character to make up the word
    tokens = [data for data in tokens if len(data) > 1]

    # Return the tokens
    return tokens 

def minmaxscale(data):
    scaler = MinMaxScaler()
    df_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
    return df_scaled

# This function takes the first n items of a dictionary
def take(n, iterable):
    #https://***.com/questions/7971618/python-return-first-n-keyvalue-pairs-from-dict
    #Return first n items of the iterable as a dict
    return dict(islice(iterable, n))

def main():

    tsv_file = "filepath\\dev.tsv"
    csv_table=pd.read_csv(tsv_file, sep='\t', header=None)
    csv_table.columns = ['class', 'ID', 'text']

    s = pd.Series(csv_table['text'])
    new = s.str.cat(sep=' ')
    vocab = get_words(new)

    s = pd.Series(csv_table['text'])
    corpus = s.apply(lambda s: ' '.join(get_words(s)))

    csv_table['dirty'] = csv_table['text'].str.split().apply(len)
    csv_table['clean'] = csv_table['text'].apply(lambda s: len(get_words(s)))

    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(corpus)

    df = pd.DataFrame(data=X.todense(), columns=vectorizer.get_feature_names())

    result = pd.concat([csv_table, df], axis=1, sort=False)

    Y = result['class']

    result = result.drop('text', axis=1)
    result = result.drop('ID', axis=1)
    result = result.drop('class', axis=1)

    X = result

    mlp_opt = load("C:\\filepath\\Models\\mlp_opt.joblib")
    mlp = load("C:\\filepath\\Models\\mlp.joblib")
    rf = load("C:\\filepath\\Models\\rf.joblib")

    print("Testing Classifiers")
    mlp_opt_preds = mlp_opt.predict(X)
    mlp_preds = mlp.predict(X)
    rf_preds = rf.predict(X)

    mlp_opt_performance = check_performance(mlp_opt_preds, Y)
    mlp_performance = check_performance(mlp_preds, Y)
    rf_performance = check_performance(rf_preds, Y)

    print("MLP OPT PERF: ".format(mlp_opt_performance))
    print("MLP PERF: ".format(mlp_performance))
    print("RF PERF: ".format(rf_performance))

main()

我最终得到的是一个错误：

Testing Classifiers
Traceback (most recent call last):
  File "Reader.py", line 121, in <module>
    main()
  File "Reader.py", line 109, in main
    mlp_opt_preds = mlp_opt.predict(X)
  File "C:\Users\Jerry\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 953, in predict
    y_pred = self._predict(X)
  File "C:\Users\Jerry\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 676, in _predict
    self._forward_pass(activations)
  File "C:\Users\Jerry\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 102, in _forward_pass
    self.coefs_[i])
  File "C:\Users\Jerry\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\utils\extmath.py", line 173, in safe_sparse_dot
    return np.dot(a, b)
**ValueError: shapes (2000,13231) and (12299,1000) not aligned: 13231 (dim 1) != 12299 (dim 0)**

我知道错误与特征向量的差异有关大小——因为向量是从数据中的文本创建的。我对 NLP 或机器学习知之甚少，无法设计一个解决此问题的解决方案。我怎样才能创造一种方式来拥有模型使用测试数据中的特征集进行预测？

我尝试根据以下答案进行编辑以保存特征向量：

Train.py 现在看起来像：

import nltk, re, pandas as pd
from nltk.corpus import stopwords
import sklearn, string
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from itertools import islice
import time
import pickle
from joblib import dump, load

def ID_to_Num(arr):
    le = preprocessing.LabelEncoder()
    new_arr = le.fit_transform(arr)
    return new_arr

def Num_to_ID(arr):
    le = preprocessing.LabelEncoder()
    new_arr = le.inverse_transform(arr)
    return new_arr

def check_performance(preds, acts):
    preds = list(preds)
    acts = pd.Series.tolist(acts)
    right = 0
    total = 0
    for i in range(len(preds)):
        if preds[i] == acts[i]:
            right += 1
        total += 1

    return (right / total) * 100

# This function removes numbers from an array
def remove_nums(arr): 
    # Declare a regular expression
    pattern = '[0-9]'  
    # Remove the pattern, which is a number
    arr = [re.sub(pattern, '', i) for i in arr]    
    # Return the array with numbers removed
    return arr

# This function cleans the passed in paragraph and parses it
def get_words(para):   
    # Create a set of stop words
    stop_words = set(stopwords.words('english'))
    # Split it into lower case    
    lower = para.lower().split()
    # Remove punctuation
    no_punctuation = (nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in lower)
    # Remove integers
    no_integers = remove_nums(no_punctuation)
    # Remove stop words
    dirty_tokens = (data for data in no_integers if data not in stop_words)
    # Ensure it is not empty
    tokens = [data for data in dirty_tokens if data.strip()]
    # Ensure there is more than 1 character to make up the word
    tokens = [data for data in tokens if len(data) > 1]

    # Return the tokens
    return tokens 

def minmaxscale(data):
    scaler = MinMaxScaler()
    df_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
    return df_scaled

# This function takes the first n items of a dictionary
def take(n, iterable):
    #https://***.com/questions/7971618/python-return-first-n-keyvalue-pairs-from-dict
    #Return first n items of the iterable as a dict
    return dict(islice(iterable, n))

def main():

    tsv_file = "filepath\\train.tsv"
    csv_table=pd.read_csv(tsv_file, sep='\t', header=None)
    csv_table.columns = ['class', 'ID', 'text']

    s = pd.Series(csv_table['text'])
    new = s.str.cat(sep=' ')
    vocab = get_words(new)

    s = pd.Series(csv_table['text'])
    corpus = s.apply(lambda s: ' '.join(get_words(s)))

    csv_table['dirty'] = csv_table['text'].str.split().apply(len)
    csv_table['clean'] = csv_table['text'].apply(lambda s: len(get_words(s)))

    vectorizer = TfidfVectorizer()
    test = vectorizer.fit_transform(corpus)

    df = pd.DataFrame(data=test.todense(), columns=vectorizer.get_feature_names())

    result = pd.concat([csv_table, df], axis=1, sort=False)

    Y = result['class']

    result = result.drop('text', axis=1)
    result = result.drop('ID', axis=1)
    result = result.drop('class', axis=1)

    X = result

    mlp = MLPClassifier()
    rf = RandomForestClassifier()    
    mlp_opt = MLPClassifier(
        activation = 'tanh',
        hidden_layer_sizes = (1000,),
        alpha = 0.009,
        learning_rate = 'adaptive',
        learning_rate_init = 0.01,
        max_iter = 250,
        momentum = 0.9,
        solver = 'lbfgs',
        warm_start = False
    )    

    print("Training Classifiers")
    mlp_opt.fit(X, Y)
    mlp.fit(X, Y)
    rf.fit(X, Y)

    dump(mlp_opt, "filepath\\Models\\mlp_opt.joblib")
    dump(mlp, "filepath\\Models\\mlp.joblib")
    dump(rf, "filepath\\Models\\rf.joblib")
    pickle.dump(test, open("filepath\\tfidf_vectorizer.pkl", 'wb'))

    print("Trained Classifiers")

main()

Test.py 现在看起来像：

from nltk.corpus import stopwords
import sklearn, string, nltk, re, pandas as pd, numpy, time
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from joblib import dump, load
import pickle

def ID_to_Num(arr):
    le = preprocessing.LabelEncoder()
    new_arr = le.fit_transform(arr)
    return new_arr

def Num_to_ID(arr):
    le = preprocessing.LabelEncoder()
    new_arr = le.inverse_transform(arr)
    return new_arr

def check_performance(preds, acts):
    preds = list(preds)
    acts = pd.Series.tolist(acts)
    right = 0
    total = 0
    for i in range(len(preds)):
        if preds[i] == acts[i]:
            right += 1
        total += 1

    return (right / total) * 100

# This function removes numbers from an array
def remove_nums(arr): 
    # Declare a regular expression
    pattern = '[0-9]'  
    # Remove the pattern, which is a number
    arr = [re.sub(pattern, '', i) for i in arr]    
    # Return the array with numbers removed
    return arr

# This function cleans the passed in paragraph and parses it
def get_words(para):   
    # Create a set of stop words
    stop_words = set(stopwords.words('english'))
    # Split it into lower case    
    lower = para.lower().split()
    # Remove punctuation
    no_punctuation = (nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in lower)
    # Remove integers
    no_integers = remove_nums(no_punctuation)
    # Remove stop words
    dirty_tokens = (data for data in no_integers if data not in stop_words)
    # Ensure it is not empty
    tokens = [data for data in dirty_tokens if data.strip()]
    # Ensure there is more than 1 character to make up the word
    tokens = [data for data in tokens if len(data) > 1]

    # Return the tokens
    return tokens 

def minmaxscale(data):
    scaler = MinMaxScaler()
    df_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
    return df_scaled

# This function takes the first n items of a dictionary
def take(n, iterable):
    #https://***.com/questions/7971618/python-return-first-n-keyvalue-pairs-from-dict
    #Return first n items of the iterable as a dict
    return dict(islice(iterable, n))

def main():

    tfidf_vectorizer = pickle.load(open("filepath\\tfidf_vectorizer.pkl", 'rb'))

    tsv_file = "filepath\\dev.tsv"
    csv_table=pd.read_csv(tsv_file, sep='\t', header=None)
    csv_table.columns = ['class', 'ID', 'text']

    s = pd.Series(csv_table['text'])
    new = s.str.cat(sep=' ')
    vocab = get_words(new)

    s = pd.Series(csv_table['text'])
    corpus = s.apply(lambda s: ' '.join(get_words(s)))

    csv_table['dirty'] = csv_table['text'].str.split().apply(len)
    csv_table['clean'] = csv_table['text'].apply(lambda s: len(get_words(s)))

    print(type(corpus))
    print(corpus.head())

    X = tfidf_vectorizer.transform(corpus)

    print(X)

    df = pd.DataFrame(data=X.todense(), columns=tfidf_vectorizer.get_feature_names())

    result = pd.concat([csv_table, df], axis=1, sort=False)

    Y = result['class']

    result = result.drop('text', axis=1)
    result = result.drop('ID', axis=1)
    result = result.drop('class', axis=1)

    X = result

    mlp_opt = load("filepath\\Models\\mlp_opt.joblib")
    mlp = load("filepath\\Models\\mlp.joblib")
    rf = load("filepath\\Models\\rf.joblib")

    print("Testing Classifiers")
    mlp_opt_preds = mlp_opt.predict(X)
    mlp_preds = mlp.predict(X)
    rf_preds = rf.predict(X)

    mlp_opt_performance = check_performance(mlp_opt_preds, Y)
    mlp_performance = check_performance(mlp_preds, Y)
    rf_performance = check_performance(rf_preds, Y)

    print("MLP OPT PERF: ".format(mlp_opt_performance))
    print("MLP PERF: ".format(mlp_performance))
    print("RF PERF: ".format(rf_performance))

main()

但这会产生：

Traceback (most recent call last):
  File "Filepath\Reader.py", line 128, in <module>
    main()
  File "Filepath\Reader.py", line 95, in main
    X = tfidf_vectorizer.transform(corpus)
  File "C:\Users\Jerry\AppData\Local\Programs\Python\Python37\lib\site-packages\scipy\sparse\base.py", line 689, in __getattr__
    raise AttributeError(attr + " not found")
AttributeError: transform not found

【问题讨论】：

【参考方案1】：

您不应该在测试数据集上使用 fit_transform()。您应该只使用从 train-dataset 中学到的词汇。

这是一个示例解决方案，

import pickle

tfidf_vectorizer = TfidfVectorizer()
train_data = tfidf_vectorizer.fit_transform(train_corpus) # fit on train

# You could just save the vectorizer with pickle
pickle.dump(tfidf_vectorizer, open('tfidf_vectorizer.pkl', 'wb'))

# then later load the vectorizer and transform on test-dataset.
tfidf_vectorizer = pickle.load(open('tfidf_vectorizer.pkl', 'rb'))
test_data = tfidf_vectorizer.transform(test_corpus)

当您使用transform() 时，它只考虑从训练语料库中学习到的词汇，而忽略在测试集中找到的任何新词。

【讨论】：

好主意，但是：TypeError: cannot serialize '_io.BufferedWriter' object @JerryM 我犯了一个小错误，你现在可以试试并告诉我。尝试阅读它（使用您的编辑）会产生 EOFError: Ran out of input 使用如果您能提供完整的回溯，我将不胜感激。很难说你的代码出了什么问题。

File "Reader.py", line 123, in &lt;module&gt;     main()   File "Reader.py", line 76, in main     tfidf_vectorizer = pickle.load(open("filepath\\tfidf_vectorizer.pkl", 'rb')) EOFError: Ran out of input

以上是关于忽略训练数据中不存在的测试特征的主要内容，如果未能解决你的问题，请参考以下文章