在 python 中使用 scikit-learn 线性回归模型时出错

Posted

技术标签:

【中文标题】在 python 中使用 scikit-learn 线性回归模型时出错【英文标题】:error in using scikit-learn linear regression model in python 【发布时间】:2015-12-14 06:17:38 【问题描述】:

我正在使用pythonipython notebook 在此scikit-learn page 上开发Linear Regression 模型。我拥有的dataset 看起来像:

KR,Alabama,97071129.11369997,186026.0,63.14000000000001,923.8600000000001
KR,Alabama,67445447.0459,187201.0,94.71,1385.79
KR,Alabama,66332319.626799986,186611.0,121.77000000000001,1781.73
KR,Alabama,75868163.65490001,188002.0,171.38,2507.62
KR,Alabama,104626353.3301,192055.0,62.300000000000004,924.2800000000001
KR,Alabama,82482715.69460002,193070.0,93.45,1386.4199999999998
KR,Alabama,81095032.9574,196819.0,120.15,1782.5400000000002
KR,Alabama,70076833.3433,196738.0,169.1,2508.76
KR,Alabama,111183092.64729999,195091.0,64.82000000000001,946.2600000000001
KR,Alabama,90909063.08510002,197789.0,97.22999999999999,1419.3899999999999
KR,Alabama,90934598.2206,201541.0,125.01,1824.93
KR,Alabama,107374172.93309999,203338.0,175.94,2568.42
KR,Arizona,1126677862.6940002,264600.0,63.14000000000001,923.8600000000001
KR,Arizona,838166771.0832,268153.0,94.71,1385.79
KR,Arizona,956037530.2797,268429.0,121.77000000000001,1781.73
KR,Arizona,984328946.5951,268792.0,171.38,2507.62
KR,Arizona,1257812174.3229997,270547.0,62.300000000000004,924.2800000000001
KR,Arizona,883093705.2885998,272764.0,93.45,1386.4199999999998
KR,Arizona,880652373.4425,276307.0,120.15,1782.5400000000002
KR,Arizona,910039260.961,279318.0,169.1,2508.76
KR,Arizona,1226385050.8268003,279983.0,64.82000000000001,946.2600000000001
KR,Arizona,1087126209.1170998,281409.0,97.22999999999999,1419.3899999999999
KR,Arizona,934971659.6374002,286590.0,125.01,1824.93
KR,Arizona,986475815.6928002,288644.0,175.94,2568.42
KR,California,7830776748.968867,2085424.0,63.14000000000001,923.8600000000001
KR,California,5999727784.478112,2103999.0,94.71,1385.79
KR,California,5804539962.436825,2138267.0,121.77000000000001,1781.73
KR,California,6547521069.504964,2172849.0,171.38,2507.62
KR,California,7945616026.08499,2157455.0,62.300000000000004,924.2800000000001
KR,California,6068949829.714768,2182688.0,93.45,1386.4199999999998
KR,California,5767177648.936179,2227205.0,120.15,1782.5400000000002
KR,California,6292965589.900258,2284617.0,169.1,2508.76
KR,California,8805205589.885035,2254347.0,64.82000000000001,946.2600000000001
KR,California,6855033176.090414,2292655.0,97.22999999999999,1419.3899999999999
KR,California,6930741761.859158,2341652.0,125.01,1824.93
KR,California,6916313224.326924,2357810.0,175.94,2568.42

在这个dataset 中,每个company_id 和每个state 在那个company_id 中都有12 记录。现在我要做的是对于每个company_idcompany_id 中的每个state,我想分别用102 记录形成training 集和test 集。

这是我的当前更新代码:

from sklearn import linear_model
import csv


def process_chunk(chuk):

    training_set_feature_list = []
    training_set_label_list = []
    test_set_feature_list = []
    test_set_label_list = []
    count = 1
    # to divide into training & test, I am putting line 10th and 11th in test set
    count = 0
    for line in chuk:
        # Converting strings to numpy arrays
        if count == 9 or count == 10:   
            test_set_feature_list.append(np.array(line[3:5],dtype = np.float))
            test_set_label_list.append(np.array(line[2],dtype = np.float))
        else:    
            training_set_feature_list.append(np.array(line[3:5],dtype = np.float))
            training_set_label_list.append(np.array(line[2],dtype = np.float))

        count += 1
    # Create linear regression object
    regr = linear_model.LinearRegression()
    # Train the model using the training sets
    regr.fit(training_set_feature_list, training_set_label_list)

    print regr.predict(test_set_feature_list)



# Load and parse the data
file_read = open('file.csv', 'r')

reader = csv.reader(file_read)

chunk, chunksize = [], 12

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]
    chunk.append(line)

# process the remainder

process_chunk(chunk)

当我执行此代码时,我收到以下错误:

ValueError: Found arrays with inconsistent numbers of samples: [ 1 10] 在线regr.fit(training_set_feature_list, training_set_label_list)

这里的错误是什么以及如何解决?

更新:建议后这是我当前的输出屏幕,其中有一些奇怪的数字:

[  1.01999724e+08   1.03189615e+08]
[  1.08523268e+09   1.05427929e+09]
[  7.77478189e+09   7.56564733e+09]
[  8.87437438e+08   8.77578642e+08]
[  1.62710654e+08   1.51921308e+08]
[  4.19988737e+09   4.00902600e+09]
[  7.70222690e+08   7.31282229e+08]
[  1.60301569e+09   1.51976018e+09]
[  9.31799698e+08   9.28243073e+08]
[ 51831980.55257727  53136008.17725636]
[  1.92207016e+08   1.85232202e+08]
[  3.82247927e+08   3.33879176e+08]
[  1.35276200e+09   1.34525871e+09]
[  1.62557223e+09   1.53895636e+09]
[  2.12376099e+09   2.08585811e+09]
[ 61386995.4473462   58500866.29796618]
[  3.18458112e+08   3.09384959e+08]
[  4.90038249e+08   4.87984249e+08]

【问题讨论】:

看起来您有不同数量的样本和标签。例如 training_set_feature_list 的大小与 training_set_label_list 不同。 此外,对于此类任务,您可以使用 pandas 包,并通过 company_id 和 state 在您的数据框中使用 group。 @Olologin 你能告诉我应该怎么做吗?此外 training_set_feature_list 和 training_set_label_list 具有相同数量的记录,因为它们是一起形成的 你能分享你的 csv 数据吗?或者可能是 csv 里面只有很少的第一行?因为我想自己调试。 @Olologin 我已经更新了我上面的帖子以获得一些 csv 数据。请检查一下 【参考方案1】:

我认为您的数据有字符串,这就是它抱怨的原因,还有一些其他问题,我发布了一个更正的版本。

from sklearn import linear_model
import csv
import numpy as np
import matplotlib.pyplot as plt

def process_chunk(chuk):

    training_set_feature_list = []
    training_set_label_list = []
    test_set_feature_list = []
    test_set_label_list = []
    count = 1
    # to divide into training & test
    chuk = map(lambda x: x[2:], chuk) # Removing first 2 columns
    chunk = np.array(chuk,dtype = np.float) # Make floats array from strings
    ########## Testing dataset: Data after 30th row =########################################
    test_set_feature_list = chunk[30:,3:5]  #4rd and 5th column of chunk 
    test_set_label_list = chunk[30:,2] #3rd column of chunk

    ########## Training dataset: All data before 30th row########################################
    training_set_feature_list = chunk[:30,3:5]
    training_set_label_list = chunk[:30, 2]

    # Create linear regression object
    regr = linear_model.LinearRegression()
    # Train the model using the training sets
    regr.fit(training_set_feature_list, training_set_label_list)

    predictedTestSet = regr.predict(test_set_feature_list)

     # The coefficients
    print 'Coefficients: '.format(regr.coef_)
    # The mean square error
    print 'Residual sum of squares: %.2f' % np.mean(predictedTestSet - test_set_label_list) ** 2
    # Explained variance score: 1 is perfect prediction
    print 'Variance score: %.2f' % regr.score( test_set_feature_list, test_set_label_list)
    X = [x for (y,x) in sorted(zip(test_set_label_list, predictedTestSet))]
    Y = [y for (y,x) in sorted(zip(test_set_label_list, predictedTestSet))]
    plt.plot(range(len(X)),X , 'r.', label='predicted')    
    plt.plot(range(len(Y)),Y , 'g-',label='test_set')    
    plt.legend()
    plt.show()
    return predictedTestSet


# Load and parse the data
file_read = open('file1.csv', 'r')

reader = csv.reader(file_read)

chunk, chunksize = [], 12

for i, line in enumerate(reader):
    if ( i > 0):
        chunk.append(line)

predictedSet = process_chunk(chunk)
print predictedSet

结果:

Coefficients: [ 0.06821406]
Residual sum of squares: 0.00
Variance score: 1.00
[ 121.39022086  170.9286349    64.34416748   96.61828528  124.28181483
  174.99828567]

显示拟合的图(带有任意 x 轴):

【讨论】:

我根据你的建议更新了我的代码,当我执行它时,我在输出中看到了一些奇怪的结果。我在上面的帖子中发布了更新的代码和输出 您检查过缺失值吗? 我已经用我拥有的一些 csv 数据更新了我上面的帖子 @pbu 我在上面的帖子中更新了我的代码,还提供了一些我的 csv 数据。数据集中没有标题,但顺序是 - company_id,state,profit,attr1,attr2,attr3 那些不是随机的奇怪值,它们来自您的数据,检查第 11 行和第 12 行。其次,在运行此代码时保持标头完整,这是为了摆脱它。此外,我保留在那里的打印语句中保留了额外的块,我已将其从更新中删除。

以上是关于在 python 中使用 scikit-learn 线性回归模型时出错的主要内容,如果未能解决你的问题,请参考以下文章

使用 Scikit-Learn 在 Python 中绘制随机森林的树

如何在 scikit-learn 的 SVM 中使用非整数字符串标签? Python

使用 Scikit-Learn 在 Python 中绘制多项式回归

想要在不使用 Scikit-Learn 的情况下在 python 中构建支持向量机的真正建议 [关闭]

使用 scikit-learn 和 matplotlib 在 python 中重新创建决策边界图

python中使用scikit-learn的决策树算法运行错误