Scikit-learn:输入包含 NaN、无穷大或对于 dtype 来说太大的值('float64')
Posted
技术标签:
【中文标题】Scikit-learn:输入包含 NaN、无穷大或对于 dtype 来说太大的值(\'float64\')【英文标题】:Scikit-learn : Input contains NaN, infinity or a value too large for dtype ('float64')Scikit-learn:输入包含 NaN、无穷大或对于 dtype 来说太大的值('float64') 【发布时间】:2016-04-19 05:03:29 【问题描述】:我正在使用 Python scikit-learn 对从 csv 获得的数据进行简单的线性回归。
reader = pandas.io.parsers.read_csv("data/all-stocks-cleaned.csv")
stock = np.array(reader)
openingPrice = stock[:, 1]
closingPrice = stock[:, 5]
print((np.min(openingPrice)))
print((np.min(closingPrice)))
print((np.max(openingPrice)))
print((np.max(closingPrice)))
peningPriceTrain, openingPriceTest, closingPriceTrain, closingPriceTest = \
train_test_split(openingPrice, closingPrice, test_size=0.25, random_state=42)
openingPriceTrain = np.reshape(openingPriceTrain,(openingPriceTrain.size,1))
openingPriceTrain = openingPriceTrain.astype(np.float64, copy=False)
# openingPriceTrain = np.arange(openingPriceTrain, dtype=np.float64)
closingPriceTrain = np.reshape(closingPriceTrain,(closingPriceTrain.size,1))
closingPriceTrain = closingPriceTrain.astype(np.float64, copy=False)
openingPriceTest = np.reshape(openingPriceTest,(openingPriceTest.size,1))
closingPriceTest = np.reshape(closingPriceTest,(closingPriceTest.size,1))
regression = linear_model.LinearRegression()
regression.fit(openingPriceTrain, closingPriceTrain)
predicted = regression.predict(openingPriceTest)
最小值和最大值显示为 0.0 0.6 41998.0 2593.9
但我收到此错误 ValueError:Input contains NaN, infinity or a value too large for dtype('float64').
我应该如何消除这个错误? 因为从上面的结果来看,它确实不包含无穷大或 Nan 值。
解决办法是什么?
编辑:all-stocks-cleaned.csv 位于http://www.sharecsv.com/s/cb31790afc9b9e33c5919cdc562630f3/all-stocks-cleaned.csv
【问题讨论】:
请尝试提供可重现的示例。 @iled all-stocks-cleaned.csv 可通过sharecsv.com/s/cb31790afc9b9e33c5919cdc562630f3/…获得 【参考方案1】:您的回归问题在于 NaN
以某种方式潜入了您的数据。这可以使用以下代码 sn-p 轻松检查:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.cross_validation import train_test_split
reader = pd.io.parsers.read_csv("./data/all-stocks-cleaned.csv")
stock = np.array(reader)
openingPrice = stock[:, 1]
closingPrice = stock[:, 5]
openingPriceTrain, openingPriceTest, closingPriceTrain, closingPriceTest = \
train_test_split(openingPrice, closingPrice, test_size=0.25, random_state=42)
openingPriceTrain = openingPriceTrain.reshape(openingPriceTrain.size,1)
openingPriceTrain = openingPriceTrain.astype(np.float64, copy=False)
closingPriceTrain = closingPriceTrain.reshape(closingPriceTrain.size,1)
closingPriceTrain = closingPriceTrain.astype(np.float64, copy=False)
openingPriceTest = openingPriceTest.reshape(openingPriceTest.size,1)
openingPriceTest = openingPriceTest.astype(np.float64, copy=False)
np.isnan(openingPriceTrain).any(), np.isnan(closingPriceTrain).any(), np.isnan(openingPriceTest).any()
(True, True, True)
如果您尝试像下面这样估算缺失值:
openingPriceTrain[np.isnan(openingPriceTrain)] = np.median(openingPriceTrain[~np.isnan(openingPriceTrain)])
closingPriceTrain[np.isnan(closingPriceTrain)] = np.median(closingPriceTrain[~np.isnan(closingPriceTrain)])
openingPriceTest[np.isnan(openingPriceTest)] = np.median(openingPriceTest[~np.isnan(openingPriceTest)])
您的回归将毫无问题地顺利进行:
regression = linear_model.LinearRegression()
regression.fit(openingPriceTrain, closingPriceTrain)
predicted = regression.predict(openingPriceTest)
predicted[:5]
array([[ 13598.74748173],
[ 53281.04442146],
[ 18305.4272186 ],
[ 50753.50958453],
[ 14937.65782778]])
简而言之:如错误消息所述,您的数据中缺少值。
编辑::
也许更简单直接的方法是在使用 pandas 读取数据后立即检查是否有任何丢失的数据:
data = pd.read_csv('./data/all-stocks-cleaned.csv')
data.isnull().any()
Date False
Open True
High True
Low True
Last True
Close True
Total Trade Quantity True
Turnover (Lacs) True
然后使用以下两行中的任何一行来估算数据:
data = data.fillna(lambda x: x.median())
或
data = data.fillna(method='ffill')
【讨论】:
np.isnan(openingPriceTrain).any(), np.isnan(closurePriceTrain).any(), np.isnan(openingPriceTest).any() (True, True, True) 这部分有帮助我确定问题,非常感谢以上是关于Scikit-learn:输入包含 NaN、无穷大或对于 dtype 来说太大的值('float64')的主要内容,如果未能解决你的问题,请参考以下文章
ValueError:输入包含 NaN、无穷大或对于 dtype 来说太大的值
Python - 输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值
SVM ValueError:输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值
GridSearchCV():ValueError:输入包含 NaN、无穷大或对于 dtype('float64') 来说太大的值
如何解决:ValueError:输入包含 NaN、无穷大或对于 dtype('float32') 来说太大的值?
StandardScaler -ValueError:输入包含 NaN、无穷大或对于 dtype('float64')来说太大的值