sklearn 的 PLSRegression：“ValueError：数组不得包含 infs 或 NaN”

Posted 2023-02-23

技术标签:

【中文标题】sklearn 的 PLSRegression：“ValueError：数组不得包含 infs 或 NaN”【英文标题】：sklearn's PLSRegression: "ValueError: array must not contain infs or NaNs" 【发布时间】：2016-01-31 13:50:29 【问题描述】：

使用sklearn.cross_decomposition.PLSRegression时：

import numpy as np
import sklearn.cross_decomposition

pls2 = sklearn.cross_decomposition.PLSRegression()
xx = np.random.random((5,5))
yy = np.zeros((5,5) ) 

yy[0,:] = [0,1,0,0,0]
yy[1,:] = [0,0,0,1,0]
yy[2,:] = [0,0,0,0,1]
#yy[3,:] = [1,0,0,0,0] # Uncommenting this line solves the issue

pls2.fit(xx, yy)

我明白了：

C:\Anaconda\lib\site-packages\sklearn\cross_decomposition\pls_.py:44: RuntimeWarning: invalid value encountered in divide
  x_weights = np.dot(X.T, y_score) / np.dot(y_score.T, y_score)
C:\Anaconda\lib\site-packages\sklearn\cross_decomposition\pls_.py:64: RuntimeWarning: invalid value encountered in less
  if np.dot(x_weights_diff.T, x_weights_diff) < tol or Y.shape[1] == 1:
C:\Anaconda\lib\site-packages\sklearn\cross_decomposition\pls_.py:67: UserWarning: Maximum number of iterations reached
  warnings.warn('Maximum number of iterations reached')
C:\Anaconda\lib\site-packages\sklearn\cross_decomposition\pls_.py:297: RuntimeWarning: invalid value encountered in less
  if np.dot(x_scores.T, x_scores) < np.finfo(np.double).eps:
C:\Anaconda\lib\site-packages\sklearn\cross_decomposition\pls_.py:275: RuntimeWarning: invalid value encountered in less
  if np.all(np.dot(Yk.T, Yk) < np.finfo(np.double).eps):
Traceback (most recent call last):
  File "C:\svn\hw4\code\test_plsr2.py", line 8, in <module>
    pls2.fit(xx, yy)
  File "C:\Anaconda\lib\site-packages\sklearn\cross_decomposition\pls_.py", line 335, in fit
    linalg.pinv(np.dot(self.x_loadings_.T, self.x_weights_)))
  File "C:\Anaconda\lib\site-packages\scipy\linalg\basic.py", line 889, in pinv
    a = _asarray_validated(a, check_finite=check_finite)
  File "C:\Anaconda\lib\site-packages\scipy\_lib\_util.py", line 135, in _asarray_validated
    a = np.asarray_chkfinite(a)
  File "C:\Anaconda\lib\site-packages\numpy\lib\function_base.py", line 613, in asarray_chkfinite
    "array must not contain infs or NaNs")
ValueError: array must not contain infs or NaNs

可能是什么问题？

我知道scikit-learn GitHub issue #2089，但由于我使用 scikit-learn 0.16.1（使用 Python 2.7.10 x64）这个问题应该得到解决（GitHub 问题中提到的代码 sn-ps 工作正常）。

【问题讨论】：

这是您尝试复制的示例还是您定义了此数据？可能是您遇到了一个棘手的问题，因此您会得到某些权重的 nans 或 inf 值，这些值在计算的稍后时间点会失败 @Chris 我的原始数据较大 (dropbox.com/s/zjrz6upfeln07ga/SE-sklearn-PLSR.zip?dl=0)，我试图缩小问题范围。 【参考方案1】：

请检查您传入的任何值是否为 NaN 或 inf：

np.isnan(xx).any()
np.isnan(yy).any()

np.isinf(xx).any()
np.isinf(yy).any()

如果其中任何一个结果为真。删除 nan 条目或 inf 条目。例如。您可以将它们设置为0：

xx = np.nan_to_num(xx)
yy = np.nan_to_num(yy)

numpy 也有可能被输入如此大的正负和零值，以至于库深处的方程产生零、Nan 或 Inf。奇怪的是，一种解决方法是发送较小的数字（例如 -1 和 1 之间的代表性数字。一种方法是标准化，请参阅：https://***.com/a/36390482/445131

如果这些都不能解决问题，那么您可能正在处理您使用的库中的低级错误，或者您的数据中存在某种奇异性。创建一个 sscce 并将其发布到 *** 或创建一个关于维护您的软件的库的新错误报告。

【讨论】：

谢谢。两者都评估为 False：np.isnan(xx).any(): False; np.isnan(yy).any(): False 我已经缩小了问题范围：github.com/scikit-learn/scikit-learn/issues/… 有趣。您的 github 问题指向数据中的某种奇异性。这些应该由 scikit-learn 优雅地处理，所以它现在的行为方式肯定是错误的。 nan 和inf check 可以组合在np.isfinite()中【参考方案2】：

该问题是由 scikit-learn 中的错误引起的。我在 GitHub 上报告了：https://github.com/scikit-learn/scikit-learn/issues/2089#issuecomment-152753095

【讨论】：

【参考方案3】：

我找到了一个适合我的棘手的小解决方案。

我正在使用以下代码通过 cesium 进行时间序列特征化：

timeInput = np.array(timeData)
valueInput = np.array(data)

#Featurizing Data
featurizedData = featurize.featurize_time_series(times=timeInput,
                                                     values=valueInput,
                                                     errors=None,
                                                     features_to_use=featuresToUse)

导致此错误的原因：

ValueError: array must not contain infs or NaNs

为了笑，我检查了数据的长度和类型：

data:
70
<class 'numpy.int32'>

timeData: 
70
<class 'numpy.float64'>

我决定尝试用这行代码转换数据类型：

valueInput = valueInput.astype(float)

它成功了，产生了这个代码：

timeInput = np.array(timeData)
valueInput = np.array(data)
valueInput = valueInput.astype(float)

#Featurizing Data
try:
    featurizedData = featurize.featurize_time_series(times=timeInput,
                                                     values=valueInput,
                                                     errors=None,
                                                     features_to_use=featuresToUse)

如果您遇到这样的错误，请尝试匹配的数据类型

【讨论】：

【参考方案4】：

我可以重现相同的错误，我通过过滤所有 0s 来消除此错误

threshold_for_bug = 0.00000001 # could be any value, ex numpy.min
xx[xx < threshold_for_bug] = threshold_for_bug

这可以消除错误（我从不检查精度差异）

我的系统信息：

numpy-1.11.2
python-3.5
macOS Sierra

【讨论】：

【参考方案5】：

您可能需要检查权重是否为负值，因为负权重也会触发此错误。

【讨论】：

以上是关于sklearn 的 PLSRegression：“ValueError：数组不得包含 infs 或 NaN”的主要内容，如果未能解决你的问题，请参考以下文章