来自 sklearn.metrics.silhouette_samples 的 MemoryError
Posted
技术标签:
【中文标题】来自 sklearn.metrics.silhouette_samples 的 MemoryError【英文标题】:MemoryError from sklearn.metrics.silhouette_samples 【发布时间】:2018-05-22 00:20:33 【问题描述】:尝试调用sklearn.metrics.silhouette_samples 时出现内存错误。我的用例与tutorial 相同。我在 Python 3.5 中使用 scikit-learn 0.18.1。
对于相关函数 silhouette_score ,此 post 建议使用 sample_size 参数,该参数可在调用 silhouette_samples 之前减小样本大小。我不确定下采样是否仍会产生可靠的结果,所以我犹豫不决。
我的输入 X 是一个 [107545 行 x 12 列] 数据框,虽然我只有 8gb 的 RAM,但我并不认为它很大
sklearn.metrics.silhouette_samples(X, labels, metric=’euclidean’)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-39-7285690e9ce8> in <module>()
----> 1 silhouette_samples(df_scaled, df['Cluster_Label'])
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\cluster\unsupervised.py in silhouette_samples(X, labels, metric, **kwds)
167 check_number_of_labels(len(le.classes_), X.shape[0])
168
--> 169 distances = pairwise_distances(X, metric=metric, **kwds)
170 unique_labels = le.classes_
171 n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1245 func = partial(distance.cdist, metric=metric, **kwds)
1246
-> 1247 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1248
1249
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1088 if n_jobs == 1:
1089 # Special case to avoid picklability checks in delayed
-> 1090 return func(X, Y, **kwds)
1091
1092 # TODO: in some cases, backend='threading' may be appropriate
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\metrics\pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
244 YY = row_norms(Y, squared=True)[np.newaxis, :]
245
--> 246 distances = safe_sparse_dot(X, Y.T, dense_output=True)
247 distances *= -2
248 distances += XX
C:\Users\KE56166\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
138 return ret
139 else:
--> 140 return np.dot(a, b)
141
142
MemoryError:
计算似乎依赖于euclidean_distances,它在调用np.dot 时崩溃了。我不是在这里处理稀缺性,所以也许没有解决方案。在计算距离时,我通常使用numpy.linalg.norm(A-B)。这有更好的内存处理吗?
【问题讨论】:
【参考方案1】:更新:PR 11135 应该在 scikit-learn 中解决此问题,从而使帖子的其余部分过时。
您有大约 100000 = 1e5 个样本,它们是 12 维空间中的点。 pairwise_distances
方法试图计算它们之间的所有成对距离。即 (1e5)**2 = 1e10 距离。每个都是一个浮点数; float64 格式占用 8 个字节的内存。所以距离矩阵的大小为 8e10 字节,即 74.5 GB。
这偶尔会在 GitHub 上报告:#4701、#4197,答案大致如下:这是一个 NumPy 问题,它无法处理 np.dot
这种大小的矩阵。虽然有one comment说
可以将其分解为子矩阵,以提高计算的内存效率。
确实,如果不是在开始时形成一个巨大的距离矩阵,而是在the loop over labels 中计算它的相关块,那将需要更少的内存。
使用它的source 修改方法并不难,因此它不是先计算距离然后再应用二进制掩码,而是先掩码。这就是我在下面所做的。而不是N**2
内存,其中N是样本数,它需要n**2
,其中n是最大集群大小。
如果这看起来很实用,我想它可以通过一些标志添加到 Scikit 中……但是应该注意,这个版本不支持 metric='precomputed'
。
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels
def silhouette_samples_memory_saving(X, labels, metric='euclidean', **kwds):
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
le = LabelEncoder()
labels = le.fit_transform(labels)
check_number_of_labels(len(le.classes_), X.shape[0])
unique_labels = le.classes_
n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
# For sample i, store the mean distance of the cluster to which
# it belongs in intra_clust_dists[i]
intra_clust_dists = np.zeros(X.shape[0], dtype=X.dtype)
# For sample i, store the mean distance of the second closest
# cluster in inter_clust_dists[i]
inter_clust_dists = np.inf + intra_clust_dists
for curr_label in range(len(unique_labels)):
# Find inter_clust_dist for all samples belonging to the same
# label.
mask = labels == curr_label
# Leave out current sample.
n_samples_curr_lab = n_samples_per_label[curr_label] - 1
if n_samples_curr_lab != 0:
intra_distances = pairwise_distances(X[mask, :], metric=metric, **kwds)
intra_clust_dists[mask] = np.sum(intra_distances, axis=1) / n_samples_curr_lab
# Now iterate over all other labels, finding the mean
# cluster distance that is closest to every sample.
for other_label in range(len(unique_labels)):
if other_label != curr_label:
other_mask = labels == other_label
inter_distances = pairwise_distances(X[mask, :], X[other_mask, :], metric=metric, **kwds)
other_distances = np.mean(inter_distances, axis=1)
inter_clust_dists[mask] = np.minimum(inter_clust_dists[mask], other_distances)
sil_samples = inter_clust_dists - intra_clust_dists
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
# score 0 for clusters of size 1, according to the paper
sil_samples[n_samples_per_label.take(labels) == 1] = 0
return sil_samples
【讨论】:
这不是用于聚类方法,而是用于分析结果聚类。无论如何,我在任何时候都不需要记忆中的所有距离。对于每个点,我需要到单个集群中所有其他点的平均距离。这最多意味着 8e10 字节 = 0.000745 GB,然后平均完成。然后只需遍历所有点。我是否遗漏了什么,或者只是编码没有以防止这种情况的方式处理内存? 数学上是的(尽管您还需要到 another 集群中的点的平均距离)。但是计算是矢量化的,这在基于 NumPy 的代码中是典型的。矢量化对很多事情都很好,但它的内存使用成本高得令人无法接受。我更新了答案。如果您最终编写了此方法的记忆保守形式,请发布;对于所有相关问题都已关闭的 scikit-learn 来说,这似乎不是一个活跃的问题。 我刚刚发布了一个memory_saving版本,看看它是否适合你。如果你写一个 PR,一个参数可能是最好的,因为每个人的 CPU / 内存平衡是不同的,没有一个适用于所有人的阈值。此外,当前方法支持 metric='precomputed',而节省内存的版本不支持(因此,如果度量是预先计算的,则需要忽略“保存”参数)。 在这里使用intra_clust_dists = np.zeros(X.shape[0], dtype=X.dtype)
可能是个坏主意,因为X 可能是整数数据类型。也许计算pairwise_distances(X[:100, :])
并使用它的dtype。或者只是强制 np.float64...
提升到 GitHub 问题 github.com/scikit-learn/scikit-learn/issues/10279【参考方案2】:
我为使用 numba 的欧几里德距离投射开发了一种内存高效且相对快速的解决方案。这适用于相对于输入数据大小大致恒定的内存,并使用 numba 的自动并行化。有了它,一个 24 维的 300000 行数据集(这需要大约 720GB 的 RAM)。可以根据需要对其进行修改以实现其他距离指标。
from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels
from numba import jit
@jit(nogil=True, parallel=True)
def euclidean_distances_numba(X, Y=None, Y_norm_squared=None):
# disable checks
XX_ = (X * X).sum(axis=1)
XX = XX_.reshape((1, -1))
if X is Y: # shortcut in the common case euclidean_distances(X, X)
YY = XX.T
elif Y_norm_squared is not None:
YY = Y_norm_squared
else:
YY_ = np.sum(Y * Y, axis=1)
YY = YY_.reshape((1,-1))
distances = np.dot(X, Y.T)
distances *= -2
distances += XX
distances += YY
distances = np.maximum(distances, 0)
return np.sqrt(distances)
@jit(parallel=True)
def euclidean_distances_sum(X, Y=None):
if Y is None:
Y = X
Y_norm_squared = (Y ** 2).sum(axis=1)
sums = np.zeros((len(X)))
for i in range(len(X)):
base_row = X[i, :]
sums[i] = euclidean_distances_numba(base_row.reshape(1, -1), Y, Y_norm_squared=Y_norm_squared).sum()
return sums
@jit(parallel=True)
def euclidean_distances_mean(X, Y=None):
if Y is None:
Y = X
Y_norm_squared = (Y ** 2).sum(axis=1)
means = np.zeros((len(X)))
for i in range(len(X)):
base_row = X[i, :]
means[i] = euclidean_distances_numba(base_row.reshape(1, -1), Y, Y_norm_squared=Y_norm_squared).mean()
return means
def silhouette_samples_memory_saving(X, labels, metric='euclidean', **kwds):
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
le = LabelEncoder()
labels = le.fit_transform(labels)
check_number_of_labels(len(le.classes_), X.shape[0])
unique_labels = le.classes_
n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
# For sample i, store the mean distance of the cluster to which
# it belongs in intra_clust_dists[i]
intra_clust_dists = np.zeros(X.shape[0], dtype=X.dtype)
# For sample i, store the mean distance of the second closest
# cluster in inter_clust_dists[i]
inter_clust_dists = np.inf + intra_clust_dists
for curr_label in range(len(unique_labels)):
# Find inter_clust_dist for all samples belonging to the same label.
mask = labels == curr_label
# Leave out current sample.
n_samples_curr_lab = n_samples_per_label[curr_label] - 1
if n_samples_curr_lab != 0:
intra_clust_dists[mask] = euclidean_distances_sum(X[mask, :]) / n_samples_curr_lab
# Now iterate over all other labels, finding the mean
# cluster distance that is closest to every sample.
for other_label in range(len(unique_labels)):
if other_label != curr_label:
other_mask = labels == other_label
other_distances = euclidean_distances_mean(X[mask, :], X[other_mask, :])
inter_clust_dists[mask] = np.minimum(inter_clust_dists[mask], other_distances)
sil_samples = inter_clust_dists - intra_clust_dists
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
# score 0 for clusters of size 1, according to the paper
sil_samples[n_samples_per_label.take(labels) == 1] = 0
return sil_samples
【讨论】:
【参考方案3】:接受的答案在内存上比官方功能要好得多。它从 len(data)^2 到 len(cluster)^2。如果您的集群足够大,那么这仍然会造成问题。我写了以下内容,它是 ~len(data) 但它非常慢。
import numpy as np
from sklearn.utils import check_X_y
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics.cluster.unsupervised import check_number_of_labels
def silhouette_samples_newest(X, labels, metric='euclidean', **kwds):
X, labels = check_X_y(X, labels, accept_sparse=['csc', 'csr'])
le = LabelEncoder()
labels = le.fit_transform(labels)
unique_labels = le.classes_
check_number_of_labels(len(unique_labels), X.shape[0])
n_samples_per_label = np.bincount(labels, minlength=len(unique_labels))
intra_clust_dists = np.array([np.linalg.norm( X[(labels == labels[i]), :] - point, axis = 1).mean() for i, point in enumerate(X)])
inter_clust_dists = np.array([min([np.linalg.norm( X[(labels == label), :] - point, axis = 1).mean() for label in unique_labels if label!=labels[i]]) for i, point in enumerate(X)])
sil_samples = inter_clust_dists - intra_clust_dists
sil_samples /= np.maximum(intra_clust_dists, inter_clust_dists)
# score 0 for clusters of size 1, according to the paper
sil_samples[n_samples_per_label.take(labels) == 1] = 0
return sil_samples
【讨论】:
以上是关于来自 sklearn.metrics.silhouette_samples 的 MemoryError的主要内容,如果未能解决你的问题,请参考以下文章
为啥 WCF 服务能够处理来自不同进程的调用而不是来自线程的调用
来自 viewDidAppear 的 Segue 调用有效,但不是来自 viewWillAppear