稀疏数据集上的光谱聚类

Posted 2023-03-12

技术标签:

【中文标题】稀疏数据集上的光谱聚类【英文标题】：Spectral clustering on sparse dataset 【发布时间】：2016-04-24 16:55:58 【问题描述】：

我正在对具有相当稀疏特征的数据集应用光谱聚类 (sklearn.cluster.SpectralClustering)。在 Python 中进行谱聚类时，我收到以下警告：

UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn("Graph is not fully connected, spectral embedding"

这之后通常会出现类似这样的错误：

`
File "****.py", line 120, in perform_clustering_spectral_clustering
  predicted_clusters = cluster.SpectralClustering(n_clusters=n).fit_predict(features)
File "****\sklearn\base.py", line 349, in fit_predict
  self.fit(X)
File "****\sklearn\cluster\spectral.py", line 450, in fit
  assign_labels=self.assign_labels)
File "****\sklearn\cluster\spectral.py", line 256, in spectral_clustering
  eigen_tol=eigen_tol, drop_first=False)
File "****\sklearn\manifold\spectral_embedding_.py", line 297, in spectral_embedding
  largest=False, maxiter=2000)
File "****\scipy\sparse\linalg\eigen\lobpcg\lobpcg.py", line 462, in lobpcg
  activeBlockVectorBP, retInvR=True)
File "****\scipy\sparse\linalg\eigen\lobpcg\lobpcg.py", line 112, in _b_orthonormalize
  gramVBV = cholesky(gramVBV)
File "****\scipy\linalg\decomp_cholesky.py", line 81, in cholesky
  check_finite=check_finite)
File "****\scipy\linalg\decomp_cholesky.py", line 30, in _cholesky
  raise LinAlgError("%d-th leading minor not positive definite" % info)
numpy.linalg.linalg.LinAlgError: 9-th leading minor not positive definite
numpy.linalg.linalg.LinAlgError: 9-th leading minor not positive definite
numpy.linalg.linalg.LinAlgError: the leading minor of order 12 of 'b' is not positive definite. The factorization of 'b' could not be completed and no eigenvalues or eigenvectors were computed.`

但是，当使用相同的设置时，此警告/错误并不总是发生（即其行为不是很一致，因此难以测试）。它发生在 n_clusters 的不同值上，但在值 n=2 和 n > 7 时发生的频率更高（至少这是我的简短经验；正如我所提到的，它的行为不是很一致）。

我应该如何处理这个警告和相关错误？它取决于功能的数量吗？如果我添加更多呢？

【问题讨论】：

我假设您使用的是sklearn.cluster.SpectralClustering？您确实需要在问题中提及这一点。另外，请显示错误和警告的完整回溯，而不仅仅是最后一行。你的稀疏相似矩阵是正定吗？我使用请求的信息编辑了帖子。矩阵可能不是正定的（因为这就是错误所说的）。问题是如何应对？ 【参考方案1】：

我在使用 n_clusters 时也遇到了这个问题。由于这是无监督的 ML，因此 n_clusters 没有单一的正确值。在您的情况下，n_cluster 似乎介于 3 和 7 之间。假设您对聚类有一些基本事实，最好的处理方法是尝试 n_cluster 的几个值，以查看给定数据集是否出现任何模式，同时确保避免任何过度-配件。您也可以使用轮廓系数（sklearn.metrics.silhouette_score）

【讨论】：

以上是关于稀疏数据集上的光谱聚类的主要内容，如果未能解决你的问题，请参考以下文章

scikit-learn 谱聚类：无法找到潜伏在数据中的 NaN

谱聚类算法及其代码（Spectral Clustering）