K-Means GridSearchCV 超参数调优

Posted

技术标签:

【中文标题】K-Means GridSearchCV 超参数调优【英文标题】:K-Means GridSearchCV hyperparameter tuning 【发布时间】:2020-09-12 02:04:56 【问题描述】:

我正在尝试通过在带有决策树分类器的管道中使用它来执行时空 K-Means 聚类的超参数调整。这个想法是使用 K-Means 聚类算法生成聚类距离空间矩阵和聚类标签,然后将其传递给决策树分类器。对于超参数调优,只需使用 K-Means 算法的参数即可。

我正在使用 Python 3.8 和 sklearn 0.22。

我感兴趣的数据有 3 列/属性:“时间”、“x”和“y”(x 和 y 是空间坐标)。

代码是:

class ST_KMeans(BaseEstimator, TransformerMixin):
# class ST_KMeans():
    """
    Note that K-means clustering algorithm is designed for Euclidean distances.
    It may stop converging with other distances, when the mean is no longer a
    best estimation for the cluster 'center'.

    The 'mean' minimizes squared differences (or, squared Euclidean distance).
    If you want a different distance function, you need to replace the mean with
    an appropriate center estimation.


    Parameters:

    k:  number of clusters

    eps1 : float, default=0.5
        The spatial density threshold (maximum spatial distance) between 
        two points to be considered related.

    eps2 : float, default=10
        The temporal threshold (maximum temporal distance) between two 
        points to be considered related.

    metric : string default='euclidean'
        The used distance metric - more options are
        ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’,
        ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’,
        ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘rogerstanimoto’, ‘sqeuclidean’,
        ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘yule’.

    n_jobs : int or None, default=-1
        The number of processes to start; -1 means use all processors (BE AWARE)


    Attributes:

    labels : array, shape = [n_samples]
        Cluster labels for the data - noise is defined as -1
    """

    def __init__(self, k, eps1 = 0.5, eps2 = 10, metric = 'euclidean', n_jobs = 1):
        self.k = k
        self.eps1 = eps1
        self.eps2 = eps2
        # self.min_samples = min_samples
        self.metric = metric
        self.n_jobs = n_jobs


    def fit(self, X, Y = None):
        """
        Apply the ST K-Means algorithm 

        X : 2D numpy array. The first attribute of the array should be time attribute
            as float. The following positions in the array are treated as spatial
            coordinates.
            The structure should look like this [[time_step1, x, y], [time_step2, x, y]..]

            For example 2D dataset:
            array([[0,0.45,0.43],
            [0,0.54,0.34],...])


        Returns:

        self
        """

        # check if input is correct
        X = check_array(X)

        # type(X)
        # numpy.ndarray

        # Check arguments for DBSCAN algo-
        if not self.eps1 > 0.0 or not self.eps2 > 0.0:
            raise ValueError('eps1, eps2, minPts must be positive')

        # Get dimensions of 'X'-
        # n - number of rows
        # m - number of attributes/columns-
        n, m = X.shape


        # Compute sqaured form Euclidean Distance Matrix for 'time' and spatial attributes-
        time_dist = squareform(pdist(X[:, 0].reshape(n, 1), metric = self.metric))
        euc_dist = squareform(pdist(X[:, 1:], metric = self.metric))

        '''
        Filter the euclidean distance matrix using time distance matrix. The code snippet gets all the
        indices of the 'time_dist' matrix in which the time distance is smaller than 'eps2'.
        Afterward, for the same indices in the euclidean distance matrix the 'eps1' is doubled which results
        in the fact that the indices are not considered during clustering - as they are bigger than 'eps1'.
        '''
        # filter 'euc_dist' matrix using 'time_dist' matrix-
        dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)


        # Initialize K-Means clustering model-
        self.kmeans_clust_model = KMeans(
            n_clusters = self.k, init = 'k-means++',
            n_init = 10, max_iter = 300,
            precompute_distances = 'auto', algorithm = 'auto')

        # Train model-
        self.kmeans_clust_model.fit(dist)


        self.labels = self.kmeans_clust_model.labels_
        self.X_transformed = self.kmeans_clust_model.fit_transform(X)

        return self


    def transform(self, X):
        if not isinstance(X, np.ndarray):
            # Convert to numpy array-
            X = X.values

        # Get dimensions of 'X'-
        # n - number of rows
        # m - number of attributes/columns-
        n, m = X.shape


        # Compute sqaured form Euclidean Distance Matrix for 'time' and spatial attributes-
        time_dist = squareform(pdist(X[:, 0].reshape(n, 1), metric = self.metric))
        euc_dist = squareform(pdist(X[:, 1:], metric = self.metric))

        # filter 'euc_dist' matrix using 'time_dist' matrix-
        dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)

        # return self.kmeans_clust_model.transform(X)
        return self.kmeans_clust_model.transform(dist)


# Initialize ST-K-Means object-
st_kmeans_algo = ST_KMeans(
    k = 5, eps1=0.6,
    eps2=9, metric='euclidean',
    n_jobs=1
    )

Y = np.zeros(shape = (501,))

# Train on a chunk of dataset-
st_kmeans_algo.fit(data.loc[:500, ['time', 'x', 'y']], Y)

# Get clustered data points labels-
kmeans_labels = st_kmeans_algo.labels

kmeans_labels.shape
# (501,)


# Get labels for points clustered using trained model-
# kmeans_transformed = st_kmeans_algo.X_transformed
kmeans_transformed = st_kmeans_algo.transform(data.loc[:500, ['time', 'x', 'y']])

kmeans_transformed.shape
# (501, 5)

dtc = DecisionTreeClassifier()

dtc.fit(kmeans_transformed, kmeans_labels)

y_pred = dtc.predict(kmeans_transformed)

# Get model performance metrics-
accuracy = accuracy_score(kmeans_labels, y_pred)
precision = precision_score(kmeans_labels, y_pred, average='macro')
recall = recall_score(kmeans_labels, y_pred, average='macro')

print("\nDT model metrics are:")
print("accuracy = 0:.4f, precision = 1:.4f & recall = 2:.4f\n".format(
    accuracy, precision, recall
    ))

# DT model metrics are:
# accuracy = 1.0000, precision = 1.0000 & recall = 1.0000




# Hyper-parameter Tuning:

# Define steps of pipeline-
pipeline_steps = [
    ('st_kmeans_algo' ,ST_KMeans(k = 5, eps1=0.6, eps2=9, metric='euclidean', n_jobs=1)),
    ('dtc', DecisionTreeClassifier())
    ]

# Instantiate a pipeline-
pipeline = Pipeline(pipeline_steps)

kmeans_transformed.shape, kmeans_labels.shape
# ((501, 5), (501,))

# Train pipeline-
pipeline.fit(kmeans_transformed, kmeans_labels)




# Specify parameters to be hyper-parameter tuned-
params = [
    
        'st_kmeans_algo__k': [3, 5, 7]
    
    ]

# Initialize GridSearchCV object-
grid_cv = GridSearchCV(estimator=pipeline, param_grid=params, cv = 2)

# Train GridSearch on computed data from above-
grid_cv.fit(kmeans_transformed, kmeans_labels)

“grid_cv.fit()”调用出现以下错误:

ValueError Traceback(最近调用 最后)在 5 6 # 用上面的计算数据训练 GridSearch - ----> 7 grid_cv.fit(kmeans_transformed, kmeans_labels)

~/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py 适合(自我,X,y,组,**fit_params) 708返回结果 709 --> 710 self._run_search(evaluate_candidates) 711 712 # 对于多指标评估,存储best_index_,best_params_和

~/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py 在_run_search(self,evaluate_candidates)1149 def _run_search(self, evaluate_candidates): 1150 """在 param_grid 中搜索所有候选人""" -> 1151评估候选(参数网格(self.param_grid))1152 1153

~/.local/lib/python3.8/site-packages/sklearn/model_selection/_search.py 在evaluate_candidates(candidate_params) 680 n_splits, n_candidates, n_candidates * n_splits)) 681 --> 682 out = 并行(延迟(_fit_and_score)(克隆(base_estimator), 第683章 684 火车=火车,测试=测试,

~/.local/lib/python3.8/site-packages/joblib/parallel.py 在 call(self, iterable) 1002 # 个剩余工作。第1003章 -> 1004 if self.dispatch_one_batch(iterator): 1005 self._iterating = self._original_iterator is not None 1006

~/.local/lib/python3.8/site-packages/joblib/parallel.py 在 dispatch_one_batch(自我,迭代器) 833 返回错误 834 其他: --> 835 self._dispatch(任务) 第836章 第837章

~/.local/lib/python3.8/site-packages/joblib/parallel.py 在 _dispatch(自我,批次) 752与self._lock: 第753章 --> 754 作业 = self._backend.apply_async(batch, callback=cb) 第755章 756 # 在我们到达之前调用,导致 self._jobs

~/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py 在 apply_async(自我,功能,回调) 207 def apply_async(自我,函数,回调=无): 208 """安排一个函数运行""" --> 209 结果 = 立即结果(函数) 210 如果回调: 211回调(结果)

~/.local/lib/python3.8/site-packages/joblib/_parallel_backends.py 在 初始化(自我,批处理) 588 # 不要延迟应用,避免保持输入 第589章 --> 590 self.results = batch() 591 第592章

~/.local/lib/python3.8/site-packages/joblib/parallel.py 在 打电话(自己) 253 # 将默认进程数更改为-1 254 与并行后端(self._backend,n_jobs=self._n_jobs): --> 255 返回 [func(*args, **kwargs) 256 用于 self.items 中的 func、args、kwargs] 257

~/.local/lib/python3.8/site-packages/joblib/parallel.py 在 (.0) 253 # 将默认进程数更改为-1 254 与并行后端(self._backend,n_jobs=self._n_jobs): --> 255 返回 [func(*args, **kwargs) 256 用于 self.items 中的 func、args、kwargs] 257

~/.local/lib/python3.8/site-packages/sklearn/model_selection/_validation.py 在 _fit_and_score(估计器,X,y,得分手,训练,测试,详细, 参数,fit_params,return_train_score,return_parameters, return_n_test_samples,return_times,return_estimator,error_score) 542 其他: 第543章 --> 544 test_scores = _score(估计器,X_test,y_test,记分器) 545 score_time = time.time() - start_time - fit_time 546 if return_train_score:

~/.local/lib/python3.8/site-packages/sklearn/model_selection/_validation.py 在_score(估计器,X_test,y_test,记分器) 589 分数 = 评分者(估计器,X_test) 590 其他: --> 591 分数 = 评分者(估计器,X_test,y_test) 592 593 error_msg = ("评分必须返回一个数字,得到%s(%s)"

~/.local/lib/python3.8/site-packages/sklearn/metrics/_scorer.py 调用(自我、估算器、*args、**kwargs) 87 *args, **kwargs) 88 其他: ---> 89 score = scorer(estimator, *args, **kwargs) 90 分[姓名] = 分数 91返回分数

~/.local/lib/python3.8/site-packages/sklearn/metrics/_scorer.py _passthrough_scorer(估计器,*args,**kwargs) 第369章 第370章 --> 371 返回 estimator.score(*args, **kwargs) 372 第373章

~/.local/lib/python3.8/site-packages/sklearn/utils/metaestimators.py 在 (*args, **kwargs) 114 115 # lambda,但不是部分的,允许 help() 与 update_wrapper 一起工作 --> 116 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs) 117 # 更新返回函数的文档字符串 118 update_wrapper(out, self.fn)

~/.local/lib/python3.8/site-packages/sklearn/pipeline.py 在 分数(自我,X,y,样本权重) 617 如果 sample_weight 不是 None: 第618章 --> 619 返回 self.steps[-1][-1].score(Xt, y, **score_params) 620 第621章

~/.local/lib/python3.8/site-packages/sklearn/base.py in score(self, X, y,样本重量) 第367章 368 从 .metrics 导入 accuracy_score --> 369 返回准确度分数(y, self.predict(X), sample_weight=sample_weight) 370 371

~/.local/lib/python3.8/site-packages/sklearn/metrics/_classification.py 在 accuracy_score(y_true, y_pred, normalize, sample_weight) 183 184 # 计算每个可能表示的准确度 --> 185 y_type, y_true, y_pred = _check_targets(y_true, y_pred) 第186章 187 如果 y_type.startswith('multilabel'):

~/.local/lib/python3.8/site-packages/sklearn/metrics/_classification.py 在 _check_targets(y_true, y_pred) 78 y_pred : 数组或指标矩阵 79 """ ---> 80 check_consistent_length(y_true, y_pred) 81 type_true = type_of_target(y_true) 82 type_pred = type_of_target(y_pred)

~/.local/lib/python3.8/site-packages/sklearn/utils/validation.py 在 check_consistent_length(*数组) 209 唯一 = np.unique(长度) 210 如果 len(uniques) > 1: --> 211 raise ValueError("发现输入变量的数量不一致" 212“样本:%r”%[int(l),长度为l]) 213

ValueError:发现输入变量的数量不一致 样本:[251, 250]

不同的尺寸/形状是:

kmeans_transformed.shape, kmeans_labels.shape, data.loc[:500, ['time', 'x', 'y']].shape                                       
# ((501, 5), (501,), (501, 3))

我不明白错误是如何到达“样本:[251, 25]”的?

怎么了?

谢谢!

【问题讨论】:

【参考方案1】:

250 和 251 分别是 GridSearchCV 中火车和验证的形状

查看您的自定义估算器...

def transform(self, X):

    return self.X_transformed

原始转换方法不应用任何类型的操作,它只是返回训练数据。我们需要一个能够以灵活的方式转换新数据(在某些情况下是 gridsearch 中的验证)的估计器。用这种方式改变变换方法

def transform(self, X):

    return self.kmeans_clust_model.transform(X)

【讨论】:

将 'transform()' 更改为您的建议会在使用代码时出现以下错误: 已经编辑了带有预处理的 'transform()' 方法,因为 'X' 是时空数据。此外,如果您只是传递“X”,那么它会给出错误:ValueError: Incorrect number of features。有 3 个功能,预期 501。实施您的建议仍然会给出错误:“ValueError:功能数量不正确。有 251 个功能,预期 250” 您正在为 Kmeans 提供一个维度为 (train_samples, train_sample) 的距离矩阵,您如何获得对新数据的预测?你只能传递一个维度矩阵 (train_samples, train_sample) 你有什么建议? 使用数据的前 500 个(不是 501 个)和 cv=2...这是一个试用版,请告诉我

以上是关于K-Means GridSearchCV 超参数调优的主要内容,如果未能解决你的问题,请参考以下文章

K-Means 聚类 超参数调优

Tensorflow 模型的超参数调优

如何保存 GridSearchCV 对象?

scikit-learn 中的超参数优化(网格搜索)

XGBoost 提前停止 cv 与 GridSearchCV

使用 GridSearchCV 进行超参数调整