K-Means 聚类 超参数调优

Posted

技术标签:

【中文标题】K-Means 聚类 超参数调优【英文标题】:K-Means clustering Hyperparameter Tuning 【发布时间】:2020-09-11 20:03:41 【问题描述】:

我正在尝试通过在带有决策树分类器的管道中使用它来执行时空 K-Means 聚类的超参数调整。这个想法是使用 K-Means 聚类算法生成聚类距离空间矩阵和聚类标签,然后将其传递给决策树分类器。对于超参数调优,只需使用 K-Means 算法的参数即可。

我正在使用 Python 3.8 和 sklearn 0.22。

我感兴趣的数据有 3 列/属性:“时间”、“x”和“y”(x 和 y 是空间坐标)。

代码是:

class ST_KMeans(BaseEstimator, TransformerMixin):
# class ST_KMeans():
    """
    Note that K-means clustering algorithm is designed for Euclidean distances.
    It may stop converging with other distances, when the mean is no longer a
    best estimation for the cluster 'center'.

    The 'mean' minimizes squared differences (or, squared Euclidean distance).
    If you want a different distance function, you need to replace the mean with
    an appropriate center estimation.


    Parameters:

    k:  number of clusters

    eps1 : float, default=0.5
        The spatial density threshold (maximum spatial distance) between 
        two points to be considered related.

    eps2 : float, default=10
        The temporal threshold (maximum temporal distance) between two 
        points to be considered related.

    metric : string default='euclidean'
        The used distance metric - more options are
        ‘braycurtis’, ‘canberra’, ‘chebyshev’, ‘cityblock’, ‘correlation’,
        ‘cosine’, ‘dice’, ‘euclidean’, ‘hamming’, ‘jaccard’, ‘jensenshannon’,
        ‘kulsinski’, ‘mahalanobis’, ‘matching’, ‘rogerstanimoto’, ‘sqeuclidean’,
        ‘russellrao’, ‘seuclidean’, ‘sokalmichener’, ‘sokalsneath’, ‘yule’.

    n_jobs : int or None, default=-1
        The number of processes to start; -1 means use all processors (BE AWARE)


    Attributes:

    labels : array, shape = [n_samples]
        Cluster labels for the data - noise is defined as -1
    """

    def __init__(self, k, eps1 = 0.5, eps2 = 10, metric = 'euclidean', n_jobs = 1):
        self.k = k
        self.eps1 = eps1
        self.eps2 = eps2
        # self.min_samples = min_samples
        self.metric = metric
        self.n_jobs = n_jobs


    def fit(self, X):
        """
        Apply the ST K-Means algorithm 

        X : 2D numpy array. The first attribute of the array should be time attribute
            as float. The following positions in the array are treated as spatial
            coordinates.
            The structure should look like this [[time_step1, x, y], [time_step2, x, y]..]

            For example 2D dataset:
            array([[0,0.45,0.43],
            [0,0.54,0.34],...])


        Returns:

        self
        """

        # check if input is correct
        X = check_array(X)

        # type(X)
        # numpy.ndarray

        # Check arguments for DBSCAN algo-
        if not self.eps1 > 0.0 or not self.eps2 > 0.0:
            raise ValueError('eps1, eps2, minPts must be positive')

        # Get dimensions of 'X'-
        # n - number of rows
        # m - number of attributes/columns-
        n, m = X.shape


        # Compute sqaured form Euclidean Distance Matrix for 'time' and spatial attributes-
        time_dist = squareform(pdist(X[:, 0].reshape(n, 1), metric = self.metric))
        euc_dist = squareform(pdist(X[:, 1:], metric = self.metric))

        '''
        Filter the euclidean distance matrix using time distance matrix. The code snippet gets all the
        indices of the 'time_dist' matrix in which the time distance is smaller than 'eps2'.
        Afterward, for the same indices in the euclidean distance matrix the 'eps1' is doubled which results
        in the fact that the indices are not considered during clustering - as they are bigger than 'eps1'.
        '''
        # filter 'euc_dist' matrix using 'time_dist' matrix-
        dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)


        # Initialize K-Means clustering model-
        kmeans_clust_model = KMeans(
            n_clusters = self.k, init = 'k-means++',
            n_init = 10, max_iter = 300,
            precompute_distances = 'auto', algorithm = 'auto')

        # Train model-
        kmeans_clust_model.fit(dist)


        self.labels = kmeans_clust_model.labels_
        self.X_transformed = kmeans_clust_model.fit_transform(X)

        return self

    def transform(self, X):
        pass


# Initialize ST-K-Means object-
st_kmeans_algo = ST_KMeans(
    k = 5, eps1=0.6,
    eps2=9, metric='euclidean',
    n_jobs=1
    )

# Train on a chunk of dataset-
st_kmeans_algo.fit(data.loc[:500, ['time', 'x', 'y']])

# Get clustered data points labels-
kmeans_labels = st_kmeans_algo.labels

kmeans_labels.shape
# (501,)


# Get labels for points clustered using trained model-
kmeans_transformed = st_kmeans_algo.X_transformed

kmeans_transformed.shape
# (501, 5)


dtc = DecisionTreeClassifier()

dtc.fit(kmeans_transformed, kmeans_labels)

y_pred = dtc.predict(kmeans_transformed)

# Get model performance metrics-
accuracy = accuracy_score(kmeans_labels, y_pred)
precision = precision_score(kmeans_labels, y_pred, average='macro')
recall = recall_score(kmeans_labels, y_pred, average='macro')

print("\nDT model metrics are:")
print("accuracy = 0:.4f, precision = 1:.4f & recall = 2:.4f\n".format(
    accuracy, precision, recall
    ))

# DT model metrics are:
# accuracy = 1.0000, precision = 1.0000 & recall = 1.0000

但是,当我尝试使用 sklearn 的管道执行超参数调整时:

# Hyper-parameter Tuning:
# Define steps of pipeline-
pipeline_steps = [
    ('st_kmeans_algo' ,ST_KMeans(k = 5, eps1=0.6, eps2=9, metric='euclidean', n_jobs=1)),
    ('dtc', DecisionTreeClassifier())
    ]

# Instantiate a pipeline-
pipeline = Pipeline(pipeline_steps)

# Train pipeline-
pipeline.fit(kmeans_transformed, kmeans_labels)

它给了我以下错误:

----------------------------------- ---------------------------- TypeError Traceback(最近一次调用 最后)在 8 9 # 训练流水线- ---> 10 pipeline.fit(kmeans_transformed, kmeans_labels)

~/.local/lib/python3.8/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params) 第348章这个估计 第349章 --> 350 Xt, fit_params = self._fit(X, y, **fit_params) 351与_print_elapsed_time('管道', 352 self._log_message(len(self.steps) - 1)):

~/.local/lib/python3.8/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params) 第309章 310 # 适合或从缓存中加载电流互感器 --> 311 X,fitted_transformer = fit_transform_one_cached( 第312章 313 message_clsname='管道',

~/.local/lib/python3.8/site-packages/joblib/memory.py 在 调用(self, *args, **kwargs) 353 354 def 调用(自我,*args,**kwargs): --> 355 返回 self.func(*args, **kwargs) 356 第357章

~/.local/lib/python3.8/site-packages/sklearn/pipeline.py 在 _fit_transform_one(变压器,X,y,重量,message_clsname,message,**fit_params) 726 与 _print_elapsed_time(message_clsname,消息): 第727章 --> 728 res = transformer.fit_transform(X, y, **fit_params) 729 其他: 730 res = transformer.fit(X, y, **fit_params).transform(X)

~/.local/lib/python3.8/site-packages/sklearn/base.py 在 fit_transform(self, X, y, **fit_params) 572 其他: 573#arity 2的拟合方法(监督变换) --> 574 返回 self.fit(X, y, **fit_params).transform(X) 575 第576章

TypeError: fit() 接受 2 个位置参数,但给出了 3 个

【问题讨论】:

【参考方案1】:

ST_KMeans 中的 fit 方法仅接受 X 作为输入,但在这一行中:

pipeline.fit(kmeans_transformed, kmeans_labels)

您将XY 作为输入传递给您的管道,该管道尝试调用管道第一阶段的fit 方法,即ST_KKeans,这两个参数导致此错误。为了克服这个问题,只需在 ST_KMeans 对象的 fit 方法中添加一个虚拟参数 y,如下所示:

def fit(self, X, Y):

附加参数Y 不在方法内的任何地方使用,它只是保持一致性。

希望这会有所帮助!

【讨论】:

我更改了 fit(self, X, Y) 方法,当我运行“pipeline.fit(kmeans_transformed, kmeans_labels)”调用时,它给了我错误:“ValueError: Expected 2D array, got scalar array 代替:array=nan。如果您的数据具有单个特征,则使用 array.reshape(-1, 1) 重塑您的数据,如果它包含单个样本,则使用 array.reshape(1, -1)。" 你能打印出kmeans_transformedkmeans_labels的形状吗? kmeans_transformed.shape, kmeans_labels.shape # ((501, 5), (501,)) 你能给我们提供一个重现错误的例子吗?

以上是关于K-Means 聚类 超参数调优的主要内容,如果未能解决你的问题,请参考以下文章

[Spark2.0]ML 调优:模型选择和超参数调优

[Spark2.0]ML 调优:模型选择和超参数调优

机器学习超参数调优

超参数调优后精度保持不变

Tensorflow 模型的超参数调优

使用 Pyspark 进行超参数调优