Python 和 Bokeh 上的聚类;选择允许用户更改聚类算法的小部件

Posted

技术标签:

【中文标题】Python 和 Bokeh 上的聚类;选择允许用户更改聚类算法的小部件【英文标题】:Clustering on Python and Bokeh; select widget which allows user to change clustering algorithm 【发布时间】:2021-10-13 23:23:51 【问题描述】:

我正在尝试在 Bokeh 仪表板中构建一个功能,该功能允许用户对数据进行聚类。我使用以下示例作为模板,这是链接:- Clustering in Bokeh example

下面是这个例子的代码:-

import numpy as np
from sklearn import cluster, datasets
from sklearn.preprocessing import StandardScaler

from bokeh.layouts import column, row
from bokeh.plotting import figure, output_file, show

print("\n\n*** This example may take several seconds to run before displaying. ***\n\n")

N = 50000
PLOT_SIZE = 400

# generate datasets.
np.random.seed(0)
noisy_circles = datasets.make_circles(n_samples=N, factor=.5, noise=.04)
noisy_moons = datasets.make_moons(n_samples=N, noise=.05)
centers = [(-2, 3), (2, 3), (-2, -3), (2, -3)]
blobs1 = datasets.make_blobs(centers=centers, n_samples=N, cluster_std=0.4, random_state=8)
blobs2 = datasets.make_blobs(centers=centers, n_samples=N, cluster_std=0.7, random_state=8)

colors = np.array([x for x in ('#00f', '#0f0', '#f00', '#0ff', '#f0f', '#ff0')])
colors = np.hstack([colors] * 20)

# create clustering algorithms
dbscan   = cluster.DBSCAN(eps=.2)
birch    = cluster.Birch(n_clusters=2)
means    = cluster.MiniBatchKMeans(n_clusters=2)
spectral = cluster.SpectralClustering(n_clusters=2, eigen_solver='arpack', affinity="nearest_neighbors")
affinity = cluster.AffinityPropagation(damping=.9, preference=-200)

# change here, to select clustering algorithm (note: spectral is slow)
algorithm = dbscan  # <- SELECT ALG

plots =[]
for dataset in (noisy_circles, noisy_moons, blobs1, blobs2):
    X, y = dataset
    X = StandardScaler().fit_transform(X)

    # predict cluster memberships
    algorithm.fit(X)
    if hasattr(algorithm, 'labels_'):
        y_pred = algorithm.labels_.astype(int)
    else:
        y_pred = algorithm.predict(X)

    p = figure(output_backend="webgl", title=algorithm.__class__.__name__,
               width=PLOT_SIZE, height=PLOT_SIZE)

    p.circle(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), alpha=0.1,)

    plots.append(p)

# generate layout for the plots
layout = column(row(plots[:2]), row(plots[2:]))

output_file("clustering.html", title="clustering with sklearn")

show(layout)

该示例允许用户对数据进行聚类。在代码中,您可以指定使用哪种算法;在上面粘贴的代码中,算法是dbscan。我试图修改代码,以便我可以添加一个允许用户指定要使用的算法的小部件:-


from bokeh.models.annotations import Label
import numpy as np
from sklearn import cluster, datasets
from sklearn.preprocessing import StandardScaler

from bokeh.layouts import column, row
from bokeh.plotting import figure, output_file, show
from bokeh.models import CustomJS, Select
print("\n\n*** This example may take several seconds to run before displaying. ***\n\n")

N = 50000
PLOT_SIZE = 400

# generate datasets.
np.random.seed(0)
noisy_circles = datasets.make_circles(n_samples=N, factor=.5, noise=.04)
noisy_moons = datasets.make_moons(n_samples=N, noise=.05)
centers = [(-2, 3), (2, 3), (-2, -3), (2, -3)]
blobs1 = datasets.make_blobs(centers=centers, n_samples=N, cluster_std=0.4, random_state=8)
blobs2 = datasets.make_blobs(centers=centers, n_samples=N, cluster_std=0.7, random_state=8)

colors = np.array([x for x in ('#00f', '#0f0', '#f00', '#0ff', '#f0f', '#ff0')])
colors = np.hstack([colors] * 20)

# create clustering algorithms
dbscan   = cluster.DBSCAN(eps=.2)
birch    = cluster.Birch(n_clusters=2)
means    = cluster.MiniBatchKMeans(n_clusters=2)
spectral = cluster.SpectralClustering(n_clusters=2, eigen_solver='arpack', affinity="nearest_neighbors")
affinity = cluster.AffinityPropagation(damping=.9, preference=-200)
kmeans   = cluster.KMeans(n_clusters=2)

############################select widget for different clustering algorithms############


menu     =[('DBSCAN','dbscan'),('Birch','birch'),('MiniBatchKmeans','means'),('Spectral','spectral'),('Affinity','affinity'),('K-means','kmeans')]
select = Select(title="Option:", value="DBSCAN", options=menu)
select.js_on_change("value", CustomJS(code="""
    console.log('select: value=' + this.value, this.toString())
"""))

# change here, to select clustering algorithm (note: spectral is slow)
algorithm = select.value  

############################################################
plots =[]
for dataset in (noisy_circles, noisy_moons, blobs1, blobs2):
    X, y = dataset
    X = StandardScaler().fit_transform(X)

    # predict cluster memberships
    algorithm.fit(X)
    if hasattr(algorithm, 'labels_'):
        y_pred = algorithm.labels_.astype(int)
    else:
        y_pred = algorithm.predict(X)

    p = figure(output_backend="webgl", title=algorithm.__class__.__name__,
               width=PLOT_SIZE, height=PLOT_SIZE)

    p.circle(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), alpha=0.1,)

    plots.append(p)

# generate layout for the plots
layout = column(select,row(plots[:2]), row(plots[2:]))

output_file("clustering.html", title="clustering with sklearn")

show(layout)

但是,当我尝试运行它时出现此错误:-

AttributeError: 'str' object has no attribute 'fit'

谁能告诉我我缺少什么来解决这个问题?

另外,如果不是太难,我想添加一个数字输入小部件,允许用户选择每个算法要查找的聚类数。有什么建议吗?

非常感谢:)

编辑

这是使用@Tony 解决方案的代码的当前状态。

''' Example inspired by an example from the scikit-learn project:
http://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html
'''
#https://github.com/bokeh/bokeh/blob/branch-2.4/examples/webgl/clustering.py
from bokeh.models.annotations import Label
import numpy as np
from sklearn import cluster, datasets
from sklearn.preprocessing import StandardScaler

from bokeh.layouts import column, row
from bokeh.plotting import figure, output_file, show
from bokeh.models import CustomJS, Select
print("\n\n*** This example may take several seconds to run before displaying. ***\n\n")

N = 50000
PLOT_SIZE = 400

# generate datasets.
np.random.seed(0)
noisy_circles = datasets.make_circles(n_samples=N, factor=.5, noise=.04)
noisy_moons = datasets.make_moons(n_samples=N, noise=.05)
centers = [(-2, 3), (2, 3), (-2, -3), (2, -3)]
blobs1 = datasets.make_blobs(centers=centers, n_samples=N, cluster_std=0.4, random_state=8)
blobs2 = datasets.make_blobs(centers=centers, n_samples=N, cluster_std=0.7, random_state=8)

colors = np.array([x for x in ('#00f', '#0f0', '#f00', '#0ff', '#f0f', '#ff0')])
colors = np.hstack([colors] * 20)

# create clustering algorithms
dbscan   = cluster.DBSCAN(eps=.2)
birch    = cluster.Birch(n_clusters=2)
means    = cluster.MiniBatchKMeans(n_clusters=2)
spectral = cluster.SpectralClustering(n_clusters=2, eigen_solver='arpack', affinity="nearest_neighbors")
affinity = cluster.AffinityPropagation(damping=.9, preference=-200)
kmeans   = cluster.KMeans(n_clusters=2)

menu     =[('DBSCAN','dbscan'),('Birch','birch'),('MiniBatchKmeans','means'),('Spectral','spectral'),('Affinity','affinity'),('K-means','kmeans')]
select = Select(title="Option:", value="DBSCAN", options=menu)
select.js_on_change("value", CustomJS(code="""
    console.log('select: value=' + this.value, this.toString())
"""))

# change here, to select clustering algorithm (note: spectral is slow)
#algorithm = select.value  

algorithm = None

if select.value == 'dbscan':
    algorithm = dbscan # use dbscan algorithm function
elif select.value == 'birch':
      algorithm = birch  # use birch algorithm function
elif select.value == 'means':
      algorithm = means  # use means algorithm function
elif select.value == 'spectral':
      algorithm = spectral
elif select.value == 'affinity':
      algorithm = affinity
elif select.value == 'kmeans':
      algorithm = 'kmeans'


if algorithm is not None:
    plots =[]
for dataset in (noisy_circles, noisy_moons, blobs1, blobs2):
    X, y = dataset
    X = StandardScaler().fit_transform(X)

    # predict cluster memberships
    algorithm.fit(X)           ######################This is what appears to be the problem######################
    if hasattr(algorithm, 'labels_'):
        y_pred = algorithm.labels_.astype(int)
    else:
        y_pred = algorithm.predict(X)

    p = figure(output_backend="webgl", title=algorithm.__class__.__name__,
               width=PLOT_SIZE, height=PLOT_SIZE)

    p.circle(X[:, 0], X[:, 1], color=colors[y_pred].tolist(), alpha=0.1,)

    plots.append(p)
else:
   print('Please select an algorithm first')
    


# generate layout for the plots
layout = column(select,row(plots[:2]), row(plots[2:]))

output_file("clustering.html", title="clustering with sklearn")

show(layout)

algorithm.fit(X) 这是错误发生的地方。 错误信息:-

AttributeError: 'NoneType' object has no attribute 'fit'
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
m:\bokehdash\clusteringbokeh.py in 
     67 
     68     # predict cluster memberships
---> 69     algorithm.fit(X)
     70     if hasattr(algorithm, 'labels_'):
     71         y_pred = algorithm.labels_.astype(int)

AttributeError: 'NoneType' object has no attribute 'fit'

【问题讨论】:

【参考方案1】:

我不知道sklearn,但比较你的两个例子我可以看到以下内容:

    Select 是具有value 类型string 属性的散景模型。所以select.value 是一个字符串 dbscan 是一个算法函数

因此,当您执行 algorithm = dbscan 时,您将算法函数分配给您的 algorithm 变量,而当您在第二个示例中执行 algorithm = select.value 时,您只为其分配了一个字符串,因此它不起作用,因为 string 没有'没有fit() 功能。你应该这样做:

algorithm = None

if select.value == 'DBSCAN':
    algorithm = dbscan # use dbscan algorithm function
elif select.value == 'Birch':
      algorithm = birch  # use birch algorithm function
elif select.value == 'MiniBatchKmeans':
      algorithm = means  # use means algorithm function
etc...

if algorithm is not None:
    plots =[]
    for dataset in (noisy_circles, noisy_moons, blobs1, blobs2):
        ...
else:
   print('Please select an algorithm first')

【讨论】:

感谢您的回复。实施您的解决方案后,我现在遇到的当前错误是NameError: name 'algorithm' is not defined。我假设我需要先创建某种名为algorithm 的对象?从 R 毕业后,我对 Python 还比较陌生,所以仍在努力学习基础知识 :) 见上面更新的代码。这是关于 Python 的。有许多在线 Python 课程可以学习基础知识。我希望它有所帮助。 它不太好用,但我感觉它快到了。我收到此错误AttributeError: 'NoneType' object has no attribute 'fit'。似乎是由第 69 行附近的 algorithm.fit(X) 引起的。如果有助于发现问题所在,将使用我当前的解决方案更新我的原始帖子。 1) 在Select 菜单元组中,第一项是小部件value,第二项是显示名称2) for 循环应该在if 语句内。请参阅更新的代码。您还应该将algorithm = 'kmeans' 替换为algorithm = kmeans(删除单引号)

以上是关于Python 和 Bokeh 上的聚类;选择允许用户更改聚类算法的小部件的主要内容,如果未能解决你的问题,请参考以下文章

如何在 Python 中使用 K-Means 聚类找到最佳的聚类数量

不用苦苦寻找,这就是最全的聚类算法汇总(附Python代码演示)

python birch的聚类结果怎么输出 看某一个具体是啥分类

数据分析系列 之python语言中的聚类分析

基于位置信息的聚类算法介绍及模型选择

R语言层次聚类:通过内平方和(Within Sum of Squares, WSS)选择最优的聚类K值以内平方和(WSS)和K的关系并通过弯头法(elbow method)获得最优的聚类个数