是否可以使用数字属性作为 K 均值聚类的类？

Posted 2023-03-13

技术标签:

【中文标题】是否可以使用数字属性作为 K 均值聚类的类？【英文标题】：Is it possible using numeric attribute as class for K-means clustering? 【发布时间】：2017-05-29 12:29:36 【问题描述】：

@attribute CustomerID       NUMERIC
@attribute Age              A,B,C,D,E,F,G,H,I,J,K
@attribute Region           A,B,C,D,E,F,G,H
@attribute ProductSubClass  NUMERIC
@attribute ProductID        NUMERIC 
@attribute Quantity         NUMERIC
@attribute Cost             NUMERIC
@attribute sales            NUMERIC

@data
00141833,F,F,130207,4710105011011,2,44,52
01376753,E,E,110217,4710265849066,1,150,129
01603071,E,G,100201,4712019100607,1,35,39
01738667,E,F,530105,4710168702901,1,94,119

上面是标题和部分 Trianing 数据集 training.arff 文件我想使用 Kmeans 聚类和 J48 分类器，我可以毫无问题地做到这一点。流动的是我的测试数据集 test.arff

@attribute CustomerID       NUMERIC
@attribute Age              A,B,C,D,E,F,G,H,I,J,K
@attribute Region           A,B,C,D,E,F,G,H
@attribute ProductSubClass  NUMERIC
@attribute ProductID        INTEGER
@attribute Quantity         NUMERIC
@attribute Cost             NUMERIC
@attribute sales            NUMERIC

@data
1754698,H,A,560402,?,1,676,849
1027365,F,C,530404,?,1,170,219
956710,E,E,500303,?,1,36,59

在这两种情况下，我都确保 ProductID 被选为 Class

这是我做的步骤

Setp1: assigning "AddCluster" to use K-means clusterig for each instance in the dataset 
step2: and then using J48 classificaion algorithm to evaluate the performance of the clustering algorithms using 10-fold cross validation option 
Step3: save Finalized Model and close weka (I am closing to test if I can relode and use it agian)
Step4:Load the Model in weaka (Useing "Load Model")
step5: This time I am selecting "supplied test set"  and select test file to predict (which is same formate as I mentioned in the questien above)
step6: I am trying "Re-evaluate model on  current test set"

但是在这里我收到一个通知“用于训练模式测试集的数据不兼容。您想自动包装分类器吗？在“继续之前的 inputMappedClassifier？”中在纯文本中给出以下输出：

=== Predictions on user test set ===

    inst#     actual  predicted error prediction
        1          ?      0              ? 
        2          ?      0              ? 
        3          ?      0              ? 
        4          ?      0              ? 
        5          ?      0              ? 
        6          ?      0              ? 
        7          ?      0              ? 
        8          ?      0              ? 
        9          ?      0              ? 
       10          ?      0              ? 
       11          ?      0              ? 
       12          ?      0              ? 
       13          ?      0              ? 
       14          ?      0              ? 
       15          ?      0              ? 
       16          ?      1              ? 
       17          ?      0              ? 
       18          ?      0              ? 
       19          ?      0              ? 
       20          ?      0              ? 
       21          ?      0              ?

现在 1. 是否可以使用数字字段 ProductID 作为一个类，因为我必须在考虑其他属性的情况下根据 ProductID 预测客户对产品的选择。

训练和测试集不兼容

注意：我使用的是 Weka 3.8.1 GUI

【问题讨论】：

K-means 不能使用类信息。这是一种聚类算法，而不是分类算法。它还将忽略所有非数字列，因为您无法在那里计算均值。还要确保不包含 ID 列。手段在这里也没有意义，这些列不能被认为是数字！通过查看您的属性名称，我会说 k-means 在这里完全没用。 【参考方案1】：

您的测试数据集可能缺少 K-Means 聚类操作可能添加到训练集中的 cluster-id（您是否告诉 Weka 这样做？），但没有添加到测试数据集。

除此之外，K-Means 的全部意义在于将其用于聚类而非分类。

坦率地说，你的应用不正确，没有给我们读者足够的信息（J48？），并且在这里问了（至少）两个问题。

【讨论】：

感谢您的关注。我对数据挖掘和 Weka 很陌生。请检查我已经编辑了我的问题，添加了其他信息。如果我做错了，请告诉我

以上是关于是否可以使用数字属性作为 K 均值聚类的类？的主要内容，如果未能解决你的问题，请参考以下文章