DBSCAN 聚类算法无法正常工作。我究竟做错了啥?

Posted

技术标签:

【中文标题】DBSCAN 聚类算法无法正常工作。我究竟做错了啥?【英文标题】:DBSCAN clustering algorithm not working properly. What am I doing wrong?DBSCAN 聚类算法无法正常工作。我究竟做错了什么? 【发布时间】:2013-03-26 00:07:49 【问题描述】:

我正在尝试编写 DBSCAN 算法来聚类一组点,但我得到的结果非常糟糕。这可能是因为数据,但不仅如此。我得到了不应该发生的大小

我做错了什么?代码我看了很多遍,也搞不明白是什么问题。

我参考了DBSCAN Wikipedia page上给出的算法。

private static int[] dbScan(String[] points, int epsilon, int minPts) 
    int cluster = 0;
    // visited stores if point has been visited
    boolean[] visited = new boolean[points.length];
    // pointsCluster stores which cluster a point has been assigned to
    int[] pointsCluster = new int[points.length];
    for(int iii = 0; iii < points.length; iii++) 
        // if point iii is already visited, do nothing  
        if(visited[iii]) continue;                      
        visited[iii] = true;    // mark point iii as visited
        // get points in neighborhood of point iii
        HashSet<Integer> neighbors = epsilonNeighbors(points, iii, epsilon);    
        if(neighbors.size() < minPts) 
            // if number of neighbors < minPts, mark point iii as noise
            pointsCluster[iii] = -1;
         else 
            ++cluster;                      // else, start new cluster
            expandCluster(points, iii, neighbors, pointsCluster, visited, cluster, epsilon, minPts);
        
    
    return pointsCluster;


/*
 * Expands a cluster if a point is not a noise point
 * and has > minPts in its epsilon neighborhood
 */
private static void expandCluster(String[] points, int seedPoint, HashSet<Integer> neighbors,
        int[] pointsCluster, boolean[] visited, int cluster, int epsilon, int minPts) 

    pointsCluster[seedPoint] = cluster;     //assign cluster to seed point
    // create queue to process neighbors
    Queue<Integer> seeds = new LinkedList<Integer>();
    seeds.addAll(neighbors);
    while(!seeds.isEmpty()) 
        int currentPoint = (Integer) seeds.poll();
        if(!visited[currentPoint]) 
            visited[currentPoint] = true;       // mark neighbor as visited
            // get neighbors of this currentPoint
            HashSet<Integer> currentNeighbors = epsilonNeighbors(points, currentPoint, epsilon);
            // if currentPoint has >= minPts in neighborhood, add those points to the queue
            if(currentNeighbors.size() >= minPts) 
                seeds.addAll(currentNeighbors);
            
        
        // if currentPoint has not been assigned a cluster, assign it to the current cluster
        if(pointsCluster[currentPoint] == 0) pointsCluster[currentPoint] = cluster;
    


/*
 * Returns a HashSet containing the indexes of points which are
 * in the epsilon neighborhood of the point at index == currentPoint
 */
private static HashSet<Integer> epsilonNeighbors(String[] points, int currentPoint, int epsilon) 
    HashSet<Integer> neighbors = new HashSet<Integer>();
    String protein = points[currentPoint];
    for(int iii = 0; iii < points.length; iii++) 
        int score = similarity(points[iii], points[jjj]);
        if(score >= epsilon) neighbors.add(iii);
    
    return neighbors;

【问题讨论】:

还可以考虑查看原始出版物,而不是***! 【参考方案1】:

当您的结果不好时,可能是因为您的数据不好(对于基于密度的聚类),或者是因为您的参数不好。

事实上,如果它们相互接触,DBSCAN 可以产生小于 minPts 的簇。然后他们可以互相“窃取”边界点。

如何使用例如ELKI验证你的算法输出?

【讨论】:

哇,你说得对。我没有考虑集群“窃取”边界点。非常感谢。所以,从外观上看,算法看起来不错,对吧? 我没有仔细检查。而你的epsilonNeighbors 引用了未定义的变量jjj。另请注意,Java 集合对原始类型的性能非常差。你真的可能想试试 ELKI,因为它真的非常快。 是的,jjj 应该是 currentPoint。将调查 ELKI。感谢您的帮助。

以上是关于DBSCAN 聚类算法无法正常工作。我究竟做错了啥?的主要内容,如果未能解决你的问题,请参考以下文章

dataTables - 无法获得水平滚动和固定列来完成他们的工作。似乎到处渲染不同。我究竟做错了啥?

OpenGL 茶壶无法正确绘制。我究竟做错了啥? [关闭]

我无法解析从 NSURL 会话返回的 JSON。我究竟做错了啥?

我无法让 Netbeans C++ 工作,我做错了啥?

“多重定义”错误。我究竟做错了啥?

我究竟做错了啥?角垫形式场