机器学习之KNN鸢尾花分类

Posted esc_ai

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了机器学习之KNN鸢尾花分类相关的知识,希望对你有一定的参考价值。

KNN简介

邻近算法,或者说K最近邻(kNN,k-NearestNeighbor)分类算法是数据挖掘分类技术中最简单的方法之一。所谓K最近邻,就是k个最近的邻居的意思,说的是每个样本都可以用它最接近的k个邻居来代表。
kNN算法的核心思想是如果一个样本在特征空间中的k个最相邻的样本中的大多数属于某一个类别,则该样本也属于这个类别,并具有这个类别上样本的特性。该方法在确定分类决策上只依据最邻近的一个或者几个样本的类别来决定待分样本所属的类别。
kNN方法在类别决策时,只与极少量的相邻样本有关。由于kNN方法主要靠周围有限的邻近的样本,而不是靠判别类域的方法来确定所属类别的,因此对于类域的交叉或重叠较多的待分样本集来说,kNN方法较其他方法更为适合。

Python实现

Python2

#!/usr/bin/python
# -*- coding: utf8 -*-

import math
import operator
import random
import csv


def distance(l1, l2):
    d = 0

    for x in range(4):
        d += pow((l1[x] - l2[x]), 2)

    return math.sqrt(d)


def getNeighbors(traningSet, testInstance, k):
    distances = []
    for i in range(len(traningSet)):
        dis = distance(testInstance, traningSet[i])
        distances.append((traningSet[i], dis))
    distances.sort(key=operator.itemgetter(1))
    # print "distances:", distances
    neighbors = []
    for i in range(k):
        neighbors.append(distances[i][0])
    return neighbors


def getResult(neighbors):
    votes = 
    for i in range(len(neighbors)):
        result = neighbors[i][-1]
        if result in votes:
            votes[result] += 1
        else:
            votes[result] = 1
    sortedVotes = sorted(votes, key=operator.itemgetter(1), reverse=True)
    return sortedVotes[0]


if __name__ == '__main__':
    trainingset = []
    testSet = []
    dataSet = []
    splitRatio = 0.75
    filename = 'iris.data.txt'
    with open(filename, 'r') as datafile:
        lines = csv.reader(datafile)
        dataSet = list(lines)
        print("dataset len", len(dataSet))
        for x in range(len(dataSet)):
            for y in range(4):
                dataSet[x][y] = float(dataSet[x][y])
            if random.random() < splitRatio:
                trainingset.append(dataSet[x])
            else:
                testSet.append(dataSet[x])
    # print "trainingset ", trainingset
    # print "testset ", testSet

    print "trainingset len", len(trainingset)
    print "testset len", len(testSet)

    results = []
    for i in range(len(testSet)):
        neighbors = getNeighbors(trainingset, testSet[i], 3)
        result = getResult(neighbors)
        results.append(result)
        print "期望值:", testSet[i][-1], "实际值:", result
    correct = 0
    for i in range(len(results)):
        if results[i] == testSet[i][-1]:
            correct += 1
    print "准确率:", correct / float(len(results))

Python3

#!/usr/bin/python
# -*- coding: utf8 -*-

import math
import operator
import random
import csv


def distance(l1, l2):
    d = 0

    for x in range(4):
        d += pow((l1[x] - l2[x]), 2)

    return math.sqrt(d)


def getNeighbors(traningSet, testInstance, k):
    distances = []
    for i in range(len(traningSet)):
        dis = distance(testInstance, traningSet[i])
        distances.append((traningSet[i], dis))
    distances.sort(key=operator.itemgetter(1))
    # print "distances:", distances
    neighbors = []
    for i in range(k):
        neighbors.append(distances[i][0])
    return neighbors


def getResult(neighbors):
    votes = 
    for i in range(len(neighbors)):
        result = neighbors[i][-1]
        if result in votes:
            votes[result] += 1
        else:
            votes[result] = 1
    sortedVotes = sorted(votes, key=operator.itemgetter(1), reverse=True)
    return sortedVotes[0]


if __name__ == '__main__':
    trainingset = []
    testSet = []
    dataSet = []
    splitRatio = 0.75
    filename = 'iris.data.txt'
    with open(filename, 'r') as datafile:
        lines = csv.reader(datafile)
        dataSet = list(lines)
        print("dataset len", len(dataSet))
        for x in range(len(dataSet)):
            for y in range(4):
                dataSet[x][y] = float(dataSet[x][y])
            if random.random() < splitRatio:
                trainingset.append(dataSet[x])
            else:
                testSet.append(dataSet[x])
    # print "trainingset ", trainingset
    # print "testset ", testSet

    print("trainingset len", len(trainingset))
    print("testset len", len(testSet))

    results = []
    for i in range(len(testSet)):
        neighbors = getNeighbors(trainingset, testSet[i], 3)
        result = getResult(neighbors)
        results.append(result)
        print("期望值:", testSet[i][-1], "实际值:", result)
    correct = 0
    for i in range(len(results)):
        if results[i] == testSet[i][-1]:
            correct += 1
    print("准确率:", correct / float(len(results)))

结果

('dataset len', 150)
trainingset len 109
testset len 41
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-setosa 实际值: Iris-setosa
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-virginica
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-virginica
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-versicolor 实际值: Iris-versicolor
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-versicolor
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
期望值: Iris-virginica 实际值: Iris-virginica
准确率: 0.926829268293

以上是关于机器学习之KNN鸢尾花分类的主要内容,如果未能解决你的问题,请参考以下文章

机器学习之数据处理与可视化鸢尾花数据分类|特征属性比较

机器学习之分类方法K近邻(KNN)

机器学习之KNN算法

机器学习之近邻算法模型(KNN)

机器学习之KNN(k近邻)算法

机器学习之K-近邻(KNN)算法