机器学习01 二分类问题(随机梯度下降法)
Posted 信步数园
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了机器学习01 二分类问题(随机梯度下降法)相关的知识,希望对你有一定的参考价值。
Classification - SGD Method
What is a classification problem?
Imagine an elementary school student is given a batch of learning materials: lots of pictures of handwitten digits. His parents is teaching him, "this is 1, o-n-e, one.""That is 5, f-i-v-e, five". And the student gradually learn to identify the digits 0~9 by their shapes. Now, whoever write a number, the little boy can name it correctly.
The above process is what classification problems look like:
-
First, given a dataset with labels, called training set, to the machine
-
Next, using some kind of algorithm, the machine learns to classify data of different labels based on the training set.
-
Then, we can use a test set to check the effectiveness of the learning process. If the machine is not satisfying, we can improve the algorithm or give better training sets to retrain the computer, until it is satisfying.
-
Finally, the machine can be put into practical use.
SGD as a binary classifier
A binary classification problem is the easiest among all. It is like a "yes-or-no" question, with only two choices. Stochastic Gradient Descent method is a popular and relatively elementary method in solving the problem.
SGD(Stochastic Gradient Descent) Classifier has a good performance when handling large datasets. Scikit-learn has a SGDClassifier
class for this algorithm.
First, prepare the dataset.
Here, we use the MINST dataset (handwritten digits of 0~9) as an example.
import struct
import matplotlib.pyplot as plt
import numpy as np
# train_images
train_images_idx3_ubyte_file = "dataset/MINST/train-images.idx3-ubyte"
# train_labels
train_labels_idx1_ubyte_file = "dataset/MINST/train-labels.idx1-ubyte"
# test_images
test_images_idx3_ubyte_file = "dataset/MINST/t10k-images.idx3-ubyte"
# test_labels
test_labels_idx1_ubyte_file = "dataset/MINST/t10k-labels.idx1-ubyte"
def decode_idx3_ubyte(idx3_ubyte_file):
"""
A universal function for decoding idx3 files
:param idx3_ubyte_file: idx3 file path
:return: dataset
"""
# 读取二进制数据
bin_data = open(idx3_ubyte_file, "rb").read()
# 解析文件头信息,依次为魔数、图片数量、每张图片高、每张图片宽
offset = 0
fmt_header = ">iiii" # 因为数据结构中前4行的数据类型都是32位整型,所以采用i格式,但我们需要读取前4行数据,所以需要4个i。我们后面会看到标签集中,只使用2个ii。
magic_number, num_images, num_rows, num_cols = struct.unpack_from(
fmt_header, bin_data, offset
)
print(
"魔数:%d, 图片数量: %d张, 图片大小: %d*%d" % (magic_number, num_images, num_rows, num_cols)
)
# 解析数据集
image_size = num_rows * num_cols
offset += struct.calcsize(
fmt_header
) # 获得数据在缓存中的指针位置,从前面介绍的数据结构可以看出,读取了前4行之后,指针位置(即偏移位置offset)指向0016。
print(offset)
fmt_image = (
">" + str(image_size) + "B"
) # 图像数据像素值的类型为unsigned char型,对应的format格式为B。这里还有加上图像大小784,是为了读取784个B格式数据,如果没有则只会读取一个值(即一副图像中的一个像素值)
print(fmt_image, offset, struct.calcsize(fmt_image))
images = np.empty((num_images, num_rows, num_cols))
# plt.figure()
for i in range(num_images):
if (i + 1) % 10000 == 0:
print("已解析 %d" % (i + 1) + "张")
print(offset)
images[i] = np.array(struct.unpack_from(fmt_image, bin_data, offset)).reshape(
(num_rows, num_cols)
)
# print(images[i])
offset += struct.calcsize(fmt_image)
# plt.imshow(images[i],'gray')
# plt.pause(0.00001)
# plt.show()
return images
def decode_idx1_ubyte(idx1_ubyte_file):
"""
A universal function for decoding idx1 files
:param idx1_ubyte_file: idx1 file path
:return: dataset
"""
# 读取二进制数据
bin_data = open(idx1_ubyte_file, "rb").read()
# 解析文件头信息,依次为魔数和标签数
offset = 0
fmt_header = ">ii"
magic_number, num_images = struct.unpack_from(fmt_header, bin_data, offset)
print("魔数:%d, 图片数量: %d张" % (magic_number, num_images))
# 解析数据集
offset += struct.calcsize(fmt_header)
fmt_image = ">B"
labels = np.empty(num_images)
for i in range(num_images):
if (i + 1) % 10000 == 0:
print("已解析 %d" % (i + 1) + "张")
labels[i] = struct.unpack_from(fmt_image, bin_data, offset)[0]
offset += struct.calcsize(fmt_image)
return labels
def load_train_images(idx_ubyte_file=train_images_idx3_ubyte_file):
"""
TRAINING SET IMAGE FILE (train-images-idx3-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 60000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
0016 unsigned byte ?? pixel
0017 unsigned byte ?? pixel
........
xxxx unsigned byte ?? pixel
Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).
:param idx_ubyte_file: idx file path
:return: n*row*col维np.array对象,n为图片数量
"""
return decode_idx3_ubyte(idx_ubyte_file)
def load_train_labels(idx_ubyte_file=train_labels_idx1_ubyte_file):
"""
TRAINING SET LABEL FILE (train-labels-idx1-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000801(2049) magic number (MSB first)
0004 32 bit integer 60000 number of items
0008 unsigned byte ?? label
0009 unsigned byte ?? label
........
xxxx unsigned byte ?? label
The labels values are 0 to 9.
:param idx_ubyte_file: idx文件路径
:return: n*1维np.array对象,n为图片数量
"""
return decode_idx1_ubyte(idx_ubyte_file)
def load_test_images(idx_ubyte_file=test_images_idx3_ubyte_file):
"""
TEST SET IMAGE FILE (t10k-images-idx3-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000803(2051) magic number
0004 32 bit integer 10000 number of images
0008 32 bit integer 28 number of rows
0012 32 bit integer 28 number of columns
0016 unsigned byte ?? pixel
0017 unsigned byte ?? pixel
........
xxxx unsigned byte ?? pixel
Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).
:param idx_ubyte_file: idx文件路径
:return: n*row*col维np.array对象,n为图片数量
"""
return decode_idx3_ubyte(idx_ubyte_file)
def load_test_labels(idx_ubyte_file=test_labels_idx1_ubyte_file):
"""
TEST SET LABEL FILE (t10k-labels-idx1-ubyte):
[offset] [type] [value] [description]
0000 32 bit integer 0x00000801(2049) magic number (MSB first)
0004 32 bit integer 10000 number of items
0008 unsigned byte ?? label
0009 unsigned byte ?? label
........
xxxx unsigned byte ?? label
The labels values are 0 to 9.
:param idx_ubyte_file: idx文件路径
:return: n*1维np.array对象,n为图片数量
"""
return decode_idx1_ubyte(idx_ubyte_file)
if __name__ == "__main__":
train_images = load_train_images()
train_labels = load_train_labels()
test_images = load_test_images()
test_labels = load_test_labels()
# 查看前十个数据及其标签以读取是否正确
# for i in range(10):
# print(train_labels[i])
# plt.imshow(train_images[i], cmap='gray')
# plt.pause(0.000001)
# plt.show()
# print('done')
魔数:2051, 图片数量: 60000张, 图片大小: 28*28
16
>784B 16 784
已解析 10000张
7839232
已解析 20000张
15679232
已解析 30000张
23519232
已解析 40000张
31359232
已解析 50000张
39199232
已解析 60000张
47039232
魔数:2049, 图片数量: 60000张
已解析 10000张
已解析 20000张
已解析 30000张
已解析 40000张
已解析 50000张
已解析 60000张
魔数:2051, 图片数量: 10000张, 图片大小: 28*28
16
>784B 16 784
已解析 10000张
7839232
魔数:2049, 图片数量: 10000张
已解析 10000张
The above code used https://blog.csdn.net/panrenlong/article/details/81736754 as a reference.
We then solve the following classification problem: 3 or not-3.
In order to enhance training effects, we shuffle the index of the training set.
shuffle_index = np.random.permutation(60000)
train_images, train_labels = train_images[shuffle_index], train_labels[shuffle_index]
X_train = np.empty((60000, 28 * 28))
for i in range(0, 60000):
X_train[i] = train_images[i].flatten()
X_test = np.empty((10000, 28 * 28))
for i in range(0, 10000):
test_images[i].flatten()
y_train, y_test = train_labels, test_labels
y_train_3 = y_train == 3
y_test_3 = y_test == 3
print(X_train)
print(y_train_3)
print(y_test_3)
[[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
...
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]
[0. 0. 0. ... 0. 0. 0.]]
[False False False ... False False False]
[False False False ... False False False]
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_3)
SGDClassifier(random_state=42)
We assess the accuracy of the model in a simple way:
success = 0.0
total = 0.0
rate = 0.0
for i in range(0, 10000):
res = sgd_clf.predict([X_test[i]])
if res[0] == y_test_3[i]:
success += 1
total += 1
rate = success / total
print("The accuracy on the test set is {0}".format(rate))
The accuracy on the test set is 0.899
Mathematical explanation of SGD method
A mathematical explanation of Perceptron and SGD method is given below:
Base on the mathematical explanation, we will then implement the SGD algorithm by ourselves.
class SGD:
omega = 0
b = 0
def __init__(self, omega, b):
self.omega = omega
self.b = b
def predict(self, x_predict):
x_predict = np.array(x_predict)
dim = x_predict[0].size
all_size = x_predict.size
size = int(all_size/dim)
y_predict = np.empty(size, bool)
for i in range(0,size):
y_predict[i] = ((np.dot(self.omega.T, x_predict[i]) + self.b) >= 0)
return y_predict
def SGD_train(x_train, y_train_ori, learning_rate = 0.5, max_iter = 50000):
"""
:param x_train: 2-D ndarray
:param y_train: 1-D ndarray
:param learning_rate: float number in (0,1]
:return: perceptron object
"""
# fetch the size of the training data
size = y_train_ori.size
dim = x_train[0].size
y_train = y_train_ori*2-1
print(y_train)
# Choose the initial value omega, b
omega = np.zeros(dim)
b = 0.0
learning_rate = 1
correct_iter = 0
counter = 0
best_omega = np.zeros(dim)
best_b = 0
largest_iter = 0
while True:
# Randomly choose data in the data set
i = np.random.randint(0, size)
if (y_train[i] * (np.dot(omega.T, x_train[i]) + b)) <= 0:
omega = omega + learning_rate * y_train[i] * x_train[i]
b = b + learning_rate * y_train[i]
# print('iteration {0}, current loop {1}'.format(counter, correct_iter))
best_omega = omega if(correct_iter>largest_iter) else best_omega
best_b = b if(correct_iter>largest_iter) else best_b
largest_iter = correct_iter if(correct_iter>largest_iter) else largest_iter
counter += 1
correct_iter = 0
correct_iter += 1
# Conditions to exit the loop
if correct_iter > 0.005 * size:
print('iteration {0}, current loop {1}'.format(counter, correct_iter))
print('omega: {0}'.format(best_omega))
print('b: {0}'.format(best_b))
break
if counter > max_iter:
print('Reached maximum iteration {0} steps.'.format(max_iter))
print('omega: {0}'.format(best_omega))
print('b: {0}'.format(best_b))
break
return SGD(best_omega, best_b)
Testing our model on the test set:
sgd = SGD_train(X_train, y_train_3, 0.4, 50000)
y_testres = sgd.predict(X_test)
success = 0.0
total = 0.0
rate = 0.0
for i in range(0, 10000):
if y_testres[i] == y_test_3[i]:
success += 1
total += 1
rate = success / total
print("Success: {0}. Total: {1}. The accuracy on the test set is {2}".format(success, total, rate))
[-1 -1 -1 ... -1 -1 -1]
iteration 42632, current loop 301
omega: [ 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
0.0000e+00 0.0000e+00 0.0000e+00 .......... 0.0000e+00 0.0000e+00
0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00]
b: -1123.0
Success: 8990.0. Total: 10000.0. The accuracy on the test set is 0.899
By observing the results, the overall accuracy on the test set reached 89.9%, the same as the SGDClassifier in the Scikit-learn module.
We may think that 89.9% may not be a satisfactory result. That is because we used a linear model, but the dataset is not linear separable. In order to improve the result, try using the KNN algorithm, which will be covered in later posts.
以上是关于机器学习01 二分类问题(随机梯度下降法)的主要内容,如果未能解决你的问题,请参考以下文章