[Pytorch系列-45]:卷积神经网络 - 用GPU训练AlexNet+CIFAR10数据集

Posted 文火冰糖的硅基工坊

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了[Pytorch系列-45]:卷积神经网络 - 用GPU训练AlexNet+CIFAR10数据集相关的知识,希望对你有一定的参考价值。

作者主页(文火冰糖的硅基工坊):文火冰糖(王文兵)的博客_文火冰糖的硅基工坊_CSDN博客

本文网址:https://blog.csdn.net/HiWangWenBing/article/details/121279002


目录

第1章 torchvision + GPU训练概述

1.1 为什么需要通过torchvision

1.2 torchvision详解

1.3 GPU训练

第2章 手工搭建LeNet网络与CIFAR10分类

2.1 参考LeNet参考

2.2 LeNet在CIFAR10数据集上的问题

第3章 定义前向运算:torchvision搭建CFAR10数据集

3.1 前置条件

3.2 定义数据预处理(数据强化)

3.3 下载并加载数据集

3.4 定义batch Loader

3.5 可视化部分数据集数据

第4章 定义前向运算:torchvision搭建神经网络

4.1 定义AlexNet网络

4.2 定义VGG网络(参考)

4.3 测试AlexNet的网络输出

4.4 测试VGG的网络输出

第5章 定义反向运算:损失函数与优化器

5.1 定义损失函数

5.2 定义优化器

第6章 定义反向运算:模型训练

6.1 训练前的准备

6.2 开始训练

第7章 模型评估 - 训练过程

7.1 可视化loss迭代过程

7.2 可视化精度变化过程 

第8章 模型评估 - 训练结果

8.1 手工验证

8.2 训练集上的验证

8.3 在测试集上的验证

第9章 模型的存储与保存

第10 章 模型的恢复

第11 章 笔者感悟

11.1 在CPU上训练

11.2 在GPU上训练的比较

参考:




第1章 torchvision + GPU训练概述

1.1 为什么需要通过torchvision

在前面的文章中,我们都在探讨如何手工搭建卷积神经网络,训练网络模型。

很显然,对于一些知名的模型,手工搭建的效率较低,且容易出错。

因此,利用pytorch提供的搭建好的知名模型,是一个不错的选择。

另外,torchvision还提供了数据集和数据强化的库。

1.2 torchvision详解

[Pytorch系列-37]:工具集 - torchvision库详解(数据集、数据预处理、模型)_文火冰糖(王文兵)的博客-CSDN博客

1.3 GPU训练

相对于CPU训练,GPU训练需要做一些额外的工作,主要的核心区别如下:

(1条消息) [Pytorch系列-44]:如何使能GPU训练, 提升训练效率_文火冰糖(王文兵)的博客-CSDN博客https://blog.csdn.net/HiWangWenBing/article/details/121277305本文的目的是:

  • 验证GPU训练的方法
  • 感受GPU训练与CPU训练AlexNet在性能上显著的差别,后续的训练将基于GPU.

第2章 手工搭建LeNet网络与CIFAR10分类

2.1 参考LeNet参考

[Pytorch系列-35]:卷积神经网络 - 搭建LeNet-5网络与CFAR10分类数据集_文火冰糖(王文兵)的博客-CSDN博客

2.2 LeNet在CIFAR10数据集上的问题

优点:

  • 网络规模小,训练块
  • 网络搭建简单

缺点:

  • 准确率只有60%左右

第3章 定义前向运算:torchvision搭建CFAR10数据集

3.1 前置条件

#环境准备
import numpy as np              # numpy数组库
import math                     # 数学运算库
import matplotlib.pyplot as plt # 画图库

import torch             # torch基础库
import torch.nn as nn    # torch神经网络库
import torch.nn.functional as F
import torchvision.datasets as dataset  #公开数据集的下载和管理
import torchvision.transforms as transforms  #公开数据集的预处理库,格式转换
import torchvision.utils as utils 
import torch.utils.data as data_utils  #对数据集进行分批加载的工具集
from PIL import Image #图片显示
from collections import OrderedDict
import torchvision.models as models

print("Hello World")
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.version.cuda)
print(torch.backends.cudnn.version())
Hello World
1.10.0
True
10.2
7605

3.2 定义数据预处理(数据强化)

#2-1 准备数据集
transform_train = transforms.Compose(
    [transforms.Resize(256),
     transforms.CenterCrop(224), 
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

transform_test = transforms.Compose(
    [transforms.Resize(256),
     transforms.CenterCrop(224), 
     transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

3.3 下载并加载数据集

train_data = dataset.CIFAR10 (root = "../datasets/cifar10",
                           train = True,
                           transform = transform_train,
                           download = True)

#2-1 准备数据集
test_data = dataset.CIFAR10 (root = "../datasets/cifar10",
                           train = False,
                           transform = transform_test,
                           download = True)

print(train_data)
print("size=", len(train_data))
print("")
print(test_data)
print("size=", len(test_data))
Files already downloaded and verified
Files already downloaded and verified
Dataset CIFAR10
    Number of datapoints: 50000
    Root location: cifar10
    Split: Train
    StandardTransform
Transform: Compose(
               Resize(size=256, interpolation=PIL.Image.BILINEAR)
               CenterCrop(size=(224, 224))
               ToTensor()
               Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
           )
size= 50000

Dataset CIFAR10
    Number of datapoints: 10000
    Root location: cifar10
    Split: Test
    StandardTransform
Transform: Compose(
               Resize(size=256, interpolation=PIL.Image.BILINEAR)
               CenterCrop(size=(224, 224))
               ToTensor()
               Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))
           )
size= 10000

3.4 定义batch Loader

# 批量数据读取
batch_size = 64

train_loader = data_utils.DataLoader(dataset = train_data,  #训练数据
                                  batch_size = batch_size,           #每个批次读取的图片数量
                                  shuffle = True)           #读取到的数据,是否需要随机打乱顺序

test_loader = data_utils.DataLoader(dataset = test_data,   #测试数据集
                                  batch_size = batch_size,
                                  shuffle = True)

print(train_loader)
print(test_loader)
print(len(train_data), len(train_data)/batch_size)
print(len(test_data),  len(test_data)/batch_size)
<torch.utils.data.dataloader.DataLoader object at 0x000001D0D07A8EE0>
<torch.utils.data.dataloader.DataLoader object at 0x000001D0D07A8FA0>
50000 781.25
10000 156.25

备注:

  • 每个batch读取64个图片
  • 训练数据集次数:50000/64 = 781.25
  • 测试数据集次数:10000/64 = 156.25

3.5 可视化部分数据集数据

#显示一个batch图片
print("获取一个batch组图片")
imgs, labels = next(iter(train_loader))
print(imgs.shape)
print(labels.shape)
print(labels.size()[0])

print("\\n合并成一张三通道灰度图片")
images = utils.make_grid(imgs)
print(images.shape)
print(labels.shape)

print("\\n转换成imshow格式")
images = images.numpy().transpose(1,2,0) 
print(images.shape)
print(labels.shape)


print("\\n显示样本标签")
#打印图片标签
classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

for i in range(batch_size):
    print(labels[i], end=" ")
    i += 1
    #换行
    if i%8 == 0:
        print(end='\\n')

print(' '.join('%5s' % classes[labels[j]] for j in range(batch_size)))
        
print("\\n显示图片")
plt.imshow(images)
plt.show()
获取一个batch组图片
torch.Size([64, 3, 224, 224])
torch.Size([64])
64

合并成一张三通道灰度图片
torch.Size([3, 1810, 1810])
torch.Size([64])

转换成imshow格式
(1810, 1810, 3)
torch.Size([64])

显示样本标签
tensor(1) tensor(6) tensor(5) tensor(4) tensor(8) tensor(1) tensor(1) tensor(1) 
tensor(7) tensor(4) tensor(0) tensor(1) tensor(3) tensor(5) tensor(1) tensor(5) 
tensor(6) tensor(2) tensor(5) tensor(7) tensor(2) tensor(9) tensor(2) tensor(4) 
tensor(4) tensor(8) tensor(0) tensor(7) tensor(7) tensor(3) tensor(8) tensor(5) 
tensor(4) tensor(9) tensor(0) tensor(5) tensor(6) tensor(0) tensor(2) tensor(1) 
tensor(4) tensor(2) tensor(5) tensor(8) tensor(2) tensor(5) tensor(6) tensor(0) 
tensor(4) tensor(2) tensor(5) tensor(0) tensor(6) tensor(1) tensor(2) tensor(6) 
tensor(5) tensor(3) tensor(2) tensor(1) tensor(6) tensor(3) tensor(5) tensor(0) 
  car  frog   dog  deer  ship   car   car   car horse  deer plane   car   cat   dog   car   dog  frog  bird   dog horse  bird truck  bird  deer  deer  ship plane horse horse   cat  ship   dog  deer truck plane   dog  frog plane  bird   car  deer  bird   dog  ship  bird   dog  frog plane  deer  bird   dog plane  frog   car  bird  frog   dog   cat  bird   car  frog   cat   dog plane

显示图片

第4章 定义前向运算:torchvision搭建神经网络

4.1 定义AlexNet网络

# 2-3 使用torchvision.models定义神经网络
net_a = models.alexnet(num_classes = 10)
print(net_a)
AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
    (2): ReLU(inplace=True)
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=4096, out_features=4096, bias=True)
    (5): ReLU(inplace=True)
    (6): Linear(in_features=4096, out_features=10, bias=True)
  )
)

4.2 定义VGG网络(参考)

net_b = models.vgg16(num_classes = 10)
print(net_b)
VGG(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (3): ReLU(inplace=True)
    (4): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (5): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (6): ReLU(inplace=True)
    (7): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (8): ReLU(inplace=True)
    (9): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (10): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (13): ReLU(inplace=True)
    (14): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (15): ReLU(inplace=True)
    (16): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (17): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (18): ReLU(inplace=True)
    (19): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (20): ReLU(inplace=True)
    (21): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (22): ReLU(inplace=True)
    (23): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (24): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (25): ReLU(inplace=True)
    (26): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (27): ReLU(inplace=True)
    (28): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (29): ReLU(inplace=True)
    (30): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(7, 7))
  (classifier): Sequential(
    (0): Linear(in_features=25088, out_features=4096, bias=True)
    (1): ReLU(inplace=True)
    (2): Dropout(p=0.5, inplace=False)
    (3): Linear(in_features=4096, out_features=4096, bias=True)
    (4): ReLU(inplace=True)
    (5): Dropout(p=0.5, inplace=False)
    (6): Linear(in_features=4096, out_features=10, bias=True)
  )
)

4.3 测试AlexNet的网络输出

(1)train模式

# 2-4 定义网络预测输出
# 测试网络是否能够工作
print("定义测试数据")
input = torch.randn(1, 3, 224, 224)
print(input.shape)

print("")
net_a.train()
print("net_a的输出方法1:")
out = net_a(input)
print(out.shape)
print(out)

print("")
print("net_a的输出方法2:")
out = net_a.forward(input)
print(out)
定义测试数据
torch.Size([1, 3, 224, 224])

net_a的输出方法1:
torch.Size([1, 10])
tensor([[ 0.0005,  0.0109,  0.0048, -0.0172, -0.0359,  0.0163, -0.0020,  0.0060,
          0.0102, -0.0046]], grad_fn=<AddmmBackward>)

net_a的输出方法2:
tensor([[ 0.0059,  0.0139,  0.0014, -0.0043, -0.0188, -0.0014, -0.0116, -0.0012,
          0.0205, -0.0094]], grad_fn=<AddmmBackward>)

(2)eval模式

# 2-4 定义网络预测输出
# 测试网络是否能够工作
print("定义测试数据")
input = torch.randn(1, 3, 224, 224)
print(input.shape)

print("")
net_a.eval()
print("net_a的输出方法1:")
out = net_a(input)
print(out.shape)
print(out)

print("")
print("net_a的输出方法2:")
out = net_a.forward(input)
print(out)
定义测试数据
torch.Size([1, 3, 224, 224])

net_a的输出方法1:
torch.Size([1, 10])
tensor([[-0.0033,  0.0161,  0.0018, -0.0048, -0.0265,  0.0092, -0.0085,  0.0073,
          0.0143, -0.0105]], grad_fn=<AddmmBackward>)

net_a的输出方法2:
tensor([[-0.0033,  0.0161,  0.0018, -0.0048, -0.0265,  0.0092, -0.0085,  0.0073,
          0.0143, -0.0105]], grad_fn=<AddmmBackward>)

4.4 测试VGG的网络输出

(1)eval模式

print("")
net_b.eval()
print("net_b的输出方法1:")
out = net_b(input)
print(out)
print("")
print("net_b的输出方法2:")
out = net_b.forward(input)
print(out)
net_b的输出方法1:
tensor([[-0.1271, -0.0033, -0.0870,  0.0262,  0.0463,  0.0320, -0.0020, -0.0226,
          0.0046, -0.0824]], grad_fn=<AddmmBackward>)

net_b的输出方法2:
tensor([[-0.1271, -0.0033, -0.0870,  0.0262,  0.0463,  0.0320, -0.0020, -0.0226,
          0.0046, -0.0824]], grad_fn=<AddmmBackward>)

第5章 定义反向运算:损失函数与优化器

5.1 定义损失函数

# 3-1 定义loss函数: 
loss_fn = nn.CrossEntropyLoss()
print(loss_fn)
CrossEntropyLoss()

5.2 定义优化器

# 3-2 定义优化器
net = net_a

Learning_rate = 0.01     #学习率

# optimizer = SGD: 基本梯度下降法
# parameters:指明要优化的参数列表
# lr:指明学习率
#optimizer = torch.optim.Adam(model.parameters(), lr = Learning_rate)
optimizer = torch.optim.SGD(net.parameters(), lr = Learning_rate, momentum=0.9)
print(optimizer)
SGD (
Parameter Group 0
    dampening: 0
    lr: 0.01
    momentum: 0.9
    nesterov: False
    weight_decay: 0
)

第6章 定义反向运算:模型训练

6.1 训练前的准备

# 3-3 训练前准备
# Assume that we are on a CUDA machine, then this should print a CUDA device:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

# 定义迭代次数
epochs = 30

loss_history = [] #训练过程中的loss数据
accuracy_history =[] #中间的预测结果

accuracy_batch = 0.0

net.train() # 网络进入训练模式

#net.to(device) #把网络转移到GPU
net.cuda()   
 
#loss_fn = loss_fn.to(device) #把loss计算转移到GPU
loss_fn.cuda() 

6.2 开始训练

# 3-3 模型训练
# 把网络设置在训练模式
for i in range(0, epochs):
    for j, (x_train, y_train) in enumerate(train_loader):
        
        # 设置模型在 train 模式
        # net.train()
        #指定数据处理的硬件单元
        #x_train = x_train.to(device)
        x_train = x_train.cuda()
        #y_train = y_train.to(device)
        y_train = y_train.cuda()
        
        #(0) 复位优化器的梯度
        optimizer.zero_grad()    
        
        #(1) 前向计算
        y_pred = net(x_train)
    
        #(2) 计算loss
        loss = loss_fn(y_pred, y_train)
    
        #(3) 反向求导
        loss.backward()
    
        #(4) 反向迭代
        optimizer.step()
    
        # 记录训练过程中的损失值
        loss_history.append(loss.item())  #loss for a batch
        
        # 记录训练过程中的准确率
        number_batch = y_train.size()[0] # 图片的个数
        _, predicted = torch.max(y_pred.data, dim = 1)
        correct_batch = (predicted == y_train).sum().item() # 预测正确的数目
        accuracy_batch = 100 * correct_batch/number_batch
        accuracy_history.append(accuracy_batch)
    
        if(j % 10 == 0):
            print('epoch {} batch {} In {} loss = {:.4f} accuracy = {:.4f}%'.format(i, j , len(train_data)/batch_size, loss.item(), accuracy_batch)) 

print("\\n迭代完成")
print("final loss =", loss.item())
print("final accu =", accuracy_batch)
epoch 0 batch 0 In 781.25 loss = 2.3010 accuracy = 17.1875%
epoch 0 batch 10 In 781.25 loss = 2.3029 accuracy = 12.5000%
epoch 0 batch 20 In 781.25 loss = 2.3098 accuracy = 3.1250%
epoch 0 batch 30 In 781.25 loss = 2.3070 accuracy = 4.6875%
epoch 0 batch 40 In 781.25 loss = 2.3046 accuracy = 12.5000%
epoch 0 batch 50 In 781.25 loss = 2.3016 accuracy = 12.5000%
epoch 0 batch 60 In 781.25 loss = 2.3078 accuracy = 9.3750%
epoch 0 batch 70 In 781.25 loss = 2.3036 accuracy = 6.2500%
epoch 0 batch 80 In 781.25 loss = 2.3003 accuracy = 14.0625%
epoch 0 batch 90 In 781.25 loss = 2.2997 accuracy = 10.9375%
epoch 0 batch 100 In 781.25 loss = 2.3024 accuracy = 7.8125%
...........................................................
epoch 7 batch 780 In 781.25 loss = 0.3938 accuracy = 82.8125%
epoch 8 batch 0 In 781.25 loss = 0.3017 accuracy = 92.1875%
epoch 8 batch 10 In 781.25 loss = 0.4603 accuracy = 82.8125%
epoch 8 batch 20 In 781.25 loss = 0.4147 accuracy = 87.5000%
epoch 8 batch 30 In 781.25 loss = 0.5234 accuracy = 76.5625%
epoch 8 batch 40 In 781.25 loss = 0.4334 accuracy = 82.8125%
epoch 8 batch 50 In 781.25 loss = 0.2955 accuracy = 90.6250%
epoch 8 batch 60 In 781.25 loss = 0.3096 accuracy = 85.9375%
epoch 8 batch 70 In 781.25 loss = 0.4529 accuracy = 82.8125%
epoch 8 batch 80 In 781.25 loss = 0.4333 accuracy = 82.8125%
...........................................................
epoch 29 batch 670 In 781.25 loss = 0.1166 accuracy = 95.3125%
epoch 29 batch 680 In 781.25 loss = 0.0578 accuracy = 96.8750%
epoch 29 batch 690 In 781.25 loss = 0.0197 accuracy = 100.0000%
epoch 29 batch 700 In 781.25 loss = 0.0356 accuracy = 98.4375%
epoch 29 batch 710 In 781.25 loss = 0.0484 accuracy = 98.4375%
epoch 29 batch 720 In 781.25 loss = 0.0757 accuracy = 98.4375%
epoch 29 batch 730 In 781.25 loss = 0.0605 accuracy = 98.4375%
epoch 29 batch 740 In 781.25 loss = 0.0941 accuracy = 95.3125%
epoch 29 batch 750 In 781.25 loss = 0.0542 accuracy = 98.4375%
epoch 29 batch 760 In 781.25 loss = 0.0248 accuracy = 98.4375%
epoch 29 batch 770 In 781.25 loss = 0.0334 accuracy = 98.4375%
epoch 29 batch 780 In 781.25 loss = 0.0423 accuracy = 98.4375%

迭代完成
final loss = 0.005929590202867985
final accu = 100.0

第7章 模型评估 - 训练过程

7.1 可视化loss迭代过程

#显示loss的历史数据
plt.grid()
plt.xlabel("iters")
plt.ylabel("")
plt.title("loss", fontsize = 12)
plt.plot(loss_history, "r")
plt.show()

7.2 可视化精度变化过程 

#显示准确率的历史数据
plt.grid()
plt.xlabel("iters")
plt.ylabel("%")
plt.title("accuracy", fontsize = 12)
plt.plot(accuracy_history, "b+")
plt.show()

第8章 模型评估 - 训练结果

8.1 手工验证

# 手工检查
net_b.eval()
index = 0
print("获取一个batch样本")
images, labels = next(iter(test_loader))
images = images.to(device)
labels = labels.to(device)
print(images.shape)
print(labels.shape)
print(labels)


print("\\n对batch中所有样本进行预测")
outputs = net(images)
print(outputs.data.shape)

print("\\n对batch中每个样本的预测结果,选择最可能的分类")
_, predicted = torch.max(outputs, 1)
print(predicted.data.shape)
print(predicted)


print("\\n对batch中的所有结果进行比较")
bool_results = (predicted == labels)
print(bool_results.shape)
print(bool_results)

print("\\n统计预测正确样本的个数和精度")
corrects = bool_results.sum().item()
accuracy = corrects/(len(bool_results))
print("corrects=", corrects)
print("accuracy=", accuracy)

print("\\n样本index =", index)
print("标签值    :", labels[index]. item())
print("分类可能性:", outputs.data[index].cpu().numpy())
print("最大可能性:",predicted.data[index].item())
print("正确性    :",bool_results.data[index].item())
获取一个batch样本
torch.Size([64, 3, 224, 224])
torch.Size([64])
tensor([7, 9, 4, 8, 8, 7, 8, 1, 8, 1, 8, 1, 7, 5, 0, 7, 1, 7, 8, 4, 2, 5, 6, 7,
        7, 2, 9, 4, 6, 6, 3, 4, 2, 5, 2, 9, 0, 7, 1, 1, 9, 3, 1, 0, 5, 1, 9, 7,
        8, 9, 0, 0, 5, 1, 7, 3, 8, 5, 9, 4, 1, 1, 1, 7], device='cuda:0')

对batch中所有样本进行预测
torch.Size([64, 10])

对batch中每个样本的预测结果,选择最可能的分类
torch.Size([64])
tensor([7, 9, 4, 8, 8, 7, 8, 8, 8, 1, 8, 1, 7, 5, 7, 7, 1, 7, 8, 7, 2, 3, 6, 7,
        7, 0, 9, 4, 6, 6, 4, 4, 2, 5, 2, 9, 0, 7, 1, 1, 9, 3, 1, 0, 5, 1, 9, 7,
        8, 9, 0, 0, 4, 1, 5, 3, 8, 5, 7, 0, 1, 1, 1, 7], device='cuda:0')

对batch中的所有结果进行比较
torch.Size([64])
tensor([ True,  True,  True,  True,  True,  True,  True, False,  True,  True,
         True,  True,  True,  True, False,  True,  True,  True,  True, False,
         True, False,  True,  True,  True, False,  True,  True,  True,  True,
        False,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True, False,  True, False,  True,  True,  True, False, False,
         True,  True,  True,  True], device='cuda:0')

统计预测正确样本的个数和精度
corrects= 54
accuracy= 0.84375

样本index = 0
标签值    : 7
分类可能性: [ -3.029822  -11.951658   -5.916941   -0.8202269  13.770017   12.778431
 -18.068062   39.197517  -15.120514   -9.246134 ]
最大可能性: 7
正确性    : True

8.2 训练集上的验证

# 对训练后的模型进行评估:测试其在训练集上总的准确率
correct_dataset  = 0
total_dataset    = 0
accuracy_dataset = 0.0

# 进行评测的时候网络不更新梯度
with torch.no_grad():
    for i, data in enumerate(train_loader):
        #获取一个batch样本" 
        images, labels = data
        images = images.to(device)
        labels = labels.to(device)
        
        #对batch中所有样本进行预测
        outputs = net(images)
        
        #对batch中每个样本的预测结果,选择最可能的分类
        _, predicted = torch.max(outputs.data, 1)
        
        #对batch中的样本数进行累计
        total_dataset += labels.size()[0] 
        
        #对batch中的所有结果进行比较"
        bool_results = (predicted == labels)
        
        #统计预测正确样本的个数
        correct_dataset += bool_results.sum().item()
        
        #统计预测正确样本的精度
        accuracy_dataset = 100 * correct_dataset/total_dataset
        
        if(i % 100 == 0):
            print('batch {} In {} accuracy = {:.4f}'.format(i, len(train_data)/64, accuracy_dataset))
            
print('Final result with the model on the dataset, accuracy =', accuracy_dataset)
batch 0 In 781.25 accuracy = 95.3125
batch 100 In 781.25 accuracy = 98.4220
batch 200 In 781.25 accuracy = 98.3909
batch 300 In 781.25 accuracy = 98.2247
batch 400 In 781.25 accuracy = 98.1998
batch 500 In 781.25 accuracy = 98.2285
batch 600 In 781.25 accuracy = 98.2425
batch 700 In 781.25 accuracy = 98.2592
Final result with the model on the dataset, accuracy = 98.3

8.3 在测试集上的验证

# 对训练后的模型进行评估:测试其在训练集上总的准确率
correct_dataset  = 0
total_dataset    = 0
accuracy_dataset = 0.0

# 进行评测的时候网络不更新梯度
with torch.no_grad():
    for i, data in enumerate(test_loader):
        #获取一个batch样本" 
        images, labels = data
        images = images.to(device)
        labels = labels.to(device)
        
        #对batch中所有样本进行预测
        outputs = net(images)
        
        #对batch中每个样本的预测结果,选择最可能的分类
        _, predicted = torch.max(outputs.data, 1)
        
        #对batch中的样本数进行累计
        total_dataset += labels.size()[0] 
        
        #对batch中的所有结果进行比较"
        bool_results = (predicted == labels)
        
        #统计预测正确样本的个数
        correct_dataset += bool_results.sum().item()
        
        #统计预测正确样本的精度
        accuracy_dataset = 100 * correct_dataset/total_dataset
        
        if(i % 100 == 0):
            print('batch {} In {} accuracy = {:.4f}'.format(i, len(test_data)/64, accuracy_dataset))
            
print('Final result with the model on the dataset, accuracy =', accuracy_dataset)
batch 0 In 156.25 accuracy = 84.3750
batch 100 In 156.25 accuracy = 81.3583
Final result with the model on the dataset, accuracy = 81.32

备注说明:

在测试集上的精度比训练集上的精度小:

  • 有过拟合之嫌
  • 需要泛化能力强的模型

第9章 模型的存储与保存

辛辛苦苦顺利模型不容易,需要把训练的模型保存下来。

#存储模型
torch.save(model, "models/alexnet_model.pkl")

#存储参数
torch.save(model.state_dict() , "models/alexnet_params.pkl")

第10 章 模型的恢复

参考后续章节.....

第11 章 笔者感悟

11.1 在CPU上训练

Pytorch学习到现在,进入了一个新的瓶颈期:

  • 上述流程,除了AlexNet,也适用于其他神经网络,如VGG, Resnet,inception等。
  • 通用的CPU只能胜任AlexNet之前的微型的神经网络
  • 通用的CPU已经无法胜任小型的AlexNet以及后续的卷积神经网络
  • 后续的神经网络需要GPU的训练环境

11.2 在GPU上训练的比较

相对于CPU,在GPU上训练明显轻松很多,速度很快, 训练速度的提升至少在10倍以上。30 epochs,大致花了30分钟。如果是现在CPU上训练,一整天都搞不定。

感觉非常好!

参考:

[深度学习]使用torchvison经典模型训练cifar10(AlexNet,VGG,Inception,ResNet等)_sinat_33487968的博客-CSDN博客


作者主页(文火冰糖的硅基工坊):文火冰糖(王文兵)的博客_文火冰糖的硅基工坊_CSDN博客

本文网址:https://blog.csdn.net/HiWangWenBing/article/details/121279002

以上是关于[Pytorch系列-45]:卷积神经网络 - 用GPU训练AlexNet+CIFAR10数据集的主要内容,如果未能解决你的问题,请参考以下文章

[Pytorch系列-50]:卷积神经网络 - FineTuning的统一处理流程与软件架构 - Pytorch代码实现

[Pytorch系列-49]:卷积神经网络 - 迁移学习的统一处理流程与软件架构 - Pytorch代码实现

[Pytorch系列-31]:卷积神经网络 - torch.nn.Conv2d() 用法详解

[Pytorch系列-35]:卷积神经网络 - 搭建LeNet-5网络与CFAR10分类数据集

[Pytorch系列-34]:卷积神经网络 - 搭建LeNet-5网络与MNIST数据集手写数字识别

[Pytorch系列-32]:卷积神经网络 - torch.nn.MaxPool2d() 用法详解