Faster_RCNN 1.准备工作
Posted 三年一梦
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了Faster_RCNN 1.准备工作相关的知识,希望对你有一定的参考价值。
总结自论文:Faster_RCNN,与Pytorch代码:
代码结构: simple-faster-rcnn-pytorch.py
- data
- __init__.py
- dataset.py
- util.py
- voc_dataset.py
- misc
- convert_caffe_pretain.py
- train_fast.py
- model
- utils
- nms
- __init__.py
- _nms_gpu_post.py
- build.py
- non_maximum_suppression.py
- __init__.py
- bbox_tools.py
- creator_tool.py
- roi_cupy.py
- nms
- __init__.py
- faster_rcnn.py
- faster_rcnn_vgg16.py
- region_proposal_network.py
- roi_module.py
- utils
- utils
- __init__.py
- array_tool.py
- config.py
- eval_tool.py
- vis_tool.py
- demo.ipynb
- train.py
- trainer.py
代码中有四个包分别为data、misc、model、utils。最核心的部分在model,包括了nms(非极大值抑制)、RPN网络实现、模型定义等。train.py与trainer.py为训练脚本。
本文主要介绍代码第一部分:data包 与 utils包。
一. data包
首先下载VOC2007数据集:
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCdevkit_08-Jun-2007.tar
并将三个压缩包解压至一个文件夹(名为VOCdevkit)下:
tar xvf VOCtrainval_06-Nov-2007.tar tar xvf VOCtest_06-Nov-2007.tar tar xvf VOCdevkit_08-Jun-2007.tar
1. utils.py
import numpy as np from PIL import Image import random def read_image(path, dtype=np.float32, color=True): """Read an image from a file. This function reads an image from given file. The image is CHW format and the range of its value is :math:`[0, 255]`. If :obj:`color = True`, the order of the channels is RGB. Args: path (str): A path of image file. dtype: The type of array. The default value is :obj:`~numpy.float32`. color (bool): This option determines the number of channels. If :obj:`True`, the number of channels is three. In this case, the order of the channels is RGB. This is the default behaviour. If :obj:`False`, this function returns a grayscale image. Returns: ~numpy.ndarray: An image. """ f = Image.open(path) try: if color: img = f.convert(‘RGB‘) else: img = f.convert(‘P‘) img = np.asarray(img, dtype=dtype) finally: if hasattr(f, ‘close‘): f.close() if img.ndim == 2: # reshape (H, W) -> (1, H, W) return img[np.newaxis] else: # transpose (H, W, C) -> (C, H, W) return img.transpose((2, 0, 1)) def resize_bbox(bbox, in_size, out_size): """Resize bounding boxes according to image resize. The bounding boxes are expected to be packed into a two dimensional tensor of shape :math:`(R, 4)`, where :math:`R` is the number of bounding boxes in the image. The second axis represents attributes of the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`, where the four attributes are coordinates of the top left and the bottom right vertices. Args: bbox (~numpy.ndarray): An array whose shape is :math:`(R, 4)`. :math:`R` is the number of bounding boxes. in_size (tuple): A tuple of length 2. The height and the width of the image before resized. out_size (tuple): A tuple of length 2. The height and the width of the image after resized. Returns: ~numpy.ndarray: Bounding boxes rescaled according to the given image shapes. """ bbox = bbox.copy() y_scale = float(out_size[0]) / in_size[0] x_scale = float(out_size[1]) / in_size[1] bbox[:, 0] = y_scale * bbox[:, 0] bbox[:, 2] = y_scale * bbox[:, 2] bbox[:, 1] = x_scale * bbox[:, 1] bbox[:, 3] = x_scale * bbox[:, 3] return bbox def flip_bbox(bbox, size, y_flip=False, x_flip=False): """Flip bounding boxes accordingly. The bounding boxes are expected to be packed into a two dimensional tensor of shape :math:`(R, 4)`, where :math:`R` is the number of bounding boxes in the image. The second axis represents attributes of the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`, where the four attributes are coordinates of the top left and the bottom right vertices. Args: bbox (~numpy.ndarray): An array whose shape is :math:`(R, 4)`. :math:`R` is the number of bounding boxes. size (tuple): A tuple of length 2. The height and the width of the image before resized. y_flip (bool): Flip bounding box according to a vertical flip of an image. x_flip (bool): Flip bounding box according to a horizontal flip of an image. Returns: ~numpy.ndarray: Bounding boxes flipped according to the given flips. """ H, W = size bbox = bbox.copy() if y_flip: y_max = H - bbox[:, 0] y_min = H - bbox[:, 2] bbox[:, 0] = y_min bbox[:, 2] = y_max if x_flip: x_max = W - bbox[:, 1] x_min = W - bbox[:, 3] bbox[:, 1] = x_min bbox[:, 3] = x_max return bbox def crop_bbox( bbox, y_slice=None, x_slice=None, allow_outside_center=True, return_param=False): """Translate bounding boxes to fit within the cropped area of an image. This method is mainly used together with image cropping. This method translates the coordinates of bounding boxes like :func:`data.util.translate_bbox`. In addition, this function truncates the bounding boxes to fit within the cropped area. If a bounding box does not overlap with the cropped area, this bounding box will be removed. The bounding boxes are expected to be packed into a two dimensional tensor of shape :math:`(R, 4)`, where :math:`R` is the number of bounding boxes in the image. The second axis represents attributes of the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`, where the four attributes are coordinates of the top left and the bottom right vertices. Args: bbox (~numpy.ndarray): Bounding boxes to be transformed. The shape is :math:`(R, 4)`. :math:`R` is the number of bounding boxes. y_slice (slice): The slice of y axis. x_slice (slice): The slice of x axis. allow_outside_center (bool): If this argument is :obj:`False`, bounding boxes whose centers are outside of the cropped area are removed. The default value is :obj:`True`. return_param (bool): If :obj:`True`, this function returns indices of kept bounding boxes. Returns: ~numpy.ndarray or (~numpy.ndarray, dict): If :obj:`return_param = False`, returns an array :obj:`bbox`. If :obj:`return_param = True`, returns a tuple whose elements are :obj:`bbox, param`. :obj:`param` is a dictionary of intermediate parameters whose contents are listed below with key, value-type and the description of the value. * **index** (*numpy.ndarray*): An array holding indices of used bounding boxes. """ t, b = _slice_to_bounds(y_slice) l, r = _slice_to_bounds(x_slice) crop_bb = np.array((t, l, b, r)) if allow_outside_center: mask = np.ones(bbox.shape[0], dtype=bool) else: center = (bbox[:, :2] + bbox[:, 2:]) / 2 mask = np.logical_and(crop_bb[:2] <= center, center < crop_bb[2:]) .all(axis=1) bbox = bbox.copy() bbox[:, :2] = np.maximum(bbox[:, :2], crop_bb[:2]) bbox[:, 2:] = np.minimum(bbox[:, 2:], crop_bb[2:]) bbox[:, :2] -= crop_bb[:2] bbox[:, 2:] -= crop_bb[:2] mask = np.logical_and(mask, (bbox[:, :2] < bbox[:, 2:]).all(axis=1)) bbox = bbox[mask] if return_param: return bbox, {‘index‘: np.flatnonzero(mask)} else: return bbox def _slice_to_bounds(slice_): if slice_ is None: return 0, np.inf if slice_.start is None: l = 0 else: l = slice_.start if slice_.stop is None: u = np.inf else: u = slice_.stop return l, u def translate_bbox(bbox, y_offset=0, x_offset=0): """Translate bounding boxes. This method is mainly used together with image transforms, such as padding and cropping, which translates the left top point of the image from coordinate :math:`(0, 0)` to coordinate :math:`(y, x) = (y_{offset}, x_{offset})`. The bounding boxes are expected to be packed into a two dimensional tensor of shape :math:`(R, 4)`, where :math:`R` is the number of bounding boxes in the image. The second axis represents attributes of the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`, where the four attributes are coordinates of the top left and the bottom right vertices. Args: bbox (~numpy.ndarray): Bounding boxes to be transformed. The shape is :math:`(R, 4)`. :math:`R` is the number of bounding boxes. y_offset (int or float): The offset along y axis. x_offset (int or float): The offset along x axis. Returns: ~numpy.ndarray: Bounding boxes translated according to the given offsets. """ out_bbox = bbox.copy() out_bbox[:, :2] += (y_offset, x_offset) out_bbox[:, 2:] += (y_offset, x_offset) return out_bbox def random_flip(img, y_random=False, x_random=False, return_param=False, copy=False): """Randomly flip an image in vertical or horizontal direction. Args: img (~numpy.ndarray): An array that gets flipped. This is in CHW format. y_random (bool): Randomly flip in vertical direction. x_random (bool): Randomly flip in horizontal direction. return_param (bool): Returns information of flip. copy (bool): If False, a view of :obj:`img` will be returned. Returns: ~numpy.ndarray or (~numpy.ndarray, dict): If :obj:`return_param = False`, returns an array :obj:`out_img` that is the result of flipping. If :obj:`return_param = True`, returns a tuple whose elements are :obj:`out_img, param`. :obj:`param` is a dictionary of intermediate parameters whose contents are listed below with key, value-type and the description of the value. * **y_flip** (*bool*): Whether the image was flipped in the vertical direction or not. * **x_flip** (*bool*): Whether the image was flipped in the horizontal direction or not. """ y_flip, x_flip = False, False if y_random: y_flip = random.choice([True, False]) if x_random: x_flip = random.choice([True, False]) if y_flip: img = img[:, ::-1, :] if x_flip: img = img[:, :, ::-1] if copy: img = img.copy() if return_param: return img, {‘y_flip‘: y_flip, ‘x_flip‘: x_flip} else: return img
工具文件:
函数read_image首先用PIL将图像读入为RGB格式或单通道格式彩图,然后分别转为C*H*W与1*H*W格式。图像范围【0,255】。
函数resize_bbox将形状为(R,4)的bbox按照输入与输出的height、weight进行resize。
函数flip_bbox将根据是否翻转实现对输入bbox的横向与纵向翻转。
函数crop_bbox将bbox适应于图像的裁剪区域。
函数translate_bbox根据输入的偏移量,进行水平或竖直偏移。
函数random_flip将图片(CHW格式)随机水平或竖直反转:
- img = img[:, ::-1, :] 竖直翻转
- img = img[:, :, ::-1] 水平翻转
2. voc_dataset.py
import os import xml.etree.ElementTree as ET import numpy as np from .util import read_image class VOCBboxDataset: """Bounding box dataset for PASCAL `VOC`_. .. _`VOC`: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ The index corresponds to each image. When queried by an index, if :obj:`return_difficult == False`, this dataset returns a corresponding :obj:`img, bbox, label`, a tuple of an image, bounding boxes and labels. This is the default behaviour. If :obj:`return_difficult == True`, this dataset returns corresponding :obj:`img, bbox, label, difficult`. :obj:`difficult` is a boolean array that indicates whether bounding boxes are labeled as difficult or not. The bounding boxes are packed into a two dimensional tensor of shape :math:`(R, 4)`, where :math:`R` is the number of bounding boxes in the image. The second axis represents attributes of the bounding box. They are :math:`(y_{min}, x_{min}, y_{max}, x_{max})`, where the four attributes are coordinates of the top left and the bottom right vertices. The labels are packed into a one dimensional tensor of shape :math:`(R,)`. :math:`R` is the number of bounding boxes in the image. The class name of the label :math:`l` is :math:`l` th element of :obj:`VOC_BBOX_LABEL_NAMES`. The array :obj:`difficult` is a one dimensional boolean array of shape :math:`(R,)`. :math:`R` is the number of bounding boxes in the image. If :obj:`use_difficult` is :obj:`False`, this array is a boolean array with all :obj:`False`. The type of the image, the bounding boxes and the labels are as follows. * :obj:`img.dtype == numpy.float32` * :obj:`bbox.dtype == numpy.float32` * :obj:`label.dtype == numpy.int32` * :obj:`difficult.dtype == numpy.bool` Args: data_dir (string): Path to the root of the training data. i.e. "/data/image/voc/VOCdevkit/VOC2007/" split ({‘train‘, ‘val‘, ‘trainval‘, ‘test‘}): Select a split of the dataset. :obj:`test` split is only available for 2007 dataset. year ({‘2007‘, ‘2012‘}): Use a dataset prepared for a challenge held in :obj:`year`. use_difficult (bool): If :obj:`True`, use images that are labeled as difficult in the original annotation. return_difficult (bool): If :obj:`True`, this dataset returns a boolean array that indicates whether bounding boxes are labeled as difficult or not. The default value is :obj:`False`. """ def __init__(self, data_dir, split=‘trainval‘, use_difficult=False, return_difficult=False, ): # if split not in [‘train‘, ‘trainval‘, ‘val‘]: # if not (split == ‘test‘ and year == ‘2007‘): # warnings.warn( # ‘please pick split from \‘train\‘, \‘trainval\‘, \‘val\‘‘ # ‘for 2012 dataset. For 2007 dataset, you can pick \‘test\‘‘ # ‘ in addition to the above mentioned splits.‘ # ) id_list_file = os.path.join( data_dir, ‘ImageSets/Main/{0}.txt‘.format(split)) self.ids = [id_.strip() for id_ in open(id_list_file)] self.data_dir = data_dir self.use_difficult = use_difficult self.return_difficult = return_difficult self.label_names = VOC_BBOX_LABEL_NAMES def __len__(self): return len(self.ids) def get_example(self, i): """Returns the i-th example. Returns a color image and bounding boxes. The image is in CHW format. The returned image is RGB. Args: i (int): The index of the example. Returns: tuple of an image and bounding boxes """ id_ = self.ids[i] anno = ET.parse( os.path.join(self.data_dir, ‘Annotations‘, id_ + ‘.xml‘)) bbox = list() label = list() difficult = list() for obj in anno.findall(‘object‘): # when in not using difficult split, and the object is # difficult, skipt it. if not self.use_difficult and int(obj.find(‘difficult‘).text) == 1: continue difficult.append(int(obj.find(‘difficult‘).text)) bndbox_anno = obj.find(‘bndbox‘) # subtract 1 to make pixel indexes 0-based bbox.append([ int(bndbox_anno.find(tag).text) - 1 for tag in (‘ymin‘, ‘xmin‘, ‘ymax‘, ‘xmax‘)]) name = obj.find(‘name‘).text.lower().strip() label.append(VOC_BBOX_LABEL_NAMES.index(name)) bbox = np.stack(bbox).astype(np.float32) label = np.stack(label).astype(np.int32) # When `use_difficult==False`, all elements in `difficult` are False. difficult = np.array(difficult, dtype=np.bool).astype(np.uint8) # PyTorch don‘t support np.bool # Load a image img_file = os.path.join(self.data_dir, ‘JPEGImages‘, id_ + ‘.jpg‘) img = read_image(img_file, color=True) # if self.return_difficult: # return img, bbox, label, difficult return img, bbox, label, difficult __getitem__ = get_example VOC_BBOX_LABEL_NAMES = ( ‘aeroplane‘, ‘bicycle‘, ‘bird‘, ‘boat‘, ‘bottle‘, ‘bus‘, ‘car‘, ‘cat‘, ‘chair‘, ‘cow‘, ‘diningtable‘, ‘dog‘, ‘horse‘, ‘motorbike‘, ‘person‘, ‘pottedplant‘, ‘sheep‘, ‘sofa‘, ‘train‘, ‘tvmonitor‘)
实现VOC2007数据类:共9963张图片
VOC2007包含{‘train‘, ‘val‘, ‘trainval‘, ‘test‘},共20类,加背景21类。四个集合图片数分别为2501, 2510,5011,4952(trainval=train+val)。VOC2012无test集。
训练时使用trainval数据集,测试时使用test数据集。
每张图像的标注都在xml文件中:
<annotation> <folder>VOC2007</folder> <filename>000001.jpg</filename> <source> <database>The VOC2007 Database</database> <annotation>PASCAL VOC2007</annotation> <image>flickr</image> <flickrid>341012865</flickrid> </source> <owner> <flickrid>Fried Camels</flickrid> <name>Jinky the Fruit Bat</name> </owner> <size> <width>353</width> <height>500</height> <depth>3</depth> </size> <segmented>0</segmented> <object> <name>dog</name> <pose>Left</pose> <truncated>1</truncated> <difficult>0</difficult> <bndbox> <xmin>48</xmin> <ymin>240</ymin> <xmax>195</xmax> <ymax>371</ymax> </bndbox> </object> <object> <name>person</name> <pose>Left</pose> <truncated>1</truncated> <difficult>0</difficult> <bndbox> <xmin>8</xmin> <ymin>12</ymin> <xmax>352</xmax> <ymax>498</ymax> </bndbox> </object> </annotation>
每个xml文件给出了此图像的size,每个bbox坐标、bbox所含label、以及是否是difficult。
类 VOCBboxDataset继承自Object基类,实例化该类时只需提供VOC数据集路径即可。
类 VOCBboxDataset的方法只有一个,即返回第i张图片的信息(图像、bbox、label、difficult)
3. dataset.py
import torch as t from .voc_dataset import VOCBboxDataset from skimage import transform as sktsf from torchvision import transforms as tvtsf from . import util import numpy as np from utils.config import opt def inverse_normalize(img): if opt.caffe_pretrain: img = img + (np.array([122.7717, 115.9465, 102.9801]).reshape(3, 1, 1)) return img[::-1, :, :] # approximate un-normalize for visualize return (img * 0.225 + 0.45).clip(min=0, max=1) * 255 def pytorch_normalze(img): """ https://github.com/pytorch/vision/issues/223 return appr -1~1 RGB """ normalize = tvtsf.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) img = normalize(t.from_numpy(img)) return img.numpy() def caffe_normalize(img): """ return appr -125-125 BGR """ img = img[[2, 1, 0], :, :] # RGB-BGR img = img * 255 mean = np.array([122.7717, 115.9465, 102.9801]).reshape(3, 1, 1) img = (img - mean).astype(np.float32, copy=True) return img def preprocess(img, min_size=600, max_size=1000): """Preprocess an image for feature extraction. The length of the shorter edge is scaled to :obj:`self.min_size`. After the scaling, if the length of the longer edge is longer than :param min_size: :obj:`self.max_size`, the image is scaled to fit the longer edge to :obj:`self.max_size`. After resizing the image, the image is subtracted by a mean image value :obj:`self.mean`. Args: img (~numpy.ndarray): An image. This is in CHW and RGB format. The range of its value is :math:`[0, 255]`. Returns: ~numpy.ndarray: A preprocessed image. """ C, H, W = img.shape scale1 = min_size / min(H, W) scale2 = max_size / max(H, W) scale = min(scale1, scale2) img = img / 255. img = sktsf.resize(img, (C, H * scale, W * scale), mode=‘reflect‘) # both the longer and shorter should be less than # max_size and min_size if opt.caffe_pretrain: normalize = caffe_normalize else: normalize = pytorch_normalze return normalize(img) class Transform(object): def __init__(self, min_size=600, max_size=1000): self.min_size = min_size self.max_size = max_size def __call__(self, in_data): img, bbox, label = in_data _, H, W = img.shape img = preprocess(img, self.min_size, self.max_size) _, o_H, o_W = img.shape scale = o_H / H bbox = util.resize_bbox(bbox, (H, W), (o_H, o_W)) # horizontally flip img, params = util.random_flip( img, x_random=True, return_param=True) bbox = util.flip_bbox( bbox, (o_H, o_W), x_flip=params[‘x_flip‘]) return img, bbox, label, scale class Dataset: def __init__(self, opt): self.opt = opt self.db = VOCBboxDataset(opt.voc_data_dir) self.tsf = Transform(opt.min_size, opt.max_size) def __getitem__(self, idx): ori_img, bbox, label, difficult = self.db.get_example(idx) img, bbox, label, scale = self.tsf((ori_img, bbox, label)) # TODO: check whose stride is negative to fix this instead copy all # some of the strides of a given numpy array are negative. return img.copy(), bbox.copy(), label.copy(), scale def __len__(self): return len(self.db) class TestDataset: def __init__(self, opt, split=‘test‘, use_difficult=True): self.opt = opt self.db = VOCBboxDataset(opt.voc_data_dir, split=split, use_difficult=use_difficult) def __getitem__(self, idx): ori_img, bbox, label, difficult = self.db.get_example(idx) img = preprocess(ori_img) return img, ori_img.shape[1:], bbox, label, difficult def __len__(self): return len(self.db)
制作数据
函数inverse_normalize实现对caffe与torchvision版本的去正则化。因为可以利用caffe版本的vgg预训练权重,也可利用torchvision版本的预训练权重。只不过后者结果略微逊色于前者。
函数pytorch_normalze实现对pytorch模型输入图像的标准化:由【0,255】的RGB转为【0,1】的RGB再正则化为【-1,1】的RGB。
函数caffe_normalze实现对caffe模型输入图像的标准化:由【0,255】的RGB转为【0,1】的RGB再正则化为【-125,125】的BGR。
函数preprocess实现对图像的预处理:由read_image函数读入的图像为CHW的【0,255】格式,这里首先除以255, 再按照论文长边不超1000,短边不超600。按此比例缩放。然后调用pytorch_normalze或者caffe_normalze对图像进行正则化。
类Transform实现了预处理,定义了__call__方法,在__call__方法中利用函数preprocess对图像预处理,并将bbox按照图像缩放的尺度等比例缩放。然后随机对图像与bbox同时进行水平翻转。
类Dataset实现对训练集样本的生成, 即trainval。__getitem__方法利用VOCBboxDataset类来生成一张训练图片,并调用Trandform类处理。返回处理后的图像,bbox,label,scale。
类TestDataset实现对测试机样本的生成,即test。__getitem__方法利用VOCBboxDataset类来生成一张测试图片,不同于训练的是调用preprocess函数处理。也即没有对bbox进行相应resize,而是返回处理前的图像尺寸。
二. utils包
1. array_tool.py
""" tools to convert specified type """ import torch as t import numpy as np def tonumpy(data): if isinstance(data, np.ndarray): return data if isinstance(data, t._TensorBase): return data.cpu().numpy() if isinstance(data, t.autograd.Variable): return tonumpy(data.data) def totensor(data, cuda=True): if isinstance(data, np.ndarray): tensor = t.from_numpy(data) if isinstance(data, t._TensorBase): tensor = data if isinstance(data, t.autograd.Variable): tensor = data.data if cuda: tensor = tensor.cuda() return tensor def tovariable(data): if isinstance(data, np.ndarray): return tovariable(totensor(data)) if isinstance(data, t._TensorBase): return t.autograd.Variable(data) if isinstance(data, t.autograd.Variable): return data else: raise ValueError("UnKnow data type: %s, input should be {np.ndarray,Tensor,Variable}" %type(data)) def scalar(data): if isinstance(data, np.ndarray): return data.reshape(1)[0] if isinstance(data, t._TensorBase): return data.view(1)[0] if isinstance(data, t.autograd.Variable): return data.data.view(1)[0]
类别转换脚本,实现tensor、numpy、Variable之间的转换。
2. config.py
from pprint import pprint # Default Configs for training # NOTE that, config items could be overwriten by passing argument through command line. # e.g. --voc-data-dir=‘./data/‘ class Config: # data voc_data_dir = ‘/home/cy/.chainer/dataset/pfnet/chainercv/voc/VOCdevkit/VOC2007/‘ min_size = 600 # image resize max_size = 1000 # image resize num_workers = 8 test_num_workers = 8 # sigma for l1_smooth_loss rpn_sigma = 3. roi_sigma = 1. # param for optimizer # 0.0005 in origin paper but 0.0001 in tf-faster-rcnn weight_decay = 0.0005 lr_decay = 0.1 # 1e-3 -> 1e-4 lr = 1e-3 # visualization env = ‘faster-rcnn‘ # visdom env port = 8097 plot_every = 40 # vis every N iter # preset data = ‘voc‘ pretrained_model = ‘vgg16‘ # training epoch = 14 use_adam = False # Use Adam optimizer use_chainer = False # try match everything as chainer use_drop = False # use dropout in RoIHead # debug debug_file = ‘/tmp/debugf‘ test_num = 10000 # model load_path = None caffe_pretrain = False # use caffe pretrained model instead of torchvision caffe_pretrain_path = ‘checkpoints/vgg16-caffe.pth‘ def _parse(self, kwargs): state_dict = self._state_dict() for k, v in kwargs.items(): if k not in state_dict: raise ValueError(‘UnKnown Option: "--%s"‘ % k) setattr(self, k, v) print(‘======user config========‘) pprint(self._state_dict()) print(‘==========end============‘) def _state_dict(self): return {k: getattr(self, k) for k, _ in Config.__dict__.items() if not k.startswith(‘_‘)} opt = Config()
配置文件。包括数据及地址、visdom环境、图像尺寸、预训练权重类型、学习率及各超参数。
3. vis_tool.py
import time import numpy as np import matplotlib import torch as t import visdom matplotlib.use(‘Agg‘) from matplotlib import pyplot as plot # from data.voc_dataset import VOC_BBOX_LABEL_NAMES VOC_BBOX_LABEL_NAMES = ( ‘fly‘, ‘bike‘, ‘bird‘, ‘boat‘, ‘pin‘, ‘bus‘, ‘c‘, ‘cat‘, ‘chair‘, ‘cow‘, ‘table‘, ‘dog‘, ‘horse‘, ‘moto‘, ‘p‘, ‘plant‘, ‘shep‘, ‘sofa‘, ‘train‘, ‘tv‘, ) def vis_image(img, ax=None): """Visualize a color image. Args: img (~numpy.ndarray): An array of shape :math:`(3, height, width)`. This is in RGB format and the range of its value is :math:`[0, 255]`. ax (matplotlib.axes.Axis): The visualization is displayed on this axis. If this is :obj:`None` (default), a new axis is created. Returns: ~matploblib.axes.Axes: Returns the Axes object with the plot for further tweaking. """ if ax is None: fig = plot.figure() ax = fig.add_subplot(1, 1, 1) # CHW -> HWC img = img.transpose((1, 2, 0)) ax.imshow(img.astype(np.uint8)) return ax def vis_bbox(img, bbox, label=None, score=None, ax=None): """Visualize bounding boxes inside image. Args: img (~numpy.ndarray): An array of shape :math:`(3, height, width)`. This is in RGB format and the range of its value is :math:`[0, 255]`. bbox (~numpy.ndarray): An array of shape :math:`(R, 4)`, where :math:`R` is the number of bounding boxes in the image. Each element is organized by :math:`(y_{min}, x_{min}, y_{max}, x_{max})` in the second axis. label (~numpy.ndarray): An integer array of shape :math:`(R,)`. The values correspond to id for label names stored in :obj:`label_names`. This is optional. score (~numpy.ndarray): A float array of shape :math:`(R,)`. Each value indicates how confident the prediction is. This is optional. label_names (iterable of strings): Name of labels ordered according to label ids. If this is :obj:`None`, labels will be skipped. ax (matplotlib.axes.Axis): The visualization is displayed on this axis. If this is :obj:`None` (default), a new axis is created. Returns: ~matploblib.axes.Axes: Returns the Axes object with the plot for further tweaking. """ label_names = list(VOC_BBOX_LABEL_NAMES) + [‘bg‘] # add for index `-1` if label is not None and not len(bbox) == len(label): raise ValueError(‘The length of label must be same as that of bbox‘) if score is not None and not len(bbox) == len(score): raise ValueError(‘The length of score must be same as that of bbox‘) # Returns newly instantiated matplotlib.axes.Axes object if ax is None ax = vis_image(img, ax=ax) # If there is no bounding box to display, visualize the image and exit. if len(bbox) == 0: return ax for i, bb in enumerate(bbox): xy = (bb[1], bb[0]) height = bb[2] - bb[0] width = bb[3] - bb[1] ax.add_patch(plot.Rectangle( xy, width, height, fill=False, edgecolor=‘red‘, linewidth=2)) caption = list() if label is not None and label_names is not None: lb = label[i] if not (-1 <= lb < len(label_names)): # modfy here to add backgroud raise ValueError(‘No corresponding name is given‘) caption.append(label_names[lb]) if score is not None: sc = score[i] caption.append(‘{:.2f}‘.format(sc)) if len(caption) > 0: ax.text(bb[1], bb[0], ‘: ‘.join(caption), style=‘italic‘, bbox={‘facecolor‘: ‘white‘, ‘alpha‘: 0.5, ‘pad‘: 0}) return ax def fig2data(fig): """ brief Convert a Matplotlib figure to a 4D numpy array with RGBA channels and return it @param fig: a matplotlib figure @return a numpy 3D array of RGBA values """ # draw the renderer fig.canvas.draw() # Get the RGBA buffer from the figure w, h = fig.canvas.get_width_height() buf = np.fromstring(fig.canvas.tostring_argb(), dtype=np.uint8) buf.shape = (w, h, 4) # canvas.tostring_argb give pixmap in ARGB mode. Roll the ALPHA channel to have it in RGBA mode buf = np.roll(buf, 3, axis=2) return buf.reshape(h, w, 4) def fig4vis(fig): """ convert figure to ndarray """ ax = fig.get_figure() img_data = fig2data(ax).astype(np.int32) plot.close() # HWC->CHW return img_data[:, :, :3].transpose((2, 0, 1)) / 255. def visdom_bbox(*args, **kwargs): fig = vis_bbox(*args, **kwargs) data = fig4vis(fig) return data class Visualizer(object): """ wrapper for visdom you can still access naive visdom function by self.line, self.scater,self._send,etc. due to the implementation of `__getattr__` """ def __init__(self, env=‘default‘, **kwargs): self.vis = visdom.Visdom(env=env, **kwargs) self._vis_kw = kwargs # e.g.(’loss‘,23) the 23th value of loss self.index = {} self.log_text = ‘‘ def reinit(self, env=‘default‘, **kwargs): """ change the config of visdom """ self.vis = visdom.Visdom(env=env, **kwargs) return self def plot_many(self, d): """ plot multi values @params d: dict (name,value) i.e. (‘loss‘,0.11) """ for k, v in d.items(): if v is not None: self.plot(k, v) def img_many(self, d): for k, v in d.items(): self.img(k, v) def plot(self, name, y, **kwargs): """ self.plot(‘loss‘,1.00) """ x = self.index.get(name, 0) self.vis.line(Y=np.array([y]), X=np.array([x]), win=name, opts=dict(title=name), update=None if x == 0 else ‘append‘, **kwargs ) self.index[name] = x + 1 def img(self, name, img_, **kwargs): """ self.img(‘input_img‘,t.Tensor(64,64)) self.img(‘input_imgs‘,t.Tensor(3,64,64)) self.img(‘input_imgs‘,t.Tensor(100,1,64,64)) self.img(‘input_imgs‘,t.Tensor(100,3,64,64),nrows=10) !!!don‘t ~~self.img(‘input_imgs‘,t.Tensor(100,64,64),nrows=10)~~!!! """ self.vis.images(t.Tensor(img_).cpu().numpy(), win=name, opts=dict(title=name), **kwargs ) def log(self, info, win=‘log_text‘): """ self.log({‘loss‘:1,‘lr‘:0.0001}) """ self.log_text += (‘[{time}] {info} <br>‘.format( time=time.strftime(‘%m%d_%H%M%S‘), info=info)) self.vis.text(self.log_text, win) def __getattr__(self, name): return getattr(self.vis, name) def state_dict(self): return { ‘index‘: self.index, ‘vis_kw‘: self._vis_kw, ‘log_text‘: self.log_text, ‘env‘: self.vis.env } def load_state_dict(self, d): self.vis = visdom.Visdom(env=d.get(‘env‘, self.vis.env), **(self.d.get(‘vis_kw‘))) self.log_text = d.get(‘log_text‘, ‘‘) self.index = d.get(‘index‘, dict()) return self
函数vis_image读入一张3,H,W的RGB图像并显示。
函数vis_bbox显示图像及该图的bbox,及bbox的label和score。
函数visdom_bbox调用函数fig2data、fig4vis返回显示后的图像。
类Visualizer将要在visdom中显示的项包装起来。
4. eval_tool.py
评估检测结果
函数calc_detection_voc_prec_rec计算每一类的precision和recall。
函数calc_detection_voc_ap调用第一个函数计算每一类的average precision(ap)。
函数eval_detection_voc调用前两个函数,得到ap、map。
注:bbox坐标都是以(R,4)的形状出现,在进行bounding box回归的时候会将bbox坐标转为中心点坐标(x,y)与height、weight, 其余坐标都是坐上右下角坐标,即`(y_{min}, x_{min}, y_{max}, x_{max})‘
Reference:
从编程实现角度学习Faster R-CNN(附极简实现)
以上是关于Faster_RCNN 1.准备工作的主要内容,如果未能解决你的问题,请参考以下文章