Ubuntu16.04+CUDA9.0+CUDNNv7.1+opencv3.4.0+anaconda3+Matlab 2017a+caffe安装

Posted 2022-11-02 jlqzzz

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了Ubuntu16.04+CUDA9.0+CUDNNv7.1+opencv3.4.0+anaconda3+Matlab 2017a+caffe安装相关的知识，希望对你有一定的参考价值。

Ubuntu16.04+CUDA9.0+CUDNNv7.1+opencv3.4.0+anaconda3+Matlab 2017a的相关安装配置参见之前的博客。

接下来直接进入caffe的安装配置环节。

General dependencies

sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libhdf5-serial-dev protobuf-compiler
sudo apt-get install --no-install-recommends libboost-all-dev

接着安装：

sudo apt-get install libgflags-dev libgoogle-glog-dev liblmdb-dev# ubuntu系统独有安装包
sudo apt-get install libatlas-dev
sudo apt-get install liblapack-dev
sudo apt-get install  libatlas-base-dev

然后下载caffe
直接从github上git下来源码：

git clone https://github.com/BVLC/caffe.git

cd caffe

执行安装指令：

cp Makefile.config.example Makefile.config # 拷贝一个安装配置文件

然后修改 Makefile.config 文件，在 caffe 目录下打开该文件：

sudo gedit Makefile.config

修改 Makefile.config 文件内容：

1.应用 cudnn

将
#USE_CUDNN := 1
修改成： 
USE_CUDNN := 1

2.应用 opencv 版本

将
#OPENCV_VERSION := 3 
修改为： 
OPENCV_VERSION := 3

3.修改cuda路径

将
CUDA_DIR := /usr/local/cuda 
修改为 
CUDA_DIR := /usr/local/cuda-9.0

4.修改CUDA_ARCH

将
-gencode arch=compute_20,code=sm_20 \\
		-gencode arch=compute_20,code=sm_21 \\
两行注释或删除

5.修改blas

修改为 
BLAS := mkl

6.修改MATLAB路径

修改为 
MATLAB_DIR := /usr/local/MATLAB/R2017a

7.配置python相关

注释掉 python2
#PYTHON_INCLUDE := /usr/include/python2.7 \\
#		/usr/lib/python2.7/dist-packages/numpy/core/include
然后配置为
# Anaconda Python distribution is quite popular. Include path:
# Verify anaconda location, sometimes it's in root.
 ANACONDA_HOME := $(HOME)/anaconda3
 PYTHON_INCLUDE := $(ANACONDA_HOME)/include \\
		 $(ANACONDA_HOME)/include/python3.6m \\
		 $(ANACONDA_HOME)/lib/python3.6/site-packages/numpy/core/include

# Uncomment to use Python 3 (default is Python 2)
 PYTHON_LIBRARIES := boost_python3 python3.6m
 PYTHON_INCLUDE := /usr/include/python3.6m \\
                 /usr/lib/python3.6/dist-packages/numpy/core/include \\
                /home/zya/anaconda3/include/python3.6m

# We need to be able to find libpythonX.X.so or .dylib.
#PYTHON_LIB := /usr/lib
 PYTHON_LIB := $(ANACONDA_HOME)/lib \\
               $(ANACONDA_HOME)/pkgs/python-3.6.5-hc3d631a_2/lib

8.使用 python 接口

将
#WITH_PYTHON_LAYER := 1 
修改为 
WITH_PYTHON_LAYER := 1

9.重要的一项

将# Whatever else you find you need goes here.下面的
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib 
修改为： 
INCLUDE_DIRS := $(PYTHON_INCLUDE) /usr/local/include /usr/include/hdf5/serial
LIBRARY_DIRS := $(PYTHON_LIB) /usr/local/lib /usr/lib usr/lib/x86_64-linux-gnu /usr/lib/x86_64-linux-gnu/hdf5/serial
这是因为ubuntu16.04的文件包含位置发生了变化，尤其是需要用到的hdf5的位置，所以需要更改这一路径

然后修改 caffe 目录下的 Makefile 文件：

将：
NVCCFLAGS +=-ccbin=$(CXX) -Xcompiler-fPIC $(COMMON_FLAGS)
替换为：
NVCCFLAGS += -D_FORCE_INLINES -ccbin=$(CXX) -Xcompiler -fPIC $(COMMON_FLAGS)

将：
LIBRARIES += glog gflags protobuf boost_system boost_filesystem m hdf5_hl hdf5
改为：
LIBRARIES += glog gflags protobuf boost_system boost_filesystem m hdf5_serial_hl hdf5_serial opencv_core opencv_imgproc opencv_imgcodecs opencv_highgui

然后修改 /usr/local/cuda-9.0/include/crt/host_config.h 文件 :

将
#error -- unsupported GNU version! gcc versions later than 6 are not supported!
改为
//#error -- unsupported GNU version! gcc versions later than 6 are not supported!

安装过程中可能遇到的错误

error while loading shared libraries: libpython3.6m.so.1.0 not found

locate libpython3.6m.so.1.0查找的位置

查找出来是在anaconda3/lib中

添加：

sudo gedit /etc/ld.so.conf

/home/zya/anaconda3/lib/

在make runtest后，可能会出现两个失败

**[ FAILED ] 2 tests, listed below:
[ FAILED ] BatchReindexLayerTest/2.TestGradient, where TypeParam = N5caffe9GPUDeviceIfEE
[ FAILED ] BatchReindexLayerTest/3.TestGradient, where TypeParam = N5caffe9GPUDeviceIdEE**

这个问题https://github.com/BVLC/caffe/issues/6164上可以完美解决，我在此翻译一下
vim Makefile
然后用/搜索NVCCFLAGS，知道搜到下面这一段

...
# Debugging
ifeq ($(DEBUG), 1)
COMMON_FLAGS += -DDEBUG -g -O0
NVCCFLAGS += -G
else
COMMON_FLAGS += -DNDEBUG -O2
endif
...

修改为

...
# Debugging
ifeq ($(DEBUG), 1)
COMMON_FLAGS += -DDEBUG -g -O0
NVCCFLAGS += -G
else
COMMON_FLAGS += -DNDEBUG -O2
NVCCFLAGS += -G
endif
...

也就是加一句NVCCFLAGS += -G
然后重新编译，那么所有test都能成功

还可能出现:/usr/lib/x86_64-linux-gnu/libunwind.so.8: undefined reference to `lzma_index_size@XZ_5.0’，解决改问题只需要添加库文件路径就行，在home目录下的命令行输入：

$ sudo gedit ～/.bashrc

在文件中加入：

export LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/lib/x86_64-linux-gnu:$LD_LIBRARY_PATH

在命令行输入：
sudo ldconfig #编译立即生效

可能出现问题“ld cannot find lboost_python3”，这个时候应该创建一个libboost_python-py35.so的符号链接。

具体做法为“cannot find -lboost_python3” when using Python3 Ubuntu16.04：

cd /usr/lib/x86_64-linux-gnu
sudo ln -s libboost_python-py35.so libboost_python3.so

OK ，可以开始编译了，在 caffe 目录下执行：

sudo make all -j12

sudo make test -j12

但是如果之前的配置或安装出错，那么编译就会出现各种各样的问题，所以前面的步骤一定要细心。

编译成功后可运行测试：

sudo make runtest -j12

最后编译python和matlab接口

sudo  make pycaffe -j12
sudo  make matcaffe -j12

最后配置caffe的python借口路径

sudo gedit /etc/profile

然后添加：export PYTHONPATH=/home/zya/caffe/python$PYTHONPATH:+:$PYTHONPATH

如果此时在终端输入python

import caffe

如果有问题，一般是protobuf问题（因为一开始在Generaldependencies中通过sudo apt-get install libprotobuf-dev protobuf-compiler安装的protobuf是安装在系统中的，版本2.6.1。但此时我们指定的python默认环境是anaconda3.所以此时anaconda3中由于没有安装protobuf所以会出现问题）

要解决这个问题需要执行conda install -c https://conda.anaconda.org/anaconda protobuf安装anaconda3下的protobuf。
安装protobuf时遇到权限不允许，可以看到整个anaconda3目录带小锁。直接sudo chmod -R 777 /home/zya/anaconda3解锁，
然后就顺利执行了，安装的protobuf版本是3.6.0。

然后import caffe就成功了！！！

不过可能留下一个隐患就是如果下次在编译caffe时由于protobuf两个版本会冲突，所以再次编译caffe是会出现protobuf问题。此时现在只能是再卸载annaconda3下的protobuf。。。

卸载的命令是：

conda uninstall libprotobuf
conda uninstall protobuf

注意，一定要libprotobuf和protobuf都卸载掉。
编译成功后，在运行代码的时候python可能会提示找不到protobuf模块，这时候我们再使用
conda install -c https://conda.anaconda.org/anaconda protobuf 将protobuf模块安装上就可以了。

以后再编译caffe的时候如果冲突，再卸，再装。。。一把老泪…
* 总之，出现该问题的解决方法是，卸载python中冲突的protobuf和libprotobuf。*

如果需要卸载安装在系统上的protobuf可以用如下命令（但一般往往用这个编译成功的几率高，所以一般不要卸载，即使卸载也一般需要重新安装sudo apt-get install libprotobuf-dev protobuf-compiler）

sudo apt-get remove libprotobuf-dev

sudo apt-get remove libprotobuf-compile

这里补充几个查看protobuf版本信息之类的命令：

查看哪些路径安装了protoc：

whereis protoc

查看默认调用的protoc是哪个：

which protoc

查看默认的protoc的版本：

protoc --version

查看pip安装的protoc的信息：(我的话就显示我在anaconda下的版本了)

pip show protobuf

欢迎大家指正讨论！

补充，听说再次编译时如果遇到protobuf版本问题时，可以对系统版本protobuf

重新安装一下就可以了。
先卸载protobuf

sudo apt-get remove libprotobuf-dev

sudo apt-get remove libprotobuf-compile

再重新安装sudo apt-get install libprotobuf-dev protobuf-compiler

你们感兴趣可以尝试一下。。。

然后是使用caffe自带的画网络结构图的工具./python/drew_net.py，可使用该工具来绘制模型图。例如

~/caffe# python/draw_net.py  examples/mnist/lenet_train_test.prototxt  examples/mnist/lenet_train_test.png

但一般开始时会遇到ModuleNotFoundError: No module named 'pydotplus'或ModuleNotFoundError: No module named 'pydot'的问题。

解决方法：conda install pydotplus

然后运行又报错

pydotplus.graphviz.InvocationException: GraphViz's executables not found

解决方法：

conda install graphviz

然后画图成功！！！

最后也顺便安装一下pydot

conda install pydot

Caffe跑MNIST程序

Caffe官方提供了一系列的example供用户学习。可参见Caffe/examples.

本次的MNIST-LENet参考官方教程。

在提供的examples里，Caffe把数据放在./data文件夹下,处理后的数据和模型文件等放在 ./examples文件夹下。本次的MNIST数据集即在./data/mnist下，对应的模型和配置文件在 ./examples/mnist下.

准备数据集

先进入Caffe的根目录($CAFFE_ROOT)：

cd ~/Caffe

下载MNIST数据集：

# 运行get_mnist.sh脚本
./data/mnist/get_mnist.sh

我们可以看一下这个脚本干啥了(gedit get_mnist.sh):

#!/usr/bin/env sh
# This scripts downloads the mnist data and unzips it.

DIR="$( cd "$(dirname "$0")" ; pwd -P )"
cd "$DIR"

echo "Downloading..."

for fname in train-images-idx3-ubyte train-labels-idx1-ubyte t10k-images-idx3-ubyte t10k-labels-idx1-ubyte
do
    if [ ! -e $fname ]; then
        wget --no-check-certificate http://yann.lecun.com/exdb/mnist/$fname.gz
        gunzip $fname.gz
    fi
done

可以看到该shell脚本从http://yann.lecun.com/exdb/mnist/$fname.gz依次下载了train-images-idx3-ubyte ， train-labels-idx1-ubyte ， t10k-images-idx3-ubyte， t10k-labels-idx1-ubyte4个文件。

等待一段时间下载完毕后解压。

Caffe不直接接收这样的数据集，需要处理成lmdb：

使用create_mnist.sh脚本处理数据:

./examples/mnist/create_mnist.sh

我们也可以看看这个脚本干了啥:

#!/usr/bin/env sh
# This script converts the mnist data into lmdb/leveldb format,
# depending on the value assigned to $BACKEND.
set -e

EXAMPLE=examples/mnist
DATA=data/mnist
BUILD=build/examples/mnist

BACKEND="lmdb"

echo "Creating $BACKEND..."

rm -rf $EXAMPLE/mnist_train_$BACKEND
rm -rf $EXAMPLE/mnist_test_$BACKEND

$BUILD/convert_mnist_data.bin $DATA/train-images-idx3-ubyte \\
  $DATA/train-labels-idx1-ubyte $EXAMPLE/mnist_train_$BACKEND --backend=$BACKEND
$BUILD/convert_mnist_data.bin $DATA/t10k-images-idx3-ubyte \\
  $DATA/t10k-labels-idx1-ubyte $EXAMPLE/mnist_test_$BACKEND --backend=$BACKEND

echo "Done."

可以看到使用的是./build/examples/mnist/convert_mnist_data.bin工具完成转换的，这里就不深入看了

到这里数据集算是准备好了，存储在./examples/mnist/下.

mnist_train_lmdb, and mnist_test_lmdb.

LeNet模型

Caffe的模型文件是以.prototxt结尾，Caffe提供的LeNet文件在./examples/mnist/lenet_train_test.prototxt,我们可以打开看看：

数据输入层：

name: "LeNet"
layer 
  name: "mnist"     //该layer名为mnist
  type: "Data"      //layer类型
  top: "data"       //top为输出blob，共输出两个blob
  top: "label"
  include 
    phase: TRAIN    //指定训练阶段work
  
  transform_param 
    scale: 0.00390625   //数据变换(1/255 = .0039)
  
  data_param 
    source: "examples/mnist/mnist_train_lmdb"  //数据源地址
    batch_size: 64  //batch大小
    backend: LMDB   //数据集类型
  

layer 
  name: "mnist"
  type: "Data"
  top: "data"
  top: "label"
  include 
    phase: TEST     //测试时加载
  
  transform_param 
    scale: 0.00390625
  
  data_param 
    source: "examples/mnist/mnist_test_lmdb"
    batch_size: 100
    backend: LMDB

数据层比较清晰，无论是TEST还是TRAIN都是读取数据输出data和label。

接下来就是模型的卷积层组合了：

layer 
  name: "conv1"
  type: "Convolution"   //类型为卷积
  bottom: "data"
  top: "conv1"
  param 
    lr_mult: 1  //  weights学习率
  
  param 
    lr_mult: 2  // bias学习率，设置为2更容易收敛
  
  convolution_param 
    num_output: 20      //输出多少个特征图个数  即卷积核数目
    kernel_size: 5      // 卷积核大小
    stride: 1       //步长
    weight_filler 
      type: "xavier"    //权重初始化类型
    
    bias_filler 
      type: "constant"  // bias初始化类型 constant默认填充0
    
  

layer 
  name: "pool1"
  type: "Pooling"   //池化
  bottom: "conv1"
  top: "pool1"
  pooling_param 
    pool: MAX       //最大池化
    kernel_size: 2
    stride: 2
  

layer 
  name: "conv2"
  type: "Convolution"
  bottom: "pool1"
  top: "conv2"
  param 
    lr_mult: 1
  
  param 
    lr_mult: 2
  
  convolution_param 
    num_output: 50
    kernel_size: 5
    stride: 1
    weight_filler 
      type: "xavier"
    
    bias_filler 
      type: "constant"
    
  

layer 
  name: "pool2"
  type: "Pooling"
  bottom: "conv2"
  top: "pool2"
  pooling_param 
    pool: MAX
    kernel_size: 2
    stride: 2

看完前面用于特征提取的卷积层，下面看看分类的FC层：

layer 
  name: "ip1"
  type: "InnerProduct"      // FC层
  bottom: "pool2"
  top: "ip1"
  param 
    lr_mult: 1
  
  param 
    lr_mult: 2
  
  inner_product_param 
    num_output: 500
    weight_filler 
      type: "xavier"
    
    bias_filler 
      type: "constant"
    
  

layer 
  name: "relu1"
  type: "ReLU"      //激活函数
  bottom: "ip1"
  top: "ip1"

layer 
  name: "ip2"
  type: "InnerProduct"
  bottom: "ip1"
  top: "ip2"
  param 
    lr_mult: 1
  
  param 
    lr_mult: 2
  
  inner_product_param 
    num_output: 10
    weight_filler 
      type: "xavier"
    
    bias_filler 
      type: "constant"

FC层输出分类结果，接下来就是计算精度和损失了：

layer 
  name: "accuracy"
  type: "Accuracy"      //输出精度
  bottom: "ip2"
  bottom: "label"
  top: "accuracy"
  include 
    phase: TEST
  

layer 
  name: "loss"
  type: "SoftmaxWithLoss"   //softmax and the multinomial logistic loss 
  bottom: "ip2"
  bottom: "label"
  top: "loss"

Caffe自带了绘图工具./python/drew_net.py，可使用该工具来绘制模型图。(使用该工具需要在caffe目录下执行make pycaffe操作)：

使用绘图工具绘制该模型图:

~/caffe# python/draw_net.py  examples/mnist/lenet_train_test.prototxt  examples/mnist/lenet_train_test.png

附加笔记：定制layer 规则

在定义Layer时可以指定Layer在模型内的运行规则，模板如下:

layer
    // ... layer definition ...
    inlcude: 
        phase: TRAIN

这就是layer规则模板，控制layer在模型的状态，可以在./src/caffe/proto/caffe.proto获取更多信息和主题。

在上面例子中，大部分的layer没有设置规则，默认情况是该layer一直存在模型中。注意到accuracylayer 只在TEST阶段使用，设置了100次迭代计算一次，设置见lenet_solver.prototxt。

模型优化器

上面定义了模型的结构，下面该设置训练模型相关参数.

参考文件./examples/mnist/lenet_solver.prototxt:

# The train/test net protocol buffer definition
# train/test 模型结构
net: "examples/mnist/lenet_train_test.prototxt"

# test_iter specifies how many forward passes the test should carry out.
# In the case of MNIST, we have test batch size 100 and 100 test iterations,
# covering the full 10,000 testing images.
test_iter: 100

# Carry out testing every 500 training iterations.
# 指定每500次计算一下精度
test_interval: 500

# The base learning rate, momentum and the weight decay of the network.
# 学习率设置
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# The learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75

# Display every 100 iterations
# 设置100次显示一下状态
display: 100

# The maximum number of iterations
# 最大迭代次数
max_iter: 10000

# snapshot intermediate results
# 保存快照
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"

# solver mode: CPU or GPU
solver_mode: GPU

训练模型

Caffe提供了一个训练脚本，在./examples/mnist/train_lenet.sh,我们看看都写了啥:

#!/usr/bin/env sh
set -e

./build/tools/caffe train --solver=examples/mnist/lenet_solver.prototxt $@

可以看到，这里调用了./build/tools/caffe train 然后指定对应的优化器文件，即--solver=examples/mnist/lenet_solver.prototxt。

调用时输出训练信息：

I1213 17:37:21.999351 30925 layer_factory.hpp:77] Creating layer mnist
I1213 17:37:21.999413 30925 db_lmdb.cpp:35] Opened lmdb examples/mnist/mnist_train_lmdb
I1213 17:37:21.999428 30925 net.cpp:84] Creating Layer mnist
I1213 17:37:21.999433 30925 net.cpp:380] mnist -> data
I1213 17:37:21.999445 30925 net.cpp:380] mnist -> label
I1213 17:37:22.000012 30925 data_layer.cpp:45] output data size: 64,1,28,28
I1213 17:37:22.000969 30925 net.cpp:122] Setting up mnist
I1213 17:37:22.000979 30925 net.cpp:129] Top shape: 64 1 28 28 (50176)
I1213 17:37:22.000982 30925 net.cpp:129] Top shape: 64 (64)
...
I1213 17:37:29.454346 30925 solver.cpp:447] Snapshotting to binary proto file examples/mnist/lenet_iter_5000.caffemodel
I1213 17:37:29.459178 30925 sgd_solver.cpp:273] Snapshotting solver state to binary proto file examples/mnist/lenet_iter_5000.solverstate
I1213 17:37:29.460712 30925 solver.cpp:330] Iteration 5000, Testing net (#0)
I1213 17:37:29.512395 30934 data_layer.cpp:73] Restarting data prefetching from start.
I1213 17:37:29.513818 30925 solver.cpp:397]     Test net output #0: accuracy = 0.9882
...
I1213 17:37:36.706809 30925 solver.cpp:447] Snapshotting to binary proto file examples/mnist/lenet_iter_10000.caffemodel
I1213 17:37:36.710286 30925 sgd_solver.cpp:273] Snapshotting solver state to binary proto file examples/mnist/lenet_iter_10000.solverstate
I1213 17:37:36.712179 30925 solver.cpp:310] Iteration 10000, loss = 0.00240246
I1213 17:37:36.712193 30925 solver.cpp:330] Iteration 10000, Testing net (#0)
I1213 17:37:36.765053 30934 data_layer.cpp:73] Restarting data prefetching from start.
I1213 17:37:36.766742 30925 solver.cpp:397]     Test net output #0: accuracy = 0.9913
I1213 17:37:36.766758 30925 solver.cpp:397]     Test net output #1: loss = 0.0275297 (* 1 = 0.0275297 loss)
I1213 17:37:36.766762 30925 solver.cpp:315] Optimization Done.
I1213 17:37:36.766764 30925 caffe.cpp:259] Optimization Done.

每一大轮迭代，都会输出相关训练信息，包括学习率，loss，accuracy等。同时因为设置了每5000次训练保存一次Snapshotting。

以上是关于Ubuntu16.04+CUDA9.0+CUDNNv7.1+opencv3.4.0+anaconda3+Matlab 2017a+caffe安装的主要内容，如果未能解决你的问题，请参考以下文章

Ubuntu16.04+Tensorflow+CUDA9.0+cuDNN7.0 环境简明搭建指南

深度学习GPU环境Ubuntu16.04+GTX1080+CUDA9+cuDNN7+TensorFlow1.6环境配置

ubuntu16.04服务器上无root权限，配置个人tensorflow环境--cuda9.0+cuDNN7+tensorflow-gpu-1.18

ubuntu16.04 CUDA, CUDNN 安装

环境配置（近期实测）——Ubuntu16.04+CUDA9.0+tensorflow-gpu填坑记

环境配置 python 3.6+Anaconda+cuda9.0+cudNN7.1+Tensorflow