视频流GPU解码的实现-基本概念
Posted 风翼科技
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了视频流GPU解码的实现-基本概念相关的知识,希望对你有一定的参考价值。
这段时间在实现Gpu的视频流解码,遇到了很多的问题。
得到了阿里视频处理专家蔡鼎老师以及英伟达开发季光老师的指导,在这里表示感谢!
基本命令(linux下)
1.查看物理显卡
lspci | grep -i vga root@g1060server:/home/user# lspci | grep -i vga 09:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30) 81:00.0 VGA compatible controller: NVIDIA Corporation Device 1c03 (rev a1) 82:00.0 VGA compatible controller: NVIDIA Corporation Device 1c03 (rev a1)
2.直接查看英伟达的物理显卡信息
有的时候因为服务器型号,GPU型号等不兼容等问题,会导致主板无法识别到插入的显卡,
我们可用下面的命令来查看主板是否识别到了显卡:
root@g1060server:/home/user# lspci | grep -i nvidia 81:00.0 VGA compatible controller: NVIDIA Corporation Device 1c03 (rev a1) 81:00.1 Audio device: NVIDIA Corporation Device 10f1 (rev a1) 82:00.0 VGA compatible controller: NVIDIA Corporation Device 1c03 (rev a1) 82:00.1 Audio device: NVIDIA Corporation Device 10f1 (rev a1)
出现上面的东西,说明主板已经识别到显卡信息
cuda版本,驱动信息
root@g1060server:/home/user# nvcc -V nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2013 NVIDIA Corporation Built on Wed_Jul_17_18:36:13_PDT_2013 Cuda compilation tools, release 5.5, V5.5.0
英伟达显卡运行状态信息
root@g1060server:/home/user# nvidia-smi modprobe: ERROR: could not insert \'nvidia_340\': No such device NVIDIA-SMI has failed because it couldn\'t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
查看失败,一般没安装驱动
user@g1060server:~$ nvidia-smi Fri Jan 5 21:50:34 2018 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.90 Driver Version: 384.90 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 106... Off | 00000000:81:00.0 On | N/A | | 32% 35C P8 10W / 120W | 3083MiB / 6071MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 106... Off | 00000000:82:00.0 Off | N/A | | 32% 37C P8 10W / 120W | 2542MiB / 6072MiB | 0% Default | +-------------------------------+----------------------+----------------------+
查看成功
查看cuda驱动是否安装成功
root@g1060server:/home/user# cd /usr/local/cuda-8.0/samples/1_Utilities/deviceQuery root@g1060server:/usr/local/cuda-8.0/samples/1_Utilities/deviceQuery# ls deviceQuery deviceQuery.cpp deviceQuery.o Makefile NsightEclipse.xml readme.txt root@g1060server:/usr/local/cuda-8.0/samples/1_Utilities/deviceQuery# make make: 没有什么可以做的为 `all\'。 root@g1060server:/usr/local/cuda-8.0/samples/1_Utilities/deviceQuery# ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) cudaGetDeviceCount returned 35 -> CUDA driver version is insufficient for CUDA runtime version Result = FAIL
再次确认cuda驱动安装失败
查看cuda是否安装成功 /usr/local/cuda/extras/demo_suite/deviceQuery root@g1060server:/home/user/mjl/test# /usr/local/cuda/extras/demo_suite/deviceQuery /usr/local/cuda/extras/demo_suite/deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 2 CUDA Capable device(s) Device 0: "GeForce GTX 1060 6GB" CUDA Driver Version / Runtime Version 9.0 / 8.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 6071 MBytes (6366363648 bytes) (10) Multiprocessors, (128) CUDA Cores/MP: 1280 CUDA Cores GPU Max Clock rate: 1709 MHz (1.71 GHz) Memory Clock rate: 4004 Mhz Memory Bus Width: 192-bit L2 Cache Size: 1572864 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 129 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > Device 1: "GeForce GTX 1060 6GB" CUDA Driver Version / Runtime Version 9.0 / 8.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 6073 MBytes (6367739904 bytes) (10) Multiprocessors, (128) CUDA Cores/MP: 1280 CUDA Cores GPU Max Clock rate: 1709 MHz (1.71 GHz) Memory Clock rate: 4004 Mhz Memory Bus Width: 192-bit L2 Cache Size: 1572864 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 130 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > > Peer access from GeForce GTX 1060 6GB (GPU0) -> GeForce GTX 1060 6GB (GPU1) : Yes > Peer access from GeForce GTX 1060 6GB (GPU1) -> GeForce GTX 1060 6GB (GPU0) : Yes deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 2, Device0 = GeForce GTX 1060 6GB, Device1 = GeForce GTX 1060 6GB Result = PASS
查看成功
主要流程
要想实现ffempg的GPU化,必须要要对ffempg的解码流程有基本的认识才能改造(因为GPU也是这个流程,不过中间一部分用GPU运算)
我在http://www.cnblogs.com/baldermurphy/p/7828337.html 中曾经帖出过CPU解码的流程
主要流程如下
avformat_network_init(); av_register_all();//1.注册各种编码解码模块,如果3.3及以上版本,里面包含GPU解码模块 std::string tempfile = “xxxx”;//视频流地址 avformat_find_stream_info(format_context_, nullptr)//2.拉取一小段数据流分析,便于得到数据的基本格式 if (AVMEDIA_TYPE_VIDEO == enc->codec_type && video_stream_index_ < 0)//3.筛选出视频流 codec_ = avcodec_find_decoder(enc->codec_id);//4.找到对应的解码器 codec_context_ = avcodec_alloc_context3(codec_);//5.创建解码器对应的结构体 av_read_frame(format_context_, &packet_); //6.读取数据包 avcodec_send_packet(codec_context_, &packet_) //7.发出解码 avcodec_receive_frame(codec_context_, yuv_frame_) //8.接收解码 sws_scale(y2r_sws_context_, yuv_frame_->data, yuv_frame_->linesize, 0, codec_context_->height, rgb_data_, rgb_line_size_) //9.数据格式转换
GPU解码需要改变4,7,8,9这几个步骤,也就是
找到gpu解码器,
拉取数据给GPU解码器,
得到解码后的数据,
数据格式使用gpu转换(如果需要的话,如nv12转bgra),
最终的格式由具体的需求确定,比如很多opengl的互操作,转成指定的格式(bgra),共用一段内存,数据立刻刷新,连拷贝都不用;
如果是转化成图片,又是另一种需求(bgr);
适用场景的匹配
不得不提的一点是,GPU运算是一个很好的功能,可是也要看需求和场景,如果不考虑这个,可能得不偿失
比如一个极端的例子,opencv里面也有实现图片的解码,可是在追求效率的时候不会使用它的,
因为一张图片数据上传到GPU(非并行,很耗时),解码(非常快),GPU显存拷贝到内存(非并行,很耗时)
在上传和拷贝出来的就花了几百毫秒,而图片数据的操作很频繁,这会导致cpu占用率的得不到很好的缓解,甚至有的时候,不降反升,解码虽然快,可是用户的体验是慢,而且CPU,GPU都占用了
主要的几个网站
英伟达推荐的ffempg的gpu解码sdk
https://developer.nvidia.com/nvidia-video-codec-sdk
检查显存泄露的工具
http://docs.nvidia.com/cuda/cuda-memcheck/index.html#device-side-allocation-checking
以上是关于视频流GPU解码的实现-基本概念的主要内容,如果未能解决你的问题,请参考以下文章