GPUNvidia CUDA 编程中级教程——在多个 GPU上实现数据复制与计算的重叠
Posted 从善若水
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了GPUNvidia CUDA 编程中级教程——在多个 GPU上实现数据复制与计算的重叠相关的知识,希望对你有一定的参考价值。
博主未授权任何人或组织机构转载博主任何原创文章,感谢各位对原创的支持!
博主链接
本人就职于国际知名终端厂商,负责modem芯片研发。
在5G早期负责终端数据业务层、核心网相关的开发工作,目前牵头6G算力网络技术标准研究。
博客内容主要围绕:
5G/6G协议讲解
算力网络讲解(云计算,边缘计算,端计算)
高级C语言讲解
Rust语言讲解
在多个 GPU上实现数据复制与计算的重叠
多 GPU
CUDA 可在单一主机上同时管理多个 GPU 设备。
获取多个 GPU 的相关信息
如要以运行程序的方式得出可用 GPU 的数量,请使用 cudaGetDeviceCount
:
uint64_t num_gpus;
cudaGetDeviceCount(&num_gpus);
如要以运行程序的方式得到当前处于活动状态的 GPU,请使用 cudaGetDevice
:
uint64_t device;
cudaGetDevice(&device); // `device` is now a 0-based index of the current GPU.
设置当前的 GPU
对于每个主机线程,每次只有一个 GPU 设备处于活动状态。如要将特定的 GPU 设置为活动状态,请使用 cudaSetDevice
以及所需 GPU 的索引(从 0 开始):
cudaSetDevice(0);
循环使用可用的 GPU
一种常见的模式为,遍历可用的 GPU,并为每个 GPU 执行相应操作:
uint64_t num_gpus;
cudaGetDeviceCount(&num_gpus);
for (uint64_t gpu = 0; gpu < num_gpus; gpu++)
cudaSetDevice(gpu);
// Perform operations for this GPU.
为多个 GPU 执行数据分块
与多个非默认流相同,多个 GPU 中的每个 GPU 都可处理一个数据块。我们将创建和利用数据指针数组,为每个可用的 GPU 分配显存:
const uint64_t num_gpus;
cudaGetDeviceCount(&num_gpus);
const uint64_t num_entries = 1UL << 26;
const uint64_t chunk_size = sdiv(num_entries, num_gpus);
uint64_t *data_gpu[num_gpus]; // One pointer for each GPU.
for (uint64_t gpu = 0; gpu < num_gpus; gpu++)
cudaSetDevice(gpu);
const uint64_t lower = chunk_size*gpu;
const uint64_t upper = min(lower+chunk_size, num_entries);
const uint64_t width = upper-lower;
cudaMalloc(&data_gpu[gpu], sizeof(uint64_t)*width); // Allocate chunk of data for current GPU.
为多个 GPU 复制数据
通过使用相同的循环遍历和分块技术,我们可在多个 GPU 上传入和传出数据:
// ...Assume data has been allocated on host and for each GPU
for (uint64_t gpu = 0; gpu < num_gpus; gpu++)
cudaSetDevice(gpu);
const uint64_t lower = chunk_size*gpu;
const uint64_t upper = min(lower+chunk_size, num_entries);
const uint64_t width = upper-lower;
// Note use of `cudaMemcpy` and not `cudaMemcpyAsync` since we are not
// presently using non-default streams.
cudaMemcpy(data_gpu[gpu], data_cpu+lower,
sizeof(uint64_t)*width, cudaMemcpyHostToDevice); // ...or cudaMemcpyDeviceToHost
为多个 GPU 启动核函数
通过使用相同的循环遍历和分块技术,我们可在多个 GPU 上启动核函数并处理数据块:
// ...Assume data has been allocated on host and for each GPU
for (uint64_t gpu = 0; gpu < num_gpus; gpu++)
cudaSetDevice(gpu);
const uint64_t lower = chunk_size*gpu;
const uint64_t upper = min(lower+chunk_size, num_entries);
const uint64_t width = upper-lower;
kernel<<<grid, block>>>(data_gpu[gpu], width); // Pass chunk of data for current GPU to work on.
练习:使用多个 GPU
原始code如下:
#include <cstdint>
#include <iostream>
#include "helpers.cuh"
#include "encryption.cuh"
void encrypt_cpu(uint64_t * data, uint64_t num_entries,
uint64_t num_iters, bool parallel=true)
#pragma omp parallel for if (parallel)
for (uint64_t entry = 0; entry < num_entries; entry++)
data[entry] = permute64(entry, num_iters);
__global__
void decrypt_gpu(uint64_t * data, uint64_t num_entries,
uint64_t num_iters)
const uint64_t thrdID = blockIdx.x*blockDim.x+threadIdx.x;
const uint64_t stride = blockDim.x*gridDim.x;
for (uint64_t entry = thrdID; entry < num_entries; entry += stride)
data[entry] = unpermute64(data[entry], num_iters);
bool check_result_cpu(uint64_t * data, uint64_t num_entries,
bool parallel=true)
uint64_t counter = 0;
#pragma omp parallel for reduction(+: counter) if (parallel)
for (uint64_t entry = 0; entry < num_entries; entry++)
counter += data[entry] == entry;
return counter == num_entries;
int main (int argc, char * argv[])
Timer timer;
Timer overall;
const uint64_t num_entries = 1UL << 26;
const uint64_t num_iters = 1UL << 10;
const bool openmp = true;
timer.start();
uint64_t * data_cpu, * data_gpu;
cudaMallocHost(&data_cpu, sizeof(uint64_t)*num_entries);
cudaMalloc (&data_gpu, sizeof(uint64_t)*num_entries);
timer.stop("allocate memory");
check_last_error();
timer.start();
encrypt_cpu(data_cpu, num_entries, num_iters, openmp);
timer.stop("encrypt data on CPU");
overall.start();
timer.start();
cudaMemcpy(data_gpu, data_cpu,
sizeof(uint64_t)*num_entries, cudaMemcpyHostToDevice);
timer.stop("copy data from CPU to GPU");
check_last_error();
timer.start();
decrypt_gpu<<<80*32, 64>>>(data_gpu, num_entries, num_iters);
timer.stop("decrypt data on GPU");
check_last_error();
timer.start();
cudaMemcpy(data_cpu, data_gpu,
sizeof(uint64_t)*num_entries, cudaMemcpyDeviceToHost);
timer.stop("copy data from GPU to CPU");
overall.stop("total time on GPU");
check_last_error();
timer.start();
const bool success = check_result_cpu(data_cpu, num_entries, openmp);
std::cout << "STATUS: test "
<< ( success ? "passed" : "failed")
<< std::endl;
timer.stop("checking result on CPU");
timer.start();
cudaFreeHost(data_cpu);
cudaFree (data_gpu);
timer.stop("free memory");
check_last_error();
优化后的code如下:
#include <cstdint>
#include <iostream>
#include "helpers.cuh"
#include "encryption.cuh"
void encrypt_cpu(uint64_t * data, uint64_t num_entries,
uint64_t num_iters, bool parallel=true)
#pragma omp parallel for if (parallel)
for (uint64_t entry = 0; entry < num_entries; entry++)
data[entry] = permute64(entry, num_iters);
__global__
void decrypt_gpu(uint64_t * data, uint64_t num_entries,
uint64_t num_iters)
const uint64_t thrdID = blockIdx.x*blockDim.x+threadIdx.x;
const uint64_t stride = blockDim.x*gridDim.x;
for (uint64_t entry = thrdID; entry < num_entries; entry += stride)
data[entry] = unpermute64(data[entry], num_iters);
bool check_result_cpu(uint64_t * data, uint64_t num_entries,
bool parallel=true)
uint64_t counter = 0;
#pragma omp parallel for reduction(+: counter) if (parallel)
for (uint64_t entry = 0; entry < num_entries; entry++)
counter += data[entry] == entry;
return counter == num_entries;
int main (int argc, char * argv[])
Timer timer;
Timer overall;
const uint64_t num_entries = 1UL << 26;
const uint64_t num_iters = 1UL << 10;
const bool openmp = true;
// Set number of available GPUs.
const uint64_t num_gpus = 4;
// Get chunk size using round up division.
const uint64_t chunk_size = sdiv(num_entries, num_gpus);
timer.start();
// Use array of pointers for multiple GPU memory.
uint64_t * data_cpu, * data_gpu[num_gpus];
cudaMallocHost(&data_cpu, sizeof(uint64_t)*num_entries);
// For each GPU...
for (uint64_t gpu = 0; gpu < num_gpus; gpu++)
// ...set GPU as active...
cudaSetDevice(gpu);
// ...get width of this GPUs data chunk...
const uint64_t lower = chunk_size*gpu;
const uint64_t upper = min(lower+chunk_size, num_entries);
const uint64_t width = upper-lower;
// ...allocate data for this GPU.
cudaMalloc(&data_gpu[gpu], sizeof(uint64_t)*width);
timer.stop("allocate memory");
check_last_error();
timer.start();
encrypt_cpu(data_cpu, num_entries, num_iters, openmp);
timer.stop("encrypt data on CPU");
overall.start();
timer.start();
// For each GPU...
for (uint64_t gpu = 0; gpu < num_gpus; gpu++)
cudaSetDevice(gpu);
const uint64_t lower = chunk_size*gpu;
const uint64_t upper = min(lower+chunk_size, num_entries);
const uint64_t width = upper-lower;
// ...copy correct chunk of data to active GPU.
cudaMemcpy(data_gpu[gpu], data_cpu+lower,
sizeof(uint64_t)*width, cudaMemcpyHostToDevice);
timer.stop("copy data from CPU to GPU");
check_last_error();
timer.start();
// For each GPU...
for (uint64_t gpu = 0; gpu < num_gpus; gpu++)
cudaSetDevice(gpu);
const uint64_t lower = chunk_size*gpu;
const uint64_t upper = min(lower+chunk_size, num_entries);
const uint64_t width = upper-lower;
// ...decrypt its chunk of data.
decrypt_gpu<<<80*32, 64>>>(data_gpu[gpu], width, num_iters);
timer.stop("decrypt data on the GPU");
check_last_error();
timer.start();
// For each GPU...
for (uint64_t gpu = 0; gpu < num_gpus; gpu++)
cudaSetDevice(gpu);
const uint64_t lower = chunk_size*gpu;
const uint64_t upper = min(lower+chunk_size, num_entries);
const uint64_t width = upper-lower;
// ...copy its chunk of data back to the host.
cudaMemcpy(data_cpu+lower, data_gpu[gpu],
sizeof(uint64_t)*width, cudaMemcpyDeviceToHost);
timer.stop("copy data from GPU to CPU");
overall.stop("total time on GPU");
check_last_error();
timer.start();
const bool success = check_result_cpu(data_cpu, num_entries, openmp);
std::cout << "STATUS: test "
<< ( success ? "passed" : "failed")
<< std::endl;
timer.stop("checking result on CPU");
timer.start();
cudaFreeHost(data_cpu);
for (uint64_t gpu = 0; gpu < num_gpus; gpu++)
cudaSetDevice(gpu);
cudaFree(data_gpu[gpu]);
timer.stop("free memory");
check_last_error();
在多个 GPU上实现数据复制与计算的重叠
流和多 GPU
每个 GPU 都有各自的默认流。我们可以为当前处于活动状态的 GPU 设备创建、使用和销毁非默认流。切记不要在未与当前处于活动状态的 GPU 建立关联的流中启动核函数。
为多个 GPU 创建多个流
在多个 GPU 上使用多个非默认流时,与之前不同的是,我们不是简单地将流存储在数组中,而是将其存储于二维数组中,且数组中的每一行皆包含单个 GPU 的流:
cudaStream_t streams[num_gpus][num_streams]; // 2D array containing number of streams for each GPU.
// For each available GPU...
for (uint64_t gpu = 0; gpu < num_gpus; gpu++)
// ...set as active device...
cudaSetDevice(gpu);
for (uint64_t stream = 0; stream < num_streams; stream++)
// ...create and store its number of streams.
cudaStreamCreate(&streams[gpu][stream]);
多个 GPU 上多流的数据块大小
当在多个 GPU 上使用多个非默认流时,全局数据索引尤为棘手。为帮助实现索引,我们可以为单个流和整个 GPU 分别定义数据块大小。我们将继续使用《【GPU】Nvidia CUDA 编程中级教程——数据复制与计算的重叠》中讨论过的可靠的索引策略:
// Each stream needs num_entries/num_gpus/num_streams data. We use round up division for
// reasons previously discussed.
const uint64_t stream_chunk_size = sdiv(sdiv(num_entries, num_gpus), num_streams);
// It will be helpful to also to have handy the chunk size for an entire GPU.
const uint64_t gpu_chunk_size = stream_chunk_size*num_streams;
为多个 GPU 的多个流分配显存
GPU 的显存并未分配给各个流,所以此处的分配操作看起来与之前的多 GPU 任务相似,我们只需注意数据块的大小是分配给整个 GPU 的而非其中一个流的即可:
// For each GPU...
for (uint64_t gpu = 0; gpu < num_gpus; gpu++)
// ...set device as active...
cudaSetDevice(gpu);
// ...use a GPU chunk's worth of data to calculate indices and width...
const uint64_t lower = gpu_chunk_size*gpu;
const uint64_t upper = min(lower+gpu_chunk_size, num_entries);
const uint64_t width = upper-lower;
// ...allocate data.
cudaMalloc(&data_gpu[gpu], sizeof(uint64_t)*width);
在多个 GPU 的多个流上实现复制与计算的重叠
// For each GPU...
以上是关于GPUNvidia CUDA 编程中级教程——在多个 GPU上实现数据复制与计算的重叠的主要内容,如果未能解决你的问题,请参考以下文章
GPUNvidia CUDA 编程中级教程——数据复制与计算的重叠
GPUNvidia CUDA 编程中级教程——在多个 GPU上实现数据复制与计算的重叠
GPUNvidia CUDA 编程中级教程——在多个 GPU上实现数据复制与计算的重叠
GPUNvidia CUDA 编程中级教程——在多个 GPU上实现数据复制与计算的重叠