映射内存和托管内存有啥区别？

Posted 2023-03-23

技术标签:

【中文标题】映射内存和托管内存有啥区别？【英文标题】：What is the difference between mapped memory and managed memory?映射内存和托管内存有什么区别？ 【发布时间】：2021-11-10 08:26:36 【问题描述】：

我一直在阅读 CUDA 提供的各种内存管理方法，但我正在努力理解映射内存之间的区别：

int *foo;
std::size_t size = 32;
cudaHostAlloc(&foo, size, cudaHostAllocMapped);

...和托管内存：

int *foo;
std::size_t size = 32;
cudaMallocManaged(&foo, size);

它们似乎都在主机和设备之间隐式传输内存。 cudaMallocManaged 似乎是较新的 API，它使用所谓的“统一内存”系统。也就是说，cudaHostAlloc 似乎在 64 位系统上共享了许多这些属性，这要归功于统一的虚拟地址空间。

文档中似乎还有一些其他差异，但我不相信没有明确的功能文档会导致我正确理解这两个函数之间的差异（例如，我不相信它是明确的表示cudaMallocManaged的主机内存是页面锁定的，但我怀疑是）。

它们还对应于驱动程序 API 中的不同函数（cuMemHostAlloc 和 cuMemAllocManaged），我认为这很好地表明它们的行为在某些有意义的方面有所不同。

【问题讨论】：

【参考方案1】：

我认为主要区别在于分页/页面错误机制。

固定内存的作用与普通设备内存相同。如果请求一个字节的固定内存，则一个字节将通过 PCIe 总线透明地传输到 GPU。（也许驱动程序合并了连续内存位置的请求，我不知道。）

另一方面，托管内存具有内存页面的访问粒度。如果设备上不存在所请求字节的页面，则不仅将单个字节而且整个页面（在许多系统上为 4096 字节）从其当前位置迁移到 GPU，该位置可以是主机内存，也可以是设备内存另一个 GPU。

以下程序尝试显示不同的行为。分配了 256 MB，相当于 64 * 1024 个大小为 4096 字节的页面。然后，在内核中，每个线程访问每个页面的第一个字节，即每个第 4096 个字节。时间是针对固定内存、托管内存和普通设备内存测量的。

#include <iostream>
#include <cassert>

__global__
void kernel(char* __restrict__ data, int pagesize, int numpages)
    const int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if(tid < numpages)
        data[tid * pagesize] += 1;
    


int main()
    const int pagesize = 4096;
    const int numpages = 1024 * 64;
    const int bytes = pagesize * numpages;
    cudaError_t status = cudaSuccess;
    float elapsed = 0.0f;
    const int iterations = 5;

    char* devicedata; 
    status = cudaMalloc(&devicedata, bytes);
    assert(status == cudaSuccess);

    char* pinneddata; 
    status = cudaMallocHost(&pinneddata, bytes);
    assert(status == cudaSuccess);

    char* manageddata;
    status = cudaMallocManaged(&manageddata, bytes);
    assert(status == cudaSuccess);

    status = cudaMemPrefetchAsync(manageddata, bytes, cudaCpuDeviceId);
    //status = cudaMemPrefetchAsync(manageddata, bytes, 0);
    assert(status == cudaSuccess);

    cudaEvent_t event1, event2;
    cudaEventCreate(&event1);
    cudaEventCreate(&event2);

    for(int iteration = 0; iteration < iterations; iteration++)
        cudaEventRecord(event1);
        kernel<<<numpages / 256, 256>>>(pinneddata, pagesize, numpages);
        cudaEventRecord(event2);
        status = cudaEventSynchronize(event2);
        assert(status == cudaSuccess);
        cudaEventElapsedTime(&elapsed, event1, event2);
        
        float bandwith = (numpages / elapsed) * 1000.0f / 1024.f / 1024.f;
        std::cerr << "pinned: " << elapsed << ", throughput " << bandwith << " GB/s" << "\n";
    

    for(int iteration = 0; iteration < iterations; iteration++)
        cudaEventRecord(event1);
        kernel<<<numpages / 256, 256>>>(manageddata, pagesize, numpages);
        cudaEventRecord(event2);
        status = cudaEventSynchronize(event2);
        assert(status == cudaSuccess);
        cudaEventElapsedTime(&elapsed, event1, event2);

        float bandwith = (numpages / elapsed) * 1000.0f / 1024.f / 1024.f;
        std::cerr << "managed: " << elapsed << ", throughput " << bandwith << " MB/s" << "\n";

        status = cudaMemPrefetchAsync(manageddata, bytes, cudaCpuDeviceId);
        assert(status == cudaSuccess);     
    

    for(int iteration = 0; iteration < iterations; iteration++)
        cudaEventRecord(event1);
        kernel<<<numpages / 256, 256>>>(devicedata, pagesize, numpages);
        cudaEventRecord(event2);
        status = cudaEventSynchronize(event2);
        assert(status == cudaSuccess);
        cudaEventElapsedTime(&elapsed, event1, event2);
        
        float bandwith = (numpages / elapsed) * 1000.0f / 1024.f / 1024.f;
        std::cerr << "device: " << elapsed << ", throughput " << bandwith << " MB/s" << "\n";
    

    cudaFreeHost(pinneddata);
    cudaFree(manageddata);
    cudaFree(devicedata);
    cudaEventDestroy(event1);
    cudaEventDestroy(event2);

当托管内存预取到主机时，会观察以下时间

pinned: 1.4577 ms, throughput 42.8759 MB/s
pinned: 1.4927 ms, throughput 41.8703 MB/s
pinned: 1.44947 ms, throughput 43.1192 MB/s
pinned: 1.44371 ms, throughput 43.2912 MB/s
pinned: 1.4496 ms, throughput 43.1153 MB/s
managed: 40.3646 ms, throughput 1.54839 MB/s
managed: 35.8052 ms, throughput 1.74555 MB/s
managed: 36.7788 ms, throughput 1.69935 MB/s
managed: 37.3166 ms, throughput 1.67486 MB/s
managed: 35.3378 ms, throughput 1.76864 MB/s
device: 0.052256 ms, throughput 1196.03 MB/s
device: 0.061312 ms, throughput 1019.38 MB/s
device: 0.060736 ms, throughput 1029.04 MB/s
device: 0.060096 ms, throughput 1040 MB/s
device: 0.060352 ms, throughput 1035.59 MB/s

nvprof 确认在托管内存的情况下，所有 256 MB 都传输到设备。

==27443== Unified Memory profiling result:
Device "TITAN Xp (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
    6734  38.928KB  4.0000KB  776.00KB  256.0000MB  29.95677ms  Host To Device

当我们移除循环内的预取时，迁移的页面仍保留在 GPU 上，这将访问时间提高到正常设备内存的水平。

pinned: 1.46848 ms, throughput 42.561 MB/s
pinned: 1.50842 ms, throughput 41.4342 MB/s
pinned: 1.44285 ms, throughput 43.3171 MB/s
pinned: 1.45802 ms, throughput 42.8665 MB/s
pinned: 1.4431 ms, throughput 43.3094 MB/s
managed: 41.9972 ms, throughput 1.4882 MB/s  <--- need to migrate pages
managed: 0.047584 ms, throughput 1313.47 MB/s <--- pages already present on GPU
managed: 0.059552 ms, throughput 1049.5 MB/s
managed: 0.057248 ms, throughput 1091.74 MB/s
managed: 0.062336 ms, throughput 1002.63 MB/s
device: 0.06176 ms, throughput 1011.98 MB/s
device: 0.062592 ms, throughput 998.53 MB/s
device: 0.062176 ms, throughput 1005.21 MB/s
device: 0.06128 ms, throughput 1019.91 MB/s
device: 0.063008 ms, throughput 991.937 MB/s

【讨论】：

绝妙的答案！

以上是关于映射内存和托管内存有啥区别？的主要内容，如果未能解决你的问题，请参考以下文章