异构内存管理 Heterogeneous Memory Management (HMM)

Posted rtoax

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了异构内存管理 Heterogeneous Memory Management (HMM)相关的知识,希望对你有一定的参考价值。

https://www.kernel.org/doc/html/latest/vm/hmm.html

目录

异构内存管理 (HMM)

使用特定于设备的内存分配器的问题

I/O 总线、设备内存特性

共享地址空间和迁移

地址空间镜像实现和API

利用 default_flags 和 pfn_flags_mask

从核心内核的角度表示和管理设备内存

移入和移出设备内存

内存 cgroup (memcg) 和 rss 记帐

原文

Heterogeneous Memory Management (HMM)

Problems of using a device specific memory allocator

I/O bus, device memory characteristics

Shared address space and migration

Address space mirroring implementation and API

Leverage default_flags and pfn_flags_mask

Represent and manage device memory from core kernel point of view

Migration to and from device memory

Memory cgroup (memcg) and rss accounting


异构内存管理 (HMM)

提供基础设施和帮助程序以将非常规内存(设备内存,如板上 GPU 内存)集成到常规内核路径中,其基石是此类内存的专用结构页面(请参阅本文档的第 5 至 7 节)。

HMM 还为 SVM(共享虚拟内存)提供了可选的帮助程序,即允许设备透明地访问与 CPU 一致的程序地址,这意味着 CPU 上的任何有效指针也是该设备的有效指针。这对于简化高级异构计算的使用变得必不可少,其中 GPU、DSP 或 FPGA 用于代表进程执行各种计算。

本文档分为以下部分:在第一部分中,我揭示了与使用特定于设备的内存分配器相关的问题。在第二部分中,我揭示了许多平台固有的硬件限制。第三部分概述了 HMM 设计。第四部分解释了 CPU 页表镜像的工作原理以及 HMM 在这种情况下的目的。第五部分处理内核中如何表示设备内存。最后,最后一节介绍了一个新的迁移助手,它允许利用设备 DMA 引擎。

使用特定于设备的内存分配器的问题

具有大量板载内存(几 GB)的设备(如 GPU)历来通过专用驱动程序特定 API 管理其内存。这会造成设备驱动程序分配和管理的内存与常规应用程序内存(私有匿名、共享内存或常规文件支持内存)之间的断开。从这里开始,我将把这个方面称为分割地址空间。我使用共享地址空间来指代相反的情况:即,设备可以透明地使用任何应用程序内存区域。

拆分地址空间的发生是因为设备只能访问通过设备特定 API 分配的内存。这意味着从设备的角度来看,程序中的所有内存对象并不相等,这使得依赖于大量库的大型程序变得复杂。

具体来说,这意味着想要利用像 GPU 这样的设备的代码需要在通用分配的内存(malloc、mmap 私有、mmap 共享)和通过设备驱动程序 API 分配的内存(这仍然以 mmap 结束,但设备的文件)。

对于平面数据集(数组、网格、图像……),这并不难实现,但对于复杂数据集(列表、树……),很难做到正确。复制一个复杂的数据集需要重新映射其每个元素之间的所有指针关系。由于重复的数据集和地址,这很容易出错并且程序更难调试。

拆分地址空间还意味着库不能透明地使用它们从核心程序或其他库获取的数据,因此每个库可能必须使用设备特定的内存分配器复制其输入数据集。由于各种内存副本,大型项目会受到这种影响并浪费资源。

复制每个库 API 以接受由每个设备特定分配器分配的输入或输出内存不是一个可行的选择。这将导致库入口点的组合爆炸。

最后,随着高级语言结构(在 C++ 中,但在其他语言中)的进步,编译器现在可以在没有程序员知识的情况下利用 GPU 和其他设备。某些编译器识别的模式仅适用于共享地址空间。对所有其他模式使用共享地址空间也更合理。

I/O 总线、设备内存特性

由于一些限制,I/O 总线削弱了共享地址空间。大多数 I/O 总线只允许从设备到主内存的基本内存访问;甚至缓存一致性通常是可选的。从 CPU 访问设备内存甚至更加有限。通常情况下,它不是缓存一致的。

如果我们只考虑 PCIE 总线,那么设备可以访问主内存(通常通过 IOMMU)并与 CPU 缓存一致。但是,它只允许从主存储器上的设备进行一组有限的原子操作。这在另一个方向上更糟:CPU 只能访问有限范围的设备内存,而不能对其执行原子操作。因此,从内核的角度来看,设备内存不能被视为与常规内存相同。

另一个严重的因素是带宽有限(约 32GBytes/s,PCIE 4.0 和 16 通道)。这比最快的 GPU 内存 (1 TBytes/s) 少 33 倍。最后一个限制是延迟。从设备访问主内存的延迟比设备访问自己的内存时高一个数量级。

一些平台正在开发新的 I/O 总线或对 PCIE 的添加/修改以解决其中一些限制(OpenCAPI、CCIX)。它们主要允许 CPU 和设备之间的双向缓存一致性,并允许架构支持的所有原子操作。遗憾的是,并非所有平台都遵循这一趋势,并且一些主要架构没有针对这些问题的硬件解决方案。

因此,为了使共享地址空间有意义,我们不仅必须允许设备访问任何内存,而且还必须允许任何内存在设备使用时迁移到设备内存(在发生时阻止 CPU 访问)。

共享地址空间和迁移

HMM 打算提供两个主要功能。第一个是通过复制设备页表中的 CPU 页表来共享地址空间,因此对于进程地址空间中的任何有效主内存地址,相同的地址指向相同的物理内存。

为了实现这一点,HMM 提供了一组帮助程序来填充设备页表,同时跟踪 CPU 页表更新。设备页表更新不像 CPU 页表更新那么容易。要更新设备页表,您必须分配一个缓冲区(或使用预先分配的缓冲区池)并在其中写入 GPU 特定命令以执行更新(取消映射、缓存失效和刷新等)。这不能通过所有设备的通用代码来完成。因此,为什么 HMM 提供帮助程序来分解所有可能的内容,同时将硬件特定的细节留给设备驱动程序。

HMM 提供的第二种机制是一种新的 ZONE_DEVICE 内存,它允许为设备内存的每个页面分配一个结构页面。这些页面很特殊,因为 CPU 无法映射它们。然而,它们允许使用现有的迁移机制将主内存迁移到设备内存,从 CPU 的角度来看,一切看起来都像是换出到磁盘的页面。使用结构页面可以与现有的 mm 机制进行最简单、最干净的集成。再次,HMM 仅提供帮助程序,首先为设备内存热插拔新的 ZONE_DEVICE 内存,然后执行迁移。迁移内容和时间的策略决定留给设备驱动程序。

请注意,任何 CPU 对设备页面的访问都会触发页面错误并迁移回主内存。例如,当支持给定 CPU 地址 A 的页面从主内存页面迁移到设备页面时,对地址 A 的任何 CPU 访问都会触发页面错误并启动向主内存的迁移。

凭借这两个特性,HMM 不仅允许设备镜像进程地址空间并保持 CPU 和设备页表同步,而且还通过迁移设备正在使用的数据集部分来利用设备内存。

地址空间镜像实现和API

地址空间镜像的主要目标是允许将一定范围的 CPU 页表复制到一个设备页表中;HMM 有助于保持两者同步。想要镜像进程地址空间的设备驱动程序必须从注册 mmu_interval_notifier 开始:

int mmu_interval_notifier_insert(struct mmu_interval_notifier *interval_sub,
                                 struct mm_struct *mm, unsigned long start,
                                 unsigned long length,
                                 const struct mmu_interval_notifier_ops *ops);

在 ops->invalidate() 回调期间,设备驱动程序必须对范围执行更新操作(将范围标记为只读,或完全取消映射等)。设备必须在驱动程序回调返回之前完成更新。

当设备驱动程序想要填充一个虚拟地址范围时,它可以使用:

int hmm_range_fault(struct hmm_range *range);

如果请求写访问,它将在丢失或只读条目上触发页面错误(见下文)。页面错误使用通用的 mm 页面错误代码路径,就像 CPU 页面错误一样。

这两个函数都将 CPU 页表条目复制到它们的 pfns 数组参数中。该数组中的每个条目对应于虚拟范围中的一个地址。HMM 提供了一组标志来帮助驱动程序识别特殊的 CPU 页表条目。

在 sync_cpu_device_pagetables() 回调中锁定是驱动程序必须尊重的最重要的方面,以保持事物正确同步。使用模式是:

int driver_populate_range(...)
{
     struct hmm_range range;
     ...

     range.notifier = &interval_sub;
     range.start = ...;
     range.end = ...;
     range.hmm_pfns = ...;

     if (!mmget_not_zero(interval_sub->notifier.mm))
         return -EFAULT;

again:
     range.notifier_seq = mmu_interval_read_begin(&interval_sub);
     mmap_read_lock(mm);
     ret = hmm_range_fault(&range);
     if (ret) {
         mmap_read_unlock(mm);
         if (ret == -EBUSY)
                goto again;
         return ret;
     }
     mmap_read_unlock(mm);

     take_lock(driver->update);
     if (mmu_interval_read_retry(&ni, range.notifier_seq) {
         release_lock(driver->update);
         goto again;
     }

     /* Use pfns array content to update device page table,
      * under the update lock */

     release_lock(driver->update);
     return 0;
}

driver->update 锁与驱动程序在其 invalidate() 回调中使用的锁相同。该锁必须在调用 mmu_interval_read_retry() 之前保持,以避免与并发 CPU 页表更新发生任何竞争。

利用 default_flags 和 pfn_flags_mask

hmm_range 结构有 2 个字段,default_flags 和 pfn_flags_mask,它们指定整个范围的故障或快照策略,而不必为 pfns 数组中的每个条目设置它们。

例如,如果设备驱动程序需要至少具有读取权限的范围的页面,它会设置:

range->default_flags = HMM_PFN_REQ_FAULT;
range->pfn_flags_mask = 0;

并如上所述调用 hmm_range_fault()。这将填充至少具有读取权限的范围内的所有页面。

现在假设驱动程序想要做同样的事情,除了它想要拥有写权限的范围内的一页。现在驱动程序设置:

range->default_flags = HMM_PFN_REQ_FAULT;
range->pfn_flags_mask = HMM_PFN_REQ_WRITE;
range->pfns[index_of_write] = HMM_PFN_REQ_WRITE;

有了这个,HMM 将在至少读取(即有效)的所有页面中出错,并且对于地址 == range->start + (index_of_write << PAGE_SHIFT) 它将错误写入权限,即,如果 CPU pte 没有写权限设置然后HMM将调用handle_mm_fault()。

hmm_range_fault 完成后,标志位被设置为页表的当前状态,即 HMM_PFN_VALID | 如果页面可写,将设置 HMM_PFN_WRITE。

从核心内核的角度表示和管理设备内存

尝试了几种不同的设计来支持设备内存。第一个使用特定于设备的数据结构来保存有关迁移内存的信息,HMM 将自身挂接到 mm 代码的各个位置,以处理对设备内存支持的地址的任何访问。事实证明,这最终复制了 struct page 的大部分字段,并且还需要更新许多内核代码路径才能理解这种新的内存。

大多数内核代码路径从不尝试访问页面后面的内存,而只关心结构页面的内容。正因为如此,HMM 切换到直接使用 struct page 用于设备内存,这使得大多数内核代码路径不知道差异。我们只需要确保没有人试图从 CPU 端映射这些页面。

移入和移出设备内存

由于 CPU 无法直接访问设备内存,因此设备驱动程序必须使用硬件 DMA 或设备特定的加载/存储指令来迁移数据。migrate_vma_setup()、migrate_vma_pages() 和 migrate_vma_finalize() 函数旨在使驱动程序更易于编写并集中跨驱动程序的通用代码。

在将页面迁移到设备私有内存之前,需要创建特殊的设备私有 。这些将用作特殊的“交换”页表条目,以便 CPU 进程在尝试访问已迁移到设备专用内存的页面时会出错。struct page

这些可以通过以下方式分配和释放:

struct resource *res;
struct dev_pagemap pagemap;

res = request_free_mem_region(&iomem_resource, /* number of bytes */,
                              "name of driver resource");
pagemap.type = MEMORY_DEVICE_PRIVATE;
pagemap.range.start = res->start;
pagemap.range.end = res->end;
pagemap.nr_range = 1;
pagemap.ops = &device_devmem_ops;
memremap_pages(&pagemap, numa_node_id());

memunmap_pages(&pagemap);
release_mem_region(pagemap.range.start, range_len(&pagemap.range));

还有devm_request_free_mem_region(), devm_memremap_pages(), devm_memunmap_pages() 和 devm_release_mem_region() 当资源可以绑定到.struct device

整体迁移步骤类似于在系统内存中迁移 NUMA 页面(请参阅页面迁移),但这些步骤分为设备驱动程序特定代码和共享公共代码:

  1. mmap_read_lock()

    设备驱动程序必须将 a 传递给 migrate_vma_setup(),因此需要在迁移期间保留 mmap_read_lock() 或 mmap_write_lock()。struct vm_area_struct

  2. migrate_vma_setup(struct migrate_vma *args)

    设备驱动程序初始化字段并将指针传递给 migrate_vma_setup()。该字段用于过滤应迁移哪些源页面。例如,设置 只会迁移系统内存, 只会迁移驻留在设备私有内存中的页面。如果设置了后一个标志,则该 字段用于标识驱动程序拥有的设备专用页面。这避免了尝试迁移驻留在其他设备中的设备私有页面。目前只有匿名私有 VMA 范围可以迁移到或从系统内存和设备私有内存迁移。struct migrate_vmaargs->flagsMIGRATE_VMA_SELECT_SYSTEMMIGRATE_VMA_SELECT_DEVICE_PRIVATEargs->pgmap_owner

    migrate_vma_setup() 执行的第一步是使其他设备的 MMU 无效,mmu_notifier_invalidate_range_start(()并且会mmu_notifier_invalidate_range_end()在页表周围调用和 调用args->src以使用要迁移的 PFN填充数组。该invalidate_range_start()回调传递一个 与字段设置为 与字段设置为字段传递到migrate_vma_setup()。这允许设备驱动程序跳过失效回调并且仅使实际迁移的设备私有 MMU 映射失效。这将在下一节中详细解释。struct mmu_notifier_rangeeventMMU_NOTIFY_MIGRATEmigrate_pgmap_ownerargs->pgmap_owner

    在遍历页表时, a pte_none()oris_zero_pfn() 条目会导致存储在args->src数组中的有效“零”PFN 。这让驱动程序可以分配设备私有内存并将其清除,而不是复制一页零。系统内存或设备私有结构页面的有效 PTE 条目将被锁定lock_page(),与 LRU 隔离(如果系统内存,因为设备私有页面不在 LRU 上),从进程中取消映射,并插入一个特殊的迁移 PTE 来代替原来的PTE。migrate_vma_setup() 也会清除args->dst数组。

  3. 设备驱动程序分配目标页面并将源页面复制到目标页面。

    驱动程序检查每个src条目以查看该MIGRATE_PFN_MIGRATE 位是否已设置并跳过未迁移的条目。设备驱动程序还可以通过不填充页面的dst 数组来选择跳过页面迁移。

    然后驱动程序分配一个设备私有结构页面或系统内存页面,用 锁定页面lock_page(),并用以下内容填充 dst数组条目:

    dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;

    现在驱动程序知道这个页面正在被迁移,它可以使设备私有 MMU 映射无效并将设备私有内存复制到系统内存或另一个设备私有页面。核心 Linux 内核处理 CPU 页表失效,因此设备驱动程序只需使其自己的 MMU 映射失效。

    如果指针表示源页面未填充到系统内存中,驱动程序可以使用migrate_pfn_to_page(src[i])获取 源页面并将源页面复制到目标或清除目标设备专用内存。struct pageNULL

  4. migrate_vma_pages()

    这一步是实际“提交”迁移的地方。

    如果源页是一个pte_none()is_zero_pfn()页,这是将新分配的页插入到 CPU 的页表中的位置。如果 CPU 线程在同一页面上出现故障,这可能会失败。但是,页表被锁定,并且只会插入其中一个新页。MIGRATE_PFN_MIGRATE如果它输掉比赛,设备驱动程序将看到该位被清除。

    如果源页面被锁定、隔离等,源 信息现在被复制到目标,在 CPU 端完成迁移。struct pagestruct page

  5. 设备驱动程序为仍在迁移的页面更新设备 MMU 页表,回滚未迁移的页面。

    如果src条目仍然MIGRATE_PFN_MIGRATE设置了位,则设备驱动程序可以更新设备 MMU 并设置写使能位(如果该 MIGRATE_PFN_WRITE位已设置)。

  6. migrate_vma_finalize()

    这一步用新页面的页表条目替换特殊的迁移页表条目,并释放对源和目标的引用。struct page

  7. mmap_read_unlock()

    现在可以释放锁。

内存 cgroup (memcg) 和 rss 记帐

目前,设备内存被视为 rss 计数器中的任何常规页面(如果设备页面用于匿名,则为匿名,如果设备页面用于文件支持页面,则为文件,如果设备页面用于共享内存,则为 shmem)。这是为了保持现有应用程序的故意选择,这些应用程序可能在不知情的情况下开始使用设备内存,运行不受影响。

一个缺点是 OOM 杀手可能会杀死使用大量设备内存而不是大量常规系统内存的应用程序,因此不会释放太多系统内存。在决定以不同方式计算设备内存之前,我们希望收集更多关于应用程序和系统在存在设备内存的情况下在内存压力下如何反应的真实世界经验。

对内存 cgroup 做出了相同的决定。设备内存页面根据相同的内存 cgroup 计算,常规页面将被计算在内。这确实简化了进出设备内存的迁移。这也意味着从设备内存迁移回常规内存不会失败,因为它会超过内存 cgroup 限制。一旦我们对设备内存的使用方式及其对内存资源控制的影响有了更多的了解,我们可能会在后面重新考虑这个选择。

请注意,设备内存永远不能由设备驱动程序或通过 GUP 固定,因此此类内存在进程退出时总是空闲的。或者在共享内存或文件支持内存的情况下删除最后一个引用时。

 

原文

Heterogeneous Memory Management (HMM)

Provide infrastructure and helpers to integrate non-conventional memory (device memory like GPU on board memory) into regular kernel path, with the cornerstone of this being specialized struct page for such memory (see sections 5 to 7 of this document).

HMM also provides optional helpers for SVM (Share Virtual Memory), i.e., allowing a device to transparently access program addresses coherently with the CPU meaning that any valid pointer on the CPU is also a valid pointer for the device. This is becoming mandatory to simplify the use of advanced heterogeneous computing where GPU, DSP, or FPGA are used to perform various computations on behalf of a process.

This document is divided as follows: in the first section I expose the problems related to using device specific memory allocators. In the second section, I expose the hardware limitations that are inherent to many platforms. The third section gives an overview of the HMM design. The fourth section explains how CPU page-table mirroring works and the purpose of HMM in this context. The fifth section deals with how device memory is represented inside the kernel. Finally, the last section presents a new migration helper that allows leveraging the device DMA engine.

Problems of using a device specific memory allocator

Devices with a large amount of on board memory (several gigabytes) like GPUs have historically managed their memory through dedicated driver specific APIs. This creates a disconnect between memory allocated and managed by a device driver and regular application memory (private anonymous, shared memory, or regular file backed memory). From here on I will refer to this aspect as split address space. I use shared address space to refer to the opposite situation: i.e., one in which any application memory region can be used by a device transparently.

Split address space happens because devices can only access memory allocated through a device specific API. This implies that all memory objects in a program are not equal from the device point of view which complicates large programs that rely on a wide set of libraries.

Concretely, this means that code that wants to leverage devices like GPUs needs to copy objects between generically allocated memory (malloc, mmap private, mmap share) and memory allocated through the device driver API (this still ends up with an mmap but of the device file).

For flat data sets (array, grid, image, …) this isn’t too hard to achieve but for complex data sets (list, tree, …) it’s hard to get right. Duplicating a complex data set needs to re-map all the pointer relations between each of its elements. This is error prone and programs get harder to debug because of the duplicate data set and addresses.

Split address space also means that libraries cannot transparently use data they are getting from the core program or another library and thus each library might have to duplicate its input data set using the device specific memory allocator. Large projects suffer from this and waste resources because of the various memory copies.

Duplicating each library API to accept as input or output memory allocated by each device specific allocator is not a viable option. It would lead to a combinatorial explosion in the library entry points.

Finally, with the advance of high level language constructs (in C++ but in other languages too) it is now possible for the compiler to leverage GPUs and other devices without programmer knowledge. Some compiler identified patterns are only do-able with a shared address space. It is also more reasonable to use a shared address space for all other patterns.

I/O bus, device memory characteristics

I/O buses cripple shared address spaces due to a few limitations. Most I/O buses only allow basic memory access from device to main memory; even cache coherency is often optional. Access to device memory from a CPU is even more limited. More often than not, it is not cache coherent.

If we only consider the PCIE bus, then a device can access main memory (often through an IOMMU) and be cache coherent with the CPUs. However, it only allows a limited set of atomic operations from the device on main memory. This is worse in the other direction: the CPU can only access a limited range of the device memory and cannot perform atomic operations on it. Thus device memory cannot be considered the same as regular memory from the kernel point of view.

Another crippling factor is the limited bandwidth (~32GBytes/s with PCIE 4.0 and 16 lanes). This is 33 times less than the fastest GPU memory (1 TBytes/s). The final limitation is latency. Access to main memory from the device has an order of magnitude higher latency than when the device accesses its own memory.

Some platforms are developing new I/O buses or additions/modifications to PCIE to address some of these limitations (OpenCAPI, CCIX). They mainly allow two-way cache coherency between CPU and device and allow all atomic operations the architecture supports. Sadly, not all platforms are following this trend and some major architectures are left without hardware solutions to these problems.

So for shared address space to make sense, not only must we allow devices to access any memory but we must also permit any memory to be migrated to device memory while the device is using it (blocking CPU access while it happens).

Shared address space and migration

HMM intends to provide two main features. The first one is to share the address space by duplicating the CPU page table in the device page table so the same address points to the same physical memory for any valid main memory address in the process address space.

To achieve this, HMM offers a set of helpers to populate the device page table while keeping track of CPU page table updates. Device page table updates are not as easy as CPU page table updates. To update the device page table, you must allocate a buffer (or use a pool of pre-allocated buffers) and write GPU specific commands in it to perform the update (unmap, cache invalidations, and flush, …). This cannot be done through common code for all devices. Hence why HMM provides helpers to factor out everything that can be while leaving the hardware specific details to the device driver.

The second mechanism HMM provides is a new kind of ZONE_DEVICE memory that allows allocating a struct page for each page of device memory. Those pages are special because the CPU cannot map them. However, they allow migrating main memory to device memory using existing migration mechanisms and everything looks like a page that is swapped out to disk from the CPU point of view. Using a struct page gives the easiest and cleanest integration with existing mm mechanisms. Here again, HMM only provides helpers, first to hotplug new ZONE_DEVICE memory for the device memory and second to perform migration. Policy decisions of what and when to migrate is left to the device driver.

Note that any CPU access to a device page triggers a page fault and a migration back to main memory. For example, when a page backing a given CPU address A is migrated from a main memory page to a device page, then any CPU access to address A triggers a page fault and initiates a migration back to main memory.

With these two features, HMM not only allows a device to mirror process address space and keeps both CPU and device page tables synchronized, but also leverages device memory by migrating the part of the data set that is actively being used by the device.

Address space mirroring implementation and API

Address space mirroring’s main objective is to allow duplication of a range of CPU page table into a device page table; HMM helps keep both synchronized. A device driver that wants to mirror a process address space must start with the registration of a mmu_interval_notifier:

int mmu_interval_notifier_insert(struct mmu_interval_notifier *interval_sub,
                                 struct mm_struct *mm, unsigned long start,
                                 unsigned long length,
                                 const struct mmu_interval_notifier_ops *ops);

During the ops->invalidate() callback the device driver must perform the update action to the range (mark range read only, or fully unmap, etc.). The device must complete the update before the driver callback returns.

When the device driver wants to populate a range of virtual addresses, it can use:

int hmm_range_fault(struct hmm_range *range);

It will trigger a page fault on missing or read-only entries if write access is requested (see below). Page faults use the generic mm page fault code path just like a CPU page fault.

Both functions copy CPU page table entries into their pfns array argument. Each entry in that array corresponds to an address in the virtual range. HMM provides a set of flags to help the driver identify special CPU page table entries.

Locking within the sync_cpu_device_pagetables() callback is the most important aspect the driver must respect in order to keep things properly synchronized. The usage pattern is:

int driver_populate_range(...)
{
     struct hmm_range range;
     ...

     range.notifier = &interval_sub;
     range.start = ...;
     range.end = ...;
     range.hmm_pfns = ...;

     if (!mmget_not_zero(interval_sub->notifier.mm))
         return -EFAULT;

again:
     range.notifier_seq = mmu_interval_read_begin(&interval_sub);
     mmap_read_lock(mm);
     ret = hmm_range_fault(&range);
     if (ret) {
         mmap_read_unlock(mm);
         if (ret == -EBUSY)
                goto again;
         return ret;
     }
     mmap_read_unlock(mm);

     take_lock(driver->update);
     if (mmu_interval_read_retry(&ni, range.notifier_seq) {
         release_lock(driver->update);
         goto again;
     }

     /* Use pfns array content to update device page table,
      * under the update lock */

     release_lock(driver->update);
     return 0;
}

The driver->update lock is the same lock that the driver takes inside its invalidate() callback. That lock must be held before calling mmu_interval_read_retry() to avoid any race with a concurrent CPU page table update.

Leverage default_flags and pfn_flags_mask

The hmm_range struct has 2 fields, default_flags and pfn_flags_mask, that specify fault or snapshot policy for the whole range instead of having to set them for each entry in the pfns array.

For instance if the device driver wants pages for a range with at least read permission, it sets:

range->default_flags = HMM_PFN_REQ_FAULT;
range->pfn_flags_mask = 0;

and calls hmm_range_fault() as described above. This will fill fault all pages in the range with at least read permission.

Now let’s say the driver wants to do the same except for one page in the range for which it wants to have write permission. Now driver set:

range->default_flags = HMM_PFN_REQ_FAULT;
range->pfn_flags_mask = HMM_PFN_REQ_WRITE;
range->pfns[index_of_write] = HMM_PFN_REQ_WRITE;

With this, HMM will fault in all pages with at least read (i.e., valid) and for the address == range->start + (index_of_write << PAGE_SHIFT) it will fault with write permission i.e., if the CPU pte does not have write permission set then HMM will call handle_mm_fault().

After hmm_range_fault completes the flag bits are set to the current state of the page tables, ie HMM_PFN_VALID | HMM_PFN_WRITE will be set if the page is writable.

Represent and manage device memory from core kernel point of view

Several different designs were tried to support device memory. The first one used a device specific data structure to keep information about migrated memory and HMM hooked itself in various places of mm code to handle any access to addresses that were backed by device memory. It turns out that this ended up replicating most of the fields of struct page and also needed many kernel code paths to be updated to understand this new kind of memory.

Most kernel code paths never try to access the memory behind a page but only care about struct page contents. Because of this, HMM switched to directly using struct page for device memory which left most kernel code paths unaware of the difference. We only need to make sure that no one ever tries to map those pages from the CPU side.

Migration to and from device memory

Because the CPU cannot access device memory directly, the device driver must use hardware DMA or device specific load/store instructions to migrate data. The migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize() functions are designed to make drivers easier to write and to centralize common code across drivers.

Before migrating pages to device private memory, special device private struct page need to be created. These will be used as special “swap” page table entries so that a CPU process will fault if it tries to access a page that has been migrated to device private memory.

These can be allocated and freed with:

struct resource *res;
struct dev_pagemap pagemap;

res = request_free_mem_region(&iomem_resource, /* number of bytes */,
                              "name of driver resource");
pagemap.type = MEMORY_DEVICE_PRIVATE;
pagemap.range.start = res->start;
pagemap.range.end = res->end;
pagemap.nr_range = 1;
pagemap.ops = &device_devmem_ops;
memremap_pages(&pagemap, numa_node_id());

memunmap_pages(&pagemap);
release_mem_region(pagemap.range.start, range_len(&pagemap.range));

There are also devm_request_free_mem_region(), devm_memremap_pages(), devm_memunmap_pages(), and devm_release_mem_region() when the resources can be tied to a struct device.

The overall migration steps are similar to migrating NUMA pages within system memory (see Page migration) but the steps are split between device driver specific code and shared common code:

  1. mmap_read_lock()

    The device driver has to pass a struct vm_area_struct to migrate_vma_setup() so the mmap_read_lock() or mmap_write_lock() needs to be held for the duration of the migration.

  2. migrate_vma_setup(struct migrate_vma *args)

    The device driver initializes the struct migrate_vma fields and passes the pointer to migrate_vma_setup(). The args->flags field is used to filter which source pages should be migrated. For example, setting MIGRATE_VMA_SELECT_SYSTEM will only migrate system memory and MIGRATE_VMA_SELECT_DEVICE_PRIVATE will only migrate pages residing in device private memory. If the latter flag is set, the args->pgmap_owner field is used to identify device private pages owned by the driver. This avoids trying to migrate device private pages residing in other devices. Currently only anonymous private VMA ranges can be migrated to or from system memory and device private memory.

    One of the first steps migrate_vma_setup() does is to invalidate other device’s MMUs with the mmu_notifier_invalidate_range_start(() and mmu_notifier_invalidate_range_end() calls around the page table walks to fill in the args->src array with PFNs to be migrated. The invalidate_range_start() callback is passed a struct mmu_notifier_range with the event field set to MMU_NOTIFY_MIGRATE and the migrate_pgmap_owner field set to the args->pgmap_owner field passed to migrate_vma_setup(). This is allows the device driver to skip the invalidation callback and only invalidate device private MMU mappings that are actually migrating. This is explained more in the next section.

    While walking the page tables, a pte_none() or is_zero_pfn() entry results in a valid “zero” PFN stored in the args->src array. This lets the driver allocate device private memory and clear it instead of copying a page of zeros. Valid PTE entries to system memory or device private struct pages will be locked with lock_page(), isolated from the LRU (if system memory since device private pages are not on the LRU), unmapped from the process, and a special migration PTE is inserted in place of the original PTE. migrate_vma_setup() also clears the args->dst array.

  3. The device driver allocates destination pages and copies source pages to destination pages.

    The driver checks each src entry to see if the MIGRATE_PFN_MIGRATE bit is set and skips entries that are not migrating. The device driver can also choose to skip migrating a page by not filling in the dst array for that page.

    The driver then allocates either a device private struct page or a system memory page, locks the page with lock_page(), and fills in the dst array entry with:

    dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
    

    Now that the driver knows that this page is being migrated, it can invalidate device private MMU mappings and copy device private memory to system memory or another device private page. The core Linux kernel handles CPU page table invalidations so the device driver only has to invalidate its own MMU mappings.

    The driver can use migrate_pfn_to_page(src[i]) to get the struct page of the source and either copy the source page to the destination or clear the destination device private memory if the pointer is NULL meaning the source page was not populated in system memory.

  4. migrate_vma_pages()

    This step is where the migration is actually “committed”.

    If the source page was a pte_none() or is_zero_pfn() page, this is where the newly allocated page is inserted into the CPU’s page table. This can fail if a CPU thread faults on the same page. However, the page table is locked and only one of the new pages will be inserted. The device driver will see that the MIGRATE_PFN_MIGRATE bit is cleared if it loses the race.

    If the source page was locked, isolated, etc. the source struct page information is now copied to destination struct page finalizing the migration on the CPU side.

  5. Device driver updates device MMU page tables for pages still migrating, rolling back pages not migrating.

    If the src entry still has MIGRATE_PFN_MIGRATE bit set, the device driver can update the device MMU and set the write enable bit if the MIGRATE_PFN_WRITE bit is set.

  6. migrate_vma_finalize()

    This step replaces the special migration page table entry with the new page’s page table entry and releases the reference to the source and destination struct page.

  7. mmap_read_unlock()

    The lock can now be released.

Memory cgroup (memcg) and rss accounting

For now, device memory is accounted as any regular page in rss counters (either anonymous if device page is used for anonymous, file if device page is used for file backed page, or shmem if device page is used for shared memory). This is a deliberate choice to keep existing applications, that might start using device memory without knowing about it, running unimpacted.

A drawback is that the OOM killer might kill an application using a lot of device memory and not a lot of regular system memory and thus not freeing much system memory. We want to gather more real world experience on how applications and system react under memory pressure in the presence of device memory before deciding to account device memory differently.

Same decision was made for memory cgroup. Device memory pages are accounted against same memory cgroup a regular page would be accounted to. This does simplify migration to and from device memory. This also means that migration back from device memory to regular memory cannot fail because it would go above memory cgroup limit. We might revisit this choice latter on once we get more experience in how device memory is used and its impact on memory resource control.

Note that device memory can never be pinned by a device driver nor through GUP and thus such memory is always free upon process exit. Or when last reference is dropped in case of shared memory or file backed memory.

 

以上是关于异构内存管理 Heterogeneous Memory Management (HMM)的主要内容,如果未能解决你的问题,请参考以下文章

从返回多个(异构)值的 TypeScript 方法中获取值

Heterogeneous Attentions for Solving Pickup and Delivery Problem via Deep Reinforcement Learning

Heterogeneous Attentions for Solving Pickup and Delivery Problem via Deep Reinforcement Learning

从零开始学习OpenCL开发架构

What is Heterogeneous Computing?

关于动态内存和输入输出文件