Linux 过度提交启发式

Posted 2023-02-21

技术标签:

【中文标题】Linux 过度提交启发式【英文标题】：Linux over commit heuristic 【发布时间】：2016-12-05 22:36:50 【问题描述】：

内核文档中的过度提交文章刚刚提到，过度提交模式 0 基于启发式过度提交处理。它没有概述所涉及的启发式方法。

有人可以阐明实际的启发式是什么吗？任何指向内核源代码的相关链接也可以使用！

【问题讨论】：

什么是启发式？ 【参考方案1】：

其实overcommit accounting的内核文档有一些细节：https://www.kernel.org/doc/Documentation/vm/overcommit-accounting

Linux 内核支持以下过量使用处理模式

0 - 启发式过度使用处理。

明显的地址空间过度使用被拒绝。用于典型系统。它确保了严重的疯狂分配失败，同时允许过量使用以减少交换使用。在这种模式下，root 可以分配稍微多一点的内存。这是默认设置。

还有Documentation/sysctl/vm.txt

overcommit_memory：此值包含一个启用内存过量使用的标志。当此标志为 0 时，内核尝试估计当用户空间请求更多内存时剩余的空闲内存...

请参阅文档/vm/overcommit-accounting 和 mm/mmap.c::__vm_enough_memory() 了解更多信息。

另外，man 5 proc:

/proc/sys/vm/overcommit_memory 该文件包含内核虚拟内存记帐模式。值为：
                0: heuristic overcommit (this is the default)
                1: always overcommit, never check
                2: always check, never overcommit
在模式 0 下，mmap(2) 和 MAP_NORESERVE 的调用不会被检查，并且默认检查非常弱，导致进程“OOM-killed”的风险。

因此，启发式方法会禁用非常大的分配，但有时应用程序分配的虚拟内存可能会超过系统中物理内存的大小，如果它没有使用全部。使用MAP_NORESERVE 的可映射内存量可能会更高。

设置为“overcommit policy is set via the sysctl `vm.overcommit_memory'”，所以我们可以在源码中找到它是如何实现的： http://lxr.free-electrons.com/ident?v=4.4;i=sysctl_overcommit_memory，定义在line 112 of mm/mmap.c

  112 int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;  /* heuristic overcommit */

而常量OVERCOMMIT_GUESS（在linux/mman.h中定义）是used实际上只在line 170 of mm/mmap.c中，这是启发式的实现：

138 /*
139  * Check that a process has enough memory to allocate a new virtual
140  * mapping. 0 means there is enough memory for the allocation to
141  * succeed and -ENOMEM implies there is not.
142  *
143  * We currently support three overcommit policies, which are set via the
144  * vm.overcommit_memory sysctl.  See Documentation/vm/overcommit-accounting
145  *
146  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
147  * Additional code 2002 Jul 20 by Robert Love.
148  *
149  * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
150  *
151  * Note this is a helper function intended to be used by LSMs which
152  * wish to use this logic.
153  */
154 int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin)
...
170         if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) 
171                 free = global_page_state(NR_FREE_PAGES);
172                 free += global_page_state(NR_FILE_PAGES);
173 
174                 /*
175                  * shmem pages shouldn't be counted as free in this
176                  * case, they can't be purged, only swapped out, and
177                  * that won't affect the overall amount of available
178                  * memory in the system.
179                  */
180                 free -= global_page_state(NR_SHMEM);
181 
182                 free += get_nr_swap_pages();
183 
184                 /*
185                  * Any slabs which are created with the
186                  * SLAB_RECLAIM_ACCOUNT flag claim to have contents
187                  * which are reclaimable, under pressure.  The dentry
188                  * cache and most inode caches should fall into this
189                  */
190                 free += global_page_state(NR_SLAB_RECLAIMABLE);
191 
192                 /*
193                  * Leave reserved pages. The pages are not for anonymous pages.
194                  */
195                 if (free <= totalreserve_pages)
196                         goto error;
197                 else
198                         free -= totalreserve_pages;
199 
200                 /*
201                  * Reserve some for root
202                  */
203                 if (!cap_sys_admin)
204                         free -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);
205 
206                 if (free > pages)
207                         return 0;
208 
209                 goto error;
210

因此，启发式方法是在处理更多内存请求（应用程序请求pages 页）时估计现在使用了多少物理内存页 (free)。

总是启用overcommit（“1”），这个函数总是返回0（“这个请求有足够的内存”）

164         /*
165          * Sometimes we want to use more memory than we have
166          */
167         if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
168                 return 0;

如果没有这个默认启发式，在模式“2”下，内核将尝试考虑请求的pages 页面以获取新的Committed_AS（来自/proc/meminfo）：

162         vm_acct_memory(pages);
...

这个is actually只是vm_committed_as的增量-__percpu_counter_add(&vm_committed_as, pages, vm_committed_as_batch);

212         allowed = vm_commit_limit();

一些魔法is here:

401 /*
402  * Committed memory limit enforced when OVERCOMMIT_NEVER policy is used
403  */
404 unsigned long vm_commit_limit(void)
405 
406         unsigned long allowed;
407 
408         if (sysctl_overcommit_kbytes)
409                 allowed = sysctl_overcommit_kbytes >> (PAGE_SHIFT - 10);
410         else
411                 allowed = ((totalram_pages - hugetlb_total_pages())
412                            * sysctl_overcommit_ratio / 100);
413         allowed += total_swap_pages;
414 
415         return allowed;
416 
417

因此，allowed 在vm.overcommit_kbytes sysctl 中设置为千字节，或者设置为vm.overcommit_ratio 物理 RAM 的百分比，加上交换大小。

213         /*
214          * Reserve some for root
215          */
216         if (!cap_sys_admin)
217                 allowed -= sysctl_admin_reserve_kbytes >> (PAGE_SHIFT - 10);

只允许root用户使用一些内存（健康人的page_shift为12，page_shift-10只是kbytes到page count的转换）。

218 
219         /*
220          * Don't let a single process grow so big a user can't recover
221          */
222         if (mm) 
223                 reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
224                 allowed -= min_t(long, mm->total_vm / 32, reserve);
225         
226 
227         if (percpu_counter_read_positive(&vm_committed_as) < allowed)
228                 return 0;

如果在考虑了请求之后，所有用户空间仍然有提交的内存量少于允许的，分配它。在其他情况下，拒绝该请求（并取消该请求）。

229 error:
230         vm_unacct_memory(pages);
231 
232         return -ENOMEM;

换句话说，正如 Andries Brouwer 于 2003 年 2 月 1 日在“Linux 内核。关于 Linux 内核的一些评论”中总结的那样，9. 内存、9.6 过度使用和 OOM - https://www.win.tue.nl/~aeb/linux/lk/lk-9.html：

朝着正确的方向前进

从 2.5.30 开始，这些值是：
0（默认）：和以前一样：猜测多少过度承诺是合理的， 1：永远不要拒绝任何 malloc()， 2：关于过度提交要准确——永远不要提交大于交换空间加上物理内存的一部分overcommit_ratio 的虚拟地址空间。

所以“2”是请求后使用的内存量的精确计算，“0”是启发式估计。

【讨论】：

以上是关于Linux 过度提交启发式的主要内容，如果未能解决你的问题，请参考以下文章