page compaction代码分析之一
Posted Loopers
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了page compaction代码分析之一相关的知识,希望对你有一定的参考价值。
重要的数据结构
/*
* Determines how hard direct compaction should try to succeed.
* Lower value means higher priority, analogically to reclaim priority.
*/
enum compact_priority
COMPACT_PRIO_SYNC_FULL,
MIN_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_FULL,
COMPACT_PRIO_SYNC_LIGHT,
MIN_COMPACT_COSTLY_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
COMPACT_PRIO_ASYNC,
INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
;
这个结构是compaction的优先级
- COMPACT_PRIO_SYNC_FULL:完全同步模式,允许祖塞,允许将脏页写回到存储设备上,直到等待完成
- COMPACT_PRIO_SYNC_LIGHT: 轻量级同步模式,允许绝大多数祖塞,但是不允许将脏页写回到存储设备上,因为等待时间比较长
- COMPACT_PRIO_ASYNC: 异步模式,不允许祖塞
- 优先级关系: COMPACT_PRIO_SYNC_FULL > COMPACT_PRIO_SYNC_LIGHT > COMPACT_PRIO_ASYNC
- compation对应的成本:COMPACT_PRIO_SYNC_FULL > COMPACT_PRIO_SYNC_LIGHT > COMPACT_PRIO_ASYNC
- 完全同步成功率最高
再来看下compaction的成功是否的状态
/* Return values for compact_zone() and try_to_compact_pages() */
/* When adding new states, please adjust include/trace/events/compaction.h */
enum compact_result
/* For more detailed tracepoint output - internal to compaction */
COMPACT_NOT_SUITABLE_ZONE,
/*
* compaction didn't start as it was not possible or direct reclaim
* was more suitable
*/
COMPACT_SKIPPED,
/* compaction didn't start as it was deferred due to past failures */
COMPACT_DEFERRED,
/* compaction not active last round */
COMPACT_INACTIVE = COMPACT_DEFERRED,
/* For more detailed tracepoint output - internal to compaction */
COMPACT_NO_SUITABLE_PAGE,
/* compaction should continue to another pageblock */
COMPACT_CONTINUE,
/*
* The full zone was compacted scanned but wasn't successfull to compact
* suitable pages.
*/
COMPACT_COMPLETE,
/*
* direct compaction has scanned part of the zone but wasn't successfull
* to compact suitable pages.
*/
COMPACT_PARTIAL_SKIPPED,
/* compaction terminated prematurely due to lock contentions */
COMPACT_CONTENDED,
/*
* direct compaction terminated after concluding that the allocation
* should now succeed
*/
COMPACT_SUCCESS,
;
- COMPACT_SKIPPED: 跳过此zone,可能此zone不适合
- COMPACT_DEFERRED:此zone不能开始,是由于此zone最近失败过
- COMPACT_CONTINUE:继续尝试做page compaction
- COMPACT_COMPLETE: 对整个zone扫描已经完成,但是没有规整出合适的页
- COMPACT_PARTIAL_SKIPPED: 扫描了部分的zone,但是没有找到合适的页
- COMPACT_SUCCESS:规整成功,并且合并出空闲的页
fragmentation index(碎片指数)
当我们申请内存失败的时候有两种原因:
- 内存不够
- 内存碎片太多
那怎么确定到底是什么原因导致分配失败的,所以就出现了碎片指数。取值范围[0 1000]
- 碎片指数趋近于0,说明申请内存失败原因是由于内存不足
- 碎片指数趋近于1000,说明申请内存失败原因是内存碎片太多
当然了内核同时提供了一个值,来控制碎片指数。int sysctl_extfrag_threshold = 500; 默认值是500
root:/ # cat /proc/sys/vm/extfrag_threshold
500
这个值默认是500的,如果设置太大,则每次申请内存失败,都会归结为内存不够。如果申请太小,则page compaction就会太频繁,系统负载就会增加
判断一个zone是否合适做page compaction
enum compact_result compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags,
int classzone_idx)
enum compact_result ret;
int fragindex;
ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
zone_page_state(zone, NR_FREE_PAGES));
/*
* fragmentation index determines if allocation failures are due to
* low memory or external fragmentation
*
* index of -1000 would imply allocations might succeed depending on
* watermarks, but we already failed the high-order watermark check
* index towards 0 implies failure is due to lack of memory
* index towards 1000 implies failure is due to fragmentation
*
* Only compact if a failure would be due to fragmentation. Also
* ignore fragindex for non-costly orders where the alternative to
* a successful reclaim/compaction is OOM. Fragindex and the
* vm.extfrag_threshold sysctl is meant as a heuristic to prevent
* excessive compaction for costly orders, but it should not be at the
* expense of system stability.
*/
if (ret == COMPACT_CONTINUE && (order > PAGE_ALLOC_COSTLY_ORDER))
fragindex = fragmentation_index(zone, order);
if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
ret = COMPACT_NOT_SUITABLE_ZONE;
trace_mm_compaction_suitable(zone, order, ret);
if (ret == COMPACT_NOT_SUITABLE_ZONE)
ret = COMPACT_SKIPPED;
return ret;
- __compaction_suitable 此函数主要用来判断此zone是否合适做page compaction
- 如果此函数返回的是COMPACT_CONTINUE,而且order是昂贵的分配,则就会去获取碎片指数,如果碎片指数在[0-500]之间,则此zone不适合做page compaction
- 最终返回的结果是跳过此zone=COMPACT_SKIPPED
static enum compact_result __compaction_suitable(struct zone *zone, int order,
unsigned int alloc_flags,
int classzone_idx,
unsigned long wmark_target)
unsigned long watermark;
if (is_via_compact_memory(order))
return COMPACT_CONTINUE;
watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
/*
* If watermarks for high-order allocation are already met, there
* should be no need for compaction at all.
*/
if (zone_watermark_ok(zone, order, watermark, classzone_idx,
alloc_flags))
return COMPACT_SUCCESS;
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);
watermark += compact_gap(order);
if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
ALLOC_CMA, wmark_target))
return COMPACT_SKIPPED;
return COMPACT_CONTINUE;
- 如果是通过设置/proc/sys/vm/compaction_memory,则order=-1, 则规整继续
- 如果不是通过设置compaction_memory节点的,则先获取当前zone的水位
- 通过zone_watermark_ok函数可以判断当前zone的内存很足够,(空闲页面-申请页面 >=水位)则返回COMPACT_SUCCESS说明此zone不需要做内存规整
- 根据当前zone是否是昂贵的order,如果是昂贵的order,则获取low水位的值,否则获取min水位的值
- 如果(空闲页面-申请页面)>= watetmark + 2*order的页,则此zone需要做内存规整
内存碎片整理推迟
为什么需要内存碎片整理推迟?
如果上次内存碎片整理失败,当下一次进行内存碎片整理的时候和上一次很近,如果不推迟的话有可能还会失败,白白的增加系统的负载。所以当下一次进行内存碎片整理的时候,则需要推迟。在结构体zone中就定义了推迟整理的几个字段
struct zone
#ifdef CONFIG_COMPACTION
/*
* On compaction failure, 1<<compact_defer_shift compactions
* are skipped before trying again. The number attempted since
* last failure is tracked with compact_considered.
*/
unsigned int compact_considered;
unsigned int compact_defer_shift;
int compact_order_failed;
#endif
- compact_considered代表推迟的次数
- compact_defer_shift是推迟的次数以2的底数
- compact_order_failed记录碎片整理是分配的order数
当做碎片整理失败的时候,会调用到此函数
/* Do not skip compaction more than 64 times */
#define COMPACT_MAX_DEFER_SHIFT 6
void defer_compaction(struct zone *zone, int order)
zone->compact_considered = 0;
zone->compact_defer_shift++;
if (order < zone->compact_order_failed)
zone->compact_order_failed = order;
if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT)
zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT;
- 将compact_defer_shift加1, 如果compact_defer_shift的值大于6,则设置为6
- 说明最大最迟次数为64次,当超过64次之后,则不能推迟了。
- 如果申请的order小于compact_order_failed, 则设置compact_order_failed=order
当做碎片整理成功的时候,则会调用到compaction_defer_reset函数
void compaction_defer_reset(struct zone *zone, int order, bool alloc_success)
if (alloc_success)
zone->compact_considered = 0;
zone->compact_defer_shift = 0;
if (order >= zone->compact_order_failed)
zone->compact_order_failed = order + 1;
- 当分配成功后,会将compact_defer_shift设置为0
- 同时如果order大于等于compact_order_failed时,则将compact_order_failed设置为order+1
如何确定本次碎片整理是否结束
- 当迁移扫描器和空闲扫描器相遇的时候,则怎么本次碎片整理结束
- 当迁移扫描器和空闲扫描器没有相遇,但是从此zone中的根据迁移类型可以从freelist中获取一个大得空闲的页,或者从备用的迁移类型中可以获取一个大的空闲的页,则认为本次碎片整理结束
static enum compact_result __compact_finished(struct compact_control *cc)
if (compact_scanners_met(cc)) //相遇了
/* Let the next compaction start anew. */
reset_cached_positions(cc->zone);
/*
* Mark that the PG_migrate_skip information should be cleared
* by kswapd when it goes to sleep. kcompactd does not set the
* flag itself as the decision to be clear should be directly
* based on an allocation request.
*/
if (cc->direct_compaction)
cc->zone->compact_blockskip_flush = true;
if (cc->whole_zone)
return COMPACT_COMPLETE;
else
return COMPACT_PARTIAL_SKIPPED;
for (order = cc->order; order < MAX_ORDER; order++)
struct free_area *area = &cc->zone->free_area[order];
bool can_steal;
/* Job done if page is free of the right migratetype */ //有空闲内存了,返回SUCCESS
if (!list_empty(&area->free_list[migratetype]))
return COMPACT_SUCCESS;
#ifdef CONFIG_CMA
/* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
if (migratetype == MIGRATE_MOVABLE &&
!list_empty(&area->free_list[MIGRATE_CMA]))
return COMPACT_SUCCESS;
#endif
if (find_suitable_fallback(area, order, migratetype, //从备用迁移类型中分配成功了。
true, &can_steal, cc->order) != -1)
/* movable pages are OK in any pageblock */
if (migratetype == MIGRATE_MOVABLE)
return COMPACT_SUCCESS;
以上是关于page compaction代码分析之一的主要内容,如果未能解决你的问题,请参考以下文章
Linux 内核 内存管理分区伙伴分配器 ⑥ ( zone 结构体中水线控制相关成员 | 在 Ubuntu 中查看内存区域水位线 )
Linux 内核 内存管理分区伙伴分配器 ⑥ ( zone 结构体中水线控制相关成员 | 在 Ubuntu 中查看内存区域水位线 )