page compaction代码分析之一

Posted Loopers

tags:

篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了page compaction代码分析之一相关的知识,希望对你有一定的参考价值。

重要的数据结构

/*
 * Determines how hard direct compaction should try to succeed.
 * Lower value means higher priority, analogically to reclaim priority.
 */
enum compact_priority 
    COMPACT_PRIO_SYNC_FULL,
    MIN_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_FULL,
    COMPACT_PRIO_SYNC_LIGHT,
    MIN_COMPACT_COSTLY_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
    DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
    COMPACT_PRIO_ASYNC,
    INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
;

这个结构是compaction的优先级

  • COMPACT_PRIO_SYNC_FULL:完全同步模式,允许祖塞,允许将脏页写回到存储设备上,直到等待完成
  • COMPACT_PRIO_SYNC_LIGHT: 轻量级同步模式,允许绝大多数祖塞,但是不允许将脏页写回到存储设备上,因为等待时间比较长
  • COMPACT_PRIO_ASYNC: 异步模式,不允许祖塞
  • 优先级关系:  COMPACT_PRIO_SYNC_FULL >  COMPACT_PRIO_SYNC_LIGHT > COMPACT_PRIO_ASYNC
  • compation对应的成本:COMPACT_PRIO_SYNC_FULL >  COMPACT_PRIO_SYNC_LIGHT > COMPACT_PRIO_ASYNC 
  • 完全同步成功率最高

 

再来看下compaction的成功是否的状态

/* Return values for compact_zone() and try_to_compact_pages() */
/* When adding new states, please adjust include/trace/events/compaction.h */
enum compact_result 
    /* For more detailed tracepoint output - internal to compaction */
    COMPACT_NOT_SUITABLE_ZONE,
    /*
     * compaction didn't start as it was not possible or direct reclaim
     * was more suitable
     */
    COMPACT_SKIPPED,
    /* compaction didn't start as it was deferred due to past failures */
    COMPACT_DEFERRED,
 
    /* compaction not active last round */
    COMPACT_INACTIVE = COMPACT_DEFERRED,
 
    /* For more detailed tracepoint output - internal to compaction */
    COMPACT_NO_SUITABLE_PAGE,
    /* compaction should continue to another pageblock */
    COMPACT_CONTINUE,
 
    /*
     * The full zone was compacted scanned but wasn't successfull to compact
     * suitable pages.
     */
    COMPACT_COMPLETE,
    /*
     * direct compaction has scanned part of the zone but wasn't successfull
     * to compact suitable pages.
     */
    COMPACT_PARTIAL_SKIPPED,
 
    /* compaction terminated prematurely due to lock contentions */
    COMPACT_CONTENDED,
 
    /*
     * direct compaction terminated after concluding that the allocation
     * should now succeed
     */
    COMPACT_SUCCESS,
;
  • COMPACT_SKIPPED: 跳过此zone,可能此zone不适合
  • COMPACT_DEFERRED:此zone不能开始,是由于此zone最近失败过
  • COMPACT_CONTINUE:继续尝试做page compaction
  • COMPACT_COMPLETE:  对整个zone扫描已经完成,但是没有规整出合适的页
  • COMPACT_PARTIAL_SKIPPED: 扫描了部分的zone,但是没有找到合适的页
  • COMPACT_SUCCESS:规整成功,并且合并出空闲的页

 

fragmentation index(碎片指数)

当我们申请内存失败的时候有两种原因:

  • 内存不够
  • 内存碎片太多

那怎么确定到底是什么原因导致分配失败的,所以就出现了碎片指数。取值范围[0  1000]

  • 碎片指数趋近于0,说明申请内存失败原因是由于内存不足
  • 碎片指数趋近于1000,说明申请内存失败原因是内存碎片太多

当然了内核同时提供了一个值,来控制碎片指数。int sysctl_extfrag_threshold = 500; 默认值是500

root:/ # cat /proc/sys/vm/extfrag_threshold
500

这个值默认是500的,如果设置太大,则每次申请内存失败,都会归结为内存不够。如果申请太小,则page compaction就会太频繁,系统负载就会增加

 

判断一个zone是否合适做page compaction

enum compact_result compaction_suitable(struct zone *zone, int order,
                    unsigned int alloc_flags,
                    int classzone_idx)

    enum compact_result ret;
    int fragindex;
 
    ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
                    zone_page_state(zone, NR_FREE_PAGES));
    /*
     * fragmentation index determines if allocation failures are due to
     * low memory or external fragmentation
     *
     * index of -1000 would imply allocations might succeed depending on
     * watermarks, but we already failed the high-order watermark check
     * index towards 0 implies failure is due to lack of memory
     * index towards 1000 implies failure is due to fragmentation
     *
     * Only compact if a failure would be due to fragmentation. Also
     * ignore fragindex for non-costly orders where the alternative to
     * a successful reclaim/compaction is OOM. Fragindex and the
     * vm.extfrag_threshold sysctl is meant as a heuristic to prevent
     * excessive compaction for costly orders, but it should not be at the
     * expense of system stability.
     */
    if (ret == COMPACT_CONTINUE && (order > PAGE_ALLOC_COSTLY_ORDER)) 
        fragindex = fragmentation_index(zone, order);
        if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
            ret = COMPACT_NOT_SUITABLE_ZONE;
    
 
    trace_mm_compaction_suitable(zone, order, ret);
    if (ret == COMPACT_NOT_SUITABLE_ZONE)
        ret = COMPACT_SKIPPED;
 
    return ret;
  • __compaction_suitable 此函数主要用来判断此zone是否合适做page compaction
  • 如果此函数返回的是COMPACT_CONTINUE,而且order是昂贵的分配,则就会去获取碎片指数,如果碎片指数在[0-500]之间,则此zone不适合做page compaction
  • 最终返回的结果是跳过此zone=COMPACT_SKIPPED
static enum compact_result __compaction_suitable(struct zone *zone, int order,
                    unsigned int alloc_flags,
                    int classzone_idx,
                    unsigned long wmark_target)

    unsigned long watermark;
 
    if (is_via_compact_memory(order))
        return COMPACT_CONTINUE;
 
    watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
    /*
     * If watermarks for high-order allocation are already met, there
     * should be no need for compaction at all.
     */
    if (zone_watermark_ok(zone, order, watermark, classzone_idx,
                                alloc_flags))
        return COMPACT_SUCCESS;
 
    watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
                low_wmark_pages(zone) : min_wmark_pages(zone);
    watermark += compact_gap(order);
    if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
                        ALLOC_CMA, wmark_target))
        return COMPACT_SKIPPED;
 
    return COMPACT_CONTINUE;
  • 如果是通过设置/proc/sys/vm/compaction_memory,则order=-1, 则规整继续
  • 如果不是通过设置compaction_memory节点的,则先获取当前zone的水位
  • 通过zone_watermark_ok函数可以判断当前zone的内存很足够,(空闲页面-申请页面 >=水位)则返回COMPACT_SUCCESS说明此zone不需要做内存规整
  • 根据当前zone是否是昂贵的order,如果是昂贵的order,则获取low水位的值,否则获取min水位的值
  • 如果(空闲页面-申请页面)>= watetmark + 2*order的页,则此zone需要做内存规整

 

内存碎片整理推迟

为什么需要内存碎片整理推迟?

如果上次内存碎片整理失败,当下一次进行内存碎片整理的时候和上一次很近,如果不推迟的话有可能还会失败,白白的增加系统的负载。所以当下一次进行内存碎片整理的时候,则需要推迟。在结构体zone中就定义了推迟整理的几个字段

struct zone 
#ifdef CONFIG_COMPACTION
    /*
     * On compaction failure, 1<<compact_defer_shift compactions
     * are skipped before trying again. The number attempted since
     * last failure is tracked with compact_considered.
     */
    unsigned int        compact_considered;
    unsigned int        compact_defer_shift;
    int         compact_order_failed;
#endif
  • compact_considered代表推迟的次数
  • compact_defer_shift是推迟的次数以2的底数
  • compact_order_failed记录碎片整理是分配的order数

 

当做碎片整理失败的时候,会调用到此函数

/* Do not skip compaction more than 64 times */
#define COMPACT_MAX_DEFER_SHIFT 6
 
 
void defer_compaction(struct zone *zone, int order)

    zone->compact_considered = 0;
    zone->compact_defer_shift++;
 
    if (order < zone->compact_order_failed)
        zone->compact_order_failed = order;
 
    if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT)
        zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT;
  • 将compact_defer_shift加1, 如果compact_defer_shift的值大于6,则设置为6
  • 说明最大最迟次数为64次,当超过64次之后,则不能推迟了。
  • 如果申请的order小于compact_order_failed, 则设置compact_order_failed=order

当做碎片整理成功的时候,则会调用到compaction_defer_reset函数

void compaction_defer_reset(struct zone *zone, int order, bool alloc_success)

    if (alloc_success) 
        zone->compact_considered = 0;
        zone->compact_defer_shift = 0;
    
    if (order >= zone->compact_order_failed)
        zone->compact_order_failed = order + 1;
  • 当分配成功后,会将compact_defer_shift设置为0
  • 同时如果order大于等于compact_order_failed时,则将compact_order_failed设置为order+1

 

如何确定本次碎片整理是否结束

  • 当迁移扫描器和空闲扫描器相遇的时候,则怎么本次碎片整理结束
  • 当迁移扫描器和空闲扫描器没有相遇,但是从此zone中的根据迁移类型可以从freelist中获取一个大得空闲的页,或者从备用的迁移类型中可以获取一个大的空闲的页,则认为本次碎片整理结束
static enum compact_result __compact_finished(struct compact_control *cc)

        if (compact_scanners_met(cc))                           //相遇了
        /* Let the next compaction start anew. */
        reset_cached_positions(cc->zone);
 
        /*
         * Mark that the PG_migrate_skip information should be cleared
         * by kswapd when it goes to sleep. kcompactd does not set the
         * flag itself as the decision to be clear should be directly
         * based on an allocation request.
         */
        if (cc->direct_compaction)
            cc->zone->compact_blockskip_flush = true;
 
        if (cc->whole_zone)
            return COMPACT_COMPLETE;
        else
            return COMPACT_PARTIAL_SKIPPED;
    
 
 
    for (order = cc->order; order < MAX_ORDER; order++) 
        struct free_area *area = &cc->zone->free_area[order];
        bool can_steal;
 
        /* Job done if page is free of the right migratetype */          //有空闲内存了,返回SUCCESS
        if (!list_empty(&area->free_list[migratetype]))
            return COMPACT_SUCCESS;
 
#ifdef CONFIG_CMA
        /* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
        if (migratetype == MIGRATE_MOVABLE &&
            !list_empty(&area->free_list[MIGRATE_CMA]))
            return COMPACT_SUCCESS;
#endif
 
 
 
 
        if (find_suitable_fallback(area, order, migratetype,            //从备用迁移类型中分配成功了。
                        true, &can_steal, cc->order) != -1) 
 
            /* movable pages are OK in any pageblock */
            if (migratetype == MIGRATE_MOVABLE)
                return COMPACT_SUCCESS;

 

以上是关于page compaction代码分析之一的主要内容,如果未能解决你的问题,请参考以下文章

zone watermark水位控制

zone watermark水位控制

Oracle move和shrink释放高水位空间

Linux 内核 内存管理分区伙伴分配器 ⑥ ( zone 结构体中水线控制相关成员 | 在 Ubuntu 中查看内存区域水位线 )

Linux 内核 内存管理分区伙伴分配器 ⑥ ( zone 结构体中水线控制相关成员 | 在 Ubuntu 中查看内存区域水位线 )

page compaction原理