page compaction代码分析之一

Posted 2022-12-02 Loopers

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了page compaction代码分析之一相关的知识，希望对你有一定的参考价值。

重要的数据结构

/*
 * Determines how hard direct compaction should try to succeed.
 * Lower value means higher priority, analogically to reclaim priority.
 */
enum compact_priority 
    COMPACT_PRIO_SYNC_FULL,
    MIN_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_FULL,
    COMPACT_PRIO_SYNC_LIGHT,
    MIN_COMPACT_COSTLY_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
    DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
    COMPACT_PRIO_ASYNC,
    INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
;

这个结构是compaction的优先级

COMPACT_PRIO_SYNC_FULL：完全同步模式，允许祖塞，允许将脏页写回到存储设备上，直到等待完成
COMPACT_PRIO_SYNC_LIGHT：轻量级同步模式，允许绝大多数祖塞，但是不允许将脏页写回到存储设备上，因为等待时间比较长
COMPACT_PRIO_ASYNC：异步模式，不允许祖塞
优先级关系: COMPACT_PRIO_SYNC_FULL > COMPACT_PRIO_SYNC_LIGHT > COMPACT_PRIO_ASYNC
compation对应的成本：COMPACT_PRIO_SYNC_FULL > COMPACT_PRIO_SYNC_LIGHT > COMPACT_PRIO_ASYNC
完全同步成功率最高

再来看下compaction的成功是否的状态

/* Return values for compact_zone() and try_to_compact_pages() */
/* When adding new states, please adjust include/trace/events/compaction.h */
enum compact_result 
    /* For more detailed tracepoint output - internal to compaction */
    COMPACT_NOT_SUITABLE_ZONE,
    /*
     * compaction didn't start as it was not possible or direct reclaim
     * was more suitable
     */
    COMPACT_SKIPPED,
    /* compaction didn't start as it was deferred due to past failures */
    COMPACT_DEFERRED,
 
    /* compaction not active last round */
    COMPACT_INACTIVE = COMPACT_DEFERRED,
 
    /* For more detailed tracepoint output - internal to compaction */
    COMPACT_NO_SUITABLE_PAGE,
    /* compaction should continue to another pageblock */
    COMPACT_CONTINUE,
 
    /*
     * The full zone was compacted scanned but wasn't successfull to compact
     * suitable pages.
     */
    COMPACT_COMPLETE,
    /*
     * direct compaction has scanned part of the zone but wasn't successfull
     * to compact suitable pages.
     */
    COMPACT_PARTIAL_SKIPPED,
 
    /* compaction terminated prematurely due to lock contentions */
    COMPACT_CONTENDED,
 
    /*
     * direct compaction terminated after concluding that the allocation
     * should now succeed
     */
    COMPACT_SUCCESS,
;

COMPACT_SKIPPED：跳过此zone，可能此zone不适合
COMPACT_DEFERRED：此zone不能开始，是由于此zone最近失败过
COMPACT_CONTINUE：继续尝试做page compaction
COMPACT_COMPLETE: 对整个zone扫描已经完成，但是没有规整出合适的页
COMPACT_PARTIAL_SKIPPED：扫描了部分的zone，但是没有找到合适的页
COMPACT_SUCCESS：规整成功，并且合并出空闲的页

fragmentation index（碎片指数）

当我们申请内存失败的时候有两种原因：

内存不够
内存碎片太多

那怎么确定到底是什么原因导致分配失败的，所以就出现了碎片指数。取值范围[0 1000]

碎片指数趋近于0，说明申请内存失败原因是由于内存不足
碎片指数趋近于1000，说明申请内存失败原因是内存碎片太多

当然了内核同时提供了一个值，来控制碎片指数。int sysctl_extfrag_threshold = 500; 默认值是500

root:/ # cat /proc/sys/vm/extfrag_threshold
500

这个值默认是500的，如果设置太大，则每次申请内存失败，都会归结为内存不够。如果申请太小，则page compaction就会太频繁，系统负载就会增加

判断一个zone是否合适做page compaction

enum compact_result compaction_suitable(struct zone *zone, int order,
                    unsigned int alloc_flags,
                    int classzone_idx)

    enum compact_result ret;
    int fragindex;
 
    ret = __compaction_suitable(zone, order, alloc_flags, classzone_idx,
                    zone_page_state(zone, NR_FREE_PAGES));
    /*
     * fragmentation index determines if allocation failures are due to
     * low memory or external fragmentation
     *
     * index of -1000 would imply allocations might succeed depending on
     * watermarks, but we already failed the high-order watermark check
     * index towards 0 implies failure is due to lack of memory
     * index towards 1000 implies failure is due to fragmentation
     *
     * Only compact if a failure would be due to fragmentation. Also
     * ignore fragindex for non-costly orders where the alternative to
     * a successful reclaim/compaction is OOM. Fragindex and the
     * vm.extfrag_threshold sysctl is meant as a heuristic to prevent
     * excessive compaction for costly orders, but it should not be at the
     * expense of system stability.
     */
    if (ret == COMPACT_CONTINUE && (order > PAGE_ALLOC_COSTLY_ORDER)) 
        fragindex = fragmentation_index(zone, order);
        if (fragindex >= 0 && fragindex <= sysctl_extfrag_threshold)
            ret = COMPACT_NOT_SUITABLE_ZONE;
    
 
    trace_mm_compaction_suitable(zone, order, ret);
    if (ret == COMPACT_NOT_SUITABLE_ZONE)
        ret = COMPACT_SKIPPED;
 
    return ret;

__compaction_suitable 此函数主要用来判断此zone是否合适做page compaction
如果此函数返回的是COMPACT_CONTINUE，而且order是昂贵的分配，则就会去获取碎片指数，如果碎片指数在[0-500]之间，则此zone不适合做page compaction
最终返回的结果是跳过此zone=COMPACT_SKIPPED

static enum compact_result __compaction_suitable(struct zone *zone, int order,
                    unsigned int alloc_flags,
                    int classzone_idx,
                    unsigned long wmark_target)

    unsigned long watermark;
 
    if (is_via_compact_memory(order))
        return COMPACT_CONTINUE;
 
    watermark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
    /*
     * If watermarks for high-order allocation are already met, there
     * should be no need for compaction at all.
     */
    if (zone_watermark_ok(zone, order, watermark, classzone_idx,
                                alloc_flags))
        return COMPACT_SUCCESS;
 
    watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
                low_wmark_pages(zone) : min_wmark_pages(zone);
    watermark += compact_gap(order);
    if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
                        ALLOC_CMA, wmark_target))
        return COMPACT_SKIPPED;
 
    return COMPACT_CONTINUE;

如果是通过设置/proc/sys/vm/compaction_memory，则order=-1，则规整继续
如果不是通过设置compaction_memory节点的，则先获取当前zone的水位
通过zone_watermark_ok函数可以判断当前zone的内存很足够，（空闲页面-申请页面 >=水位）则返回COMPACT_SUCCESS说明此zone不需要做内存规整
根据当前zone是否是昂贵的order，如果是昂贵的order，则获取low水位的值，否则获取min水位的值
如果（空闲页面-申请页面）>= watetmark + 2*order的页，则此zone需要做内存规整

内存碎片整理推迟

为什么需要内存碎片整理推迟？

如果上次内存碎片整理失败，当下一次进行内存碎片整理的时候和上一次很近，如果不推迟的话有可能还会失败，白白的增加系统的负载。所以当下一次进行内存碎片整理的时候，则需要推迟。在结构体zone中就定义了推迟整理的几个字段

struct zone 
#ifdef CONFIG_COMPACTION
    /*
     * On compaction failure, 1<<compact_defer_shift compactions
     * are skipped before trying again. The number attempted since
     * last failure is tracked with compact_considered.
     */
    unsigned int        compact_considered;
    unsigned int        compact_defer_shift;
    int         compact_order_failed;
#endif

compact_considered代表推迟的次数
compact_defer_shift是推迟的次数以2的底数
compact_order_failed记录碎片整理是分配的order数

当做碎片整理失败的时候，会调用到此函数

/* Do not skip compaction more than 64 times */
#define COMPACT_MAX_DEFER_SHIFT 6
 
 
void defer_compaction(struct zone *zone, int order)

    zone->compact_considered = 0;
    zone->compact_defer_shift++;
 
    if (order < zone->compact_order_failed)
        zone->compact_order_failed = order;
 
    if (zone->compact_defer_shift > COMPACT_MAX_DEFER_SHIFT)
        zone->compact_defer_shift = COMPACT_MAX_DEFER_SHIFT;

将compact_defer_shift加1，如果compact_defer_shift的值大于6，则设置为6
说明最大最迟次数为64次，当超过64次之后，则不能推迟了。
如果申请的order小于compact_order_failed，则设置compact_order_failed=order

当做碎片整理成功的时候，则会调用到compaction_defer_reset函数

void compaction_defer_reset(struct zone *zone, int order, bool alloc_success)

    if (alloc_success) 
        zone->compact_considered = 0;
        zone->compact_defer_shift = 0;
    
    if (order >= zone->compact_order_failed)
        zone->compact_order_failed = order + 1;

当分配成功后，会将compact_defer_shift设置为0
同时如果order大于等于compact_order_failed时，则将compact_order_failed设置为order+1

如何确定本次碎片整理是否结束

当迁移扫描器和空闲扫描器相遇的时候，则怎么本次碎片整理结束
当迁移扫描器和空闲扫描器没有相遇，但是从此zone中的根据迁移类型可以从freelist中获取一个大得空闲的页，或者从备用的迁移类型中可以获取一个大的空闲的页，则认为本次碎片整理结束

static enum compact_result __compact_finished(struct compact_control *cc)

        if (compact_scanners_met(cc))                           //相遇了
        /* Let the next compaction start anew. */
        reset_cached_positions(cc->zone);
 
        /*
         * Mark that the PG_migrate_skip information should be cleared
         * by kswapd when it goes to sleep. kcompactd does not set the
         * flag itself as the decision to be clear should be directly
         * based on an allocation request.
         */
        if (cc->direct_compaction)
            cc->zone->compact_blockskip_flush = true;
 
        if (cc->whole_zone)
            return COMPACT_COMPLETE;
        else
            return COMPACT_PARTIAL_SKIPPED;
    
 
 
    for (order = cc->order; order < MAX_ORDER; order++) 
        struct free_area *area = &cc->zone->free_area[order];
        bool can_steal;
 
        /* Job done if page is free of the right migratetype */          //有空闲内存了，返回SUCCESS
        if (!list_empty(&area->free_list[migratetype]))
            return COMPACT_SUCCESS;
 
#ifdef CONFIG_CMA
        /* MIGRATE_MOVABLE can fallback on MIGRATE_CMA */
        if (migratetype == MIGRATE_MOVABLE &&
            !list_empty(&area->free_list[MIGRATE_CMA]))
            return COMPACT_SUCCESS;
#endif
 
 
 
 
        if (find_suitable_fallback(area, order, migratetype,            //从备用迁移类型中分配成功了。
                        true, &can_steal, cc->order) != -1) 
 
            /* movable pages are OK in any pageblock */
            if (migratetype == MIGRATE_MOVABLE)
                return COMPACT_SUCCESS;

以上是关于page compaction代码分析之一的主要内容，如果未能解决你的问题，请参考以下文章

zone watermark水位控制

Oracle move和shrink释放高水位空间

Linux 内核内存管理分区伙伴分配器 ⑥ ( zone 结构体中水线控制相关成员 | 在 Ubuntu 中查看内存区域水位线 )

page compaction原理