内核如何获取内存

Posted 2021-05-17 PP小能手

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了内核如何获取内存相关的知识，希望对你有一定的参考价值。

一、第一阶段从底层bios获取数据

首先是由最底层的bios扫描到硬件信息，然后上传给上层的kernel使用的。这里bios定义了一系列的中断调用函数供上层使用。对于内存在x86下则是定义了INT 0x15，eax = 0xE820来获取万恒的内存映射。INT 0x15，AX = 0xE801则是用于获取内存大小。INT 0x15，AX = 0x88也是用于获取内存大小。

内核就是通过调用INT 0x15，EAX = 0xE820来获取物理内存状态的。

内核具体是通过函数detect_memory_e820（arch/x86/boot/memory.c）来执行中断调用。该函数主要是循环执行bios的中断系统调用，知道寄存器ebx的值为0的时候。其过程大致分为以下几步：

记录e820的内存地址。因为INT 15中断处理函数会将e820记录的数据拷贝到es:di指向的内存位置，因此需要在首次调用的时候，将es:di指向一块内存区域。后续每次中断调用的时候，后需要将es:di增加一个e820记录大小的偏移，用于记录下一个e820记录。
e820记录的索引。e820记录的索引是通过寄存器ebx传递的。如果还有e820记录，中断处理函数会将ebx值加1。当没有e820记录需要读取的时候，中断处理函数会将ebx的值置为0。因此内核这里使用ebx的值是否为0来判断记录是否已经读完。

static int detect_memory_e820(void)
{
    int count = 0;
    struct biosregs ireg, oreg;
    struct boot_e820_entry *desc = boot_params.e820_table;
    static struct boot_e820_entry buf; /* static so it is zeroed */

    initregs(&ireg);
    ireg.ax  = 0xe820;
    ireg.cx  = sizeof buf;
    ireg.edx = SMAP;
    ireg.di  = (size_t)&buf;

    /*
     * Note: at least one BIOS is known which assumes that the
     * buffer pointed to by one e820 call is the same one as
     * the previous call, and only changes modified fields.  Therefore,
     * we use a temporary buffer and copy the results entry by entry.
     *
     * This routine deliberately does not try to account for
     * ACPI 3+ extended attributes.  This is because there are
     * BIOSes in the field which report zero for the valid bit for
     * all ranges, and we don\'t currently make any use of the
     * other attribute bits.  Revisit this if we see the extended
     * attribute bits deployed in a meaningful way in the future.
     */

    do {
        intcall(0x15, &ireg, &oreg);  //执行bios 0x15中断系统调用
        ireg.ebx = oreg.ebx; /* for next iteration... */

        /* BIOSes which terminate the chain with CF = 1 as opposed
           to %ebx = 0 don\'t always report the SMAP signature on
           the final, failing, probe. */
        if (oreg.eflags & X86_EFLAGS_CF)
            break;

        /* Some BIOSes stop returning SMAP in the middle of
           the search loop.  We don\'t know exactly how the BIOS
           screwed up the map at that point, we might have a
           partial map, the full map, or complete garbage, so
           just return failure. */
        if (oreg.eax != SMAP) {
            count = 0;
            break;
        }

        *desc++ = buf; //读取到的数据拷贝到desc
        count++;
    } while (ireg.ebx && count < ARRAY_SIZE(boot_params.e820_table));

    return boot_params.e820_entries = count; //返回所有的e820条目
}

一个典型的INT 15h，EAX = E820的输出如下[1]：

Base Address | Length | Type

0x0000000000000000 | 0x000000000009FC00 | Free Memory (1)

0x000000000009FC00 | 0x0000000000000400 | Reserved Memory (2) 0x00000000000E8000 | 0x0000000000018000 | Reserved Memory (2) 0x0000000000100000 | 0x0000000001F00000 | Free Memory (1)

0x00000000FFFC0000 | 0x0000000000040000 | Reserved Memory (2)

内核获取到的最终结果存储在boot_params.e820_table中。

内核在bootload的第一个阶段从bios中获取到内存的原始数据信息，在内核会将其逐步转化，主要有三个数据结构：

e820_table_firmware:最原始的固件版本数据，在bootloader阶段传递给内核。

e820_table_kexec：内核轻微修改过的版本，内核标记setup_data list为reserved，因此kexec可以重用setup_data信息。此外，kexec可以修改该结构来fake一个mptable。

e820_table：这是由底层x86代码管理的最主要的结构，它最终会传递到上层的MM管理层。一旦信息传递到上层内存管理层，e820 map数据将不再有效，因此它的主要目的是作为一个临时存储，用于存储早期启动阶段固件特定的内存布局数据。

二、第二阶段将数据拷贝到e820_table结构

因此下一个阶段就是将物理内存信息从boot_params.e820_table中转换到e820_table中。

该过程其实比较简单，在平台初始化的时候会调用e820__memory_setup_default函数。该函数最终会调用__e820__range_add。就是将全局变量e820_table的entryies赋予boot_params.e820_table条目中的值。

/*
 * Add a memory region to the kernel E820 map.
 */
static void __init __e820__range_add(struct e820_table *table, u64 start, u64 size, enum e820_type type)
{
    int x = table->nr_entries;

    if (x >= ARRAY_SIZE(table->entries)) {
        pr_err("too many entries; ignoring [mem %#010llx-%#010llx]\\n",
               start, start + size - 1);
        return;
    }

    table->entries[x].addr = start;
    table->entries[x].size = size;
    table->entries[x].type = type;
    table->nr_entries++;
}

三、第三阶段将e820_table传递给memblock

最后就是将e820_table结构传递给上层MM管理单元使用。这里用到的函数e820__memblock_setup。该函数是在setup_arch中被调用。

void __init e820__memblock_setup(void)
{
    int i;
    u64 end;

    /*
     * The bootstrap memblock region count maximum is 128 entries
     * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries
     * than that - so allow memblock resizing.
     *
     * This is safe, because this call happens pretty late during x86 setup,
     * so we know about reserved memory regions already. (This is important
     * so that memblock resizing does no stomp over reserved areas.)
     */
    memblock_allow_resize();

    for (i = 0; i < e820_table->nr_entries; i++) {
        struct e820_entry *entry = &e820_table->entries[i];

        end = entry->addr + entry->size;
        if (end != (resource_size_t)end)
            continue;

        if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN)
            continue;

        memblock_add(entry->addr, entry->size);
    }

    /* Throw away partial pages: */
    memblock_trim_memory(PAGE_SIZE);

    memblock_dump_all();
}

主要是调用memblock_add添加新的memblock region。其会调用memlock_add_range来添加内存块到全局变量memblock.memory。在memlock_add_range中主要调用memblock_insert_region来插入新的memblock region。

/**
 * memblock_insert_region - insert new memblock region
 * @type:   memblock type to insert into
 * @idx:    index for the insertion point
 * @base:   base address of the new region
 * @size:   size of the new region
 * @nid:    node id of the new region
 * @flags:  flags of the new region
 *
 * Insert new memblock region [@base, @base + @size) into @type at @idx.
 * @type must already have extra room to accommodate the new region.
 */
static void __init_memblock memblock_insert_region(struct memblock_type *type,
                           int idx, phys_addr_t base,
                           phys_addr_t size,
                           int nid,
                           enum memblock_flags flags)
{
    struct memblock_region *rgn = &type->regions[idx];

    BUG_ON(type->cnt >= type->max);
    memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
    rgn->base = base;
    rgn->size = size;
    rgn->flags = flags;
    memblock_set_region_node(rgn, nid);
    type->cnt++;
    type->total_size += size;
}

这里涉及到两个数据结构struct memblock_type和struct memblock_region，其定义如下：

/**
 * struct memblock_region - represents a memory region
 * @base: physical address of the region
 * @size: size of the region
 * @flags: memory region attributes
 * @nid: NUMA node id
 */
struct memblock_region {
    phys_addr_t base;
    phys_addr_t size;
    enum memblock_flags flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
    int nid;
#endif
};

/**
 * struct memblock_type - collection of memory regions of certain type
 * @cnt: number of regions
 * @max: size of the allocated array
 * @total_size: size of all regions
 * @regions: array of regions
 * @name: the memory type symbolic name
 */
struct memblock_type {
    unsigned long cnt;
    unsigned long max;
    phys_addr_t total_size;
    struct memblock_region *regions;
    char *name;
};

memblock是一种处于启动阶段的内存管理方式，在启动阶段，通常的内存管理单元还没有起来运行。memblock将系统内存看做连续区域的集合，分为三个集合：memory、reserved、physmem。

memory：描述的是kernel使用的物理内存。

reserved：描述的是已分配的regions。

physmem：描述的是boot过程中实际可用的物理内存。physmem只在某些架构下可用。

每一个区域通过struct memblock_region来表示。每一个内存类型通过struct memblock_type来表示，其包含了一组memory regions。

在系统启动过程中，mem_init函数将会释放掉所有的内存给页分配器使用。除非架构支持CONFIG_ARCH_KEEP_MEMBLOCK，否则除了physmem的所有memblock数据结构在系统初始化完成后都将被丢弃。

参考：

https://wiki.osdev.org/Detecting_Memory_(x86)#Getting_an_E820_Memory_Map
https://www.kernel.org/doc/html/latest/core-api/boot-time-mm.html?highlight=memblock#memblock-overview

国产十佳发动机排名榜

以上是关于内核如何获取内存的主要内容，如果未能解决你的问题，请参考以下文章