go源码阅读——malloc.go

Posted 2023-02-16 Wang-Junchao

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了go源码阅读——malloc.go相关的知识，希望对你有一定的参考价值。

【博文目录>>>】【项目地址>>>】

内存分配器

golang内存分配最初是基于tcmalloc的，但是有很大的不同。tcmalloc文章：
参见：http://goog-perftools.sourceforge.net/doc/tcmalloc.html
翻译：https://blog.csdn.net/DERRANTCM/article/details/105342996

主分配器在大量页（runs of pages）中工作。将较小的分配大小（最大为32 kB，包括32 kB）舍入为大约70个大小类别之一，每个类别都有其自己的大小完全相同的空闲对象集。任何空闲的内存页都可以拆分为一个大小类别的对象集，然后使用空闲位图（free bitmap）进行管理。

分配器的数据结构为：

fixalloc：用于固定大小的堆外对象的空闲列表分配器，用于管理分配器使用的存储。
mheap：malloc堆，以页（8192字节）粒度进行管理。
mspan：由mheap管理的一系列使用中的页面。
mcentral：收集给定大小类的所有跨度。
mcache：具有可用空间的mspans的每P个缓存。
mstats：分配统计信息。

分配一个小对象沿用了高速缓存的层次结构：

1、将大小四舍五入为一个较小的类别，然后在此P的mcache中查看相应的mspan。扫描mspan的空闲位图以找到空闲位置（slot）。如果有空闲位置，分配它。这都可以在不获取锁的情况下完成。
2、如果mspan没有可用位置，则从mcentral的具有可用空间的所需size类的mspan列表中获取一个新的mspan。获得整个跨度（span）会摊销锁定mcentral的成本。
3、如果mcentral的mspan列表为空，从mheap获取一系列页以用于mspan。
4、如果mheap为空或没有足够大的页，则从操作系统中分配一组新的页（至少1MB）。分配大量页面将分摊与操作系统进行对话的成本。

清除mspan并释放对象沿用了类似的层次结构：

1、如果响应分配而清除了mspan，则将mspan返还到mcache以满足分配。
2、否则，如果mspan仍有已分配的对象，则将其放在mspan的size类别的mcentral空闲列表上。
3、否则，如果mspan中的所有对象都是空闲的，则mspan的页面将返回到mheap，并且mspan现在已失效。

分配和释放大对象直接使用mheap，而绕过mcache和mcentral。如果mspan.needzero为false，则mspan中的可用对象位置已被清零。否则，如果needzero为true，则在分配对象时将其清零。通过这种方式延迟归零有很多好处：

1、堆栈帧分配可以完全避免置零。
2、它具有更好的时间局部性，因为该程序可能即将写入内存。
3、我们不会将永远不会被重用的页面归零。

虚拟内存布局

堆由一组arena组成，这些arena在64位上为64MB，在32位（heapArenaBytes）上为4MB。每个arena的起始地址也与arena大小对齐。
每个arena都有一个关联的heapArena对象，该对象存储该arena的元数据：arena中所有字（word）的堆位图和arena中所有页的跨度（span）图。它们本身是堆外分配的。
由于arena是对齐的，因此可以将地址空间视为一系列arena帧（frame）。arena映射（mheap_.arenas）从arena帧号映射到*heapArena，对于不由Go堆支持的部分地址空间，映射为nil。arena映射的结构为两层数组，由“L1”arena映射和许多“ L2”arena映射组成；但是，由于arena很大，因此在许多体系结构上，arena映射都由一个大型L2映射组成。
arena地图覆盖了整个可用的地址空间，从而允许Go堆使用地址空间的任何部分。分配器尝试使arena保持连续，以便大跨度（以及大对象）可以跨越arena。

OS内存管理抽象层

在任何给定时间，运行时管理的地址空间区域可能处于四种状态之一：

1）无（None）——未保留和未映射，这是任何区域的默认状态。
2）保留（Reserved）——运行时拥有，但是访问它会导致故障。不计入进程的内存占用。
3）已准备（Prepared）——保留，意在不由物理内存支持（尽管OS可能会延迟实现）。可以有效过渡到就绪。在这样的区域中访问内存是不确定的（可能会出错，可能会返回意外的零等）。
4）就绪（Ready）——可以安全地访问。

这组状态对于支持所有当前受支持的平台而言绝对不是必需的。只需一个“无”，“保留”和“就绪”就可以解决问题。但是，“已准备”状态为我们提供了用于性能目的的灵活性。例如，在POSIX-y操作系统上，“保留”通常是设置了PROT_NONE的私有匿名mmap’d区域，要转换到“就绪”状态，需要设置PROT_READ | PROT_WRITE。但是，Prepared的规格不足使我们仅使用MADV_FREE从Ready过渡到Prepared。因此，在“准备好”状态下，我们可以提早设置一次权限位，我们可以有效地告诉操作系统，当我们严格不需要它们时，可以自由地将页面从我们手中夺走。

对于每个操作系统，都有一组通用的帮助程序，这些帮助程序在这些状态之间转换内存区域。帮助程序如下：

sysAlloc

sysAlloc将OS选择的内存区域从“无”转换为“就绪”。更具体地说，它从操作系统中获取大量的零位内存，通常大约为一百千字节或兆字节。该内存始终可以立即使用。

sysFree

sysFree将内存区域从任何状态转换为“无（Ready）”。因此，它无条件返回内存。如果在分配过程中检测到内存不足错误，或用于划分出地址空间的对齐部分，则使用此方法。仅当sysReserve始终返回与堆分配器的对齐限制对齐的内存区域时，如果sysFree是无操作的，这是可以的。

sysReserve

sysReserve将内存区域从“无（None）”转换为“保留（Reserved）”。它以这样一种方式保留地址空间，即在访问时（通过权限或未提交内存）会导致致命错误。因此，这种保留永远不会受到物理内存的支持。如果传递给它的指针为非nil，则调用者希望在那里保留，但是sysReserve仍然可以选择另一个位置（如果该位置不可用）。

注意：sysReserve返回OS对齐的内存，但是堆分配器可能使用更大的对齐方式，因此调用者必须小心地重新对齐sysReserve获得的内存。

sysMap

sysMap将内存区域从“保留（Reserved）”状态转换为“已准备（Prepared）”状态。它确保可以将存储区域有效地转换为“就绪（Ready）”。

sysUsed

sysUsed将内存区域从“已准备（Prepared）”过渡到“就绪（Ready）”。它通知操作系统需要内存区域，并确保可以安全地访问该区域。在没有明确的提交步骤和严格的过量提交限制的系统上，这通常是不操作的，例如，在Windows上至关重要。

sysUnused

sysUnused将内存区域从“就绪（Ready）”转换为“已准备（Prepared）”。它通知操作系统，不再需要支持该内存区域的物理页，并且可以将其重新用于其他目的。 sysUnused内存区域的内容被认为是没用的，在调用sysUsed之前，不得再次访问该区域。

sysFault

sysFault将内存区域从“就绪（Ready）”或“已准备（Prepared）”转换为“保留（Reserved）”。它标记了一个区域，以便在访问时总是会发生故障。仅用于调试运行时。

状态转换图

hint的选择

64位机器

在64位计算机上，我们选择以下hit因为：

1、从地址空间的中间开始，可以轻松扩展到连续范围，而无需运行其他映射。
2、这使Go堆地址在调试时更容易识别。
3、gccgo中的堆栈扫描仍然很保守，因此将地址与其他数据区分开很重要。

从0x00c0开始意味着有效的内存地址将从0x00c0、0x00c1 … n 小端开始，即c0 00，c1 00，…这些都不是有效的UTF-8序列，否则它们是尽可能远离ff（可能是一个公共字节）。如果失败，我们尝试其他0xXXc0地址。较早的尝试使用0x11f8导致线程分配期间OS X上的内存不足错误。 0x00c0导致与AddressSanitizer发生冲突，后者保留了最多0x0100的所有内存。这些选择减少了保守的垃圾收集器不收集内存的可能性，因为某些非指针内存块具有与内存地址匹配的位模式。

但是，在arm64上，我们忽略了上面的所有建议，并在0x40 << 32处分配，因为当使用具有3级转换缓冲区的4k页面时，
用户地址空间在darwin/arm64上被限制为39位，地址空间更小。

在AIX上，对于64位，mmaps从0x0A00000000000000开始。

32位机器

在32位计算机上，我们更加关注保持可用堆是连续的。
因此：

1、我们为所有的heapArena保留空间，这样它们就不会与heap交错。它们约为258MB，因此还算不错。（如果出现问题，我们可以在前面预留较小的空间。）
2、我们建议堆从二进制文件的末尾开始，因此我们有最大的机会保持其连续性。
3、我们尝试放出一个相当大的初始堆保留。

一些常量

var zerobase uintptr：所有0字节分配的基地址

源码

// Memory allocator.
//
// This was originally based on tcmalloc, but has diverged quite a bit.
// http://goog-perftools.sourceforge.net/doc/tcmalloc.html

// The main allocator works in runs of pages.
// Small allocation sizes (up to and including 32 kB) are
// rounded to one of about 70 size classes, each of which
// has its own free set of objects of exactly that size.
// Any free page of memory can be split into a set of objects
// of one size class, which are then managed using a free bitmap.
//
// The allocator's data structures are:
//
//    fixalloc: a free-list allocator for fixed-size off-heap objects,
//        used to manage storage used by the allocator.
//    mheap: the malloc heap, managed at page (8192-byte) granularity.
//    mspan: a run of in-use pages managed by the mheap.
//    mcentral: collects all spans of a given size class.
//    mcache: a per-P cache of mspans with free space.
//    mstats: allocation statistics.
//
// Allocating a small object proceeds up a hierarchy of caches:
//
//    1. Round the size up to one of the small size classes
//       and look in the corresponding mspan in this P's mcache.
//       Scan the mspan's free bitmap to find a free slot.
//       If there is a free slot, allocate it.
//       This can all be done without acquiring a lock.
//
//    2. If the mspan has no free slots, obtain a new mspan
//       from the mcentral's list of mspans of the required size
//       class that have free space.
//       Obtaining a whole span amortizes the cost of locking
//       the mcentral.
//
//    3. If the mcentral's mspan list is empty, obtain a run
//       of pages from the mheap to use for the mspan.
//
//    4. If the mheap is empty or has no page runs large enough,
//       allocate a new group of pages (at least 1MB) from the
//       operating system. Allocating a large run of pages
//       amortizes the cost of talking to the operating system.
//
// Sweeping an mspan and freeing objects on it proceeds up a similar
// hierarchy:
//
//    1. If the mspan is being swept in response to allocation, it
//       is returned to the mcache to satisfy the allocation.
//
//    2. Otherwise, if the mspan still has allocated objects in it,
//       it is placed on the mcentral free list for the mspan's size
//       class.
//
//    3. Otherwise, if all objects in the mspan are free, the mspan's
//       pages are returned to the mheap and the mspan is now dead.
//
// Allocating and freeing a large object uses the mheap
// directly, bypassing the mcache and mcentral.
//
// If mspan.needzero is false, then free object slots in the mspan are
// already zeroed. Otherwise if needzero is true, objects are zeroed as
// they are allocated. There are various benefits to delaying zeroing
// this way:
//
//    1. Stack frame allocation can avoid zeroing altogether.
//
//    2. It exhibits better temporal locality, since the program is
//       probably about to write to the memory.
//
//    3. We don't zero pages that never get reused.

// Virtual memory layout
//
// The heap consists of a set of arenas, which are 64MB on 64-bit and
// 4MB on 32-bit (heapArenaBytes). Each arena's start address is also
// aligned to the arena size.
//
// Each arena has an associated heapArena object that stores the
// metadata for that arena: the heap bitmap for all words in the arena
// and the span map for all pages in the arena. heapArena objects are
// themselves allocated off-heap.
//
// Since arenas are aligned, the address space can be viewed as a
// series of arena frames. The arena map (mheap_.arenas) maps from
// arena frame number to *heapArena, or nil for parts of the address
// space not backed by the Go heap. The arena map is structured as a
// two-level array consisting of a "L1" arena map and many "L2" arena
// maps; however, since arenas are large, on many architectures, the
// arena map consists of a single, large L2 map.
//
// The arena map covers the entire possible address space, allowing
// the Go heap to use any part of the address space. The allocator
// attempts to keep arenas contiguous so that large spans (and hence
// large objects) can cross arenas.
/**
 * 内存分配器
 *
 * 这最初是基于tcmalloc的，但是有很大的不同。
 * 参见：http://goog-perftools.sourceforge.net/doc/tcmalloc.html
 * 翻译：https://blog.csdn.net/DERRANTCM/article/details/105342996
 *
 * 主分配器在大量页（runs of pages）中工作。
 * 将较小的分配大小（最大为32 kB，包括32 kB）舍入为大约70个大小类别之一，每个类别都有其自己的大小完全相同的空闲对象集。
 * 任何空闲的内存页都可以拆分为一个大小类别的对象集，然后使用空闲位图（free bitmap）进行管理。
 *
 * 分配器的数据结构为：
 *
 * fixalloc：用于固定大小的堆外对象的空闲列表分配器，用于管理分配器使用的存储。
 * mheap：malloc堆，以页（8192字节）粒度进行管理。
 * mspan：由mheap管理的一系列使用中的页面。
 * mcentral：收集给定大小类的所有跨度。
 * mcache：具有可用空间的mspans的每P个缓存。
 * mstats：分配统计信息。
 *
 * 分配一个小对象沿用了高速缓存的层次结构：
 *
 * 1.将大小四舍五入为一个较小的类别，然后在此P的mcache中查看相应的mspan。
 * 扫描mspan的空闲位图以找到空闲位置（slot）。如果有空闲位置，分配它。这都可以在不获取锁的情况下完成。
 *
 * 2.如果mspan没有可用位置，则从mcentral的具有可用空间的所需size类的mspan列表中获取一个新的mspan。
 * 获得整个跨度（span）会摊销锁定mcentral的成本。
 *
 * 3.如果mcentral的mspan列表为空，从mheap获取一系列页以用于mspan。
 *
 * 4.如果mheap为空或没有足够大的页，则从操作系统中分配一组新的页（至少1MB）。
 * 分配大量页面将分摊与操作系统进行对话的成本。
 *
 * 清除mspan并释放对象沿用了类似的层次结构：
 *
 * 1.如果响应分配而清除了mspan，则将mspan返还到mcache以满足分配。
 *
 * 2.否则，如果mspan仍有已分配的对象，则将其放在mspan的size类别的mcentral空闲列表上。
 *
 * 3.否则，如果mspan中的所有对象都是空闲的，则mspan的页面将返回到mheap，并且mspan现在已失效。
 *
 * 分配和释放大对象直接使用mheap，而绕过mcache和mcentral。
 *
 * 如果mspan.needzero为false，则mspan中的可用对象位置已被清零。否则，如果needzero为true，
 * 则在分配对象时将其清零。通过这种方式延迟归零有很多好处：
 *
 * 1.堆栈帧分配可以完全避免置零。
 *
 * 2.它具有更好的时间局部性，因为该程序可能即将写入内存。
 *
 * 3.我们不会将永远不会被重用的页面归零。
 *
 * 虚拟内存布局
 *
 * 堆由一组arena组成，这些arena在64位上为64MB，在32位（heapArenaBytes）上为4MB。
 * 每个arena的起始地址也与arena大小对齐。
 *
 * 每个arena都有一个关联的heapArena对象，该对象存储该arena的元数据：arena中所有字（word）的堆位图
 * 和arena中所有页的跨度（span）图。它们本身是堆外分配的。
 *
 * 由于arena是对齐的，因此可以将地址空间视为一系列arena帧（frame）。arena映射（mheap_.arenas）
 * 从arena帧号映射到*heapArena，对于不由Go堆支持的部分地址空间，映射为nil。arena映射的结构为两层数组，
 * 由“L1”arena映射和许多“ L2”arena映射组成；但是，由于arena很大，因此在许多体系结构上，
 * arena映射都由一个大型L2映射组成。
 *
 * arena地图覆盖了整个可用的地址空间，从而允许Go堆使用地址空间的任何部分。分配器尝试使arena保持连续，
 * 以便大跨度（以及大对象）可以跨越arena。
 **/

package runtime

import (
    "runtime/internal/atomic"
    "runtime/internal/math"
    "runtime/internal/sys"
    "unsafe"
)

const (
    debugMalloc = false

    maxTinySize   = _TinySize
    tinySizeClass = _TinySizeClass
    maxSmallSize  = _MaxSmallSize

    pageShift = _PageShift
    pageSize  = _PageSize
    pageMask  = _PageMask
    // By construction, single page spans of the smallest object class
    // have the most objects per span.
    // 通过构造，对象类别最小的单个页面跨度在每个跨度中具有最多的对象。
    // 每个跨度的最大多对象数
    maxObjsPerSpan = pageSize / 8

    concurrentSweep = _ConcurrentSweep

    _PageSize = 1 << _PageShift
    _PageMask = _PageSize - 1

    // _64bit = 1 on 64-bit systems, 0 on 32-bit systems
    // _64bit = 1在64位系统上，0在32位系统上
    _64bit = 1 << (^uintptr(0) >> 63) / 2

    // Tiny allocator parameters, see "Tiny allocator" comment in malloc.go.
    // Tiny分配器参数，请参阅malloc.go中的“Tiny分配器”注释。在mallocgc方法中
    _TinySize      = 16
    _TinySizeClass = int8(2)
    
    // FixAlloc的块大小
    _FixAllocChunk = 16 << 10 // Chunk size for FixAlloc

    // Per-P, per order stack segment cache size.
    // 每个P，每个order堆栈段的缓存大小。
    _StackCacheSize = 32 * 1024

    // Number of orders that get caching. Order 0 is FixedStack
    // and each successive order is twice as large.
    // We want to cache 2KB, 4KB, 8KB, and 16KB stacks. Larger stacks
    // will be allocated directly.
    // Since FixedStack is different on different systems, we
    // must vary NumStackOrders to keep the same maximum cached size.
    // 获得缓存的order数。Order 0是FixedStack，每个连续的order是其两倍。
    // 我们要缓存2KB，4KB，8KB和16KB堆栈。较大的堆栈将直接分配。
    // 由于FixedStack在不同的系统上是不同的，因此我们必须改变NumStackOrders以保持相同的最大缓存大小。
    // 下面是不同的操作系统对应的FixedStack和NumStackOrders的对应表
    //   OS               | FixedStack | NumStackOrders
    //   -----------------+------------+---------------
    //   linux/darwin/bsd | 2KB        | 4
    //   windows/32       | 4KB        | 3
    //   windows/64       | 8KB        | 2
    //   plan9            | 4KB        | 3
    _NumStackOrders = 4 - sys.PtrSize/4*sys.GoosWindows - 1*sys.GoosPlan9

    // heapAddrBits is the number of bits in a heap address. On
    // amd64, addresses are sign-extended beyond heapAddrBits. On
    // other arches, they are zero-extended.
    //
    // On most 64-bit platforms, we limit this to 48 bits based on a
    // combination of hardware and OS limitations.
    //
    // amd64 hardware limits addresses to 48 bits, sign-extended
    // to 64 bits. Addresses where the top 16 bits are not either
    // all 0 or all 1 are "non-canonical" and invalid. Because of
    // these "negative" addresses, we offset addresses by 1<<47
    // (arenaBaseOffset) on amd64 before computing indexes into
    // the heap arenas index. In 2017, amd64 hardware added
    // support for 57 bit addresses; however, currently only Linux
    // supports this extension and the kernel will never choose an
    // address above 1<<47 unless mmap is called with a hint
    // address above 1<<47 (which we never do).
    //
    // arm64 hardware (as of ARMv8) limits user addresses to 48
    // bits, in the range [0, 1<<48).
    //
    // ppc64, mips64, and s390x support arbitrary 64 bit addresses
    // in hardware. On Linux, Go leans on stricter OS limits. Based
    // on Linux's processor.h, the user address space is limited as
    // follows on 64-bit architectures:
    // heapAddrBits是堆地址中的位数。在amd64上，地址被符号扩展到heapAddrBits之外。在其他架构上，它们是零扩展的。
    //
    // 在大多数64位平台上，基于硬件和操作系统限制的组合，我们将其限制为48位。
    //
    // amd64硬件将地址限制为48位，符号扩展为64位。前16位不全为0或全为1的地址是“非规范的”且无效。
    // 由于存在这些“负”地址，因此在计算进入堆竞技场索引的索引之前，在amd64上将地址偏移1 << 47（arenaBaseOffset）。
    // 2017年，amd64硬件增加了对57位地址的支持;但是，当前只有Linux支持此扩展，内核将永远不会选择大于1 << 47的地址，
    // 除非调用mmap的提示地址大于1 << 47（我们从未这样做）。
    //
    // arm64硬件（自ARMv8起）将用户地址限制为48位，范围为[0，1 << 48）。
    //
    // ppc64，mips64和s390x在硬件中支持任意64位地址。在Linux上，Go依靠更严格的OS限制。
    // 基于Linux的processor.h，在64位体系结构上，用户地址空间受到如下限制：
    //
    // Architecture  Name              Maximum Value (exclusive)
    // ---------------------------------------------------------------------
    // amd64         TASK_SIZE_MAX     0x007ffffffff000 (47 bit addresses)
    // arm64         TASK_SIZE_64      0x01000000000000 (48 bit addresses)
    // ppc64,le    TASK_SIZE_USER64  0x00400000000000 (46 bit addresses)
    // mips64,le   TASK_SIZE64       0x00010000000000 (40 bit addresses)
    // s390x         TASK_SIZE         1<<64 (64 bit addresses)
    //
    // These limits may increase over time, but are currently at
    // most 48 bits except on s390x. On all architectures, Linux
    // starts placing mmap'd regions at addresses that are
    // significantly below 48 bits, so even if it's possible to
    // exceed Go's 48 bit limit, it's extremely unlikely in
    // practice.
    //
    // On 32-bit platforms, we accept the full 32-bit address
    // space because doing so is cheap.
    // mips32 only has access to the low 2GB of virtual memory, so
    // we further limit it to 31 bits.
    //
    // On darwin/arm64, although 64-bit pointers are presumably
    // available, pointers are truncated to 33 bits. Furthermore,
    // only the top 4 GiB of the address space are actually available
    // to the application, but we allow the whole 33 bits anyway for
    // simplicity.
    // TODO(mknyszek): Consider limiting it to 32 bits and using
    // arenaBaseOffset to offset into the top 4 GiB.
    //
    // WebAssembly currently has a limit of 4GB linear memory.
    // 这些限制可能会随时间增加，但目前最多为48位，但s390x除外。在所有体系结构上，
    // Linux都开始将mmap'd区域放置在明显低于48位的地址上，因此，即使有可能超过Go的48位限制，在实践中也极不可能。
    //
    // 在32位平台上，我们接受完整的32位地址空间，因为这样做很便宜。 mips32仅可以访问2GB的低虚拟内存，
    // 因此我们进一步将其限制为31位。
    //
    // 在darwin / arm64上，尽管大概可以使用64位指针，但指针会被截断为33位。此外，
    // 只有地址空间的前4个GiB实际上可供应用程序使用，但是为了简单起见，我们还是允许全部33位。
    // TODO（mknyszek）：考虑将其限制为32位，并使用arenaBaseOffset偏移到前4个GiB中。
    //
    // WebAssembly当前限制为4GB线性内存。
    // heapAddrBits：堆空间地址位数，间接表示了他可以支持的最大的内存空间
    heapAddrBits = (_64bit*(1-sys.GoarchWasm)*(1-sys.GoosDarwin*sys.GoarchArm64))*48 + (1-_64bit+sys.GoarchWasm)*(32-(sys.GoarchMips+sys.GoarchMipsle)) + 33*sys.GoosDarwin*sys.GoarchArm64

    // maxAlloc is the maximum size of an allocation. On 64-bit,
    // it's theoretically possible to allocate 1<<heapAddrBits bytes. On
    // 32-bit, however, this is one less than 1<<32 because the
    // number of bytes in the address space doesn't actually fit
    // in a uintptr.
    // maxAlloc是分配的最大大小。在64位上，理论上可以分配1 << heapAddrBits字节。
    // 但是，在32位上，这比1<<32小1，因为地址空间中的字节数实际上不适合uintptr。
    maxAlloc = (1 << heapAddrBits) - (1-_64bit)*1

    // The number of bits in a heap address, the size of heap
    // arenas, and the L1 and L2 arena map sizes are related by
    //
    //   (1 << addr bits) = arena size * L1 entries * L2 entries
    //
    // Currently, we balance these as follows:
    // 堆地址中的位数，堆arena的大小以及L1和L2 arena映射的大小与
    //
    // 1 << addr位）=arena大小* L1条目* L2条目
    //
    // 目前，我们将这些平衡如下：
    //
    //       Platform  Addr bits  Arena size  L1 entries   L2 entries
    // --------------  ---------  ----------  ----------  -----------
    //       */64-bit         48        64MB           1    4M (32MB)
    // windows/64-bit         48         4MB          64    1M  (8MB)
    //       */32-bit         32         4MB           1  1024  (4KB)
    //     */mips(le)         31         4MB           1   512  (2KB)

    // heapArenaBytes is the size of a heap arena. The heap
    // consists of mappings of size heapArenaBytes, aligned to
    // heapArenaBytes. The initial heap mapping is one arena.
    //
    // This is currently 64MB on 64-bit non-Windows and 4MB on
    // 32-bit and on Windows. We use smaller arenas on Windows
    // because all committed memory is charged to the process,
    // even if it's not touched. Hence, for processes with small
    // heaps, the mapped arena space needs to be commensurate.
    // This is particularly important with the race detector,
    // since it significantly amplifies the cost of committed
    // memory.
    // heapArenaBytes是堆arenas的大小。堆由大小为heapArenaBytes的映射组成，
    // 并与heapArenaBytes对齐。最初的堆映射是一个arenas。
    //
    // 当前在64位非Windows上为64MB，在32位和Windows上为4MB。我们在Windows上使用较小的arenas，
    // 因为所有已提交的内存都由进程负责，即使未涉及也是如此。因此，对于具有小堆的进程，映射的arenas空间需要相对应。
    // 这对于竞争检测器尤其重要，因为它会大大增加已提交内存的成本。
    heapArenaBytes = 1 << logHeapArenaBytes

    // logHeapArenaBytes is log_2 of heapArenaBytes. For clarity,
    // prefer using heapArenaBytes where possible (we need the
    // constant to compute some other constants).
    // logHeapArenaBytes是heapArenaBytes的log_2。为了清楚起见，
    // 最好在可能的地方使用heapArenaBytes（我们需要使用常量来计算其他常量）。
    logHeapArenaBytes = (6+20)*(_64bit*(1-sys.GoosWindows)*(1-sys.GoarchWasm)) + (2+20)*(_64bit*sys.GoosWindows) + (2+20)*(1-_64bit) + (2+20)*sys.GoarchWasm

    // heapArenaBitmapBytes is the size of each heap arena's bitmap.
    // heapArenaBitmapBytes是每个堆arena的位图大小。
    heapArenaBitmapBytes = heapArenaBytes / (sys.PtrSize * 8 / 2)

    // 每个arena所的页数
    pagesPerArena = heapArenaBytes / pageSize

    // arenaL1Bits is the number of bits of the arena number
    // covered by the first level arena map.
    //
    // This number should be small, since the first level arena
    // map requires PtrSize*(1<<arenaL1Bits) of space in the
    // binary's BSS. It can be zero, in which case the first level
    // index is effectively unused. There is a performance benefit
    // to this, since the generated code can be more efficient,
    // but comes at the cost of having a large L2 mapping.
    //
    // We use the L1 map on 64-bit Windows because the arena size
    // is small, but the address space is still 48 bits, and
    // there's a high cost to having a large L2.
    // arenaL1Bits是第一级arena映射覆盖的arena编号的位数。
    //
    // 这个数字应该很小，因为第一级arena映射在二进制文件的BSS中需要PtrSize*(1<<arenaL1Bits)空间。
    // 它可以为零，在这种情况下，第一级索引实际上未被使用。这会带来性能上的好处，
    // 因为生成的代码可以更高效，但是以拥有较大的L2映射为代价。
    //
    // 我们在64位Windows上使用L1映射，因为arena大小很小，但是地址空间仍然是48位，并且拥有大型L2的成本很高。
    arenaL1Bits = 6 * (_64bit * sys.GoosWindows)

    // arenaL2Bits is the number of bits of the arena number
    // covered by the second level arena index.
    //
    // The size of each arena map allocation is proportional to
    // 1<<arenaL2Bits, so it's important that this not be too
    // large. 48 bits leads to 32MB arena index allocations, which
    // is about the practical threshold.
    // arenaL2Bits是第二级arena索引覆盖的arena编号的位数。
    //
    // 每个arena映射分配的大小与1<<arenaL2Bits成正比，因此，不要太大也很重要。
    // 48位导致32MB arena索引分配，这大约是实际的阈值。
    arenaL2Bits = heapAddrBits - logHeapArenaBytes - arenaL1Bits

    // arenaL1Shift is the number of bits to shift an arena frame
    // number by to compute an index into the first level arena map.
    // arenaL1Shift是将arena帧号移位以计算进入第一级arena映射的索引的位数。
    arenaL1Shift = arenaL2Bits

    // arenaBits is the total bits in a combined arena map index.
    // This is split between the index into the L1 arena map and
    // the L2 arena map.
    // arenaBits是组合的arena映射索引中的总位。这在进入L1 arena映射和L2 arena映射的索引之间进行划分。
    arenaBits = arenaL1Bits + arenaL2Bits

    // arenaBaseOffset is the pointer value that corresponds to
    // index 0 in the heap arena map.
    //
    // On amd64, the address space is 48 bits, sign extended to 64
    // bits. This offset lets us handle "negative" addresses (or
    // high addresses if viewed as unsigned).
    //
    // On aix/ppc64, this offset allows to keep the heapAddrBits to
    // 48. Otherwize, it would be 60 in order to handle mmap addresses
    // (in range 0x0a00000000000000 - 0x0afffffffffffff). But in this
    // case, the memory reserved in (s *pageAlloc).init for chunks
    // is causing important slowdowns.
    //
    // On other platforms, the user address space is contiguous
    // and starts at 0, so no offset is necessary.
    // arenaBaseOffset是与堆arena映射中的索引0对应的指针值。
    //
    // 在amd64上，地址空间为48位，符号扩展为64位。此偏移量使我们可以处理“负”地址（如果视为无符号，则为高地址）。
    //
    // 在aix/ppc64上，此偏移量允许将heapAddrBits保持为48。否则，为了处理mmap地址
    //（范围为0x0a00000000000000-0x0afffffffffffffff），它将为60。但是在这种情况下，
    // (s*pageAlloc).init中为块保留的内存会导致严重的速度下降。
    //
    // 在其他平台上，用户地址空间是连续的，并且从0开始，因此不需要偏移量。
    arenaBaseOffset = sys.GoarchAmd64*(1<<47) + (^0x0a00000000000000+1)&uintptrMask*sys.GoosAix

    // Max number of threads to run garbage collection.
    // 2, 3, and 4 are all plausible maximums depending
    // on the hardware details of the machine. The garbage
    // collector scales well to 32 cpus.
    // 运行垃圾回收的最大线程数。 2、3和4都是合理的最大值，具体取决于机器的硬件细节。 垃圾收集器可以很好地扩展到32 cpus。
    _MaxGcproc = 32

    // minLegalPointer is the smallest possible legal pointer.
    // This is the smallest possible architectural page size,
    // since we assume that the first page is never mapped.
    //
    // This should agree with minZeroPage in the compiler.
    //
    // minLegalPointer是最小的合法指针。 这是可能的最小体系架构页大小，因为我们假设第一页从未映射过。
    //这应该与编译器中的minZeroPage一致。
    minLegalPointer uintptr = 4096
)

// physPageSize is the size in bytes of the OS's physical pages.
// Mapping and unmapping operations must be done at multiples of
// physPageSize.
//
// This must be set by the OS init code (typically in osinit) before
// mallocinit.
//
// physPageSize是操作系统物理页面的大小（以字节为单位）。
// 映射和取消映射操作必须以physPageSize的倍数完成。
//
// 必须在mallocinit之前通过OS初始化代码（通常在osinit中）进行设置。
var physPageSize uintptr

// physHugePageSize is the size in bytes of the OS's default physical huge
// page size whose allocation is opaque to the application. It is assumed
// and verified to be a power of two.
//
// If set, this must be set by the OS init code (typically in osinit) before
// mallocinit. However, setting it at all is optional, and leaving the default
// value is always safe (though potentially less efficient).
//
// Since physHugePageSize is always assumed to be a power of two,
// physHugePageShift is defined as physHugePageSize == 1 << physHugePageShift.
// The purpose of physHugePageShift is to avoid doing divisions in
// performance critical functions.
//
// physHugePageSize是操作系统默认物理大页面大小的大小（以字节为单位），
// 该大小对于应用程序是不透明的。 假定并验证为2的幂。
//
// 如果已设置，则必须在mallocinit之前通过OS初始化代码（通常在osinit中）进行设置。
// 但是，完全设置它是可选的，并且保留默认值始终是安全的（尽管可能会降低效率）。
//
// 由于physHugePageSize始终假定为2的幂，因此physHugePageShift定义为physHugePageSize == 1 << physHugePageShift。
// physHugePageShift的目的是避免对性能至关重要的功能进行划分。
var (
    physHugePageSize  uintptr
    physHugePageShift uint
)

// OS memory management abstraction layer
//
// Regions of the address space managed by the runtime may be in one of four
// states at any given time:
// 1) None - Unreserved and unmapped, the default state of any region.
// 2) Reserved - Owned by the runtime, but accessing it would cause a fault.
//               Does not count against the process' memory footprint.
// 3) Prepared - Reserved, intended not to be backed by physical memory (though
//               an OS may implement this lazily). Can transition efficiently to
//               Ready. Accessing memory in such a region is undefined (may
//               fault, may give back unexpected zeroes, etc.).
// 4) Ready - may be accessed safely.
//
// This set of states is more than is strictly necessary to support all the
// currently supported platforms. One could get by with just None, Reserved, and
// Ready. However, the Prepared state gives us flexibility for performance
// purposes. For example, on POSIX-y operating systems, Reserved is usually a
// private anonymous mmap'd region with PROT_NONE set, and to transition
// to Ready would require setting PROT_READ|PROT_WRITE. However the
// underspecification of Prepared lets us use just MADV_FREE to transition from
// Ready to Prepared. Thus with the Prepared state we can set the permission
// bits just once early on, we can efficiently tell the OS that it's free to
// take pages away from us when we don't strictly need them.
//
// For each OS there is a common set of helpers defined that transition
// memory regions between these states. The helpers are as follows:
//
// sysAlloc transitions an OS-chosen region of memory from None to Ready.
// More specifically, it obtains a large chunk of zeroed memory from the
// operating system, typically on the order of a hundred kilobytes
// or a megabyte. This memory is always immediately available for use.
//
// sysFree transitions a memory region from any state to None. Therefore, it
// returns memory unconditionally. It is used if an out-of-memory error has been
// detected midway through an allocation or to carve out an aligned section of
// the address space. It is okay if sysFree is a no-op only if sysReserve always
// returns a memory region aligned to the heap allocator's alignment
// restrictions.
//
// sysReserve transitions a memory region from None to Reserved. It reserves
// address space in such a way that it would cause a fatal fault upon access
// (either via permissions or not committing the memory). Such a reservation is
// thus never backed by physical memory.
// If the pointer passed to it is non-nil, the caller wants the
// reservation there, but sysReserve can still choose another
// location if that one is unavailable.
// NOTE: sysReserve returns OS-aligned memory, but the heap allocator
// may use larger alignment, so the caller must be careful to realign the
// memory obtained by sysReserve.
//
// sysMap transitions a memory region from Reserved to Prepared. It ensures the
// memory region can be efficiently transitioned to Ready.
//
// sysUsed transitions a memory region from Prepared to Ready. It notifies the
// operating system that the memory region is needed and ensures that the region
// may be safely accessed. This is typically a no-op on systems that don't h以上是关于go源码阅读——malloc.go的主要内容，如果未能解决你的问题，请参考以下文章