spark内存管理器--MemoryManager源码解析

Posted 2022-01-06 zhuge134

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了spark内存管理器--MemoryManager源码解析相关的知识，希望对你有一定的参考价值。

MemoryManager内存管理器

内存管理器可以说是spark内核中最重要的基础模块之一，shuffle时的排序，rdd缓存，展开内存，广播变量，Task运行结果的存储等等，凡是需要使用内存的地方都需要向内存管理器定额申请。我认为内存管理器的主要作用是为了尽可能减小内存溢出的同时提高内存利用率。旧版本的spark的内存管理是静态内存管理器StaticMemoryManager，而新版本（应该是从1.6之后吧，记不清了）则改成了统一内存管理器UnifiedMemoryManager，同一内存管理器相对于静态内存管理器最大的区别在于执行内存和存储内存二者之间没有明确的界限，可以相互借用，但是执行内存的优先级更高，也就是说如果执行内存不够用就会挤占存储内存，这时会将一部分缓存的rdd溢写到磁盘上直到腾出足够的空间。但是执行内存任何情况下都不会被挤占，想想这也可以理解，毕竟执行内存是用于shuffle时排序的，这只能在内存中进行，而rdd缓存的要求就没有这么严格。
有几个参数控制各个部分内存的使用比例，

spark.memory.fraction，默认值0.6，这个参数控制spark内存管理器管理的内存占内存存的比例（准确地说是：堆内存-300m，300m是为永久代预留），也就是说执行内存和存储内存加起来只有（堆内存-300m）的0.6，剩余的0.4是用于用户代码执行过程中的内存占用，比如你的代码中可能会加载一些较大的文件到内存中，或者做一些排序，用户代码使用的内存并不受内存管理器管理，所以需要预留一定的比例。
spark.memory.storageFraction，默认值0.5，顾名思义，这个值决定了存储内存的占比，注意是占内存管理器管理的那部分内存的比例，剩余的部分用作执行内存。例如，默认情况下，存储内存占堆内存的比例是0.6 * 0.5 = 0.3（当然准确地说是占堆内存-300m的比例）。

MemoryManager概述

我们首先整体看一下MemoryManager这个类，

    maxOnHeapStorageMemory
    maxOffHeapStorageMemory
    setMemoryStore
    acquireStorageMemory
    acquireUnrollMemory
    acquireExecutionMemory
    releaseExecutionMemory
    releaseAllExecutionMemoryForTask
    releaseStorageMemory
    releaseAllStorageMemory
    releaseUnrollMemory
    executionMemoryUsed
    storageMemoryUsed
    getExecutionMemoryUsageForTask

可以发现，MemoryManager内部的方法比较少而且是有规律的，它将内存在功能上分为三种：StorageMemory，UnrollMemory，ExecutionMemory，
针对这三种内存分别有申请内存的方法和释放内存的方法，并且三种申请内存的方法都是抽象方法，由子类实现。
此外，我们看一下MemoryManager内部有哪些成员变量：

    protected val onHeapStorageMemoryPool = new StorageMemoryPool(this, MemoryMode.ON_HEAP)
    protected val offHeapStorageMemoryPool = new StorageMemoryPool(this, MemoryMode.OFF_HEAP)
    protected val onHeapExecutionMemoryPool = new ExecutionMemoryPool(this, MemoryMode.ON_HEAP)
    protected val offHeapExecutionMemoryPool = new ExecutionMemoryPool(this, MemoryMode.OFF_HEAP)

这四个成员变量分别代表四种内存池。这里要注意的是，MemoryPool的构造其中有一个Object类型参数用于同步锁，MemoryPool内部的一些方法会获取该对象锁用于同步。
我们看一下他们的初始化：

    onHeapStorageMemoryPool.incrementPoolSize(onHeapStorageMemory)
    onHeapExecutionMemoryPool.incrementPoolSize(onHeapExecutionMemory)
    offHeapExecutionMemoryPool.incrementPoolSize(maxOffHeapMemory - offHeapStorageMemory)
    offHeapStorageMemoryPool.incrementPoolSize(offHeapStorageMemory)

MemoryManager.releaseExecutionMemory

其实就是调用ExecutionMemoryPool的相关方法，

  private[memory]
  def releaseExecutionMemory(
      numBytes: Long,
      taskAttemptId: Long,
      memoryMode: MemoryMode): Unit = synchronized 
    memoryMode match 
      case MemoryMode.ON_HEAP => onHeapExecutionMemoryPool.releaseMemory(numBytes, taskAttemptId)
      case MemoryMode.OFF_HEAP => offHeapExecutionMemoryPool.releaseMemory(numBytes, taskAttemptId)

ExecutionMemoryPool.releaseMemory

代码逻辑很简单，就不多说了。
其实从这个方法，我们大概可以看出，spark内存管理的含义，其实spark的内存管理说到底就是对内存使用量的记录和管理，而并不是像操作系统或jvm那样真正地进行内存的分配和回收。

def releaseMemory(numBytes: Long, taskAttemptId: Long): Unit = lock.synchronized 
// 从内部的簿记量中获取该任务使用的内存
val curMem = memoryForTask.getOrElse(taskAttemptId, 0L)
// 检查要释放的内存是否超过了该任务实际使用的内存，并打印告警日志
var memoryToFree = if (curMem < numBytes) 
  logWarning(
    s"Internal error: release called on $numBytes bytes but task only has $curMem bytes " +
      s"of memory from the $poolName pool")
  curMem
 else 
  numBytes

if (memoryForTask.contains(taskAttemptId)) 
  // 更新簿记量
  memoryForTask(taskAttemptId) -= memoryToFree
  // 如果该任务的内存使用量小于等于0，那么从簿记量中移除该任务
  if (memoryForTask(taskAttemptId) <= 0) 
    memoryForTask.remove(taskAttemptId)
  

// 最后通知其他等待的线程
// 因为可能会有其他的任务在等待获取执行内存
lock.notifyAll() // Notify waiters in acquireMemory() that memory has been freed

MemoryManager.releaseAllExecutionMemoryForTask

把堆上的执行内存和直接内存的执行内存中该任务使用的内存都释放掉，
onHeapExecutionMemoryPool和offHeapExecutionMemoryPool是同一个类，只是一个记录执行内存对直接内存的使用，一个记录执行内存对堆内存的使用。

private[memory] def releaseAllExecutionMemoryForTask(taskAttemptId: Long): Long = synchronized 
onHeapExecutionMemoryPool.releaseAllMemoryForTask(taskAttemptId) +
  offHeapExecutionMemoryPool.releaseAllMemoryForTask(taskAttemptId)

MemoryManager.releaseStorageMemory

对于存储内存的使用的记录并没有执行内存那么细，不会记录每个RDD使用了多少内存

def releaseStorageMemory(numBytes: Long, memoryMode: MemoryMode): Unit = synchronized 
memoryMode match 
  case MemoryMode.ON_HEAP => onHeapStorageMemoryPool.releaseMemory(numBytes)
  case MemoryMode.OFF_HEAP => offHeapStorageMemoryPool.releaseMemory(numBytes)

MemoryManager.releaseUnrollMemory

这里，我们看一下释放展开内存的方法，发现展开内存使用的就是存储内存。回顾一下BlockManager部分，展开内存的申请主要是在将数据通过MemoryStore存储成块时需要将数据临时放在内存中，这时就需要申请展开内存。

final def releaseUnrollMemory(numBytes: Long, memoryMode: MemoryMode): Unit = synchronized 
releaseStorageMemory(numBytes, memoryMode)

小结

从上面分析的几个释放内存的方法不难看出，所谓的释放内存其实只是对内存管理器内部的一些簿记量的改变，这就要求外部的调用者必须确保它们确实释放了这么多的内存，否则内存管理就会和实际的内存使用情况出现很大偏差。当然，好在内存管理器是spark内部的模块，并不向用户开放，所以在用户代码中不会调用内存管理模块。

UnifiedMemoryManager

开篇我们讲到，spark的内存管理器分为两种，而新的版本默认都是使用统一内存管理器UnifiedMemoryManager，后面静态内存管理器会逐渐启用，所以这里我们也重点分析统一内存管理。
前面，我们分析了父类MemoryManager中释放内存的几个方法，而申请内存的几个方法都是抽象方法，这些方法的实现都是在子类中，也就是UnifiedMemoryManager中实现的。

UnifiedMemoryManager.acquireExecutionMemory

这个方法是用来申请执行内存的。其中定义了几个局部方法，maybeGrowExecutionPool方法用来挤占存储内存以扩展执行内存空间；
computeMaxExecutionPoolSize方法用来计算最大的执行内存大小。
最后调用了executionPool.acquireMemory方法实际申请执行内存。

override private[memory] def acquireExecutionMemory(
  numBytes: Long,
  taskAttemptId: Long,
  memoryMode: MemoryMode): Long = synchronized 
// 检查内存大小是否正确
assertInvariants()
assert(numBytes >= 0)
// 根据堆内存还是直接内存决定使用不同的内存池和内存大小
val (executionPool, storagePool, storageRegionSize, maxMemory) = memoryMode match 
  case MemoryMode.ON_HEAP => (
    onHeapExecutionMemoryPool,
    onHeapStorageMemoryPool,
    onHeapStorageRegionSize,
    maxHeapMemory)
  case MemoryMode.OFF_HEAP => (
    offHeapExecutionMemoryPool,
    offHeapStorageMemoryPool,
    offHeapStorageMemory,
    maxOffHeapMemory)


/**
 * Grow the execution pool by evicting cached blocks, thereby shrinking the storage pool.
 *
 * When acquiring memory for a task, the execution pool may need to make multiple
 * attempts. Each attempt must be able to evict storage in case another task jumps in
 * and caches a large block between the attempts. This is called once per attempt.
 */
// 通过挤占存储内存来扩张执行内存，
// 通过将缓存的块溢写到磁盘上，从而为执行内存腾出空间
def maybeGrowExecutionPool(extraMemoryNeeded: Long): Unit = 
  if (extraMemoryNeeded > 0) 
    // There is not enough free memory in the execution pool, so try to reclaim memory from
    // storage. We can reclaim any free memory from the storage pool. If the storage pool
    // has grown to become larger than `storageRegionSize`, we can evict blocks and reclaim
    // the memory that storage has borrowed from execution.
    // 我们可以将剩余的存储内存都借过来用作执行内存
    // 另外，如果存储内存向执行内存借用了一部分内存，也就是说此时存储内存的实际大小大于配置的值
    // 那么我们就将所有的借用的存储内存都还回来
    val memoryReclaimableFromStorage = math.max(
      storagePool.memoryFree,
      storagePool.poolSize - storageRegionSize)
    if (memoryReclaimableFromStorage > 0) 
      // Only reclaim as much space as is necessary and available:
      // 只腾出必要大小的内存空间，这个方法会将内存中的block挤到磁盘中
      val spaceToReclaim = storagePool.freeSpaceToShrinkPool(
        math.min(extraMemoryNeeded, memoryReclaimableFromStorage))
      // 更新一些簿记量，存储内存少了这么多内存，相应的执行内存增加了这么多内存
      storagePool.decrementPoolSize(spaceToReclaim)
      executionPool.incrementPoolSize(spaceToReclaim)
    
  


/**
 * The size the execution pool would have after evicting storage memory.
 *
 * The execution memory pool divides this quantity among the active tasks evenly to cap
 * the execution memory allocation for each task. It is important to keep this greater
 * than the execution pool size, which doesn't take into account potential memory that
 * could be freed by evicting storage. Otherwise we may hit SPARK-12155.
 *
 * Additionally, this quantity should be kept below `maxMemory` to arbitrate fairness
 * in execution memory allocation across tasks, Otherwise, a task may occupy more than
 * its fair share of execution memory, mistakenly thinking that other tasks can acquire
 * the portion of storage memory that cannot be evicted.
 */
def computeMaxExecutionPoolSize(): Long = 
  maxMemory - math.min(storagePool.memoryUsed, storageRegionSize)


executionPool.acquireMemory(
  numBytes, taskAttemptId, maybeGrowExecutionPool, () => computeMaxExecutionPoolSize)

ExecutionMemoryPool.acquireMemory

这个方法的代码我就不贴了，主要是一些复杂的内存申请规则的计算，以及内部簿记量的维护，此外如果现有可用的内存量太小，则会等待（通过对象锁等待）直到其他任务释放一些内存；
除此之外最重要的就是对上面提到的maybeGrowExecutionPool方法的调用，所以我们重点还是看一下maybeGrowExecutionPool方法。

maybeGrowExecutionPool

由于这个方法在前面已经贴出来，并且标上了很详细的注释，所以代码逻辑略过，其中有一个关键的调用storagePool.freeSpaceToShrinkPool，这个方法实现了将内存中的块挤出去的逻辑。

storagePool.freeSpaceToShrinkPool

我们发现其中调用了memoryStore.evictBlocksToFreeSpace方法，

def freeSpaceToShrinkPool(spaceToFree: Long): Long = lock.synchronized 
    val spaceFreedByReleasingUnusedMemory = math.min(spaceToFree, memoryFree)
    val remainingSpaceToFree = spaceToFree - spaceFreedByReleasingUnusedMemory
    if (remainingSpaceToFree > 0) 
      // If reclaiming free memory did not adequately shrink the pool, begin evicting blocks:
      val spaceFreedByEviction =
        memoryStore.evictBlocksToFreeSpace(None, remainingSpaceToFree, memoryMode)
      // When a block is released, BlockManager.dropFromMemory() calls releaseMemory(), so we do
      // not need to decrement _memoryUsed here. However, we do need to decrement the pool size.
      spaceFreedByReleasingUnusedMemory + spaceFreedByEviction
     else 
      spaceFreedByReleasingUnusedMemory

memoryStore.evictBlocksToFreeSpace

这个方法看似很长，其实大概可以总结为一点。
因为MemoryStore存储了内存中所有块的实际数据，所以可以根据这些信息知道每个块实际大小，这样就能计算出需要挤出哪些块，当然这个过程中还有一些细节的处理，比如块的写锁的获取和释放等等。
这里面，实际将块从内存中释放（本质上就是将块的数据对应的MemoryEntry的引用设为null，这样gc就可以回收这个块）的功能代码在blockEvictionHandler.dropFromMemory方法中实现，也就是
BlockManager.dropFromMemory。

private[spark] def evictBlocksToFreeSpace(
  blockId: Option[BlockId],
  space: Long,
  memoryMode: MemoryMode): Long = 
assert(space > 0)
memoryManager.synchronized 
  var freedMemory = 0L
  val rddToAdd = blockId.flatMap(getRddId)
  val selectedBlocks = new ArrayBuffer[BlockId]
  def blockIsEvictable(blockId: BlockId, entry: MemoryEntry[_]): Boolean = 
    entry.memoryMode == memoryMode && (rddToAdd.isEmpty || rddToAdd != getRddId(blockId))
  
  // This is synchronized to ensure that the set of entries is not changed
  // (because of getValue or getBytes) while traversing the iterator, as that
  // can lead to exceptions.
  entries.synchronized 
    val iterator = entries.entrySet().iterator()
    while (freedMemory < space && iterator.hasNext) 
      val pair = iterator.next()
      val blockId = pair.getKey
      val entry = pair.getValue
      if (blockIsEvictable(blockId, entry)) 
        // We don't want to evict blocks which are currently being read, so we need to obtain
        // an exclusive write lock on blocks which are candidates for eviction. We perform a
        // non-blocking "tryLock" here in order to ignore blocks which are locked for reading:
        // 这里之所以要获取写锁是为了防止在块正在被读取或写入的时候将其挤出去
        if (blockInfoManager.lockForWriting(blockId, blocking = false).isDefined) 
          selectedBlocks += blockId
          freedMemory += pair.getValue.size
        
      
    
  

  def dropBlock[T](blockId: BlockId, entry: MemoryEntry[T]): Unit = 
    val data = entry match 
      case DeserializedMemoryEntry(values, _, _) => Left(values)
      case SerializedMemoryEntry(buffer, _, _) => Right(buffer)
    
    // 这里的调用将块挤出内存，如果允许写到磁盘则溢写到磁盘上
    // 注意blockEvictionHandler的实现类就是BlockManager
    val newEffectiveStorageLevel =
      blockEvictionHandler.dropFromMemory(blockId, () => data)(entry.classTag)
    if (newEffectiveStorageLevel.isValid) 
      // The block is still present in at least one store, so release the lock
      // but don't delete the block info
      // 因为前面获取了这些块的写锁，还没有释放，
      // 所以在这里释放这些块的写锁
      blockInfoManager.unlock(blockId)
     else 
      // The block isn't present in any store, so delete the block info so that the
      // block can be stored again
      // 因为块由于从内存中移除又没有写到磁盘上，所以直接从内部的簿记量中移除该块的信息
      blockInfoManager.removeBlock(blockId)
    
  

  // 如果腾出的内存足够多，比申请的量要大，这时才会真正释放相应的块
  if (freedMemory >= space) 
    var lastSuccessfulBlock = -1
    try 
      logInfo(s"$selectedBlocks.size blocks selected for dropping " +
        s"($Utils.bytesToString(freedMemory) bytes)")
      (0 until selectedBlocks.size).foreach  idx =>
        val blockId = selectedBlocks(idx)
        val entry = entries.synchronized 
          entries.get(blockId)
        
        // This should never be null as only one task should be dropping
        // blocks and removing entries. However the check is still here for
        // future safety.
        if (entry != null) 
          dropBlock(blockId, entry)
          // 这时为测试留的一个钩子方法
          afterDropAction(blockId)
        
        lastSuccessfulBlock = idx
      
      logInfo(s"After dropping $selectedBlocks.size blocks, " +
        s"free memory is $Utils.bytesToString(maxMemory - blocksMemoryUsed)")
      freedMemory
     finally 
      // like BlockManager.doPut, we use a finally rather than a catch to avoid having to deal
      // with InterruptedException
      // 如果不是所有的块都转移成功，那么必然有的块的写锁可能没有释放
      // 所以在这里将这些没有移除成功的块的写锁释放掉
      if (lastSuccessfulBlock != selectedBlocks.size - 1) 
        // the blocks we didn't process successfully are still locked, so we have to unlock them
        (lastSuccessfulBlock + 1 until selectedBlocks.size).foreach  idx =>
          val blockId = selectedBlocks(idx)
          blockInfoManager.unlock(blockId)
        
      
    
   else // 如果不能腾出足够多的内存，那么取消这次行动，释放所有已经持有的块的写锁
    blockId.foreach  id =>
      logInfo(s"Will not store $id")
    
    selectedBlocks.foreach  id =>
      blockInfoManager.unlock(id)
    
    0L

BlockManager.dropFromMemory

总结一下这个方法的主要逻辑：

如果存储级别允许存到磁盘，那么先溢写到磁盘上
将block从MemoryStore内部的map结构中移除掉
向driver上的BlockManagerMaster汇报块更新
向任务度量系统汇报块更新的统计信息

所以，七绕八绕，饶了这么一大圈，其实所谓的内存挤占，其实就是把引用设为null ^_^当然肯定不是这么简单啦，其实在整个分析的过程中我们也能发现，所谓的内存管理大部分工作就是对任务使用内存一些簿记量的管理维护，这里面有一些比较复杂的逻辑，例如给每个任务分配多少内存的计算逻辑就比较复杂。

private[storage] override def dropFromMemory[T: ClassTag](
  blockId: BlockId,
  data: () => Either[Array[T], ChunkedByteBuffer]): StorageLevel = 
logInfo(s"Dropping block $blockId from memory")
val info = blockInfoManager.assertBlockIsLockedForWriting(blockId)
var blockIsUpdated = false
val level = info.level

// Drop to disk, if storage level requires
// 如果存储级别允许存到磁盘，那么先溢写到磁盘上
if (level.useDisk && !diskStore.contains(blockId)) 
  logInfo(s"Writing block $blockId to disk")
  data() match 
    case Left(elements) =>
      diskStore.put(blockId)  channel =>
        val out = Channels.newOutputStream(channel)
        serializerManager.dataSerializeStream(
          blockId,
          out,
          elements.toIterator)(info.classTag.asInstanceOf[ClassTag[T]])
      
    case Right(bytes) =>
      diskStore.putBytes(blockId, bytes)
  
  blockIsUpdated = true


// Actually drop from memory store
val droppedMemorySize =
  if (memoryStore.contains(blockId)) memoryStore.getSize(blockId) else 0L
val blockIsRemoved = memoryStore.remove(blockId)
if (blockIsRemoved) 
  blockIsUpdated = true
 else 
  logWarning(s"Block $blockId could not be dropped from memory as it does not exist")


val status = getCurrentBlockStatus(blockId, info)
if (info.tellMaster) 
  reportBlockStatus(blockId, status, droppedMemorySize)

// 向任务度量系统汇报块更新的统计信息
if (blockIsUpdated) 
  addUpdatedBlockStatusToTaskMetrics(blockId, status)

status.storageLevel

UnifiedMemoryManager.acquireStorageMemory

我们再来看一下对于存储内存的申请。
其中，存储内存向执行内存借用的逻辑相对简单，仅仅是将两个内存池的大小改一下，执行内存池减少一定的大小,存储内存池则增加相应的大小。

override def acquireStorageMemory(
  blockId: BlockId,
  numBytes: Long,
  memoryMode: MemoryMode): Boolean = synchronized 
assertInvariants()
assert(numBytes >= 0)
val (executionPool, storagePool, maxMemory) = memoryMode match 
  case MemoryMode.ON_HEAP => (
    onHeapExecutionMemoryPool,
    onHeapStorageMemoryPool,
    maxOnHeapStorageMemory)
  case MemoryMode.OFF_HEAP => (
    offHeapExecutionMemoryPool,
    offHeapStorageMemoryPool,
    maxOffHeapStorageMemory)

// 因为执行内存挤占不了，所以这里如果申请的内存超过现在可用的内存，那么就申请不了了
if (numBytes > maxMemory) 
  // Fail fast if the block simply won't fit
  logInfo(s"Will not store $blockId as the required space ($numBytes bytes) exceeds our " +
    s"memory limit ($maxMemory bytes)")
  return false

// 如果大于存储内存的可用内存，那么就需要向执行内存借用一部分内存
if (numBytes > storagePool.memoryFree) 
  // There is not enough free memory in the storage pool, so try to borrow free memory from
  // the execution pool.
  val memoryBorrowedFromExecution = Math.min(executionPool.memoryFree,
    numBytes - storagePool.memoryFree)
  // 存储内存向执行内存借用的逻辑很简单，
  // 仅仅是将两个内存池的大小改一下，
  // 执行内存池减少一定的大小,存储内存池则增加相应的大小
  executionPool.decrementPoolSize(memoryBorrowedFromExecution)
  storagePool.incrementPoolSize(memoryBorrowedFromExecution)

// 通过storagePool申请一定量的内存
storagePool.acquireMemory(blockId, numBytes)

StorageMemoryPool.acquireMemory

def acquireMemory(
  blockId: BlockId,
  numBytesToAcquire: Long,
  numBytesToFree: Long): Boolean = lock.synchronized 
assert(numBytesToAcquire >= 0)
assert(numBytesToFree >= 0)
assert(memoryUsed <= poolSize)
// 首先调用MemoryStore的相关方法挤出一些块以释放内存
if (numBytesToFree > 0) 
  memoryStore.evictBlocksToFreeSpace(Some(blockId), numBytesToFree, memoryMode)

// NOTE: If the memory store evicts blocks, then those evictions will synchronously call
// back into this StorageMemoryPool in order to free memory. Therefore, these variables
// should have been updated.
// 因为前面挤出一些块后释放内存时，BlockManager会通过MemoryManager相关方法更新内部的簿记量，
// 所以这里的memoryFree就会变化，会变大
val enoughMemory = numBytesToAcquire <= memoryFree
if (enoughMemory) 
  _memoryUsed += numBytesToAcquire

enoughMemory

可以看到，这里也调用了memoryStore.evictBlocksToFreeSpace方法来讲一部分块挤出内存，以此来为新的block腾出空间。

UnifiedMemoryManager.acquireUnrollMemory

另外还有对展开内存的申请，实际就是申请存储内存。

override def acquireUnrollMemory(
  blockId: BlockId,
  numBytes: Long,
  memoryMode: MemoryMode): Boolean = synchronized 
acquireStorageMemory(blockId, numBytes, memoryMode)

总结

内存管理，本质上是对shuffle排序过程中使用的内存和rdd缓存使用的内存的簿记，通过对内存使用量的详细精确的记录和管理，最大限度避免OOM的发生，同时尽量提高内存利用率。

以上是关于spark内存管理器--MemoryManager源码解析的主要内容，如果未能解决你的问题，请参考以下文章