mesos支持gpu代码分析以及capos支持gpu实现

Posted 2020-12-25 yanghuahui

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了mesos支持gpu代码分析以及capos支持gpu实现相关的知识，希望对你有一定的参考价值。

这篇文章涉及mesos如何在原生的mesoscontainerizer和docker containerizer上支持gpu的，以及如果自己实现一个mesos之上的framework capos支持gpu调度的实现原理，(capos是hulu内部的资源调度平台 refer to https://www.cnblogs.com/yanghuahui/p/9304302.html）。

mesos slave在启动的时候需要初始化containerizer的resource，包含cpu/mem/gpu等，这对于mesos containerizer和docker containerizer都是通用的

void Slave::initialize() {
   ...
   Try<Resources> resources = Containerizer::resources(flags);
   ...
}

然后到了src/slave/containerizer/containerizer.cpp 代码块中, 根据mesos-slave/agent的启动参数flags，调用allocator逻辑

Try<Resources> Containerizer::resources(const Flags& flags)
{
  ...
  // GPU resource.
  Try<Resources> gpus = NvidiaGpuAllocator::resources(flags);
  if (gpus.isError()) {
    return Error("Failed to obtain GPU resources: " + gpus.error());
  }

  // When adding in the GPU resources, make sure that we filter out
  // the existing GPU resources (if any) so that we do not double
  // allocate GPUs.
  resources = gpus.get() + resources.filter(
      [](const Resource& resource) {
        return resource.name() != "gpus";
      });  
  ...
}

src/slave/containerizer/mesos/isolators/gpu/allocator.cpp 会用nvidia的管理gpu的命令nvml以及根据启动参数，返回这台机器上gpu的资源，供之后的调度使用。

// To determine the proper number of GPU resources to return, we
// need to check both --resources and --nvidia_gpu_devices.
// There are two cases to consider:
//
//   (1) --resources includes "gpus" and --nvidia_gpu_devices is set.
//       The number of GPUs in --resources must equal the number of
//       GPUs within --nvidia_gpu_resources.
//
//   (2) --resources does not include "gpus" and --nvidia_gpu_devices
//       is not specified. Here we auto-discover GPUs using the
//       NVIDIA management Library (NVML). We special case specifying
//       `gpus:0` explicitly to not perform auto-discovery.
//
static Try<Resources> enumerateGpuResources(const Flags& flags)
{
 ...
}

因为gpu资源是需要绑定gpu卡number的，gpu资源在调度的数据结构中，是一个set<Gpu>, allocator.go提供allocate和deallocate接口的实现

  Future<Nothing> allocate(const set<Gpu>& gpus)
  {
    set<Gpu> allocation = available & gpus;

    if (allocation.size() < gpus.size()) {
      return Failure(stringify(gpus - allocation) + " are not available");
    }

    available = available - allocation;
    allocated = allocated | allocation;

    return Nothing();
  }

  Future<Nothing> deallocate(const set<Gpu>& gpus)
  {
    set<Gpu> deallocation = allocated & gpus;

    if (deallocation.size() < gpus.size()) {
      return Failure(stringify(gpus - deallocation) + " are not allocated");
    }

    allocated = allocated - deallocation;
    available = available | deallocation;

    return Nothing();
  }

但是封装到上层，供containerizer调用的时候，指定需要allocate的gpu number就可以

Future<set<Gpu>> NvidiaGpuAllocator::allocate(size_t count)
{
  // Need to disambiguate for the compiler.
  Future<set<Gpu>> (NvidiaGpuAllocatorProcess::*allocate)(size_t) =
    &NvidiaGpuAllocatorProcess::allocate;

  return process::dispatch(data->process, allocate, count);
}

但是deallocate仍然需要显示指定需要释放哪些gpu

Future<Nothing> NvidiaGpuAllocator::deallocate(const set<Gpu>& gpus)
{
  return process::dispatch(
      data->process,
      &NvidiaGpuAllocatorProcess::deallocate,
      gpus);
}

然后如果作业是用docker containerizer，可以看到src/slave/containerizer/docker.cpp中调用gpu的逻辑

Future<Nothing> DockerContainerizerProcess::allocateNvidiaGpus(
    const ContainerID& containerId,
    const size_t count)
{
  if (!nvidia.isSome()) {
    return Failure("Attempted to allocate GPUs"
                   " without Nvidia libraries available");
  }

  if (!containers_.contains(containerId)) {
    return Failure("Container is already destroyed");
  }

  return nvidia->allocator.allocate(count)
    .then(defer(
        self(),
        &Self::_allocateNvidiaGpus,
        containerId,
        lambda::_1));
}

所以之上，就是在slave中启动的时候加载确认gpu资源，然后在启动containerizer的时候，可以利用slave中维护的gpu set资源池，去拿到资源，之后启动作业。

那capos是如何实现的呢，capos是hulu内部的资源调度平台（refer to https://www.cnblogs.com/yanghuahui/p/9304302.html），因为自己实现了mesos的capos containerizer，我们的做法是，在mesos slave注册的时候显示的通过参数或者自动探测的机制，发现gpu资源，然后用--resources=gpu range的形式启动mesos agent，这样offer资源的gpu在capos看来就是一个range，可以类似使用port资源的方式，来调度gpu，在capos containerizer中，根据调度器指定的gpu range，去绑定一个或者多个gpu资源到docker nvidia runtime中。完成gpu调度功能。

以上是关于mesos支持gpu代码分析以及capos支持gpu实现的主要内容，如果未能解决你的问题，请参考以下文章