[k8s源码分析][kubelet] devicemanager 之使用device-plugin(模拟gpu)

Posted 2021-04-24 Golang Cloud

tags:

篇首语：本文由小常识网(cha138.com)小编为大家整理，主要介绍了[k8s源码分析][kubelet] devicemanager 之使用device-plugin(模拟gpu)相关的知识，希望对你有一定的参考价值。

本文将分析device plugin是如何使用的, 然后再开始对device plugin与kubelet之间如何进行协同工作进行分析.

本文将以gpu-device-plugin为例子, 然后由于机器上没有真正的GPU, 因此将虚拟出几个GPU, 但是功能上会完全一样.

2. 例子

2.1 当前集群的状态

[root@master kubectl]# ./kubectl get nodes
NAME          STATUS     ROLES    AGE   VERSION
172.21.0.12   NotReady   <none>   15d   v0.0.0-master+$Format:%h$
172.21.0.16   Ready      <none>   15d   v0.0.0-master+$Format:%h$
[root@master kubectl]# 
[root@master kubectl]# ./kubectl describe node 172.21.0.12
Name:               172.21.0.12
...
Capacity:
 cpu:                2
 ephemeral-storage:  51473888Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             3880944Ki
 pods:               110
Allocatable:
 cpu:                2
 ephemeral-storage:  47438335103
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             3778544Ki
 pods:               110
...
[root@master kubectl]# ./kubectl describe node 172.21.0.16
Name:               172.21.0.16
...
Capacity:
 cpu:                2
 ephemeral-storage:  51473888Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             8009720Ki
 pods:               110
Allocatable:
 cpu:                2
 ephemeral-storage:  47438335103
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             7907320Ki
 pods:               110
...

这里主要关注资源(Capacity 和 Allocatable), 所以无关的的地方就滤过了.
Capacity: 代表容量
Allocatable: 可分配的各种资源
如果不理解没关系, 在分析device manager的时候会有一个更清晰的认识.

从上面的信息可以看到当前集群中的两个节点都没有任何外来的资源. 另外需要关注一个目录/var/lib/kubelet/device-plugins, 该目录很重要:

kubelet_internal_checkpoint: 保存了device manager的状态, device manager重启的时候会从该文件中加载数据.
kubelet.sock: device manger的服务端, 各种device-plugin向该服务端请求注册.

[root@master device-plugins]# pwd
/var/lib/kubelet/device-plugins
[root@master device-plugins]# ls
DEPRECATION  kubelet_internal_checkpoint  kubelet.sock
[root@master device-plugins]# cat kubelet_internal_checkpoint 
{"Data":{"PodDeviceEntries":null,"RegisteredDevices":{}},"Checksum":3467439661}
[root@master device-plugins]#

2.2 运行device-plugin

[k8s源码分析][kubelet] devicemanager 之使用device-plugin(模拟gpu)

由于没有真正的GPU, 所以改了一下NVIDIA关于获取和监控gpu的代码. 由于其本质上是获取机器上的所有GPU的UUID 然后注册到device manager中, 因此本文就自己构造了几个GPU UUID. (效果是一样的.)

// k8s-device-plugin/nvidia.go
func getDevices() []*pluginapi.Device {
    n := uint(10)
    var devs []*pluginapi.Device
    for i := uint(0); i < n; i++ {
        devs = append(devs, &pluginapi.Device{
            ID:     fmt.Sprintf("%v-%v", resourceName, i),
            Health: pluginapi.Healthy,
        })
    }
    return devs
}
// k8s-device-plugin/main.go
newResourceName := os.Getenv("resourcename")
    if newResourceName != "" {
        resourceName = newResourceName
    }
serverSock = fmt.Sprintf("%v%v.sock", pluginapi.DevicePluginPath, resourceName)

// k8s-device-plugin/server.go
func (m *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
    devs := m.devs
    name := fmt.Sprintf("NVIDIA_VISIBLE_DEVICES/%v", resourceName)
    ...
    for _, req := range reqs.ContainerRequests {
        response := pluginapi.ContainerAllocateResponse{
            Envs: map[string]string{
                name: strings.Join(req.DevicesIDs, ","),
            },
        }
        ...
}

运行

[root@master NVIDIA]# pwd
/root/go/src/github.com/NVIDIA
[root@master NVIDIA]# git clone https://github.com/nicktming/k8s-device-plugin.git
[root@master k8s-device-plugin]# go build .
[root@master k8s-device-plugin]# export resourcename=nicktming.com/gpu
[root@master k8s-device-plugin]# ./k8s-device-plugin 
2019/10/31 16:33:43 Loading NVML
2019/10/31 16:33:43 Fetching devices.
2019/10/31 16:33:43 Starting FS watcher.
2019/10/31 16:33:43 Starting OS watcher.
2019/10/31 16:33:43 Starting to serve on /var/lib/kubelet/device-plugins/gpu.sock
2019/10/31 16:33:43 Registered device plugin with Kubelet

2.3 查看节点状态

首先查看集群中该节点的资源信息

[root@master kubectl]# ./kubectl describe node 172.21.0.16
Name:               172.21.0.16
...
Capacity:
 cpu:                2
 ephemeral-storage:  51473888Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             8009720Ki
 nicktming.com/gpu: 10
 pods:               110
Allocatable:
 cpu:                2
 ephemeral-storage:  47438335103
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             7907320Ki
 nicktming.com/gpu: 10
 pods:               110
...

可以看到刚刚运行device-plugin的节点(172.21.0.16)已经向kubelet中的device manager注册了该资源nicktming.com/gpu 并且可分配的资源数为10.

2.4 申请该资源

先申请8个gpu.

[root@master kubectl]# ./kubectl get nodes
NAME          STATUS   ROLES    AGE   VERSION
172.21.0.12   Ready    <none>   15d   v0.0.0-master+$Format:%h$
172.21.0.16   Ready    <none>   15d   v0.0.0-master+$Format:%h$
[root@master kubectl]# ./kubectl get pods --all-namespaces
No resources found.
[root@master kubectl]# cat deviceplugin/pod-gpu-8.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: test-gpu-8
spec:
  containers:
  - name: podtest-8
    image: nginx
    resources:
      limits:
        nicktming.com/gpu : 8
      requests:
        nicktming.com/gpu : 8
    ports:
    - containerPort: 80

[root@master kubectl]# ./kubectl apply -f deviceplugin/pod-gpu-8.yaml 
pod/test-gpu-8 created

查看状态: 可以看到成功申请了8块gpu, 毫无疑问该pod必须只能运行172.21.0.16节点上, 因为目前只有该节点有此资源nicktming.com/gpu.

当然真实情况中docker(nvidia docker)看到了环境变量NVIDIA_VISIBLE_DEVICES=具体的GPU UUID, 就会将对应的gpu投射到容器中.

[root@master kubectl]# ./kubectl get pods
NAME         READY   STATUS    RESTARTS   AGE
test-gpu-8   1/1     Running   0          50s
[root@master kubectl]# ./kubectl exec -it test-gpu-8 env | grep NVIDIA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES/nicktming.com/gpu=nicktming.com/gpu-2,nicktming.com/gpu-3,nicktming.com/gpu-7,nicktming.com/gpu-6,nicktming.com/gpu-1,nicktming.com/gpu-5,nicktming.com/gpu-8,nicktming.com/gpu-9
[root@master kubectl]# 
[root@master kubectl]# ./kubectl describe pods test-gpu-8 | grep -i node
Node:               172.21.0.16/172.21.0.16
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s

查看/var/lib/kubelet/device-plugins中内容, 可用看到多了一个gpu.sock, 此处是devicemanager需要与对应的device-plugin发请求. (后面源码部分中会具体分析)

[root@master device-plugins]# pwd
/var/lib/kubelet/device-plugins
[root@master device-plugins]# ls
DEPRECATION  gpu.sock  kubelet_internal_checkpoint  kubelet.sock
[root@master device-plugins]# cat kubelet_internal_checkpoint | jq .
{
  "Data": {
    "PodDeviceEntries": [
      {
        "PodUID": "94c13838-fbba-11e9-ba9e-525400d54f7e",
        "ContainerName": "podtest-8",
        "ResourceName": "nicktming.com/gpu",
        "DeviceIDs": [
          "nicktming.com/gpu-9",
          "nicktming.com/gpu-2",
          "nicktming.com/gpu-3",
          "nicktming.com/gpu-7",
          "nicktming.com/gpu-6",
          "nicktming.com/gpu-1",
          "nicktming.com/gpu-5",
          "nicktming.com/gpu-8"
        ],
        "AllocResp": "CroBChZOVklESUFfVklTSUJMRV9ERVZJQ0VTEp8Bbmlja3RtaW5nLmNvbS9ncHUtMixuaWNrdG1pbmcuY29tL2dwdS0zLG5pY2t0bWluZy5jb20vZ3B1LTcsbmlja3RtaW5nLmNvbS9ncHUtNixuaWNrdG1pbmcuY29tL2dwdS0xLG5pY2t0bWluZy5jb20vZ3B1LTUsbmlja3RtaW5nLmNvbS9ncHUtOCxuaWNrdG1pbmcuY29tL2dwdS05"
      }
    ],
    "RegisteredDevices": {
      "nicktming.com/gpu": [
        "nicktming.com/gpu-6",
        "nicktming.com/gpu-7",
        "nicktming.com/gpu-8",
        "nicktming.com/gpu-0",
        "nicktming.com/gpu-1",
        "nicktming.com/gpu-2",
        "nicktming.com/gpu-3",
        "nicktming.com/gpu-4",
        "nicktming.com/gpu-9",
        "nicktming.com/gpu-5"
      ]
    }
  },
  "Checksum": 3602853121
}

接下来再创建一个申请3个gpu的pod, 按照常识, 该pod无法创建成功, 因为现在只剩下2个gpu, 分别是nicktming.com/gpu-4 和 nicktming.com/gpu-0.

[root@master kubectl]# cat deviceplugin/pod-gpu-3.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: test-gpu-3
spec:
  containers:
  - name: podtest-3
    image: nginx
    resources:
      limits:
        nicktming.com/gpu : 3
      requests:
        nicktming.com/gpu : 3
    ports:
    - containerPort: 80

[root@master kubectl]# ./kubectl apply -f deviceplugin/pod-gpu-3.yaml 
pod/test-gpu-3 created
[root@master kubectl]# ./kubectl get pods 
NAME         READY   STATUS    RESTARTS   AGE
test-gpu-3   0/1     Pending   0          6s
test-gpu-8   1/1     Running   0          8m20s
[root@master kubectl]# ./kubectl describe pod test-gpu-3
Name:               test-gpu-3
...
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  24s (x2 over 24s)  default-scheduler  0/2 nodes are available: 2 Insufficient nicktming.com/gpu.

可以看到该pod一直处于pending状态, 无法进行调度, 因为集群中的两个节点都无法满足该pod.

2.5 为另外一个节点添加资源

由于资源不够, 此时在另外一个节点172.21.0.12中添加资源, 就是运行同样资源的device-plugin.

[root@worker device-plugin]# ifconfig 
...
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 172.21.0.12  netmask 255.255.240.0  broadcast 172.21.15.255
       ...
[root@worker device-plugin]# pwd
/root/worker/device-plugin
[root@worker device-plugin]# export resourcename=nicktming.com/gpu
[root@worker device-plugin]# ls
k8s-device-plugin
[root@worker device-plugin]# ./k8s-device-plugin 
2019/10/31 17:00:42 Loading NVML
2019/10/31 17:00:42 Fetching devices.
2019/10/31 17:00:42 Starting FS watcher.
2019/10/31 17:00:42 Starting OS watcher.
2019/10/31 17:00:42 Starting to serve on /var/lib/kubelet/device-plugins/gpu.sock
2019/10/31 17:00:42 Registered device plugin with Kubelet

查看节点(172.21.0.12)状态, 可以看到该节点已经有了该资源(nicktming.com/gpu)

[root@master kubectl]# ./kubectl describe node 172.21.0.12
Name:               172.21.0.12
...
Capacity:
 cpu:                2
 ephemeral-storage:  51473888Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             3880944Ki
 nicktming.com/gpu:  10
 pods:               110
Allocatable:
 cpu:                2
 ephemeral-storage:  47438335103
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             3778544Ki
 nicktming.com/gpu:  10
 pods:               110
...

查看pod运行情况, 可以看到test-gpu-3已经运行在172.21.0.12, 关于调度部分可以参考 kube-scheduler, 因为该pod会隔一段时间拿回来调度, 此时发现已经有可用的资源, 就是被调度到某一台机器上了.

[root@master kubectl]# ./kubectl get pods
NAME         READY   STATUS    RESTARTS   AGE
test-gpu-3   1/1     Running   0          10m
test-gpu-8   1/1     Running   0          18m
[root@master kubectl]# ./kubectl exec -it test-gpu-3 env | grep NVIDIA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES/nicktming.com/gpu=nicktming.com/gpu-0,nicktming.com/gpu-2,nicktming.com/gpu-3
[root@master kubectl]# ./kubectl describe pod test-gpu-3 | grep -i node
Node:               172.21.0.12/172.21.0.12
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
  Warning  FailedScheduling   3m57s (x16 over 11m)  default-scheduler     0/2 nodes are available: 2 Insufficient nicktming.com/gpu.
[root@master kubectl]#

2.6 创建另外一种资源rdma

[root@master k8s-device-plugin]# export resourcename=nicktming.com/rdma
[root@master k8s-device-plugin]# ./k8s-device-plugin 
2019/10/31 18:02:34 Loading NVML
2019/10/31 18:02:34 Fetching devices.
2019/10/31 18:02:34 Starting FS watcher.
2019/10/31 18:02:34 Starting OS watcher.
2019/10/31 18:02:34 Starting to serve on /var/lib/kubelet/device-plugins/rdma.sock
2019/10/31 18:02:34 Registered device plugin with Kubelet

查看状态

[root@master kubectl]# ./kubectl describe node 172.21.0.16
Name:               172.21.0.16
...
Capacity:
 cpu:                 2
 ephemeral-storage:   51473888Ki
 hugepages-1Gi:       0
 hugepages-2Mi:       0
 memory:              8009720Ki
 nicktming.com/gpu:   10
 nicktming.com/rdma:  10
 pods:                110
Allocatable:
 cpu:                 2
 ephemeral-storage:   47438335103
 hugepages-1Gi:       0
 hugepages-2Mi:       0
 memory:              7907320Ki
 nicktming.com/gpu:   10
 nicktming.com/rdma:  10
 pods:                110
...

此时申请2个gpu和10个rdma设备.

[root@master kubectl]# ./kubectl get pods
NAME         READY   STATUS    RESTARTS   AGE
test-gpu-3   1/1     Running   0          82m
test-gpu-8   1/1     Running   0          90m
[root@master kubectl]# cat deviceplugin/pod-gpu2-rdma10.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: test-gpu2-rdma10
spec:
  containers:
  - name: testpod-gpu2-rdma10
    image: nginx
    resources:
      limits:
        nicktming.com/gpu : 2
        nicktming.com/rdma : 10
      requests:
        nicktming.com/gpu : 2
        nicktming.com/rdma : 10
    ports:
    - containerPort: 80

[root@master kubectl]# ./kubectl apply -f deviceplugin/pod-gpu2-rdma10.yaml 
pod/test-gpu2-rdma10 created
[root@master kubectl]# ./kubectl get pods
NAME               READY   STATUS    RESTARTS   AGE
test-gpu-3         1/1     Running   0          82m
test-gpu-8         1/1     Running   0          91m
test-gpu2-rdma10   1/1     Running   0          6s
[root@master kubectl]# ./kubectl exec -it test-gpu2-rdma10 env | grep NVIDIA_VISIBLE_DEVICES
NVIDIA_VISIBLE_DEVICES/nicktming.com/gpu=nicktming.com/gpu-0,nicktming.com/gpu-4
NVIDIA_VISIBLE_DEVICES/nicktming.com/rdma=nicktming.com/rdma-4,nicktming.com/rdma-2,nicktming.com/rdma-7,nicktming.com/rdma-3,nicktming.com/rdma-9,nicktming.com/rdma-5,nicktming.com/rdma-0,nicktming.com/rdma-1,nicktming.com/rdma-6,nicktming.com/rdma-8
[root@master kubectl]#

查看kubelet_internal_checkpoint

[root@master device-plugins]# pwd
/var/lib/kubelet/device-plugins
[root@master device-plugins]# ls
DEPRECATION  gpu.sock  kubelet_internal_checkpoint  kubelet.sock  rdma.sock
[root@master device-plugins]# 
[root@master device-plugins]# cat kubelet_internal_checkpoint | jq .
{
  "Data": {
    "PodDeviceEntries": [
      {
        "PodUID": "94c13838-fbba-11e9-ba9e-525400d54f7e",
        "ContainerName": "podtest-8",
        "ResourceName": "nicktming.com/gpu",
        "DeviceIDs": [
          "nicktming.com/gpu-8",
          "nicktming.com/gpu-9",
          "nicktming.com/gpu-2",
          "nicktming.com/gpu-3",
          "nicktming.com/gpu-7",
          "nicktming.com/gpu-6",
          "nicktming.com/gpu-1",
          "nicktming.com/gpu-5"
        ],
        "AllocResp": "CroBChZOVklESUFfVklTSUJMRV9ERVZJQ0VTEp8Bbmlja3RtaW5nLmNvbS9ncHUtMixuaWNrdG1pbmcuY29tL2dwdS0zLG5pY2t0bWluZy5jb20vZ3B1LTcsbmlja3RtaW5nLmNvbS9ncHUtNixuaWNrdG1pbmcuY29tL2dwdS0xLG5pY2t0bWluZy5jb20vZ3B1LTUsbmlja3RtaW5nLmNvbS9ncHUtOCxuaWNrdG1pbmcuY29tL2dwdS05"
      },
      {
        "PodUID": "4d589c87-fbc7-11e9-ba9e-525400d54f7e",
        "ContainerName": "testpod-gpu2-rdma10",
        "ResourceName": "nicktming.com/rdma",
        "DeviceIDs": [
          "nicktming.com/rdma-9",
          "nicktming.com/rdma-5",
          "nicktming.com/rdma-0",
          "nicktming.com/rdma-1",
          "nicktming.com/rdma-6",
          "nicktming.com/rdma-8",
          "nicktming.com/rdma-3",
          "nicktming.com/rdma-2",
          "nicktming.com/rdma-7",
          "nicktming.com/rdma-4"
        ],
        "AllocResp": "Cv8BCilOVklESUFfVklTSUJMRV9ERVZJQ0VTL25pY2t0bWluZy5jb20vcmRtYRLRAW5pY2t0bWluZy5jb20vcmRtYS00LG5pY2t0bWluZy5jb20vcmRtYS0yLG5pY2t0bWluZy5jb20vcmRtYS03LG5pY2t0bWluZy5jb20vcmRtYS0zLG5pY2t0bWluZy5jb20vcmRtYS05LG5pY2t0bWluZy5jb20vcmRtYS01LG5pY2t0bWluZy5jb20vcmRtYS0wLG5pY2t0bWluZy5jb20vcmRtYS0xLG5pY2t0bWluZy5jb20vcmRtYS02LG5pY2t0bWluZy5jb20vcmRtYS04"
      },
      {
        "PodUID": "4d589c87-fbc7-11e9-ba9e-525400d54f7e",
        "ContainerName": "testpod-gpu2-rdma10",
        "ResourceName": "nicktming.com/gpu",
        "DeviceIDs": [
          "nicktming.com/gpu-0",
          "nicktming.com/gpu-4"
        ],
        "AllocResp": "ClMKKE5WSURJQV9WSVNJQkxFX0RFVklDRVMvbmlja3RtaW5nLmNvbS9ncHUSJ25pY2t0bWluZy5jb20vZ3B1LTAsbmlja3RtaW5nLmNvbS9ncHUtNA=="
      }
    ],
    "RegisteredDevices": {
      "nicktming.com/gpu": [
        "nicktming.com/gpu-0",
        "nicktming.com/gpu-4",
        "nicktming.com/gpu-9",
        "nicktming.com/gpu-7",
        "nicktming.com/gpu-8",
        "nicktming.com/gpu-1",
        "nicktming.com/gpu-2",
        "nicktming.com/gpu-3",
        "nicktming.com/gpu-5",
        "nicktming.com/gpu-6"
      ],
      "nicktming.com/rdma": [
        "nicktming.com/rdma-0",
        "nicktming.com/rdma-1",
        "nicktming.com/rdma-2",
        "nicktming.com/rdma-7",
        "nicktming.com/rdma-8",
        "nicktming.com/rdma-9",
        "nicktming.com/rdma-3",
        "nicktming.com/rdma-4",
        "nicktming.com/rdma-5",
        "nicktming.com/rdma-6"
      ]
    }
  },
  "Checksum": 3285376913
}

3. 总结

相信到这里对device-plugin如何使用就比较明朗了, 但是里面究竟发生了什么, 会在后续源码部分进行分析, 这里的例子也是为源码分析做准备.
接下来会从两个部分来分析device-plugin与device manager的工作机制.
1. device-plugin向device manager注册资源的过程.
2. pod申请资源的过程.

以上是关于[k8s源码分析][kubelet] devicemanager 之使用device-plugin(模拟gpu)的主要内容，如果未能解决你的问题，请参考以下文章

kubelet Pod 的状态分析

删除K8S集群的/var/lib/kubelet目录报Device or resource busy错误解决方法

K8S 核心组件 kubelet 与 kube-proxy 分析

Kubelet无法访问rancher-metadata问题分析

k8s学习-CKA真题-集群故障排查kubelet