使用nvidia_gpu_expoter配合prometheus+grafana监控GPU性能 Posted 2022-05-16 boxrice
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了使用nvidia_gpu_expoter配合prometheus+grafana监控GPU性能相关的知识,希望对你有一定的参考价值。
项目地址:GitHub - utkuozdemir/nvidia_gpu_exporter: Nvidia GPU exporter for prometheus using nvidia-smi binary
根据git上面的nvidia监控项目,可以实现grafana监控GPU,但是git上面提供的utkuozdemir/nvidia_gpu_exporter:0.3.0 这个镜像只可以在ubuntu系统上面运行,如果在centos上运行,日志会提示无法获取到GPU信息,也就导致无法接到k8s的prometheus.目前使用的方法是将nvidia_gpu_exporter这个可执行访问下载到centos系统中,然后通过系统命令运行,最终得到一个服务,也就是gpu的metircs。然后在k8s中,创建endpoinst、service、servicemonitor,实现prometheus收集到gpu-metrics信息,最后通过grafana进行可视化展示。下面是具体操作步骤:
1 在centos系统中有创建nvidia_gpu_exporter服务
安装nvidia_gpu_exporter服务
此时通过web页面就可查看此台GPU服务器的gpu-metircs信息,如下图
可以看到GPU相关信息
创建nvidia_gpu_exporter服务
[Unit]Description = Nvidia GPU ExporterAfter = network-online.target [Service]Type = simpleUser = nvidia_gpu_exporterGroup = nvidia_gpu_exporterExecStart = /usr/local/bin/nvidia_gpu_exporterSyslogIdentifier = nvidia_gpu_exporterRestart = alwaysRestartSec = 1 NoNewPrivileges = yes ProtectHome = yes ProtectSystem = strictProtectControlGroups = true ProtectKernelModules = true ProtectKernelTunables = yes ProtectHostname = yes ProtectKernelLogs = yes ProtectProc = yes [Install]WantedBy = multi-user.target [root@k8s-gpu4 ~] [root@k8s-gpu4 ~] [root@k8s-gpu4 ~] ● nvidia_gpu_exporter.service - Nvidia GPU Exporter Loaded: loaded (/etc/systemd/system/nvidia_gpu_exporter.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2022 -05 -13 17 :36:03 CST; 5s ago Main PID: 80178 (nvidia_gpu_expo) Tasks: 6 Memory: 5 .6M CGroup: /system.slice/nvidia_gpu_exporter.service └─80178 /usr/local/bin/nvidia_gpu_exporter May 13 17 :36:03 k8s-gpu4 systemd[1]: Started Nvidia GPU Exporter. May 13 17 :36:04 k8s-gpu4 nvidia_gpu_exporter[80178]: ts = 2022 -05 -13T09 :36:04.005Z caller = main.go:68 level = info msg = "Listening on add...=:9835 May 13 17:36:04 k8s-gpu4 nvidia_gpu_exporter[80178]: ts=2022-05-13T09:36:04.006Z caller=tls_config.go:195 level=info msg=" TLS is di...= false Hint: Some lines were ellipsized, use -l to show in full.服务启动成功,通过页面查看
2 在k8s中创建endpoints、service、servicemonitor
创建endpoints apiVersion: v1 kind: Endpoints metadata: name: nvidia-gpu-exporter namespace: monitoring subsets: - addresses: - ip: 10 .1.12.17 ports: - name: http port: 9835 protocol: TCP上面的ip为GPU服务器地址,如果是多台GPU,可在下面继续添加,如 - ip: *.*.*.* - ip: *.*.*.*
endpoints/nvidia-gpu-exporter created NAME ENDPOINTS AGE nvidia-gpu-exporter 10 .1.12.17:9835 39s Name: nvidia-gpu-exporter Namespace: monitoring Labels: <none> Annotations: <none> Subsets: Addresses: 10 .1.12.17 NotReadyAddresses: <none> Ports: Name Port Protocol -- -- -- -- -- -- -- -- http 9835 TCP Events: <none>创建service apiVersion: v1 kind: Service metadata: labels: app: nvidia-gpu-exporter name: nvidia-gpu-exporter namespace: monitoring spec: ports: - name: http protocol: TCP port: 9835 targetPort: http type: ClusterIPservice "nvidia-gpu-exporter" deleted kubectl create -f gpu-exporter-svc.yaml service /nvidia-gpu-exporter created NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE nvidia-gpu-exporter ClusterIP 10 .10.75.226 <none> 9835 /TCP 12s Name: nvidia-gpu-exporter Namespace: monitoring Labels: app = nvidia-gpu-exporter Annotations: <none> Selector: <none> Type: ClusterIP IP: 10 .10.235.70 Port: http 9835 /TCP TargetPort: http/TCP Endpoints: 10 .1.12.17:9835 Session Affinity: None Events: <none>上面的endpioins一定要为上面创建的endpoints中的IP和port
创建servicemonitor apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: labels: app: nvidia-gpu-exporter name: nvidia-gpu-exporter namespace: monitoring spec: endpoints: - interval: 30s port: http jobLabel: app selector: matchLabels: app: nvidia-gpu-exporter kubectl create -f gpu-exporter-serviceMonitor.yaml servicemonitor.monitoring.coreos.com/nvidia-gpu-exporter created [root@k8s-master dongtai] NAME AGE nvidia-gpu-exporter 12s Name: nvidia-gpu-exporter Namespace: monitoring Labels: app = nvidia-gpu-exporter Annotations: <none> API Version: monitoring.coreos.com/v1 Kind: ServiceMonitor Metadata: Creation Timestamp: 2022 -05 -13T09 :50:35Z Generation: 1 Managed Fields: API Version: monitoring.coreos.com/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:labels: .: f:app: f:spec: .: f:endpoints: f:jobLabel: f:selector: .: f:matchLabels: .: f:app: Manager: kubectl-create Operation: Update Time: 2022 -05 -13T09 :50:35Z Resource Version: 14080381 Self Link: /apis/monitoring.coreos.com/v1/namespaces/monitoring/servicemonitors/nvidia-gpu-exporter UID: 7fdb365b-8bcd-4fc2-9772-9ad7de6155bf Spec: Endpoints: Interval: 30s Port: http Job Label: app Selector: Match Labels: App: nvidia-gpu-exporter Events: <none>prometheus页面验证 在prometheus页面的targets中查看nvidia_gpu_exporter
在Graph页面中进行nvidia搜索
通过搜索可以得到这台GPU服务器有两张3090GPU
3 在grafana中创建GPU监控面板
在grafana导入官方提供的json文件
导入官方的json文件会出现错误提示,原因是这个json文件配置有问题,我们需要进行修改。
点击右上角进行修改
点击Variables,点击gpu
将Query改成如下,改完后,可以得到GPU服务器的IP,最后点击update
返回监控页后,可以得到如下图:
最终GPU相关的性能指标能得到很好展示
以上是关于使用nvidia_gpu_expoter配合prometheus+grafana监控GPU性能的主要内容,如果未能解决你的问题,请参考以下文章
统一开发环境,qt pro转vs工程的便捷工具
12.9英寸新iPad Pro兼容旧款妙控键盘 但是有点“贴身”
Macbook Pro 怎么装写论文用的Tex软件
微软官网:Microsoft Surface Pro 7 最低$998起 还送保护套
荣耀9X PRO方舟编译器顺滑体验
2019年模块化Mac Pro预计将有三点改进