为啥 nvprof 没有浮点除法运算的指标?
Posted
技术标签:
【中文标题】为啥 nvprof 没有浮点除法运算的指标?【英文标题】:Why nvprof does not have metrics on floating point division operations?为什么 nvprof 没有浮点除法运算的指标? 【发布时间】:2019-08-30 02:18:29 【问题描述】:使用nvprof
来衡量我的示例内核的浮点运算,似乎flop_count_dp_div
没有度量标准,而实际的双精度除法运算是用double-的add/mul/fma来衡量的精度,甚至一些单精度运算的 fma。
我想知道为什么会这样,如果我没有源代码,如何从nvprof
报告中推断出内核的动态除法操作数?
我的简单测试内核:
#include <iostream>
__global__ void mul(double a, double* x, double* y)
y[threadIdx.x] = a * x[threadIdx.x];
__global__ void div(double a, double* x, double* y)
y[threadIdx.x] = a / x[threadIdx.x];
int main(int argc, char* argv[])
const int kDataLen = 4;
double a = 2.0f;
double host_x[kDataLen] = 1.0f, 2.0f, 3.0f, 4.0f;
double host_y[kDataLen];
// Copy input data to device.
double* device_x;
double* device_y;
cudaMalloc(&device_x, kDataLen * sizeof(double));
cudaMalloc(&device_y, kDataLen * sizeof(double));
cudaMemcpy(device_x, host_x, kDataLen * sizeof(double),
cudaMemcpyHostToDevice);
// Launch the kernel.
mul<<<1, kDataLen>>>(a, device_x, device_y);
div<<<1, kDataLen>>>(a, device_x, device_y);
// Copy output data to host.
cudaDeviceSynchronize();
cudaMemcpy(host_y, device_y, kDataLen * sizeof(double),
cudaMemcpyDeviceToHost);
// Print the results.
for (int i = 0; i < kDataLen; ++i)
std::cout << "y[" << i << "] = " << host_y[i] << "\n";
cudaDeviceReset();
return 0;
和nvprof
两个内核的输出:
nvprof --metrics flop_count_sp \
--metrics flop_count_sp_add \
--metrics flop_count_sp_mul \
--metrics flop_count_sp_fma \
--metrics flop_count_sp_special \
--metrics flop_count_dp \
--metrics flop_count_dp_add \
--metrics flop_count_dp_mul \
--metrics flop_count_dp_fma \
./a.out
==14380== NVPROF is profiling process 14380, command: ./a.out
==14380== Some kernel(s) will be replayed on device 0 in order to collect all events/metrics.
Replaying kernel "mul(double, double*, double*)" (done)
Replaying kernel "div(double, double*, double*)" (done)
y[0] = 24 internal events
y[1] = 1
y[2] = 0.666667
y[3] = 0.5
==14380== Profiling application: ./a.out
==14380== Profiling result:
==14380== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "GeForce GTX 1080 Ti (0)"
Kernel: mul(double, double*, double*)
1 flop_count_sp Floating Point Operations(Single Precision) 0 0 0
1 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
1 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
1 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 0 0 0
1 flop_count_sp_special Floating Point Operations(Single Precision Special) 0 0 0
1 flop_count_dp Floating Point Operations(Double Precision) 4 4 4
1 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
1 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 4 4 4
1 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 0 0 0
Kernel: div(double, double*, double*)
1 flop_count_sp Floating Point Operations(Single Precision) 8 8 8
1 flop_count_sp_add Floating Point Operations(Single Precision Add) 0 0 0
1 flop_count_sp_mul Floating Point Operation(Single Precision Mul) 0 0 0
1 flop_count_sp_fma Floating Point Operations(Single Precision FMA) 4 4 4
1 flop_count_sp_special Floating Point Operations(Single Precision Special) 4 4 4
1 flop_count_dp Floating Point Operations(Double Precision) 44 44 44
1 flop_count_dp_add Floating Point Operations(Double Precision Add) 0 0 0
1 flop_count_dp_mul Floating Point Operations(Double Precision Mul) 4 4 4
1 flop_count_dp_fma Floating Point Operations(Double Precision FMA) 20 20 20
【问题讨论】:
【参考方案1】:似乎没有针对 flop_count_dp_div, t 的指标
因为 CUDA 硬件中没有浮点除法指令。
而实际的双精度除法运算是根据双精度的 add/mul/fma 甚至一些单精度运算的 fma 来衡量的。
因为浮点除法是使用 Newton Raphson 迭代方法实现的,该方法使用乘加和乘法运算。甚至可能是混合精度(因此是单精度操作)
如果我没有源代码,如何从 nvprof 报告中推断出内核的动态除法操作数?
你真的不能。
【讨论】:
以上是关于为啥 nvprof 没有浮点除法运算的指标?的主要内容,如果未能解决你的问题,请参考以下文章