为啥相同的 OpenCL 代码从 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 具有不同的输出?
Posted
技术标签:
【中文标题】为啥相同的 OpenCL 代码从 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 具有不同的输出?【英文标题】:Why does the same OpenCL code have different outputs from Intel Xeon CPU and NVIDIA GTX 1080 Ti GPU?为什么相同的 OpenCL 代码从 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 具有不同的输出? 【发布时间】:2019-12-31 02:41:15 【问题描述】:我正在尝试使用 OpenCL 并行化 Monte Carlo 模拟。我使用 MWC64X 作为统一随机数生成器。该代码在不同的 Intel CPU 上运行良好,因为并行计算的输出非常接近顺序计算的输出。
Using OpenCL device: Intel(R) Xeon(R) CPU E5-2630L v3 @ 1.80GHz
Literal influence running time: 0.029048 seconds r1 seqInfl= 0.4771
Literal influence running time: 0.029762 seconds r2 seqInfl= 0.4771
Literal influence running time: 0.029742 seconds r3 seqInfl= 0.4771
Literal influence running time: 0.02971 seconds ra seqInfl= 0.4771
Literal influence running time: 0.029225 seconds trust1-57 seqInfl= 0.6001
Literal influence running time: 0.04992 seconds trust110-1 seqInfl= 0
Literal influence running time: 0.034636 seconds trust4-57 seqInfl= 0
Literal influence running time: 0.049079 seconds trust57-110 seqInfl= 0
Literal influence running time: 0.024442 seconds trust57-4 seqInfl= 0.8026
Literal influence running time: 0.04946 seconds trust33-1 seqInfl= 0
Literal influence running time: 0.049071 seconds trust57-33 seqInfl= 0
Literal influence running time: 0.053117 seconds trust4-1 seqInfl= 0.1208
Literal influence running time: 0.051642 seconds trust57-1 seqInfl= 0
Literal influence running time: 0.052052 seconds trust57-64 seqInfl= 0
Literal influence running time: 0.052118 seconds trust64-1 seqInfl= 0
Literal influence running time: 0.051998 seconds trust57-7 seqInfl= 0
Literal influence running time: 0.052069 seconds trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.71728 seconds
Sequential maxInfluence Literal: trust57-4 0.8026
index1= 17 size= 51 dim1_size= 6
sum0:4781 influence0:0.478100 sum2:4781 influence2:0.478100 sum6:0 influence6:0.000000 sum10:0 sum12:0 influence12:0.000000 sum7:0 influence7:0.000000 influence10:0.000000 sum4:5962 influence4:0.596200 sum8:7971 influence8:0.797100 sum1:4781 influence1:0.478100 sum3:4781 influence3:0.478100 sum13:0 influence13:0.000000 sum11:1261 influence11:0.126100 sum9:0 influence9:0.000000 sum14:0 influence14:0.000000 sum5:0 influence5:0.000000 sum15:0 influence15:0.000000 sum16:0 influence16:0.000000
Parallel influence running time: 0.054391 seconds
Parallel maxInfluence Literal: trust57-4 Infl=0.7971
但是,当我在安装了 NVIDIA-SMI 430.40 和 CUDA 10.1 和 OpenCL 1.2 CUDA 的 GeForce GTX 1080 Ti 上运行代码时,输出如下:
Using OpenCL device: GeForce GTX 1080 Ti
Influence:
Literal influence running time: 0.011119 seconds r1 seqInfl= 0.4771
Literal influence running time: 0.011238 seconds r2 seqInfl= 0.4771
Literal influence running time: 0.011408 seconds r3 seqInfl= 0.4771
Literal influence running time: 0.01109 seconds ra seqInfl= 0.4771
Literal influence running time: 0.011132 seconds trust1-57 seqInfl= 0.6001
Literal influence running time: 0.018978 seconds trust110-1 seqInfl= 0
Literal influence running time: 0.013093 seconds trust4-57 seqInfl= 0
Literal influence running time: 0.018968 seconds trust57-110 seqInfl= 0
Literal influence running time: 0.009105 seconds trust57-4 seqInfl= 0.8026
Literal influence running time: 0.018753 seconds trust33-1 seqInfl= 0
Literal influence running time: 0.018583 seconds trust57-33 seqInfl= 0
Literal influence running time: 0.02005 seconds trust4-1 seqInfl= 0.1208
Literal influence running time: 0.01957 seconds trust57-1 seqInfl= 0
Literal influence running time: 0.019686 seconds trust57-64 seqInfl= 0
Literal influence running time: 0.019632 seconds trust64-1 seqInfl= 0
Literal influence running time: 0.019687 seconds trust57-7 seqInfl= 0
Literal influence running time: 0.019859 seconds trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.272032 seconds
Sequential maxInfluence Literal: trust57-4 0.8026
index1= 17 size= 51 dim1_size= 6
sum0:10000 sum1:10000 sum2:10000 sum3:10000 sum4:10000 sum5:0 sum6:0 sum7:0 sum8:10000 sum9:0 sum10:0 sum11:0 sum12:0 sum13:0 sum14:0 sum15:0 sum16:0
Parallel influence running time: 0.193581 seconds
“影响”值等于sum*1.0/10000
,因此并行影响仅由 1 和 0 组成,这是不正确的(在 GPU 运行中)并且在 Intel CPU 上并行化时不会发生。
当我检查随机数生成器if(flag==0) printf("randint=%u",randint);
的输出时,GPU 上的输出似乎全为零。下面是 clinfo
和 .cl
代码:
Device Name GeForce GTX 1080 Ti
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 430.40
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Topology (NV) PCI-E, 68:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 28
Max clock frequency 1721MHz
Compute Capability (NV) 6.1
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 11720130560 (10.92GiB)
Error Correction support No
Max memory allocation 2930032640 (2.729GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 458752 (448KiB)
Global Memory cache line size 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x32768 pixels
Max 3D image size 16384x16384x16384 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max number of constant args 9
Max constant buffer size 65536 (64KiB)
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) Yes
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
printf() buffer size 1048576 (1024KiB)
#define N 70 // N > index, which is the total number of literals
#define BASE 4294967296UL
//! Represents the state of a particular generator
typedef struct uint x; uint c; mwc64x_state_t;
enum MWC64X_A = 4294883355U ;
enum MWC64X_M = 18446383549859758079UL ;
void MWC64X_Step(mwc64x_state_t *s)
uint X=s->x, C=s->c;
uint Xn=MWC64X_A*X+C;
uint carry=(uint)(Xn<C); // The (Xn<C) will be zero or one for scalar
uint Cn=mad_hi(MWC64X_A,X,carry);
s->x=Xn;
s->c=Cn;
//! Return a 32-bit integer in the range [0..2^32)
uint MWC64X_NextUint(mwc64x_state_t *s)
uint res=s->x ^ s->c;
MWC64X_Step(s);
return res;
__kernel void setInfluence(const int literals, const int size, const int dim1_size, __global int* lambdas, __global float* lambdap, __global int* dim2_size, __global float* influence)
int flag=get_global_id(0);
int sum=0;
int count=10000;
int assignment[N];
//or try to get newlambda like original version does
if(flag < literals)
mwc64x_state_t rng;
for(int i=0; i<count; i++)
for(int j=0; j<size; j++)
uint randint=MWC64X_NextUint(&rng);
float rand=randint*1.0/BASE;
//if(flag==0)
// printf("randint=%u",randint);
if(lambdap[j]<rand)
assignment[lambdas[j]]=0;
else
assignment[lambdas[j]]=1;
//the true case
assignment[flag]=1;
int valuet=0;
int index=0;
for(int m=0; m<dim1_size; m++)
int valueMono=1;
for(int n=0; n<dim2_size[m]; n++)
if(assignment[lambdas[index+n]]==0)
valueMono=0;
index+=dim2_size[m];
break;
if(valueMono==1)
valuet=1;
break;
//the false case
assignment[flag]=0;
int valuef=0;
index=0;
for(int m=0; m<dim1_size; m++)
int valueMono=1;
for(int n=0; n<dim2_size[m]; n++)
if(assignment[lambdas[index+n]]==0)
valueMono=0;
index+=dim2_size[m];
break;
if(valueMono==1)
valuef=1;
break;
sum += valuet-valuef;
influence[flag] = 1.0*sum/count;
printf("sum%d:%d\t", flag, sum);
在 GPU 上运行代码时可能会出现什么问题?是MWC64X吗?据其作者称,它可以在 NVIDIA GPU 上运行良好。如果是这样,我该如何解决它;如果不是,可能是什么问题?
【问题讨论】:
你没有初始化你的mwc64x_state_t rng
变量,所以任何结果都是未定义的。请注意,您可能希望为每个工作项设置不同的 RNG 种子,否则您的结果中会出现令人讨厌的相关伪影。
@pmdj 谢谢。初始化后,CPU和GPU上的结果都是正确的。
【参考方案1】:
(这开始是一个评论,事实证明这是问题的根源,所以我把它变成了一个答案。)
您没有在读取 mwc64x_state_t rng;
变量之前对其进行初始化,因此任何结果都将是未定义的:
mwc64x_state_t rng;
for(int i=0; i<count; i++)
for(int j=0; j<size; j++)
uint randint=MWC64X_NextUint(&rng);
MWC64X_NextUint()
在更新之前立即从 rng 状态读取:
uint MWC64X_NextUint(mwc64x_state_t *s)
uint res=s->x ^ s->c;
请注意,您可能希望为每个工作项设置不同的 RNG 种子,否则您的结果中会出现令人讨厌的相关伪影。
【讨论】:
【参考方案2】:伪随机数的所有用例都是真正的-[PARALLEL]
计算平台(不是语言、平台)中的下一级挑战。
或者,存在一些 随机源,一旦大规模并行请求在真正的[PARALLEL]
时尚(在这里,硬件资源可能会有所帮助,但代价是无法在同一平台“外部”重现相同的行为(如果这样的来源不是软件的话,也是即时的)使用一些种子注入功能进行操作,这可能会设置“just”-pseudo-random 算法,该算法创建一个纯-[SERIAL]
sequence-of-produced “just”-pseudo-random numbers))
或者,有一些伪随机数的“共享”生成器,它享有更高层次的全系统水平-熵(这有利于产生伪随机性的“质量”),但代价是纯串行依赖(不可能并行执行,串行序列以顺序方式一个接一个地得到服务)并且几乎为零的机会可重复运行(可重复科学的必需品)提供可重复的相同序列,这是测试和方法验证案例所需的。
恢复:
代码可以使用工作项-“私有”伪随机生成函数(为了并行代码执行和生成这些伪随机数的相互独立性(非干预过程)),但每个实例都必须a)独立初始化,以提供可实现的预期随机性水平并行代码运行和 b) 任何 此类初始化都应该以可重复的方式执行,以便在不同时间运行测试,通常使用不同的 OpenCL 目标计算-平台。
对于__kernel
-s,不依赖于特定于硬件的随机源,满足条件 a && b 将足以接收可重复再现(相同)的测试结果“体外”,从而为在通用生产级用例代码“体内”运行期间生成结果提供了一种合理的随机方法。
net-run-times 的比较(上面的基准)似乎表明Amdahl's law add-on overhead costs 加上工作原子性的尾部效应最终决定了 net-run-time 是@987654326 @ 在 XEON 上比 GPU 更快:
index1 = 17
size = 51
dim1_size = 6
sum0: 4781 influence0: 0.478100
sum2: 4781 influence2: 0.478100
sum6: 0 influence6: 0.000000
sum10: 0 influence10: 0.000000
sum12: 0 influence12: 0.000000
sum7: 0 influence7: 0.000000
sum4: 5962 influence4: 0.596200
sum8: 7971 influence8: 0.797100
sum1: 4781 influence1: 0.478100
sum3: 4781 influence3: 0.478100
sum13: 0 influence13: 0.000000
sum11: 1261 influence11: 0.126100
sum9: 0 influence9: 0.000000
sum14: 0 influence14: 0.000000
sum5: 0 influence5: 0.000000
sum15: 0 influence15: 0.000000
sum16: 0 influence16: 0.000000
Parallel influence running time: 0.054391 seconds on XEON E5-2630L v3 @ 1.80GHz using OpenCL
|....
index1 = 17 |....
size = 51 |....
dim1_size = 6 |....
sum0: 10000 |....
sum1: 10000 |....
sum2: 10000 |....
sum3: 10000 |....
sum4: 10000 |....
sum5: 0 |....
sum6: 0 |....
sum7: 0 |....
sum8: 10000 |....
sum9: 0 |....
sum10: 0 |....
sum11: 0 |....
sum12: 0 |....
sum13: 0 |....
sum14: 0 |....
sum15: 0 |....
sum16: 0 |....
Parallel influence running time: 0.193581 seconds on GeForce GTX 1080 Ti using OpenCL
【讨论】:
以上是关于为啥相同的 OpenCL 代码从 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 具有不同的输出?的主要内容,如果未能解决你的问题,请参考以下文章
一台机器上的 OpenCL Nvidia 和 Intel 平台
带有 OpenCL 的 Intel HD 6000 本地内存带宽 [关闭]
OpenCL设计优化(基于Intel FPGA SDK for OpenCL)