为啥相同的 OpenCL 代码从 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 具有不同的输出？

Posted 2023-04-15

技术标签:

【中文标题】为啥相同的 OpenCL 代码从 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 具有不同的输出？【英文标题】：Why does the same OpenCL code have different outputs from Intel Xeon CPU and NVIDIA GTX 1080 Ti GPU?为什么相同的 OpenCL 代码从 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 具有不同的输出？ 【发布时间】：2019-12-31 02:41:15 【问题描述】：

我正在尝试使用 OpenCL 并行化 Monte Carlo 模拟。我使用 MWC64X 作为统一随机数生成器。该代码在不同的 Intel CPU 上运行良好，因为并行计算的输出非常接近顺序计算的输出。

Using OpenCL device: Intel(R) Xeon(R) CPU E5-2630L v3 @ 1.80GHz
Literal influence running time: 0.029048 seconds        r1 seqInfl= 0.4771
Literal influence running time: 0.029762 seconds        r2 seqInfl= 0.4771
Literal influence running time: 0.029742 seconds        r3 seqInfl= 0.4771
Literal influence running time: 0.02971 seconds         ra seqInfl= 0.4771
Literal influence running time: 0.029225 seconds        trust1-57 seqInfl= 0.6001
Literal influence running time: 0.04992 seconds         trust110-1 seqInfl= 0
Literal influence running time: 0.034636 seconds        trust4-57 seqInfl= 0
Literal influence running time: 0.049079 seconds        trust57-110 seqInfl= 0
Literal influence running time: 0.024442 seconds        trust57-4 seqInfl= 0.8026
Literal influence running time: 0.04946 seconds         trust33-1 seqInfl= 0
Literal influence running time: 0.049071 seconds        trust57-33 seqInfl= 0
Literal influence running time: 0.053117 seconds        trust4-1 seqInfl= 0.1208
Literal influence running time: 0.051642 seconds        trust57-1 seqInfl= 0
Literal influence running time: 0.052052 seconds        trust57-64 seqInfl= 0
Literal influence running time: 0.052118 seconds        trust64-1 seqInfl= 0
Literal influence running time: 0.051998 seconds        trust57-7 seqInfl= 0
Literal influence running time: 0.052069 seconds        trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.71728 seconds
Sequential maxInfluence Literal: trust57-4 0.8026

index1= 17 size= 51 dim1_size= 6
sum0:4781   influence0:0.478100 sum2:4781   influence2:0.478100 sum6:0  influence6:0.000000 sum10:0 sum12:0 influence12:0.000000    sum7:0  influence7:0.000000 influence10:0.000000    sum4:5962   influence4:0.596200 sum8:7971   influence8:0.797100 sum1:4781   influence1:0.478100 sum3:4781   influence3:0.478100 sum13:0 influence13:0.000000    sum11:1261  influence11:0.126100    sum9:0  influence9:0.000000 sum14:0 influence14:0.000000    sum5:0  influence5:0.000000 sum15:0 influence15:0.000000    sum16:0 influence16:0.000000    
Parallel influence running time: 0.054391 seconds
Parallel maxInfluence Literal: trust57-4 Infl=0.7971

但是，当我在安装了 NVIDIA-SMI 430.40 和 CUDA 10.1 和 OpenCL 1.2 CUDA 的 GeForce GTX 1080 Ti 上运行代码时，输出如下：

Using OpenCL device: GeForce GTX 1080 Ti
Influence:
Literal influence running time: 0.011119 seconds        r1 seqInfl= 0.4771
Literal influence running time: 0.011238 seconds        r2 seqInfl= 0.4771
Literal influence running time: 0.011408 seconds        r3 seqInfl= 0.4771
Literal influence running time: 0.01109 seconds         ra seqInfl= 0.4771
Literal influence running time: 0.011132 seconds        trust1-57 seqInfl= 0.6001
Literal influence running time: 0.018978 seconds        trust110-1 seqInfl= 0
Literal influence running time: 0.013093 seconds        trust4-57 seqInfl= 0
Literal influence running time: 0.018968 seconds        trust57-110 seqInfl= 0
Literal influence running time: 0.009105 seconds        trust57-4 seqInfl= 0.8026
Literal influence running time: 0.018753 seconds        trust33-1 seqInfl= 0
Literal influence running time: 0.018583 seconds        trust57-33 seqInfl= 0
Literal influence running time: 0.02005 seconds         trust4-1 seqInfl= 0.1208
Literal influence running time: 0.01957 seconds         trust57-1 seqInfl= 0
Literal influence running time: 0.019686 seconds        trust57-64 seqInfl= 0
Literal influence running time: 0.019632 seconds        trust64-1 seqInfl= 0
Literal influence running time: 0.019687 seconds        trust57-7 seqInfl= 0
Literal influence running time: 0.019859 seconds        trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.272032 seconds
Sequential maxInfluence Literal: trust57-4 0.8026

index1= 17 size= 51 dim1_size= 6
sum0:10000  sum1:10000  sum2:10000  sum3:10000  sum4:10000  sum5:0  sum6:0  sum7:0  sum8:10000  sum9:0  sum10:0 sum11:0 sum12:0 sum13:0 sum14:0 sum15:0 sum16:0 
Parallel influence running time: 0.193581 seconds

“影响”值等于sum*1.0/10000，因此并行影响仅由 1 和 0 组成，这是不正确的（在 GPU 运行中）并且在 Intel CPU 上并行化时不会发生。

当我检查随机数生成器if(flag==0) printf("randint=%u",randint); 的输出时，GPU 上的输出似乎全为零。下面是 clinfo 和 .cl 代码：

 Device Name                                     GeForce GTX 1080 Ti
  Device Vendor                                   NVIDIA Corporation
  Device Vendor ID                                0x10de
  Device Version                                  OpenCL 1.2 CUDA
  Driver Version                                  430.40
  Device OpenCL C Version                         OpenCL C 1.2 
  Device Type                                     GPU
  Device Topology (NV)                            PCI-E, 68:00.0
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               28
  Max clock frequency                             1721MHz
  Compute Capability (NV)                         6.1
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             1024x1024x64
  Max work group size                             1024
  Preferred work group size multiple              32
  Warp size (NV)                                  32
  Preferred / native vector sizes                 
    char                                                 1 / 1       
    short                                                1 / 1       
    int                                                  1 / 1       
    long                                                 1 / 1       
    half                                                 0 / 0        (n/a)
    float                                                1 / 1       
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              11720130560 (10.92GiB)
  Error Correction support                        No
  Max memory allocation                           2930032640 (2.729GiB)
  Unified memory for Host and Device              No
  Integrated memory (NV)                          No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       4096 bits (512 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        458752 (448KiB)
  Global Memory cache line size                   128 bytes
  Image support                                   Yes
    Max number of samplers per kernel             32
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             16384x32768 pixels
    Max 3D image size                             16384x16384x16384 pixels
    Max number of read image args                 256
    Max number of write image args                16
  Local memory type                               Local
  Local memory size                               49152 (48KiB)
  Registers per block (NV)                        65536
  Max number of constant args                     9
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     4352 (4.25KiB)
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
    Kernel execution timeout (NV)                 Yes
  Concurrent copy and kernel execution (NV)       Yes
    Number of async copy engines                  2
  printf() buffer size                            1048576 (1024KiB)

#define N 70 // N > index, which is the total number of literals
#define BASE 4294967296UL

//! Represents the state of a particular generator
typedef struct uint x; uint c;  mwc64x_state_t;
enum MWC64X_A = 4294883355U ;
enum MWC64X_M = 18446383549859758079UL ;

void MWC64X_Step(mwc64x_state_t *s)

    uint X=s->x, C=s->c;

    uint Xn=MWC64X_A*X+C;
    uint carry=(uint)(Xn<C);                // The (Xn<C) will be zero or one for scalar
    uint Cn=mad_hi(MWC64X_A,X,carry);  

    s->x=Xn;
    s->c=Cn;


//! Return a 32-bit integer in the range [0..2^32)
uint MWC64X_NextUint(mwc64x_state_t *s)

    uint res=s->x ^ s->c;
    MWC64X_Step(s);
    return res;



__kernel void setInfluence(const int literals, const int size, const int dim1_size, __global int* lambdas, __global float* lambdap, __global int* dim2_size, __global float* influence)   
    int flag=get_global_id(0);
    int sum=0;
    int count=10000;
    int assignment[N];
    //or try to get newlambda like original version does
    if(flag < literals)
        mwc64x_state_t rng;
        for(int i=0; i<count; i++)
            for(int j=0; j<size; j++)
                uint randint=MWC64X_NextUint(&rng);
                float rand=randint*1.0/BASE;
                //if(flag==0)
                //  printf("randint=%u",randint);
                if(lambdap[j]<rand)
                    assignment[lambdas[j]]=0;
                else
                    assignment[lambdas[j]]=1;               
            
            //the true case
            assignment[flag]=1;
            int valuet=0;
            int index=0;
            for(int m=0; m<dim1_size; m++)
                int valueMono=1;
                for(int n=0; n<dim2_size[m]; n++)
                    if(assignment[lambdas[index+n]]==0)
                        valueMono=0;
                        index+=dim2_size[m];
                        break;
                    
                
                if(valueMono==1)
                    valuet=1;
                    break;
                
                    
            //the false case
            assignment[flag]=0;
            int valuef=0;
            index=0;
            for(int m=0; m<dim1_size; m++)
                int valueMono=1;
                for(int n=0; n<dim2_size[m]; n++)
                    if(assignment[lambdas[index+n]]==0)
                        valueMono=0;
                        index+=dim2_size[m];
                        break;
                    
                
                if(valueMono==1)
                    valuef=1;
                    break;
                
            
            sum += valuet-valuef;            
        
        influence[flag] = 1.0*sum/count;
        printf("sum%d:%d\t", flag, sum);

在 GPU 上运行代码时可能会出现什么问题？是MWC64X吗？据其作者称，它可以在 NVIDIA GPU 上运行良好。如果是这样，我该如何解决它；如果不是，可能是什么问题？

【问题讨论】：

你没有初始化你的 mwc64x_state_t rng 变量，所以任何结果都是未定义的。请注意，您可能希望为每个工作项设置不同的 RNG 种子，否则您的结果中会出现令人讨厌的相关伪影。 @pmdj 谢谢。初始化后，CPU和GPU上的结果都是正确的。 【参考方案1】：

（这开始是一个评论，事实证明这是问题的根源，所以我把它变成了一个答案。）

您没有在读取 mwc64x_state_t rng; 变量之前对其进行初始化，因此任何结果都将是未定义的：

    mwc64x_state_t rng;
    for(int i=0; i<count; i++)
        for(int j=0; j<size; j++)
            uint randint=MWC64X_NextUint(&rng);

MWC64X_NextUint() 在更新之前立即从 rng 状态读取：

uint MWC64X_NextUint(mwc64x_state_t *s)

    uint res=s->x ^ s->c;

请注意，您可能希望为每个工作项设置不同的 RNG 种子，否则您的结果中会出现令人讨厌的相关伪影。

【讨论】：

【参考方案2】：

伪随机数的所有用例都是真正的-[PARALLEL] 计算平台（不是语言、平台）中的下一级挑战。

或者，存在一些 随机源，一旦大规模并行请求在真正的[PARALLEL]时尚（在这里，硬件资源可能会有所帮助，但代价是无法在同一平台“外部”重现相同的行为（如果这样的来源不是软件的话，也是即时的）使用一些种子注入功能进行操作，这可能会设置“just”-pseudo-random 算法，该算法创建一个纯-[SERIAL] sequence-of-produced “just”-pseudo-random numbers））

或者，有一些伪随机数的“共享”生成器，它享有更高层次的全系统水平-熵（这有利于产生伪随机性的“质量”），但代价是纯串行依赖（不可能并行执行，串行序列以顺序方式一个接一个地得到服务）并且几乎为零的机会可重复运行（可重复科学的必需品）提供可重复的相同序列，这是测试和方法验证案例所需的。

恢复：

代码可以使用工作项-“私有”伪随机生成函数_{（为了并行代码执行和生成这些伪随机数的相互独立性（非干预过程））}，但每个实例都必须a）独立初始化，以提供可实现的预期随机性水平并行代码运行和 b) 任何 此类初始化都应该以可重复的方式执行，以便在不同时间运行测试，通常使用不同的 OpenCL 目标计算-平台。

对于__kernel-s，不依赖于特定于硬件的随机源，满足条件 a && b 将足以接收可重复再现（相同）的测试结果“体外”，从而为在通用生产级用例代码“体内”运行期间生成结果提供了一种合理的随机方法。

net-run-times 的比较（上面的基准）似乎表明Amdahl's law add-on overhead costs 加上工作原子性的尾部效应最终决定了 net-run-time 是@987654326 @ 在 XEON 上比 GPU 更快：

index1    = 17
size      = 51
dim1_size =  6
sum0:  4781   influence0:  0.478100
sum2:  4781   influence2:  0.478100
sum6:     0   influence6:  0.000000
sum10:    0   influence10: 0.000000
sum12:    0   influence12: 0.000000
sum7:     0   influence7:  0.000000
sum4:  5962   influence4:  0.596200
sum8:  7971   influence8:  0.797100
sum1:  4781   influence1:  0.478100
sum3:  4781   influence3:  0.478100
sum13:    0   influence13: 0.000000
sum11: 1261   influence11: 0.126100
sum9:     0   influence9:  0.000000
sum14:    0   influence14: 0.000000
sum5:     0   influence5:  0.000000
sum15:    0   influence15: 0.000000
sum16:    0   influence16: 0.000000
     Parallel influence running time: 0.054391 seconds on XEON E5-2630L v3 @ 1.80GHz using OpenCL
                                         |....
index1    = 17                           |....
size      = 51                           |....
dim1_size =  6                           |....
sum0: 10000                              |....
sum1: 10000                              |....
sum2: 10000                              |....
sum3: 10000                              |....
sum4: 10000                              |....
sum5:     0                              |....
sum6:     0                              |....
sum7:     0                              |....
sum8: 10000                              |....
sum9:     0                              |....
sum10:    0                              |....
sum11:    0                              |....
sum12:    0                              |....
sum13:    0                              |....
sum14:    0                              |....
sum15:    0                              |....
sum16:    0                              |....
     Parallel influence running time: 0.193581 seconds on GeForce GTX 1080 Ti using OpenCL

【讨论】：

以上是关于为啥相同的 OpenCL 代码从 Intel Xeon CPU 和 NVIDIA GTX 1080 Ti GPU 具有不同的输出？的主要内容，如果未能解决你的问题，请参考以下文章