如何使用CUDA并行化嵌套for循环以在2D数组上执行计算

Question

我正在进行一些研究，并且非常适合使用CUDA。我使用的语言是C和C ++，这是与Nvidia的CUDA兼容的基本语言。在过去的一周里，我一直试图通过将CUDA与我的C ++代码集成来获得任何加速。

据我所知，就内存分配和释放而言，我正在正确地做基础知识。但是当涉及到实际加速计算时，我目前正在从非CUDA实现中获得不同的结果。

此外，CUDA的实现也比普通的非cuda版本低。

以下是我调用内核函数的函数。本质上，我将最初在此函数中的计算移动到内核函数中以便并行化它。 //计算输入之间的距离void computeInput（int vectorNumber，double * dist，double ** weight）{

double *d_dist, **d_weight;


//cout << "Dist[0] Before: " << dist[0] << endl;

cudaMalloc(&d_dist, maxClusters * sizeof(double));
cudaMalloc(&d_weight, maxClusters * vector_length * sizeof(double));

//  cout << "Memory Allocated" << endl;

//copy variables from host machine running on CPU to Kernel running on GPU
cudaMemcpy(d_dist, dist, maxClusters * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_weight, weight, maxClusters * vector_length * sizeof(double), cudaMemcpyHostToDevice);

//  cout << "Variables copied to GPU Device." << endl;

//kernel currently being run with 1 blocks with 4 threads for each block.
//right now only a single loop is parallelized, I need to parallelize each loop individually or 2d arrays individually.
dim3 blocks(8,8);
dim3 grid(1, 1);
threadedInput<<<grid,blocks>>>(vectorNumber, d_dist, d_weight);

//  cout << "Kernel Run." << endl;  

//Waits for the GPU to finish computations
cudaDeviceSynchronize();

//cout << "Weight[0][0] : " << weight[0][0];

//copy back varaible from kernelspace on GPU to host on CPU into variable weight
cudaMemcpy(weight, d_weight, maxClusters * vector_length * sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy(dist, d_dist, maxClusters * sizeof(double), cudaMemcpyDeviceToHost);
//  cout << "GPU Memory Copied back to Host" << endl;

cout << "Dist[0] After: " << dist[0] << endl;

cudaFree(d_dist);
cudaFree(d_weight);

//cout << " Cuda Memory Freed" << endl;
}

以下是内核函数。它使用节点上的权重计算距离。

我想要做的是在不同的线程上执行循环的每次迭代。

我担心它正在做的是弄乱订单并执行错误的计算。我已经通过Stack Overflow和其他地方搜索了嵌套for循环并行化的帮助，但是他们都没有对我做错的事情有所了解。有什么建议？

__global__ void threadedInput(int vecNum, double *dist, double **weight)
{
int tests[vectors][vector_length] = {{0, 1, 1, 0},
                                     {1, 0, 0, 1},
                                     {0, 1, 0, 1},
                                     {1, 0, 1, 0}};
dist[0] = 0.0;
dist[1] = 0.0;
int indexX,indexY, incrX, incrY;
indexX = blockIdx.x * blockDim.x + threadIdx.x;
indexY = blockIdx.y * blockDim.y + threadIdx.y;
incrX = blockDim.x * gridDim.x; 
incrY = blockDim.y * gridDim.y; 

for(int i = indexY; i <= (maxClusters - 1); i+=incrY)
{
    for(int j = indexX; j <= (vectors - 1); j+= incrX)
    {       
        dist[i] += pow((weight[i][j] - tests[vecNum][j]), 2);
    }// end inner for
}// end outer for

}// end CUDA-kernel

我目前的输出：

Clusters for training input:

Vector (1, 0, 1, 0, ) Place in Bin 0

Vector (1, 1, 1, 0, ) Place in Bin 0

Vector (0, 1, 1, 1, ) Place in Bin 0

Vector (1, 1, 0, 0, ) Place in Bin 0

Weights for Node 0 connections:
0.74753098, 0.75753881, 0.74233157, 0.25246902, 

Weights for Node 1 connections:
0.00000000, 0.00000000, 0.00000000, 0.00000000, 

Categorized test input:

Vector (0, 1, 1, 0, ) Place in Bin 0

Vector (1, 0, 0, 1, ) Place in Bin 0

Vector (0, 1, 0, 1, ) Place in Bin 0

Vector (1, 0, 1, 0, ) Place in Bin 0
Time Ran: 0.96623900

预期输出（除了预期的时间应该至少快50％）

Clusters for training input:

Vector (1, 0, 1, 0, ) Place in Bin 0

Vector (1, 1, 1, 0, ) Place in Bin 1

Vector (0, 1, 1, 1, ) Place in Bin 0

Vector (1, 1, 0, 0, ) Place in Bin 1

Weights for Node 0 connections:
0.74620975, 0.75889148, 0.74351981, 0.25379025, 

Weights for Node 1 connections:
0.75368531, 0.75637331, 0.74105526, 0.24631469, 

Categorized test input:

Vector (0, 1, 1, 0, ) Place in Bin 0

Vector (1, 0, 0, 1, ) Place in Bin 1

Vector (0, 1, 0, 1, ) Place in Bin 0

Vector (1, 0, 1, 0, ) Place in Bin 1
Time Ran: 0.00033100