使用 CUDA 进行大整数加法

Posted 2023-02-19

技术标签:

【中文标题】使用 CUDA 进行大整数加法【英文标题】：large integer addition with CUDA 【发布时间】：2012-10-09 01:21:39 【问题描述】：

我一直在 GPU 上开发一种加密算法，目前坚持使用一种算法来执行大整数加法。大整数以通常的方式表示为一堆 32 位字。

例如，我们可以使用一个线程来添加两个 32 位字。为简单起见，假设要添加的数字具有相同的长度和每个块的线程数 == 字数。那么：

__global__ void add_kernel(int *C, const int *A, const int *B) 
     int x = A[threadIdx.x];
     int y = B[threadIdx.x];
     int z = x + y;
     int carry = (z < x);
     /** do carry propagation in parallel somehow ? */
     ............

     z = z + newcarry; // update the resulting words after carry propagation
     C[threadIdx.x] = z;

我很确定有一种方法可以通过一些棘手的缩减程序来进行进位传播，但无法弄清楚..

我查看了CUDA thrust extensions，但似乎还没有实现大整数包。也许有人可以给我一个提示如何在 CUDA 上做到这一点？

【问题讨论】：

GPU 最多可以直接处理 64 位（long long）。 this SO question/answer 中概述了一种用于 128 位的方法。我认为你想从 CUDA 中得到的东西可以通过 C 技术来实现。因此，我也在C 中重新提出了这个问题。希望得到 C 专家的好评。是的，您还可以仅使用高级 C 构造（与 CUDA 中的 PXT linline 汇编相反）对长整数加法进行编程，但正如我在此指出的那样，它需要更多的指令答案：***.com/questions/12448549/… 感谢您的建议。我知道 CUDA 支持特殊的内在函数在添加后使用进位标志。关键是整数可能非常大（大约 2048 个 32 位字）所以我真的在寻找一个并行解决方案，也许以某种方式使用并行缩减？加法在算术上不够密集，无法有意义地将其拆分为线程（至少在我的脑海中）。对于乘法，您可以让每个线程对一列部分 32x32->64 位乘积求和，然后在最后传播进位。您还可以通过将加法的结果作为单独的和和进位向量来研究延迟进位传播。很大程度上取决于确切的算法上下文。 【参考方案1】：

你是对的，进位传播可以通过前缀和计算来完成，但是为这个操作定义二进制函数并证明它是关联的（并行前缀和需要）有点棘手。事实上，这个算法（理论上）在Carry-lookahead adder中使用。

假设我们有两个大整数 a[0..n-1] 和 b[0..n-1]。然后我们计算 (i = 0..n-1)：

s[i] = a[i] + b[i]l;
carryin[i] = (s[i] < a[i]);

我们定义了两个函数：

generate[i] = carryin[i];
propagate[i] = (s[i] == 0xffffffff);

很直观的意思：generate[i] == 1 表示进位产生于 position i whilepropagate[i] == 1 表示进位将从位置传播 (i - 1) 到 (i + 1)。我们的目标是计算用于更新结果和 s[0..n-1] 的函数 carryout[0..n-1]。进位可以递归计算如下：

carryout[i] = generate[i] OR (propagate[i] AND carryout[i-1])
carryout[0] = 0

这里进位[i] == 1 如果进位是在位置 i 生成的，或者它有时更早生成并传播到位置 i。最后，我们更新结果和：

s[i] = s[i] + carryout[i-1];  for i = 1..n-1
carry = carryout[n-1];

现在很容易证明进位函数确实是二进制关联的，因此并行前缀和计算适用。为了在 CUDA 上实现这一点，我们可以将两个标志 'generate' 和 'propagate' 合并到一个变量中，因为它们是互斥的，即：

cy[i] = (s[i] == -1u ? -1u : 0) | carryin[i];

换句话说，

cy[i] = 0xffffffff  if propagate[i]
cy[i] = 1           if generate[i]
cy[u] = 0           otherwise

然后，可以验证以下公式计算进位函数的前缀和：

cy[i] = max((int)cy[i], (int)cy[k]) & cy[i];

对于所有 k

// add & output carry flag
#define UADDO(c, a, b) \ 
     asm volatile("add.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));
// add with carry & output carry flag
#define UADDC(c, a, b) \ 
     asm volatile("addc.cc.u32 %0, %1, %2;" : "=r"(c) : "r"(a) , "r"(b));

#define WS 32

__global__ void bignum_add(unsigned *g_R, const unsigned *g_A,const unsigned *g_B) 

extern __shared__ unsigned shared[];
unsigned *r = shared; 

const unsigned N_THIDS = 512;
unsigned thid = threadIdx.x, thid_in_warp = thid & WS-1;
unsigned ofs, cf;

uint4 a = ((const uint4 *)g_A)[thid],
      b = ((const uint4 *)g_B)[thid];

UADDO(a.x, a.x, b.x) // adding 128-bit chunks with carry flag
UADDC(a.y, a.y, b.y)
UADDC(a.z, a.z, b.z)
UADDC(a.w, a.w, b.w)
UADDC(cf, 0, 0) // save carry-out

// memory consumption: 49 * N_THIDS / 64
// use "alternating" data layout for each pair of warps
volatile short *scan = (volatile short *)(r + 16 + thid_in_warp +
        49 * (thid / 64)) + ((thid / 32) & 1);

scan[-32] = -1; // put identity element
if(a.x == -1u && a.x == a.y && a.x == a.z && a.x == a.w)
    // this indicates that carry will propagate through the number
    cf = -1u;

// "Hillis-and-Steele-style" reduction 
scan[0] = cf;
cf = max((int)cf, (int)scan[-2]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-4]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-8]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-16]) & cf;
scan[0] = cf;
cf = max((int)cf, (int)scan[-32]) & cf;
scan[0] = cf;

int *postscan = (int *)r + 16 + 49 * (N_THIDS / 64);
if(thid_in_warp == WS - 1) // scan leading carry-outs once again
    postscan[thid >> 5] = cf;

__syncthreads();

if(thid < N_THIDS / 32) 
    volatile int *t = (volatile int *)postscan + thid;
    t[-8] = -1; // load identity symbol
    cf = t[0];
    cf = max((int)cf, (int)t[-1]) & cf;
    t[0] = cf;
    cf = max((int)cf, (int)t[-2]) & cf;
    t[0] = cf;
    cf = max((int)cf, (int)t[-4]) & cf;
    t[0] = cf;

__syncthreads();

cf = scan[0];
int ps = postscan[(int)((thid >> 5) - 1)]; // postscan[-1] equals to -1
scan[0] = max((int)cf, ps) & cf; // update carry flags within warps
cf = scan[-2];

if(thid_in_warp == 0)
    cf = ps;
if((int)cf < 0)
    cf = 0;

UADDO(a.x, a.x, cf) // propagate carry flag if needed
UADDC(a.y, a.y, 0)
UADDC(a.z, a.z, 0)
UADDC(a.w, a.w, 0)
((uint4 *)g_R)[thid] = a;

请注意，宏 UADDO / UADDC 可能不再需要，因为 CUDA 4.0 具有相应的内在函数（但我不完全确定）。

还要注意，虽然并行归约速度很快，但如果需要连续添加几个大整数，最好使用一些冗余表示（上面的 cmets 建议），即先将结果累加64位字的加法，然后在“一次扫描”的最后进行一次进位传播。

【讨论】：

我尝试编译它，但在这一行出现错误： volatile short *scan = (volatile short *)(r + 16 + thid_in_warp + (49 * (thid / 64)) + ( (thid / 32) & 1); 它似乎缺少一个右括号。我尝试在分号前的末尾添加一个。你能检查一下吗？（修复后我尝试使用它添加两个启动失败2048x32bit 无符号整数。我自己的代码可能有错误。）哎呀，你是对的，应该是 (r + 16 + thid_in_warp + 49 * (thid / 64)) + ((thid / 32) & 1)。我修好了它。还请确保您为算法正确运行分配了足够的共享内存，大约为 (49 * 512 / 64) + 32 个字。我分配了 4096 个字节，看起来应该绰绰有余。我用您修改的行更新了我的测试。我仍然遇到未指定的启动失败。也许这是我正在做的事情。读取输入数据并将结果保存在全局内存中也有问题。我也修好了。当您从较大的代码中提取代码时会发生这种情况;) 另一件事：你是为 64 位编译的吗？我刚刚意识到消极的节日可能真的很讨厌，因为默认情况下它们没有符号扩展，即。这里：postscan[((thid >> 5) - 1)]【参考方案2】：

除了@asm，我想我也会发布我的答案，所以这个 SO 问题可以成为一种想法的存储库。与@asm 类似，我检测并存储进位条件以及“结转”条件，即。当中间字结果全为 1 (0xF...FFF) 时，如果进位传播到该字中，它将“携带”到下一个字。

我的代码中没有使用任何 PTX 或 asm，因此我选择使用 64 位无符号整数而不是 32 位，以实现 2048x32 位的能力，使用 1024 个线程。

与@asm 的代码更大的不同在于我的并行进位传播方案。我构造了一个位压缩数组（“进位”），其中每个位表示从 1024 个线程中的每一个线程的独立中间 64 位相加生成的进位条件。我还构造了一个位压缩数组（“carry_through”），其中每个位表示单个 64 位中间结果的进位条件。对于 1024 个线程，这相当于每个位压缩数组的 1024/64 = 16x64 位共享内存字，因此共享内存的总使用量为 64+3 个 32 位数量。使用这些位压缩数组，我执行以下操作以生成组合的传播进位指示符：

carry = carry | (carry_through ^ ((carry & carry_through) + carry_through);

（注意进位左移一位：进位[i]表示a[i-1] + b[i-1]的结果产生了进位）解释如下：

进位和进位通的按位与生成将进位的候选与一系列一个或多个进位条件进行交互将步骤 1 的结果添加到 carry_through 会生成一个结果，该结果已更改位，这些位表示将受到影响的所有单词进位传播到进位序列中采用异或进位加第 2 步的结果显示受影响的结果，用 1 位表示对第 3 步的结果进行按位或进位指标给出一个组合进位条件，然后用于更新所有中间结果。

请注意，第 2 步中的加法需要另一个多字加法（对于由超过 64 个字组成的大整数）。我相信这个算法是有效的，它已经通过了我给它的测试用例。

这是我实现此功能的示例代码：

// parallel add of large integers
// requires CC 2.0 or higher
// compile with:
// nvcc -O3 -arch=sm_20 -o paradd2 paradd2.cu
#include <stdio.h>
#include <stdlib.h>

#define MAXSIZE 1024 // the number of 64 bit quantities that can be added
#define LLBITS 64  // the number of bits in a long long
#define BSIZE ((MAXSIZE + LLBITS -1)/LLBITS) // MAXSIZE when packed into bits
#define nTPB MAXSIZE

// define either GPU or GPUCOPY, not both -- for timing
#define GPU
//#define GPUCOPY

#define LOOPCNT 1000

#define cudaCheckErrors(msg) \
    do  \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess)  \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
         \
     while (0)

// perform c = a + b, for unsigned integers of psize*64 bits.
// all work done in a single threadblock.
// multiple threadblocks are handling multiple separate addition problems
// least significant word is at a[0], etc.

__global__ void paradd(const unsigned size, const unsigned psize, unsigned long long *c, const unsigned long long *a, const unsigned long long *b)

  __shared__ unsigned long long carry_through[BSIZE];
  __shared__ unsigned long long carry[BSIZE+1];
  __shared__ volatile unsigned mcarry;
  __shared__ volatile unsigned mcarry_through;

  unsigned idx = threadIdx.x + (psize * blockIdx.x);
  if ((threadIdx.x < psize) && (idx < size))
    // handle 64 bit unsigned add first
    unsigned long long cr1 = a[idx];
    unsigned long long lc = cr1 + b[idx];
    // handle carry
    if (threadIdx.x < BSIZE)
      carry[threadIdx.x] = 0;
      carry_through[threadIdx.x] = 0;
      
    if (threadIdx.x == 0)
      mcarry = 0;
      mcarry_through = 0;
      
    __syncthreads();
    if (lc < cr1)
      if ((threadIdx.x%LLBITS) != (LLBITS-1))  
        atomicAdd(&(carry[threadIdx.x/LLBITS]), (2ull<<(threadIdx.x%LLBITS)));
      else atomicAdd(&(carry[(threadIdx.x/LLBITS)+1]), 1);
      
    // handle carry-through
    if (lc == 0xFFFFFFFFFFFFFFFFull) 
      atomicAdd(&(carry_through[threadIdx.x/LLBITS]), (1ull<<(threadIdx.x%LLBITS))); 
    __syncthreads();
    if (threadIdx.x < ((psize + LLBITS-1)/LLBITS))
      // only 1 warp executing within this if statement
      unsigned long long cr3 = carry_through[threadIdx.x];
      cr1 = carry[threadIdx.x] & cr3;
      // start of sub-add
      unsigned long long cr2 = cr3 + cr1;
      if (cr2 < cr1) atomicAdd((unsigned *)&mcarry, (2u<<(threadIdx.x)));
      if (cr2 == 0xFFFFFFFFFFFFFFFFull) atomicAdd((unsigned *)&mcarry_through, (1u<<threadIdx.x));
      if (threadIdx.x == 0) 
        unsigned cr4 = mcarry & mcarry_through;
        cr4 += mcarry_through;
        mcarry |= (mcarry_through ^ cr4); 
        
      if (mcarry & (1u<<threadIdx.x)) cr2++;
      // end of sub-add
      carry[threadIdx.x] |= (cr2 ^ cr3);
      
    __syncthreads();
    if (carry[threadIdx.x/LLBITS] & (1ull<<(threadIdx.x%LLBITS))) lc++;
    c[idx] = lc;
  


int main() 

  unsigned long long *h_a, *h_b, *h_c, *d_a, *d_b, *d_c, *c;
  unsigned at_once = 256;   // valid range = 1 .. 65535
  unsigned prob_size = MAXSIZE ; // valid range = 1 .. MAXSIZE
  unsigned dsize = at_once * prob_size;
  cudaEvent_t t_start_gpu, t_start_cpu, t_end_gpu, t_end_cpu;
  float et_gpu, et_cpu, tot_gpu, tot_cpu;
  tot_gpu = 0;
  tot_cpu = 0;


  if (sizeof(unsigned long long) != (LLBITS/8)) printf("Word Size Error\n"); return 1;
  if ((c = (unsigned long long *)malloc(dsize * sizeof(unsigned long long)))  == 0) printf("Malloc Fail\n"); return 1;

  cudaHostAlloc((void **)&h_a, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
  cudaCheckErrors("cudaHostAlloc1 fail");
  cudaHostAlloc((void **)&h_b, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
  cudaCheckErrors("cudaHostAlloc2 fail");
  cudaHostAlloc((void **)&h_c, dsize * sizeof(unsigned long long), cudaHostAllocDefault);
  cudaCheckErrors("cudaHostAlloc3 fail");

  cudaMalloc((void **)&d_a, dsize * sizeof(unsigned long long));
  cudaCheckErrors("cudaMalloc1 fail");
  cudaMalloc((void **)&d_b, dsize * sizeof(unsigned long long));
  cudaCheckErrors("cudaMalloc2 fail");
  cudaMalloc((void **)&d_c, dsize * sizeof(unsigned long long));
  cudaCheckErrors("cudaMalloc3 fail");
  cudaMemset(d_c, 0, dsize*sizeof(unsigned long long));

  cudaEventCreate(&t_start_gpu);
  cudaEventCreate(&t_end_gpu);
  cudaEventCreate(&t_start_cpu);
  cudaEventCreate(&t_end_cpu);

  for (unsigned loops = 0; loops <LOOPCNT; loops++)
  //create some test cases
  if (loops == 0)
  for (int j=0; j<at_once; j++)
  for (int k=0; k<prob_size; k++)
    int i= (j*prob_size) + k;
    h_a[i] = 0xFFFFFFFFFFFFFFFFull;
    h_b[i] = 0;
    
    h_a[prob_size-1] = 0;
    h_b[prob_size-1] = 1;
    h_b[0] = 1;
  
  else if (loops == 1)
  for (int i=0; i<dsize; i++)
    h_a[i] = 0xFFFFFFFFFFFFFFFFull;
    h_b[i] = 0;
    
    h_b[0] = 1;
  
  else if (loops == 2)
  for (int i=0; i<dsize; i++)
    h_a[i] = 0xFFFFFFFFFFFFFFFEull;
    h_b[i] = 2;
    
    h_b[0] = 1;
  
  else 
  for (int i = 0; i<dsize; i++)
    h_a[i] = (((unsigned long long)lrand48())<<33) + (unsigned long long)lrand48();
    h_b[i] = (((unsigned long long)lrand48())<<33) + (unsigned long long)lrand48();
    
  
#ifdef GPUCOPY
  cudaEventRecord(t_start_gpu, 0);
#endif
  cudaMemcpy(d_a, h_a, dsize*sizeof(unsigned long long), cudaMemcpyHostToDevice);
  cudaCheckErrors("cudaMemcpy1 fail");
  cudaMemcpy(d_b, h_b, dsize*sizeof(unsigned long long), cudaMemcpyHostToDevice);
  cudaCheckErrors("cudaMemcpy2 fail");
#ifdef GPU
  cudaEventRecord(t_start_gpu, 0);
#endif
  paradd<<<at_once, nTPB>>>(dsize, prob_size, d_c, d_a, d_b);
  cudaCheckErrors("Kernel Fail");
#ifdef GPU
  cudaEventRecord(t_end_gpu, 0);
#endif
  cudaMemcpy(h_c, d_c, dsize*sizeof(unsigned long long), cudaMemcpyDeviceToHost);
  cudaCheckErrors("cudaMemcpy3 fail");
#ifdef GPUCOPY
  cudaEventRecord(t_end_gpu, 0);
#endif
  cudaEventSynchronize(t_end_gpu);
  cudaEventElapsedTime(&et_gpu, t_start_gpu, t_end_gpu);
  tot_gpu += et_gpu;
  cudaEventRecord(t_start_cpu, 0);
  //also compute result on CPU for comparison
  for (int j=0; j<at_once; j++) 
    unsigned rc=0;
    for (int n=0; n<prob_size; n++)
      unsigned i = (j*prob_size) + n;
      c[i] = h_a[i] + h_b[i];
      if (c[i] < h_a[i]) 
        c[i] += rc;
        rc=1;
      else 
        if ((c[i] += rc) != 0) rc=0;
        
      if (c[i] != h_c[i]) printf("Results mismatch at offset %d, GPU = 0x%lX, CPU = 0x%lX\n", i, h_c[i], c[i]); return 1;
      
    
  cudaEventRecord(t_end_cpu, 0);
  cudaEventSynchronize(t_end_cpu);
  cudaEventElapsedTime(&et_cpu, t_start_cpu, t_end_cpu);
  tot_cpu += et_cpu;
  if ((loops%(LOOPCNT/10)) == 0) printf("*\n");
  
  printf("\nResults Match!\n");
  printf("Average GPU time = %fms\n", (tot_gpu/LOOPCNT));
  printf("Average CPU time = %fms\n", (tot_cpu/LOOPCNT));

  return 0;

【讨论】：

其实我相信我的进位传播可以进一步简化为：进位 = 进位 | (carry_through ^ (carry + carry_through)); 这非常有用。您能否给出您的机器在 CPU 与 GPU 上的平均时间的数据（说明 CPU、GPU、操作系统等）？

以上是关于使用 CUDA 进行大整数加法的主要内容，如果未能解决你的问题，请参考以下文章

ZZNUOJ_用C语言编写程序实现1153：大整数加法(附完整源码)