在软件中实现 SSE 4.2 的 CRC32C

Posted 2023-02-19

技术标签:

【中文标题】在软件中实现 SSE 4.2 的 CRC32C【英文标题】：Implementing SSE 4.2's CRC32C in software 【发布时间】：2013-07-12 18:46:00 【问题描述】：

所以我有一个包含 CRC32C 校验和的设计，以确保数据没有被损坏。我决定使用 CRC32C，因为如果运行该软件的计算机支持 SSE 4.2，我可以同时拥有软件版本和硬件加速版本

我正在阅读英特尔的开发人员手册（第 2A 卷），它似乎提供了 crc32 指令背后的算法。但是，我运气不佳。英特尔的开发人员指南如下：

BIT_REFLECT32: DEST[31-0] = SRC[0-31]
MOD2: Remainder from Polynomial division modulus 2

TEMP1[31-0] <- BIT_REFLECT(SRC[31-0])
TEMP2[31-0] <- BIT_REFLECT(DEST[31-0])
TEMP3[63-0] <- TEMP1[31-0] << 32
TEMP4[63-0] <- TEMP2[31-0] << 32
TEMP5[63-0] <- TEMP3[63-0] XOR TEMP4[63-0]
TEMP6[31-0] <- TEMP5[63-0] MOD2 0x11EDC6F41
DEST[31-0]  <- BIT_REFLECT(TEMP6[31-0])

现在，据我所知，我已经正确地完成了从TEMP6 开始的所有操作，但我认为我可能误解了多项式除法，或者执行不正确。如果我的理解是正确的，1 / 1 mod 2 = 1、0 / 1 mod 2 = 0 和两个被零除都是未定义的。

我不明白 64 位和 33 位操作数的二进制除法如何工作。如果SRC 是0x00000000，DEST 是0xFFFFFFFF，则TEMP5[63-32] 将是所有设置位，而TEMP5[31-0] 将是所有未设置位。

如果我使用来自 TEMP5 的位作为分子，则将有 30 个除以零，因为多项式 11EDC6F41 只有 33 位长（因此将其转换为 64 位无符号整数会留下前 30 位未设置），因此分母未设置 30 位。

但是，如果我使用多项式作为分子，TEMP5 的底部 32 位未设置，导致那里除以零，结果的前 30 位将为零，前 30 位分子的位将为零，如0 / 1 mod 2 = 0。

我是否误解了它的工作原理？只是缺少一些东西吗？还是英特尔在他们的文档中遗漏了一些关键步骤？

我去英特尔的开发人员指南了解他们使用的算法的原因是因为他们使用了 33 位多项式，并且我想让输出相同，而当我使用 32 位时没有发生这种情况多项式1EDC6F41（如下所示）。

uint32_t poly = 0x1EDC6F41, sres, crcTable[256], data = 0x00000000;

for (n = 0; n < 256; n++) 
    sres = n;
    for (k = 0; k < 8; k++)
        sres = (sres & 1) == 1 ? poly ^ (sres >> 1) : (sres >> 1);
    crcTable[n] = sres;

sres = 0xFFFFFFFF;

for (n = 0; n < 4; n++) 
    sres = crcTable[(sres ^ data) & 0xFF] ^ (sres >> 8);

上面的代码产生4138093821作为输出，crc32操作码使用输入0x00000000产生2346497208。

对不起，如果写得不好或有些地方难以理解，对我来说已经很晚了。

【问题讨论】：

对于那些使用 Delphi 的人，我已经 written some Open Source code 使用新的 crc32 硬件指令（如果可用），以及快速 x86 asm 或纯 pascal 代码（使用预计算表）（如果 SSE 4.2 不是）可用的。原始滚动版本以 330 MB/s 的速度运行，优化的展开 x86 asm 以 1.7 GB/s 的速度运行，SSE 4.2 硬件提供了惊人的 3.7 GB/s 速度（在 Win32 和 Win64 平台上）。如果您阅读 LGPL 代码是合法的，请参阅 code.woboq.org/qt5/qtbase/src/corelib/tools/qhash.cpp.html#95 【参考方案1】：

这里是 CRC-32C 的软件和硬件版本。软件版本经过优化，可以一次处理 8 个字节。硬件版本经过优化，可在单个内核上有效地并行运行三个crc32q 指令，因为该指令的吞吐量为一个周期，但延迟为三个周期。

crc32c.c:

/* crc32c.c -- compute CRC-32C using the Intel crc32 instruction
 * Copyright (C) 2013, 2021 Mark Adler
 * Version 1.2  5 Jun 2021  Mark Adler
 */

/*
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the author be held liable for any damages
  arising from the use of this software.

  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

  Mark Adler
  madler@alumni.caltech.edu
 */

/* Version History:
 1.0  10 Feb 2013  First version
 1.1  31 May 2021  Correct register constraints on assembly instructions
                   Include pre-computed tables to avoid use of pthreads
                   Return zero for the CRC when buf is NULL, as initial value
 1.2   5 Jun 2021  Make tables constant
 */

// Use hardware CRC instruction on Intel SSE 4.2 processors.  This computes a
// CRC-32C, *not* the CRC-32 used by Ethernet and zip, gzip, etc.  A software
// version is provided as a fall-back, as well as for speed comparisons.

#include <stddef.h>
#include <stdint.h>

// Tables for CRC word-wise calculation, definitions of LONG and SHORT, and CRC
// shifts by LONG and SHORT bytes.
#include "crc32c.h"

// Table-driven software version as a fall-back.  This is about 15 times slower
// than using the hardware instructions.  This assumes little-endian integers,
// as is the case on Intel processors that the assembler code here is for.
static uint32_t crc32c_sw(uint32_t crc, void const *buf, size_t len) 
    if (buf == NULL)
        return 0;
    unsigned char const *data = buf;
    while (len && ((uintptr_t)data & 7) != 0) 
        crc = (crc >> 8) ^ crc32c_table[0][(crc ^ *data++) & 0xff];
        len--;
    
    size_t n = len >> 3;
    for (size_t i = 0; i < n; i++) 
        uint64_t word = crc ^ ((uint64_t const *)data)[i];
        crc = crc32c_table[7][word & 0xff] ^
              crc32c_table[6][(word >> 8) & 0xff] ^
              crc32c_table[5][(word >> 16) & 0xff] ^
              crc32c_table[4][(word >> 24) & 0xff] ^
              crc32c_table[3][(word >> 32) & 0xff] ^
              crc32c_table[2][(word >> 40) & 0xff] ^
              crc32c_table[1][(word >> 48) & 0xff] ^
              crc32c_table[0][word >> 56];
    
    data += n << 3;
    len &= 7;
    while (len) 
        len--;
        crc = (crc >> 8) ^ crc32c_table[0][(crc ^ *data++) & 0xff];
    
    return crc;


// Apply the zeros operator table to crc.
static uint32_t crc32c_shift(uint32_t const zeros[][256], uint32_t crc) 
    return zeros[0][crc & 0xff] ^ zeros[1][(crc >> 8) & 0xff] ^
           zeros[2][(crc >> 16) & 0xff] ^ zeros[3][crc >> 24];


// Compute CRC-32C using the Intel hardware instruction. Three crc32q
// instructions are run in parallel on a single core. This gives a
// factor-of-three speedup over a single crc32q instruction, since the
// throughput of that instruction is one cycle, but the latency is three
// cycles.
static uint32_t crc32c_hw(uint32_t crc, void const *buf, size_t len) 
    if (buf == NULL)
        return 0;

    // Pre-process the crc.
    uint64_t crc0 = crc ^ 0xffffffff;

    // Compute the crc for up to seven leading bytes, bringing the data pointer
    // to an eight-byte boundary.
    unsigned char const *next = buf;
    while (len && ((uintptr_t)next & 7) != 0) 
        __asm__("crc32b\t" "(%1), %0"
                : "+r"(crc0)
                : "r"(next), "m"(*next));
        next++;
        len--;
    

    // Compute the crc on sets of LONG*3 bytes, making use of three ALUs in
    // parallel on a single core.
    while (len >= LONG*3) 
        uint64_t crc1 = 0;
        uint64_t crc2 = 0;
        unsigned char const *end = next + LONG;
        do 
            __asm__("crc32q\t" "(%3), %0\n\t"
                    "crc32q\t" LONGx1 "(%3), %1\n\t"
                    "crc32q\t" LONGx2 "(%3), %2"
                    : "+r"(crc0), "+r"(crc1), "+r"(crc2)
                    : "r"(next), "m"(*next));
            next += 8;
         while (next < end);
        crc0 = crc32c_shift(crc32c_long, crc0) ^ crc1;
        crc0 = crc32c_shift(crc32c_long, crc0) ^ crc2;
        next += LONG*2;
        len -= LONG*3;
    

    // Do the same thing, but now on SHORT*3 blocks for the remaining data less
    // than a LONG*3 block.
    while (len >= SHORT*3) 
        uint64_t crc1 = 0;
        uint64_t crc2 = 0;
        unsigned char const *end = next + SHORT;
        do 
            __asm__("crc32q\t" "(%3), %0\n\t"
                    "crc32q\t" SHORTx1 "(%3), %1\n\t"
                    "crc32q\t" SHORTx2 "(%3), %2"
                    : "+r"(crc0), "+r"(crc1), "+r"(crc2)
                    : "r"(next), "m"(*next));
            next += 8;
         while (next < end);
        crc0 = crc32c_shift(crc32c_short, crc0) ^ crc1;
        crc0 = crc32c_shift(crc32c_short, crc0) ^ crc2;
        next += SHORT*2;
        len -= SHORT*3;
    

    // Compute the crc on the remaining eight-byte units less than a SHORT*3
    // block.
    unsigned char const *end = next + (len - (len & 7));
    while (next < end) 
        __asm__("crc32q\t" "(%1), %0"
                : "+r"(crc0)
                : "r"(next), "m"(*next));
        next += 8;
    
    len &= 7;

    // Compute the crc for up to seven trailing bytes.
    while (len) 
        __asm__("crc32b\t" "(%1), %0"
                : "+r"(crc0)
                : "r"(next), "m"(*next));
        next++;
        len--;
    

    // Return the crc, post-processed.
    return ~(uint32_t)crc0;


// Check for SSE 4.2.  SSE 4.2 was first supported in Nehalem processors
// introduced in November, 2008.  This does not check for the existence of the
// cpuid instruction itself, which was introduced on the 486SL in 1992, so this
// will fail on earlier x86 processors.  cpuid works on all Pentium and later
// processors.
#define SSE42(have) \
    do  \
        uint32_t eax, ecx; \
        eax = 1; \
        __asm__("cpuid" \
                : "=c"(ecx) \
                : "a"(eax) \
                : "%ebx", "%edx"); \
        (have) = (ecx >> 20) & 1; \
     while (0)

// Compute a CRC-32C.  If the crc32 instruction is available, use the hardware
// version.  Otherwise, use the software version.
uint32_t crc32c(uint32_t crc, void const *buf, size_t len) 
    int sse42;
    SSE42(sse42);
    return sse42 ? crc32c_hw(crc, buf, len) : crc32c_sw(crc, buf, len);

生成 crc32c.h 的代码（*** 不允许我自己发布表格，因为答案中有 30,000 个字符的限制）：

// Generate crc32c.h for crc32c.c.

#include <stdio.h>
#include <stdint.h>

#define LONG 8192
#define SHORT 256

// Print a 2-D table of four-byte constants in hex.
static void print_table(uint32_t *tab, size_t rows, size_t cols, char *name) 
    printf("static uint32_t const %s[][%zu] = \n", name, cols);
    size_t end = rows * cols;
    size_t k = 0;
    for (;;) 
        fputs("   ", stdout);
        size_t n = 0, j = 0;
        for (;;) 
            printf("0x%08x", tab[k + n]);
            if (++n == cols)
                break;
            putchar(',');
            if (++j == 6) 
                fputs("\n   ", stdout);
                j = 0;
            
            putchar(' ');
        
        k += cols;
        if (k == end)
            break;
        puts(",");
    
    puts("\n;");


/* CRC-32C (iSCSI) polynomial in reversed bit order. */
#define POLY 0x82f63b78

static void crc32c_word_table(void) 
    uint32_t table[8][256];

    // Generate byte-wise table.
    for (unsigned n = 0; n < 256; n++) 
        uint32_t crc = ~n;
        for (unsigned k = 0; k < 8; k++)
            crc = crc & 1 ? (crc >> 1) ^ POLY : crc >> 1;
        table[0][n] = ~crc;
    

    // Use byte-wise table to generate word-wise table.
    for (unsigned n = 0; n < 256; n++) 
        uint32_t crc = ~table[0][n];
        for (unsigned k = 1; k < 8; k++) 
            crc = table[0][crc & 0xff] ^ (crc >> 8);
            table[k][n] = ~crc;
        
    

    // Print table.
    print_table(table[0], 8, 256, "crc32c_table");


// Return a(x) multiplied by b(x) modulo p(x), where p(x) is the CRC
// polynomial. For speed, this requires that a not be zero.
static uint32_t multmodp(uint32_t a, uint32_t b) 
    uint32_t prod = 0;
    for (;;) 
        if (a & 0x80000000) 
            prod ^= b;
            if ((a & 0x7fffffff) == 0)
                break;
        
        a <<= 1;
        b = b & 1 ? (b >> 1) ^ POLY : b >> 1;
    
    return prod;


/* Take a length and build four lookup tables for applying the zeros operator
   for that length, byte-by-byte, on the operand. */
static void crc32c_zero_table(size_t len, char *name) 
    // Generate operator for len zeros.
    uint32_t op = 0x80000000;               // 1 (x^0)
    uint32_t sq = op >> 4;                  // x^4
    while (len) 
        sq = multmodp(sq, sq);              // x^2^(k+3), k == len bit position
        if (len & 1)
            op = multmodp(sq, op);
        len >>= 1;
    

    // Generate table to update each byte of a CRC using op.
    uint32_t table[4][256];
    for (unsigned n = 0; n < 256; n++) 
        table[0][n] = multmodp(op, n);
        table[1][n] = multmodp(op, n << 8);
        table[2][n] = multmodp(op, n << 16);
        table[3][n] = multmodp(op, n << 24);
    

    // Print the table to stdout.
    print_table(table[0], 4, 256, name);


int main(void) 
    puts(
"// crc32c.h\n"
"// Tables and constants for crc32c.c software and hardware calculations.\n"
"\n"
"// Table for a 64-bits-at-a-time software CRC-32C calculation. This table\n"
"// has built into it the pre and post bit inversion of the CRC."
    );
    crc32c_word_table();
    puts(
"\n// Block sizes for three-way parallel crc computation.  LONG and SHORT\n"
"// must both be powers of two.  The associated string constants must be set\n"
"// accordingly, for use in constructing the assembler instructions."
        );
    printf("#define LONG %d\n", LONG);
    printf("#define LONGx1 \"%d\"\n", LONG);
    printf("#define LONGx2 \"%d\"\n", 2 * LONG);
    printf("#define SHORT %d\n", SHORT);
    printf("#define SHORTx1 \"%d\"\n", SHORT);
    printf("#define SHORTx2 \"%d\"\n", 2 * SHORT);
    puts(
"\n// Table to shift a CRC-32C by LONG bytes."
    );
    crc32c_zero_table(8192, "crc32c_long");
    puts(
"\n// Table to shift a CRC-32C by SHORT bytes."
    );
    crc32c_zero_table(256, "crc32c_short");
    return 0;

【讨论】：

这是为 GNU 编译器 (gcc) 编写的，它使用 AT&T 语法来表示汇编指令，而不是 Intel 语法。 AT&T 语法对于生成什么指令更加清晰，因为它不依赖于参数类型（例如 dword ptr 等）。您的汇编程序可能使用 Intel 语法，其中crc32“指令”实际上可以生成六种不同指令之一。哪一个必须由汇编程序以及试图阅读代码的人根据参数的性质来确定。并行处理 3 个缓冲区的原因是 CRC32C 指令是流水线的，并且具有 3 个周期的延迟和 1 个周期的吞吐量 - 如果结果不是，您可以获得每个时钟周期退出的 CRC32C 指令用作另一个 CRC32C 指令的输入 3 个周期......只有一个 ALU 能够执行 CRC32C - 指令通过端口 1 发送给它，这个 ALU 执行“复杂/慢”整数指令。其他 ALU 无法处理 CRC32C。 intel.com/content/dam/www/public/us/en/documents/manuals/… 谢谢！我误解了为什么并行执行四个 CRC 指令没有帮助。我会修好 cmets。我已将代码包装在 library for Windows 中，并添加了 .NET 包装器和 NuGet 包。我还将软件回退速度加快了 50%。不错的答案，但请注意，查找表的 C++ constexpr 初始化可能比此 C 版本更快，因为您可能会因为 pthread_once_t 为每次调用支付一点成本【参考方案2】：

Mark Adler 的回答是正确且完整的，但那些寻求在其应用程序中集成 CRC-32C 的快速简便方法的人可能会发现调整代码有点困难，尤其是在使用 Windows 和 .NET 时。

我已根据可用硬件使用硬件或软件方法创建了一个library that implements CRC-32C。它可用作 C++ 和 .NET 的 NuGet 包。当然是开源的。

除了将上面的 Mark Adler 代码打包之外，我还找到了一种将软件回退的吞吐量提高 50% 的简单方法。在我的电脑上，图书馆现在在软件上达到了 2 GB/s，在硬件上达到了超过 20 GB/s。对于那些好奇的人，这里是优化的软件实现：

static uint32_t append_table(uint32_t crci, buffer input, size_t length)

    buffer next = input;
#ifdef _M_X64
    uint64_t crc;
#else
    uint32_t crc;
#endif

    crc = crci ^ 0xffffffff;
#ifdef _M_X64
    while (length && ((uintptr_t)next & 7) != 0)
    
        crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
        --length;
    
    while (length >= 16)
    
        crc ^= *(uint64_t *)next;
        uint64_t high = *(uint64_t *)(next + 8);
        crc = table[15][crc & 0xff]
            ^ table[14][(crc >> 8) & 0xff]
            ^ table[13][(crc >> 16) & 0xff]
            ^ table[12][(crc >> 24) & 0xff]
            ^ table[11][(crc >> 32) & 0xff]
            ^ table[10][(crc >> 40) & 0xff]
            ^ table[9][(crc >> 48) & 0xff]
            ^ table[8][crc >> 56]
            ^ table[7][high & 0xff]
            ^ table[6][(high >> 8) & 0xff]
            ^ table[5][(high >> 16) & 0xff]
            ^ table[4][(high >> 24) & 0xff]
            ^ table[3][(high >> 32) & 0xff]
            ^ table[2][(high >> 40) & 0xff]
            ^ table[1][(high >> 48) & 0xff]
            ^ table[0][high >> 56];
        next += 16;
        length -= 16;
    
#else
    while (length && ((uintptr_t)next & 3) != 0)
    
        crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
        --length;
    
    while (length >= 12)
    
        crc ^= *(uint32_t *)next;
        uint32_t high = *(uint32_t *)(next + 4);
        uint32_t high2 = *(uint32_t *)(next + 8);
        crc = table[11][crc & 0xff]
            ^ table[10][(crc >> 8) & 0xff]
            ^ table[9][(crc >> 16) & 0xff]
            ^ table[8][crc >> 24]
            ^ table[7][high & 0xff]
            ^ table[6][(high >> 8) & 0xff]
            ^ table[5][(high >> 16) & 0xff]
            ^ table[4][high >> 24]
            ^ table[3][high2 & 0xff]
            ^ table[2][(high2 >> 8) & 0xff]
            ^ table[1][(high2 >> 16) & 0xff]
            ^ table[0][high2 >> 24];
        next += 12;
        length -= 12;
    
#endif
    while (length)
    
        crc = table[0][(crc ^ *next++) & 0xff] ^ (crc >> 8);
        --length;
    
    return (uint32_t)crc ^ 0xffffffff;

如您所见，它一次只能处理较大的块。它需要更大的查找表，但它仍然对缓存友好。表格的生成方式相同，只是行数更多。

我探索的另一件事是使用 PCLMULQDQ 指令在 AMD 处理器上获得硬件加速。我已经成功地将Intel's CRC patch for zlib（也是available on GitHub）移植到CRC-32C 多项式~~除了the magic constant 0x9db42487。如果有人能够破译那个，请告诉我~~。在supersaw7's excellent explanation on reddit 之后，我还移植了难以捉摸的 0x9db42487 常量，我只需要找一些时间来打磨和测试它。

【讨论】：

+1 感谢您分享您的代码。将它移植到 Delphi 时对我有很大帮助。我修复了补丁的链接并添加了一些额外的链接。罗伯特，你在这个问题上取得进展了吗？看来 cloudflare 的支持 PCLMULQDQ 的 zlib 不使用常量...也许这对你有用？ PCLMULQDQ 不再是一个谜。查看更新的答案。 @RobertVažan - 可能为时已晚，但我有使用 pclmulqdq 转换为与 Visual Studio 汇编程序 (ML64.EXE) 一起工作的工作版本，用于左右移动 CRC 和两个多项式。在我的系统上，Intel 3770K 3.5 ghz，速度约为 3.3 GB/秒。【参考方案3】：

首先，英特尔的CRC32 指令用于计算CRC-32C（即使用与常规CRC32 不同的多项式。查看Wikipedia CRC32 条目）

要使用英特尔的 CRC32C 硬件加速 gcc，您可以：

asm

_mm_crc32_u8

_mm_crc32_u16

_mm_crc32_u32

_mm_crc32_u64

icc

gcc

这就是使用__mm_crc32_u8 一次占用一个字节的方法，使用__mm_crc32_u64 将进一步提高性能，因为它一次占用8 个字节。

uint32_t sse42_crc32(const uint8_t *bytes, size_t len)

  uint32_t hash = 0;
  size_t i = 0;
  for (i=0;i<len;i++) 
    hash = _mm_crc32_u8(hash, bytes[i]);
  

  return hash;

要编译它，您需要在CFLAGS 中传递-msse4.2。像gcc -g -msse4.2 test.c 一样，否则它会抱怨undefined reference to _mm_crc32_u8。

如果在运行可执行文件的平台上没有该指令，您想恢复为纯 C 实现，您可以使用 GCC 的 ifunc 属性。喜欢

uint32_t sse42_crc32(const uint8_t *bytes, size_t len)

  /* use _mm_crc32_u* here */


uint32_t default_crc32(const uint8_t *bytes, size_t len)

  /* pure C implementation */


/* this will be called at load time to decide which function really use */
/* sse42_crc32 if SSE 4.2 is supported */
/* default_crc32 if not */
static void * resolve_crc32(void) 
  __builtin_cpu_init();
  if (__builtin_cpu_supports("sse4.2")) return sse42_crc32;

  return default_crc32;


/* crc32() implementation will be resolved at load time to either */
/* sse42_crc32() or default_crc32() */
uint32_t crc32(const uint8_t *bytes, size_t len) __attribute__ ((ifunc ("resolve_crc32")));

【讨论】：

如果我正在处理，是否有一种方法可以获取校验和，让一个 1MB 的块使用上述方法您可以创建此函数的一个版本，其中初始哈希值作为参数传递。这将允许您逐块处理【参考方案4】：

我在这里比较各种算法：https://github.com/htot/crc32c

最快的算法取自 Intel 的 crc_iscsi_v_pcl.asm 汇编代码（在 linux 内核中以修改后的形式提供）并使用包含在此项目中的 C 包装器 (crcintelasm.cc)。

为了能够首先在 32 位平台上运行此代码，已尽可能将其移植到 C (crc32intelc)，需要进行少量内联汇编。代码的某些部分取决于位数，crc32q 在 32 位上不可用，movq 也不可用，这些都放在宏 (crc32intel.h) 中，并带有 32 位平台的替代代码。

【讨论】：

以上是关于在软件中实现 SSE 4.2 的 CRC32C的主要内容，如果未能解决你的问题，请参考以下文章

可以使用 CRC32C 作为基础构造一个“好”的哈希函数吗？

CRC32C 的测试向量

在 Alpine 上安装 crcmod CRC32C C 扩展

CRC32(C) 可以返回 0 吗？

AVX mat4 inv 实现比 SSE 慢

如何在 sse 中实现有符号定点数学中向零的衰减？