使用 Vector<T> 的带有 SIMD 的矢量化 C# 代码运行速度比经典循环慢

Posted 2023-02-16

技术标签:

【中文标题】使用 Vector<T> 的带有 SIMD 的矢量化 C# 代码运行速度比经典循环慢【英文标题】：Vectorized C# code with SIMD using Vector<T> running slower than classic loop 【发布时间】：2018-01-11 01:51:21 【问题描述】：

我看过一些文章描述了Vector<T> 如何启用 SIMD 并使用 JIT 内部函数实现，因此编译器在使用它时将正确输出 AVS/SSE/... 指令，从而允许比经典代码更快的代码，线性循环（例如 here）。

我决定尝试重写一个方法，我必须看看我是否设法获得了一些加速，但到目前为止我失败了，矢量化代码的运行速度比原始代码慢 3 倍，我不确定为什么。下面是一种方法的两个版本，用于检查两个 Span<float> 实例是否在同一位置具有相对于阈值共享同一位置的所有项目对。

// Classic implementation
public static unsafe bool MatchElementwiseThreshold(this Span<float> x1, Span<float> x2, float threshold)

    fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
        for (int i = 0; i < x1.Length; i++)
            if (px1[i] > threshold != px2[i] > threshold)
                return false;
    return true;


// Vectorized
public static unsafe bool MatchElementwiseThresholdSIMD(this Span<float> x1, Span<float> x2, float threshold)

    // Setup the test vector
    int l = Vector<float>.Count;
    float* arr = stackalloc float[l];
    for (int i = 0; i < l; i++)
        arr[i] = threshold;
    Vector<float> cmp = Unsafe.Read<Vector<float>>(arr);
    fixed (float* px1 = &x1.DangerousGetPinnableReference(), px2 = &x2.DangerousGetPinnableReference())
    
        // Iterate in chunks
        int
            div = x1.Length / l,
            mod = x1.Length % l,
            i = 0,
            offset = 0;
        for (; i < div; i += 1, offset += l)
        
            Vector<float>
                v1 = Unsafe.Read<Vector<float>>(px1 + offset),
                v1cmp = Vector.GreaterThan<float>(v1, cmp),
                v2 = Unsafe.Read<Vector<float>>(px2 + offset),
                v2cmp = Vector.GreaterThan<float>(v2, cmp);
            float*
                pcmp1 = (float*)Unsafe.AsPointer(ref v1cmp),
                pcmp2 = (float*)Unsafe.AsPointer(ref v2cmp);
            for (int j = 0; j < l; j++)
                if (pcmp1[j] == 0 != (pcmp2[j] == 0))
                    return false;
        

        // Test the remaining items, if any
        if (mod == 0) return true;
        for (i = x1.Length - mod; i < x1.Length; i++)
            if (px1[i] > threshold != px2[i] > threshold)
                return false;
    
    return true;

正如我所说，我已经使用 BenchmarkDotNet 测试了这两个版本，使用 Vector<T> 的版本比另一个慢 3 倍左右。我尝试使用不同长度的跨度（从大约 100 到超过 2000）运行测试，但矢量化方法一直比另一种慢得多。

我在这里遗漏了什么明显的东西吗？

谢谢！

编辑：我使用不安全代码并尝试在不并行化的情况下尽可能优化此代码的原因是，此方法已在 Parallel.For 迭代中调用。

另外，能够在多个线程上并行化代码通常不是不优化各个并行任务的好理由。

【问题讨论】：

就我个人的经验来说，我会去另一个方向使用Parallel.For进行多线程，而不是进入不安全的代码来加速我的代码。 @Gordon 我已经在使用Parallel.For，这个方法实际上会在每个并行迭代中被调用。如果您真的对性能感兴趣，您可能需要考虑将代码移至 c++，您可以在其中使用 .NET doesnt support（至少从编码人员的角度来看）的功能，例如 SSE2 和超过。使用 c++/CLI 或直接 p-invoke 桥接它。在 c# 中过度使用unsafe 和指针就像是在和语言打架 @MickyD 我已经在我的库中使用 GPU 加速，但是当 CUDA GPU 不可用时，仍然有一个仅 CPU 的部分，我想对其进行优化尽可能。如果我之前的评论以错误的方式出现，我很抱歉，我并不是要听起来粗鲁或生气，我只是想解释为什么我有兴趣以这种方式优化这段代码 - 实际上是你的观察当然是 100% 有效。这一切都很好，先生。你的项目是一个非常令人兴奋的项目。祝你好运。 :) 【参考方案1】：

我遇到了同样的问题。解决方案是取消选中项目属性中的 Prefer 32-bit 选项。

SIMD 仅对 64 位进程启用。因此，请确保您的应用程序直接针对 x64 或编译为 Any CPU 且未标记为首选 32 位。 [Source]

【讨论】：

【参考方案2】：

** EDIT ** 看完a blog post by Marc Gravell，看到这个可以简单实现……

public static bool MatchElementwiseThresholdSIMD(ReadOnlySpan<float> x1, ReadOnlySpan<float> x2, float threshold)

    if (x1.Length != x2.Length) throw new ArgumentException("x1.Length != x2.Length");

    if (Vector.IsHardwareAccelerated)
    
        var vx1 = x1.NonPortableCast<float, Vector<float>>();
        var vx2 = x2.NonPortableCast<float, Vector<float>>();

        var vthreshold = new Vector<float>(threshold);
        for (int i = 0; i < vx1.Length; ++i)
        
            var v1cmp = Vector.GreaterThan(vx1[i], vthreshold);
            var v2cmp = Vector.GreaterThan(vx2[i], vthreshold);
            if (Vector.Xor(v1cmp, v2cmp) != Vector<int>.Zero)
                return false;
        

        x1 = x1.Slice(Vector<float>.Count * vx1.Length);
        x2 = x2.Slice(Vector<float>.Count * vx2.Length);
    

    for (var i = 0; i < x1.Length; i++)
        if (x1[i] > threshold != x2[i] > threshold)
            return false;

    return true;

现在这并不像直接使用数组那样快（如果你有的话），但仍然比非 SIMD 版本快得多......

（另一个编辑...）

...只是为了好玩，我想我会很好地看到这些东西在完全通用时可以正常工作，而且答案很好...所以你可以编写如下代码，它和具体的（除了在非硬件加速的情况下，在这种情况下它的速度比它慢两倍 - 但不是完全可怕...）

    public static bool MatchElementwiseThreshold<T>(ReadOnlySpan<T> x1, ReadOnlySpan<T> x2, T threshold)
        where T : struct
    
        if (x1.Length != x2.Length)
            throw new ArgumentException("x1.Length != x2.Length");

        if (Vector.IsHardwareAccelerated)
        
            var vx1 = x1.NonPortableCast<T, Vector<T>>();
            var vx2 = x2.NonPortableCast<T, Vector<T>>();

            var vthreshold = new Vector<T>(threshold);
            for (int i = 0; i < vx1.Length; ++i)
            
                var v1cmp = Vector.GreaterThan(vx1[i], vthreshold);
                var v2cmp = Vector.GreaterThan(vx2[i], vthreshold);
                if (Vector.AsVectorInt32(Vector.Xor(v1cmp, v2cmp)) != Vector<int>.Zero)
                    return false;
            

            // slice them to handling remaining elementss
            x1 = x1.Slice(Vector<T>.Count * vx1.Length);
            x2 = x2.Slice(Vector<T>.Count * vx1.Length);
        

        var comparer = System.Collections.Generic.Comparer<T>.Default;
        for (int i = 0; i < x1.Length; i++)
            if ((comparer.Compare(x1[i], threshold) > 0) != (comparer.Compare(x2[i], threshold) > 0))
                return false;

        return true;

【讨论】：

“不如直接使用数组快”你是说矢量化仍然不能提高速度吗？【参考方案3】：

向量只是一个向量。它不声称或保证使用 SIMD 扩展。使用

System.Numerics.Vector2

https://docs.microsoft.com/en-us/dotnet/standard/numerics#simd-enabled-vector-types

【讨论】：

以上是关于使用 Vector<T> 的带有 SIMD 的矢量化 C# 代码运行速度比经典循环慢的主要内容，如果未能解决你的问题，请参考以下文章