rescue-prime:基于Goldilocks域的Rescue-Prime 哈希函数加速
Posted mutourend
tags:
篇首语:本文由小常识网(cha138.com)小编为大家整理,主要介绍了rescue-prime:基于Goldilocks域的Rescue-Prime 哈希函数加速相关的知识,希望对你有一定的参考价值。
1. 引言
前序博客:
所谓计算友好的哈希函数,是指:
- 基于素数域元素,而不是 通常的如SHA3-256/SHA256/BLAKE3中的raw bits/bytes/N-bit words。原因是,在STARK证明系统中,基于素数域的计算电路更易于证明。
Rescue-Prime哈希函数为计算友好的哈希函数,目前已用于Winterfell STARK证明系统中。
在Winterfell STARK证明系统中,基于的素数域为Goldilocks域,即 q = 2 64 − 2 32 + 1 q=2^64-2^32+1 q=264−232+1,相应的哈希函数API接口为:
- 输入: N > 0 N>0 N>0个元素,每个元素均 ∈ Z q , q = 2 64 − 2 32 + 1 \\in Z_q,q=2^64-2^32+1 ∈Zq,q=264−232+1。
- 输出:4个元素,每个元素均 ∈ Z q , q = 2 64 − 2 32 + 1 \\in Z_q,q=2^64-2^32+1 ∈Zq,q=264−232+1。
开源代码见:
在https://github.com/itzmeanjan/rescue-prime中,做了scalar和vectorized Rescue两种实现。若目标CPU支持AVX2或NEON指令,则运行Rescue permutation将更快。
make benchmark # benchmarks scalar implementation
AVX2=1 make benchmark # benchmarks AVX2 implementation
NEON=1 make benchmark # benchmarks NEON implementation
CPU加速后,在各平台benchmark的性能为:
- On Intel® Xeon® Platinum 8375C CPU @ 2.90GHz ( Scalar implementation compiled with GCC )
2022-12-22T15:39:15+00:00
Running ./bench/a.out
Run on (128 X 2072.24 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x64)
L1 Instruction 32 KiB (x64)
L2 Unified 1280 KiB (x64)
L3 Unified 55296 KiB (x2)
Load Average: 0.11, 0.04, 0.01
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations items_per_second max_exec_time (ns) median_exec_time (ns) min_exec_time (ns)
-------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_rphash::permutation/manual_time 31499 ns 43123 ns 22224 31.7467k/s 33.988k 31.489k 31.363k
bench_rphash::hash/4/manual_time 32136 ns 36090 ns 21782 31.118k/s 38.102k 32.124k 31.998k
bench_rphash::hash/8/manual_time 31821 ns 39618 ns 22000 31.426k/s 42.13k 31.806k 31.681k
bench_rphash::hash/16/manual_time 63608 ns 79070 ns 11005 15.7213k/s 68.194k 63.588k 63.417k
bench_rphash::hash/32/manual_time 127188 ns 157987 ns 5504 7.8624k/s 129.934k 127.15k 126.867k
bench_rphash::hash/64/manual_time 254360 ns 315933 ns 2752 3.93144k/s 288.681k 254.272k 253.928k
bench_rphash::hash/128/manual_time 508638 ns 632906 ns 1376 1.96603k/s 522.51k 508.456k 507.97k
- On Intel® Xeon® Platinum 8375C CPU @ 2.90GHz ( AVX2 implementation compiled with GCC )
2022-12-22T15:40:04+00:00
Running ./bench/a.out
Run on (128 X 1292.25 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x64)
L1 Instruction 32 KiB (x64)
L2 Unified 1280 KiB (x64)
L3 Unified 55296 KiB (x2)
Load Average: 0.13, 0.06, 0.01
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations items_per_second max_exec_time (ns) median_exec_time (ns) min_exec_time (ns)
-------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_rphash::permutation/manual_time 13742 ns 25366 ns 50938 72.7718k/s 16.197k 13.738k 13.646k
bench_rphash::hash/4/manual_time 13756 ns 17694 ns 50886 72.6964k/s 21.138k 13.752k 13.677k
bench_rphash::hash/8/manual_time 13751 ns 21534 ns 50901 72.7217k/s 16.273k 13.747k 13.669k
bench_rphash::hash/16/manual_time 27480 ns 42949 ns 25473 36.3899k/s 32.871k 27.473k 27.356k
bench_rphash::hash/32/manual_time 54944 ns 85784 ns 12740 18.2003k/s 59.909k 54.927k 54.784k
bench_rphash::hash/64/manual_time 109873 ns 171524 ns 6371 9.10138k/s 185.71k 109.828k 109.609k
bench_rphash::hash/128/manual_time 219703 ns 344066 ns 3186 4.55159k/s 232.839k 219.643k 219.373k
- On Intel® Xeon® Platinum 8375C CPU @ 2.90GHz ( Scalar implementation compiled with Clang )
2022-12-22T15:40:48+00:00
Running ./bench/a.out
Run on (128 X 1295.09 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x64)
L1 Instruction 32 KiB (x64)
L2 Unified 1280 KiB (x64)
L3 Unified 55296 KiB (x2)
Load Average: 0.14, 0.08, 0.02
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations items_per_second max_exec_time (ns) median_exec_time (ns) min_exec_time (ns)
-------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_rphash::permutation/manual_time 9735 ns 22576 ns 71907 102.723k/s 12.152k 9.732k 9.673k
bench_rphash::hash/4/manual_time 9746 ns 14100 ns 71820 102.606k/s 18.543k 9.742k 9.681k
bench_rphash::hash/8/manual_time 9749 ns 18356 ns 71795 102.578k/s 11.831k 9.746k 9.681k
bench_rphash::hash/16/manual_time 19480 ns 36583 ns 35937 51.3342k/s 23.983k 19.474k 19.382k
bench_rphash::hash/32/manual_time 38931 ns 73003 ns 17979 25.6863k/s 42.868k 38.921k 38.769k
bench_rphash::hash/64/manual_time 77837 ns 145841 ns 8994 12.8474k/s 81.722k 77.814k 77.656k
bench_rphash::hash/128/manual_time 155657 ns 291537 ns 4497 6.4244k/s 158.186k 155.614k 155.404k
- On Intel® Xeon® Platinum 8375C CPU @ 2.90GHz ( AVX2 implementation compiled with Clang )
2022-12-22T15:41:42+00:00
Running ./bench/a.out
Run on (128 X 2035.27 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x64)
L1 Instruction 32 KiB (x64)
L2 Unified 1280 KiB (x64)
L3 Unified 55296 KiB (x2)
Load Average: 0.21, 0.11, 0.03
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations items_per_second max_exec_time (ns) median_exec_time (ns) min_exec_time (ns)
-------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_rphash::permutation/manual_time 9603 ns 22470 ns 72890 104.13k/s 17.661k 9.599k 9.527k
bench_rphash::hash/4/manual_time 9612 ns 13965 ns 72826 104.039k/s 18.814k 9.609k 9.535k
bench_rphash::hash/8/manual_time 9613 ns 18212 ns 72818 104.024k/s 16.387k 9.61k 9.532k
bench_rphash::hash/16/manual_time 19203 ns 36295 ns 36448 52.0739k/s 23.494k 19.198k 19.104k
bench_rphash::hash/32/manual_time 38372 ns 72446 ns 18242 26.0607k/s 41.034k 38.361k 38.22k
bench_rphash::hash/64/manual_time 76717 ns 144751 ns 9124 13.0349k/s 80.947k 76.692k 76.503k
bench_rphash::hash/128/manual_time 153405 ns 289330 ns 4563 6.51869k/s 178.467k 153.358k 153.099k
- On Intel® Core™ i5-8279U CPU @ 2.40GHz ( Scalar implementation compiled with Clang )
2022-12-22T19:43:28+04:00
Running ./bench/a.out
Run on (8 X 2400 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB
Load Average: 1.19, 2.15, 2.89
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations items_per_second max_exec_time (ns) median_exec_time (ns) min_exec_time (ns)
-------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_rphash::permutation/manual_time 11417 ns 137884 ns 61405 87.5911k/s 119.946k 11.218k 11.065k
bench_rphash::hash/4/manual_time 11567 ns 54524 ns 61425 86.4507k/s 120.79k 11.236k 11.071k
bench_rphash::hash/8/manual_time 11464 ns 95950 ns 61527 87.228k/s 145.688k 11.236k 11.081k
bench_rphash::hash/16/manual_time 22696 ns 189947 ns 30847 44.0598k/s 138.261k 22.402k 22.13k
bench_rphash::hash/32/manual_time 45439 ns 381059 ns 15416 22.0077k/s 191.966k 44.724k 44.247k
bench_rphash::hash/64/manual_time 91015 ns 763956 ns 7709 10.9872k/s 1.46905M 89.331k 88.494k
bench_rphash::hash/128/manual_time 188629 ns 1596825 ns 3871 5.30141k/s 435.33k 178.67k 176.944k
- On Intel® Core™ i5-8279U CPU @ 2.40GHz ( AVX2 implementation compiled with Clang )
2022-12-22T19:45:04+04:00
Running ./bench/a.out
Run on (8 X 2400 MHz CPU s)
CPU Caches:
L1 Data 32 KiB
L1 Instruction 32 KiB
L2 Unified 256 KiB (x4)
L3 Unified 6144 KiB
Load Average: 1.97, 2.11, 2.79
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations items_per_second max_exec_time (ns) median_exec_time (ns) min_exec_time (ns)
-------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_rphash::permutation/manual_time 8858 ns 134181 ns 79128 112.898k/s 126.803k 8.722k 8.545k
bench_rphash::hash/4/manual_time 8874 ns 50963 ns 79380 112.692k/s 123.209k 8.724k 8.55k
bench_rphash::hash/8/manual_time 8880 ns 92534 ns 79036 112.609k/s 97.469k 8.728k 8.549k
bench_rphash::hash/16/manual_time 17671 ns 185488 ns 39656 56.59k/s 98.486k 17.376k 17.082k
bench_rphash::hash/32/manual_time 35122 ns 368258 ns 19880 28.4719k/s 183.345k 34.633k 34.112k
bench_rphash::hash/64/manual_time 70485 ns 737101 ns 10012 14.1873k/s 4.48529M 69.141k 68.175k
bench_rphash::hash/128/manual_time 140581 ns 1482180 ns 4985 7.11335k/s 330.147k 138.158k 136.303k
- On ARM Neoverse-V1 aka AWS Graviton3 ( Scalar implementation compiled with GCC )
2022-12-22T15:48:54+00:00
Running ./bench/a.out
Run on (64 X 2100 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x64)
L1 Instruction 64 KiB (x64)
L2 Unified 1024 KiB (x64)
L3 Unified 32768 KiB (x1)
Load Average: 0.08, 0.02, 0.01
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations items_per_second max_exec_time (ns) median_exec_time (ns) min_exec_time (ns)
-------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_rphash::permutation/manual_time 17051 ns 31612 ns 41058 58.6473k/s 23.325k 17.024k 16.957k
bench_rphash::hash/4/manual_time 17107 ns 22032 ns 40920 58.4551k/s 27.503k 17.08k 17.013k
bench_rphash::hash/8/manual_time 17060 ns 26937 ns 41039 58.617k/s 38.753k 17.032k 16.963k
bench_rphash::hash/16/manual_time 34090 ns 53631 ns 20533 29.3341k/s 45.077k 34.035k 33.944k
bench_rphash::hash/32/manual_time 68140 ns 107193 ns 10274 14.6757k/s 82.014k 68.029k 67.896k
bench_rphash::hash/64/manual_time 136230 ns 214227 ns 5138 7.34055k/s 146.935k 136.014k 135.83k
bench_rphash::hash/128/manual_time 272428 ns 428993 ns 2570 3.6707k/s 283.816k 271.989k 271.714k
- On ARM Neoverse-V1 aka AWS Graviton3 ( NEON implementation compiled with GCC )
2022-12-22T15:50:11+00:00
Running ./bench/a.out
Run on (64 X 2100 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x64)
L1 Instruction 64 KiB (x64)
L2 Unified 1024 KiB (x64)
L3 Unified 32768 KiB (x1)
Load Average: 0.15, 0.05, 0.02
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations items_per_second max_exec_time (ns) median_exec_time (ns) min_exec_time (ns)
-------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_rphash::permutation/manual_time 17119 ns 31696 ns 40891 58.4156k/s 50.317k 17.092k 17.019k
bench_rphash::hash/4/manual_time 17236 ns 22195 ns 40606 58.0188k/s 26.107k 17.21k 17.151k
bench_rphash::hash/8/manual_time 17152 ns 27013 ns 40812 58.3027k/s 40.492k 17.126k 17.056k
bench_rphash::hash/16/manual_time 34278 ns 53945 ns 20420 29.1735k/s 57.256k 34.225k 34.102k
bench_rphash::hash/32/manual_time 68511 ns 107882 ns 10215 14.5962k/s 92.947k 68.413k 68.261k
bench_rphash::hash/64/manual_time 136978 ns 215582 ns 5110 7.30046k/s 143.904k 136.786k 136.576k
bench_rphash::hash/128/manual_time 273909 ns 431178 ns 2555 3.65085k/s 286.895k 273.525k 273.209k
- On ARM Neoverse-V1 aka AWS Graviton3 ( Scalar implementation compiled with Clang )
2022-12-22T15:51:07+00:00
Running ./bench/a.out
Run on (64 X 2100 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x64)
L1 Instruction 64 KiB (x64)
L2 Unified 1024 KiB (x64)
L3 Unified 32768 KiB (x1)
Load Average: 0.13, 0.07, 0.02
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations items_per_second max_exec_time (ns) median_exec_time (ns) min_exec_time (ns)
-------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_rphash::permutation/manual_time 17550 ns 32973 ns 39886 56.9786k/s 30.101k 17.521k 17.427k
bench_rphash::hash/4/manual_time 17560 ns 22696 ns 39864 56.948k/s 40.568k 17.53k 17.43k
bench_rphash::hash/8/manual_time 17555 ns 27777 ns 39855 56.9635k/s 26.61k 17.527k 17.433k
bench_rphash::hash/16/manual_time 35088 ns 55486 ns 19950 28.4996k/s 54.669k 35.032k 34.889k
bench_rphash::hash/32/manual_time 70136 ns 110887 ns 9978 14.258k/s 78.483k 70.023k 69.828k
bench_rphash::hash/64/manual_time 140230 ns 221700 ns 4992 7.13113k/s 149.666k 140.014k 139.697k
bench_rphash::hash/128/manual_time 280417 ns 443432 ns 2496 3.56611k/s 286.263k 280.025k 279.561k
- On ARM Neoverse-V1 aka AWS Graviton3 ( NEON implementation compiled with Clang )
2022-12-22T15:51:50+00:00
Running ./bench/a.out
Run on (64 X 2100 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x64)
L1 Instruction 64 KiB (x64)
L2 Unified 1024 KiB (x64)
L3 Unified 32768 KiB (x1)
Load Average: 0.15, 0.09, 0.03
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations items_per_second max_exec_time (ns) median_exec_time (ns) min_exec_time (ns)
-------------------------------------------------------------------------------------------------------------------------------------------------------------
bench_rphash::permutation/manual_time 18277 ns 33684 ns 38300 54.7146k/s 30.028k 18.25k 18.154k
bench_rphash::hash/4/manual_time 18287 ns 23426 ns 38278 54.6826k/s 25.803k 18.26k 18.169k
bench_rphash::hash/8/manual_time 18290 ns 28515 ns 38273 54.6752k/s 42.531k 18.262k 18.158k
bench_rphash::hash/16/manual_time 36553 ns 56968 ns 19150 27.3573k/s 45.235k 36.5k 36.317k
bench_rphash::hash/32/manual_time 73058 ns 113815 ns 9581 13.6877k/s 95.43k 72.949k 72.773k
bench_rphash::hash/64/manual_time 146111 ns 227554 ns 4791 6.84411k/s 156.203k 145.895k 145.631k
bench_rphash::hash/128/manual_time 292153 ns 455092 ns 2396 3.42286k/s 302.318k 291.729k 291.347k
相关背景知识可参看:
- STARK入门知识
- Rescue-Prime: a Standard Specification (SoK)
- Winterfell中的Rescue Prime
- https://github.com/novifinancial/winterfell(Rust)
- Rescue Permutation test vectors
以上是关于rescue-prime:基于Goldilocks域的Rescue-Prime 哈希函数加速的主要内容,如果未能解决你的问题,请参考以下文章