./cpufp --thread_pool=[0] Number Threads: 1 Thread Pool Binding: 0 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | AVX512_VNNI | DP4A(s32,u8,s8) | 694.48 GOPS | | AVX512_VNNI | DP2A(s32,s16,s16) | 347.23 GOPS | | AVX512_BF16 | DP2A(f32,bf16,bf16) | 347.24 GFLOPS | | AVX512F | FMA(f32,f32,f32) | 173.62 GFLOPS | | AVX512F | FMA(f64,f64,f64) | 86.802 GFLOPS | | FMA | FMA(f32,f32,f32) | 173.62 GFLOPS | | FMA | FMA(f64,f64,f64) | 86.81 GFLOPS | | AVX | ADD(MUL(f32,f32),f32) | 167.59 GFLOPS | | AVX | ADD(MUL(f64,f64),f64) | 83.98 GFLOPS | | SSE | ADD(MUL(f32,f32),f32) | 86.195 GFLOPS | | SSE2 | ADD(MUL(f64,f64),f64) | 42.975 GFLOPS | -------------------------------------------------------------- ./cpufp --thread_pool=[0,1,2,3,4,5,6,7] Number Threads: 8 Thread Pool Binding: 0 1 2 3 4 5 6 7 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | AVX512_VNNI | DP4A(s32,u8,s8) | 5.4384 TOPS | | AVX512_VNNI | DP2A(s32,s16,s16) | 2.7151 TOPS | | AVX512_BF16 | DP2A(f32,bf16,bf16) | 2.7021 TFLOPS | | AVX512F | FMA(f32,f32,f32) | 1.3512 TFLOPS | | AVX512F | FMA(f64,f64,f64) | 673.64 GFLOPS | | FMA | FMA(f32,f32,f32) | 1.3384 TFLOPS | | FMA | FMA(f64,f64,f64) | 666.42 GFLOPS | | AVX | ADD(MUL(f32,f32),f32) | 1.2684 TFLOPS | | AVX | ADD(MUL(f64,f64),f64) | 637.01 GFLOPS | | SSE | ADD(MUL(f32,f32),f32) | 656.07 GFLOPS | | SSE2 | ADD(MUL(f64,f64),f64) | 327.59 GFLOPS | -------------------------------------------------------------- ./cpufp --thread_pool=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] Number Threads: 16 Thread Pool Binding: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 -------------------------------------------------------------- | Instruction Set | Core Computation | Peak Performance | | AVX512_VNNI | DP4A(s32,u8,s8) | 10.35 TOPS | | AVX512_VNNI | DP2A(s32,s16,s16) | 5.1765 TOPS | | AVX512_BF16 | DP2A(f32,bf16,bf16) | 5.1508 TFLOPS | | AVX512F | FMA(f32,f32,f32) | 2.5749 TFLOPS | | AVX512F | FMA(f64,f64,f64) | 1.281 TFLOPS | | FMA | FMA(f32,f32,f32) | 2.5363 TFLOPS | | FMA | FMA(f64,f64,f64) | 1.262 TFLOPS | | AVX | ADD(MUL(f32,f32),f32) | 2.3755 TFLOPS | | AVX | ADD(MUL(f64,f64),f64) | 1.1893 TFLOPS | | SSE | ADD(MUL(f32,f32),f32) | 1.2362 TFLOPS | | SSE2 | ADD(MUL(f64,f64),f64) | 614.57 GFLOPS | --------------------------------------------------------------