Platform: NVIDIA CUDA Device: NVIDIA GeForce RTX 4090 Driver version : 570.133.20 (Linux x64) Compute units : 128 Clock frequency : 2610 MHz Global memory bandwidth (GBPS) float : 895.18 float2 : 923.80 float4 : 941.00 float8 : 952.13 float16 : 962.52 Single-precision compute (GFLOPS) float : 84098.24 float2 : 81316.00 float4 : 81409.93 float8 : 80840.90 float16 : 80439.52 No half precision support! Skipped Double-precision compute (GFLOPS) double : 1423.48 double2 : 1421.67 double4 : 1418.26 double8 : 1411.35 double16 : 1397.99 Integer compute (GIOPS) int : 44874.12 int2 : 44940.15 int4 : 44661.71 int8 : 44780.54 int16 : 44792.71 Integer compute Fast 24bit (GIOPS) int : 44748.47 int2 : 44771.30 int4 : 44763.04 int8 : 44584.92 int16 : 43890.11 Integer char (8bit) compute (GIOPS) char : 38131.23 char2 : 37187.54 char4 : 36365.29 char8 : 30460.54 char16 : 28798.10 Integer short (16bit) compute (GIOPS) short : 37390.55 short2 : 36008.64 short4 : 37428.57 short8 : 32507.62 short16 : 27648.33 Transfer bandwidth (GBPS) enqueueWriteBuffer : 16.58 enqueueReadBuffer : 17.27 enqueueWriteBuffer non-blocking : 16.27 enqueueReadBuffer non-blocking : 16.69 enqueueMapBuffer(for read) : 18.00 memcpy from mapped ptr : 5.30 enqueueUnmap(after write) : 26.84 memcpy to mapped ptr : 17.77 Kernel launch latency : 3.92 us Device: NVIDIA GeForce RTX 4090 Driver version : PoCL 7.0 (Linux x64) Compute units : 128 Clock frequency : 2610 MHz Global memory bandwidth (GBPS) float : 893.98 float2 : 922.62 float4 : 940.90 float8 : 951.27 float16 : 960.86 Single-precision compute (GFLOPS) float : 74350.94 float2 : 77464.59 float4 : 80243.05 float8 : 80875.78 float16 : 80187.65 Half-precision compute (GFLOPS) half : 44700.20 half2 : 89401.85 half4 : 89370.85 half8 : 88731.35 half16 : 87873.76 Double-precision compute (GFLOPS) double : 1422.87 double2 : 1421.34 double4 : 1418.00 double8 : 1411.15 double16 : 1397.38 Integer compute (GIOPS) int : 30744.91 int2 : 30581.63 int4 : 30607.29 int8 : 30897.89 int16 : 30301.60 Integer compute Fast 24bit (GIOPS) int : 30686.79 int2 : 30517.12 int4 : 30543.35 int8 : 30831.12 int16 : 30575.51 Integer char (8bit) compute (GIOPS) char : 14530.12 char2 : 21110.57 char4 : 20030.94 char8 : 20232.04 char16 : 21248.74 Integer short (16bit) compute (GIOPS) short : 14339.66 short2 : 21305.61 short4 : 22127.96 short8 : 20693.65 short16 : 21423.40 Transfer bandwidth (GBPS) enqueueWriteBuffer : 21.80 enqueueReadBuffer : 21.29 enqueueWriteBuffer non-blocking : 21.79 enqueueReadBuffer non-blocking : 21.29 enqueueMapBuffer(for read) : 363980.28 memcpy from mapped ptr : 5.18 enqueueUnmap(after write) : 26.00 memcpy to mapped ptr : 18.52 Kernel launch latency : 3634.32 us