Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@bghira
Copy link
Contributor

@bghira bghira commented Nov 11, 2025

Shape Metal (GFLOPS) CPU (GFLOPS)
19x19, Cin=256, Cout=4096, fp32 191.6 17.1
224x224, Cin=3, Cout=8, fp32 103.0 1.5
58x58, Cin=32, Cout=64, fp32 159.3 7.0

@bghira
Copy link
Contributor Author

bghira commented Nov 11, 2025

$ ./build/bin/test-backend-ops perf -b Metal -o CONV_2D
ggml_metal_device_init: tensor API disabled for pre-M5 device
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.007 sec
ggml_metal_device_init: GPU name:   Apple M3 Max
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 103079.22 MB
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3 Max)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3 Max)
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_conv_2d_f32_f32', name = 'kernel_conv_2d_f32_f32'
ggml_metal_library_compile_pipeline: loaded kernel_conv_2d_f32_f32                        0x142e19030 | th_max = 1024 | th_width =   32
Testing 3 devices

Backend 1/3: Metal
  Device description: Apple M3 Max
  Device memory: 98304 MB (98303 MB free)

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      2 runs - 717399.00 us/run - 137.42 GFLOP/run - 191.56 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1496 runs -   873.27 us/run - 133.69 MFLOP/run - 153.10 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1474 runs -   887.64 us/run - 135.78 MFLOP/run - 152.97 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    19.32 us/run - 642.82 kFLOP/run -  33.27 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 9572 runs -   202.87 us/run -  20.90 MFLOP/run - 103.01 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                16384 runs -   106.52 us/run -   2.78 MFLOP/run -  26.14 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  1068.67 us/run -  22.28 MFLOP/run -  20.85 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1734 runs -   724.21 us/run - 115.40 MFLOP/run - 159.35 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 218 runs -  5673.44 us/run - 923.24 MFLOP/run - 162.73 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0): ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_conv_2d_f16_f32', name = 'kernel_conv_2d_f16_f32'
ggml_metal_library_compile_pipeline: loaded kernel_conv_2d_f16_f32                        0x10ea082e0 | th_max = 1024 | th_width =   32
                     110 runs - 10713.83 us/run -   1.85 GFLOP/run - 172.57 GFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      2 runs - 712804.00 us/run - 137.42 GFLOP/run - 192.79 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1496 runs -   873.38 us/run - 133.69 MFLOP/run - 153.08 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1474 runs -   888.16 us/run - 135.78 MFLOP/run - 152.88 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    19.24 us/run - 642.82 kFLOP/run -  33.42 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 9572 runs -   203.09 us/run -  20.90 MFLOP/run - 102.89 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                16384 runs -   106.55 us/run -   2.78 MFLOP/run -  26.14 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  1069.74 us/run -  22.28 MFLOP/run -  20.83 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1734 runs -   723.38 us/run - 115.40 MFLOP/run - 159.54 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 218 runs -  5668.27 us/run - 923.24 MFLOP/run - 162.88 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0): ggml_metal_free: deallocating
                     110 runs - 10682.93 us/run -   1.85 GFLOP/run - 173.07 GFLOPS
  Backend Metal: OK
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Skipping
3/3 backends passed
OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant