metal: accelerated conv2d #1384

bghira · 2025-11-11T18:38:20Z

Shape	Metal (GFLOPS)	CPU (GFLOPS)
19x19, Cin=256, Cout=4096, fp32	191.6	17.1
224x224, Cin=3, Cout=8, fp32	103.0	1.5
58x58, Cin=32, Cout=64, fp32	159.3	7.0

bghira · 2025-11-11T18:43:32Z

$ ./build/bin/test-backend-ops perf -b Metal -o CONV_2D
ggml_metal_device_init: tensor API disabled for pre-M5 device
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.007 sec
ggml_metal_device_init: GPU name:   Apple M3 Max
ggml_metal_device_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 103079.22 MB
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3 Max)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3 Max)
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_conv_2d_f32_f32', name = 'kernel_conv_2d_f32_f32'
ggml_metal_library_compile_pipeline: loaded kernel_conv_2d_f32_f32                        0x142e19030 | th_max = 1024 | th_width =   32
Testing 3 devices

Backend 1/3: Metal
  Device description: Apple M3 Max
  Device memory: 98304 MB (98303 MB free)

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      2 runs - 717399.00 us/run - 137.42 GFLOP/run - 191.56 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1496 runs -   873.27 us/run - 133.69 MFLOP/run - 153.10 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1474 runs -   887.64 us/run - 135.78 MFLOP/run - 152.97 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    19.32 us/run - 642.82 kFLOP/run -  33.27 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 9572 runs -   202.87 us/run -  20.90 MFLOP/run - 103.01 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                16384 runs -   106.52 us/run -   2.78 MFLOP/run -  26.14 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  1068.67 us/run -  22.28 MFLOP/run -  20.85 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1734 runs -   724.21 us/run - 115.40 MFLOP/run - 159.35 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 218 runs -  5673.44 us/run - 923.24 MFLOP/run - 162.73 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0): ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_conv_2d_f16_f32', name = 'kernel_conv_2d_f16_f32'
ggml_metal_library_compile_pipeline: loaded kernel_conv_2d_f16_f32                        0x10ea082e0 | th_max = 1024 | th_width =   32
                     110 runs - 10713.83 us/run -   1.85 GFLOP/run - 172.57 GFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      2 runs - 712804.00 us/run - 137.42 GFLOP/run - 192.79 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1496 runs -   873.38 us/run - 133.69 MFLOP/run - 153.08 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1474 runs -   888.16 us/run - 135.78 MFLOP/run - 152.88 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    19.24 us/run - 642.82 kFLOP/run -  33.42 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 9572 runs -   203.09 us/run -  20.90 MFLOP/run - 102.89 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                16384 runs -   106.55 us/run -   2.78 MFLOP/run -  26.14 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  1069.74 us/run -  22.28 MFLOP/run -  20.83 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1734 runs -   723.38 us/run - 115.40 MFLOP/run - 159.54 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 218 runs -  5668.27 us/run - 923.24 MFLOP/run - 162.88 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0): ggml_metal_free: deallocating
                     110 runs - 10682.93 us/run -   1.85 GFLOP/run - 173.07 GFLOPS
  Backend Metal: OK
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Skipping
3/3 backends passed
OK

metal: accelerated conv2d

b191aff

bghira mentioned this pull request Nov 11, 2025

metal: accelerated conv2d ggml-org/llama.cpp#17175

Merged

DajanaV mentioned this pull request Nov 11, 2025

UPSTREAM PR #17175: metal: accelerated conv2d auroralabs-loci/llama.cpp#171

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metal: accelerated conv2d #1384

metal: accelerated conv2d #1384

bghira commented Nov 11, 2025

Uh oh!

bghira commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

metal: accelerated conv2d #1384

Are you sure you want to change the base?

metal: accelerated conv2d #1384

Conversation

bghira commented Nov 11, 2025

Uh oh!

bghira commented Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant