-
What I understand is that on nvidia GPUs, blocks are scheduled on SMs (CUs) using a round-robin policy, so the blocks in the kernel should be interleaved on CUs, not just on a few CUs as in the figure for a kernel's blocks
-
For the “dispatch delay” described in the first case in the figure, what I wonder is why can't blocks wait for idle CUs?

I would be grateful if you could reply!