Replacement for THCudaBlas_SgemmBatched

## ❓ Questions and Help

What is the correct way to preform a batched add add batched matrix multiply using Cuda in the C++ API?

I found at::baddbmm but could not find its source to verify that it will use Cuda if available or if this is only used on cpu

Thank you for any guidance