Description
Currently, our implementation of the matmul ufunc is intelligent, and is able to pass appropriate transpose flags to BLAS to handle transposed contiguous arrays.
For A
, B
, and C
as contiguous 2D arrays, the inner loop is intelligent enough to map np.matmul(B.T, A.T, out=C.T)
to np.matmul(A, B, out=C)
:
numpy/numpy/core/src/umath/matmul.c.src
Lines 476 to 491 in 59a9752
However when the out
argument is omitted, the ufunc machinery pre-allocates out
with "C" memory ordering, which is not the "F" ordering that C.T
has. Ideally, we'd be able to allocate our array such that we can make o_c_blasable
or o_f_blasable
true as necessary.
As part of @seberg's ufunc work, it would be great if ufuncs could be involved in the output allocation machinery.