It comes to my attention that the current release replaces the original L1 BLAS call with a new kernel function in ComputeUpdate of SGD.
I don't think this is optimal in terms of performance. It creates another layer of software complexity and no benefit for performance at all.