In MNIST training example, I've noticed that training time with AVX backend is much slower than @beru 's original PR. In my environment (core i7-6700K / without TBB,openCL) v1.0.0alpha: 31.2sec / epoch (very close to tiny-cnn backend) original (71f8648): 9.2sec / epoch