There are various ways of doing so... a straightforward one would be to parallelize some of the computations with open mp.
There are also ways of optimizing the execution of some gates.
qiskit aer does it, they have a tensor class oriented towards qubits computations (it holds internally matrices and execution of many operations is optimized), this implementation is more generic and uses Eigen tensors (with suboptimal data copying between matrices and tensors back and forth).
I probably won't go that path, unless really necessary.