What's the optimal parallel strategy using TensorRT-LLM?

Thanks for your great efforts first. I read the PR you opened in the TensorRT-LLM repo and noticed that EP +TP, PP + TP, and TP are supported during inference. May I ask which one is optimal? Specifically, as for the MoE layer,  does EP or TP yield better performance?