ld_triton triton ops suport ops attention flash_attention v1 v2 conv2d convolution embedding flip linear matmul max mse rmsnorm rope sigmoid softmax sparsecon2d sparsecon3d submcon2d submcon3d rkl_divergence group_rkl_divergence distributed data_parallel pipeline_parallel PyTorch Symmetric Memory models qwen