Abstract: The massive parameter scale of sparsely-activated Mixture-of-Experts (MoE) models necessitates distributed training with hybrid parallelism. Placing such training tasks, i.e. mapping the ...