ExpertFlow: Enabling Low-Latency Asynchronous Inference for Mixture of Expert Models


The exponential growth in the size of deep learning models has led to a burgeoning interest in the sparsely activated Mixture of Expert (MoE) model architecture. MoE uses conditional computation to scale the model size with a sub-linear growth in the corre- sponding number of computations (FLOPs) needed to train it. As AI computation enter the exascale computing era, the MoE layer is becoming a key component of deep neu- ral networks (DNNs), and several research teams have dedicated significant resources to building efficient MoE training systems. Recently, many MoE models have been released, including Switch-C, which is the first open-source trillion-parameters model in the world. Although much attention has been given to MoE training, there has been less research on inference. In this thesis, we present ExpertFlow, a low-latency system for efficient in- ference of MoE models. The framework supports multi-GPU and multi-node distributed inference and is implemented by a fully asynchronous task scheduling manner. We com- pare ExpertFlow to NVIDIA’s FasterTransformer framework in our experiments and find that it achieves up to 1.95x higher throughput and up to 13% lower latency on average. Our goal is to make the MoE architecture more accessible, and we have open-sourced all of our code at

Tsinghua University Master Thesis
Gabriele Oliaro
Gabriele Oliaro
CS PhD Student

I’m a PhD Student in Computer Science at Carnegie Mellon University