Mixture of Experts (MoE)¶
Layer specification¶

class
deepspeed.moe.layer.
MoE
(hidden_size, expert, num_experts=1, k=1, capacity_factor=1.0, eval_capacity_factor=1.0, min_capacity=4, noisy_gate_policy: Optional[str] = None)[source]¶ Initialize an MoE layer.
Parameters:  hidden_size (int) – the hidden dimension of the model, importantly this is also the input and output dimension.
 expert (torch.nn.Module) – the torch module that defines the expert (e.g., MLP, torch.linear).
 num_experts (int, optional) – default=1, the total number of experts per layer.
 k (int, optional) – default=1, topk gating value, only supports k=1 or k=2.
 capacity_factor (float, optional) – default=1.0, the capacity of the expert at training time.
 eval_capacity_factor (float, optional) – default=1.0, the capacity of the expert at eval time.
 min_capacity (int, optional) – default=4, the minimum capacity per expert regardless of the capacity_factor.
 noisy_gate_policy (str, optional) – default=None, noisy gate policy, valid options are ‘Jitter’, ‘RSample’ or ‘None’.

forward
(hidden_states, used_token=None)[source]¶ MoE forward
Parameters:  hidden_states (Tensor) – input to the layer
 used_token (Tensor, optional) – default: None, mask only used tokens
Returns: A tuple including output, gate loss, and expert count.
 output (Tensor): output of the model
 l_aux (Tensor): gate loss value
 exp_counts (int): expert count
Groups initialization¶

deepspeed.utils.groups.
initialize
(ep_size=1, mpu=None)[source]¶ Process groups initialization supporting expert (E), data (D), and model (M) parallelism. DeepSpeed considers the following scenarios w.r.t. process group creation.
S1: There is no expert parallelism or model parallelism, only data (D):
model = my_model(args) engine = deepspeed.initialize(model) # initialize groups without mpu
S2: There is expert parallelism but no model parallelism (E+D):
deepspeed.utils.groups.initialize(ep_size) # groups will be initialized here model = my_model(args) engine = deepspeed.initialize(model)
S3: There is model parallelism but no expert parallelism (M):
mpu.init() # client initializes it's model parallel unit model = my_model(args) engine = deepspeed.initialize(model, mpu=mpu) # init w. mpu but ep_size = dp_world_size
S4: There is model, data, and expert parallelism (E+D+M):
mpu.init() # client initializes it's model parallel unit deepspeed.utils.groups.initialize(ep_size, mpu) # initialize expert groups wrt mpu model = my_model(args) engine = deepspeed.initialize(model, mpu=mpu) # passing mpu is optional in this case
Parameters:  ep_size (int, optional) – default=1, expert parallel size
 mpu (module, optional) – default=None, model parallel unit (e.g., from Megatron) that describes model/data parallel ranks.