Architecture

Mixture of Experts (MoE)

Scaling model capacity without exploding compute costs: The architecture behind GPT-4 and Mixtral.

The Scaling Dilemma

Traditionally, making a model "smarter" meant making it bigger (Dense Models). However, increasing parameters increases training and inference costs linearly. Mixture of Experts (MoE) solves this by decoupling parameter count from compute cost.

How MoE Works

An MoE model replaces the standard Feed-Forward neural network layers with a "Sparse MoE" layer. This layer contains:

  • Experts: A set of separate neural networks (e.g., 8 different FFNs). Each expert specializes in different types of data or tasks.
  • Gate (Router): A learned mechanism that decides which experts to use for each token.

For example, in Mixtral 8x7B, there are 8 experts, but for every token generated, the router selects only the top 2 experts. This means the model has 47B parameters but only uses ~13B per token inference.

Sparse vs. Dense

Feature Dense Model (e.g., Llama 3) Sparse MoE (e.g., Mixtral)
Activation Activate all neurons for every token Activate subset of experts
Training Cost High Lower for same capacity
Inference Speed Slower (full weight load) Faster (active weights only)
VRAM Usage High High (must load all experts)

Implementation: Loading Mixtral

Running a quantized MoE model using transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Mixtral 8x7B Instruct
model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"

# Load using Flash Attention for efficiency
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [{"role": "user", "content": "Explain quantum entanglement briefly."}]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")

output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))

Challenges

Memory Bandwidth

Although compute is low, all experts must eventually be loaded into VRAM, requiring high VRAM capacity (e.g., 24GB+ for Mixtral).

Training Instability

Balancing the load between experts is hard; if the router only picks one expert, the others never learn (expert collapse).