Skip to main content

No project description provided

Project description

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

arXiv PyPI

SonicMoE is a simple but blazing-fast Mixture-of-Experts (MoE) implementation optimized for NVIDIA Hopper and Blackwell architecture GPUs. It mainly leverages CuTeDSL and Triton to deliver state-of-the-art performance through IO-aware optimizations. These 2 figures provide an overview of activation memory usage and training throughput on Hopper GPUs (H100) and Blackwell GPUs (B300). The current version of SonicMoE builds on the Grouped GEMM kernels from the QuACK library which is itself built on CUTLASS.

Activation Memory Training Throughput

News

  • 04/19/2026: we release SonicMoE with Blackwell (SM100) support, built on QuACK's Grouped GEMM kernels.

📦 Installation

Prerequisites

  • NVIDIA Hopper GPUs (H100, H200, etc.) or Blackwell GPUs (GB200, B200, B300, etc.)
  • CUDA 12.9+ (13.0+ for B300 GPUs)
  • Python 3.12+ recommended
  • PyTorch 2.7+ (2.9.1 recommended)

B300 users: please manually upgrade Triton to 3.6.0 after installing PyTorch.

Install from pip

pip install sonic-moe

Install from Source

# Clone the repository
git clone https://github.com/Dao-AILab/sonic-moe.git
cd sonic-moe

# Install dependencies
pip install -r requirements.txt

# Install SonicMoE
pip install -e .

🎯 Quick Start

Basic Usage

import torch
from sonicmoe import MoE, KernelBackendMoE
from sonicmoe.enums import ActivationType

# Create MoE layer
moe = MoE(
    num_experts=128,                           # Number of experts
    num_experts_per_tok=8,                     # Top-k experts per token
    hidden_size=4096,                          # Hidden dimension
    intermediate_size=1536,                    # Expert intermediate size
    activation_function=ActivationType.SWIGLU, # SwiGLU activation
    add_bias=False,                            # Add bias to linear layers
    std=0.02,                                  # Weight initialization std
).to(device="cuda", dtype=torch.bfloat16)

# Forward pass
x = torch.randn(32768, 4096, device="cuda", dtype=torch.bfloat16)
output, aux_loss = moe(x, kernel_backend_moe=KernelBackendMoE.sonicmoe)

🧪 Testing

Run the test suite to verify correctness:

make test

Example usage

  • SonicMoE with TC top-K routing (softmax-over-topk, or softmax(topk(logits))) and interleaved weight layout format for up-proj weights

    python benchmarks/moe-cute.py --thiek 32768,4096,1024,128,8 --activation swiglu
    
  • SonicMoE with Qwen3-style routing (topk-over-softmax, or topk(softmax(logits))) with topk probabilities renormalization and interleaved weight layout format for up-proj weights

    python benchmarks/moe-cute.py --thiek 32768,4096,1024,128,8 --topk_over_softmax --norm_topk_probs
    
  • SonicMoE with token rounding routing (SwiGLU activation) and interleaved weight layout format for up-proj weights

    python benchmarks/moe-token-rounding.py --routing nr --thiekq 16384,4096,1024,256,8,128
    
  • SonicMoE with concatenated weight layout format for up-proj weights

    By default, SonicMoE expects w1 (the gated up-projection weights) in interleaved format: [gate_0, up_0, gate_1, up_1, ...]. HuggingFace models (Qwen3, Mixtral, DeepSeek, etc.) store gate_up_proj in concatenated format: [gate_0, gate_1, ..., gate_{I-1}, up_0, up_1, ..., up_{I-1}].

    # Concatenated weight layout format with TC top-K routing
    python benchmarks/moe-cute.py --thiek 32768,4096,1024,128,8 --concat_layout
    

🤝 Contributing

We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.

📄 License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

📚 Citation

If you use SonicMoE in your research, please cite:

@misc{guo2025sonicmoeacceleratingmoeio,
      title={SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations}, 
      author={Wentao Guo and Mayank Mishra and Xinle Cheng and Ion Stoica and Tri Dao},
      year={2025},
      eprint={2512.14080},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2512.14080}, 
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sonic_moe-0.1.2.tar.gz (32.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sonic_moe-0.1.2-py3-none-any.whl (38.2 kB view details)

Uploaded Python 3

File details

Details for the file sonic_moe-0.1.2.tar.gz.

File metadata

  • Download URL: sonic_moe-0.1.2.tar.gz
  • Upload date:
  • Size: 32.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sonic_moe-0.1.2.tar.gz
Algorithm Hash digest
SHA256 0c958df55908b4cf42f9ab4bbe1b97dc3ffebd7b7f0b6b971a9b4718235adb7b
MD5 1a5a9481c5e05d7fd83c0dec7564b044
BLAKE2b-256 f303e2a4b87bfbc3d602855957039bc34221ec44878295eb419684e346e70ddd

See more details on using hashes here.

File details

Details for the file sonic_moe-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: sonic_moe-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 38.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for sonic_moe-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 585ee071575f1d7db3386d4c414d009a79e09f976c06ff153ce033e79b3f5226
MD5 3c66a9f871863dd5549b403b0a9593cd
BLAKE2b-256 1650a90d2cca42b2ba870c486e32af9f36b9833e3ac992ec610973f507726b76

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page