Skip to main content

Expert-Aware Multi-Batch Pipeline for MoE + Speculative Decoding inference optimization (CPU-PCIe-GPU).

Project description

Torrent — Expert-Aware Multi-Batch MoE Pipeline

Torrent 是一个独立的 Python 包,实现了面向 CPU-PCIe-GPU offloading 场景的 MoE 推理加速。 核心算法来源于论文 "Klotski: An Expert-Aware Multi-Batch Pipeline for MoE LLM Inference"arXiv:2502.06888), 集成目标模型为 Qwen3-30B-A3B + EAGLE-3 Speculative Decoding


核心创新

专家感知多 batch 流水线(n-batch Pipeline)

n 个并发序列的专家 I/O 合并:每个专家权重从 CPU→GPU 只传输一次,服务 n 个序列中所有路由到该专家的 token。

期望唯一专家数(均匀独立路由,$E=128, k=8$):

$$E_{unique}(n) = E \cdot \left(1 - \left(1 - \frac{k}{E}\right)^n\right)$$

per-token 专家 I/O 时间相对于 $n=1$ 的理论加速:

n 唯一专家 每层 I/O 时间 AggTPS 加速比
1 8.0 3.10ms 2.6 1.00x
4 29.3 11.3ms 2.8 1.08x
8 51.0 19.7ms 3.2 1.26x
16 81.5 31.5ms 4.0 1.54x
32 108 41.8ms 6.0 2.33x
64 121 46.8ms 10.7 4.15x
128 127 49.2ms 20.4 7.88x

(参数:$t_{io}=0.387\text{ms/expert}$, $t_{attn}=0.024\text{ms/seq}$, $L=48$)

其他关键技术

  • CorrelationTable:在线学习专家路由模式,预取下一层热专家
  • AsyncLoader:双 CUDA stream(io_stream + compute_stream)重叠 I/O 与计算
  • ExpertExecutor:热专家优先 + 就绪优先排序,最小化 intra-layer bubble
  • HardwareProfiler + ConstraintSolver:自动测量 6 个硬件时延参数,求解最优 n

安装

快速安装

# 1. 先安装 speculators(vllm-project 生态)
pip install speculators

# 2. 安装 srep-moe
git clone https://github.com/yourorg/srep-moe
cd srep-moe
pip install -e .

依赖

版本 说明
speculators ≥0.1.0 上游 SD 训练/推理库
torch ≥2.1.0 CUDA kernels
transformers ≥4.40.0 Qwen3/Mixtral
safetensors ≥0.4.0 FP8 shard 读取
vllm ≥0.6.0 (可选) baseline 对比

快速开始

1. 端到端验证(无需真实模型权重)

python demo.py               # 纯 CPU 玩具模型验证
python demo.py --gpu         # GPU 验证
python demo.py --large       # 更大配置

2. 真实 Qwen3-30B 推理

# 硬件测量
python run_qwen3_moe.py --mode profile

# 理论性能表(不加载模型)
python run_qwen3_moe.py --mode theory

# Torrent 单序列推理
python run_qwen3_moe.py --mode torrent --n 4 --max_new_tokens 64

# N-batch 并发推理
python run_qwen3_moe.py --mode nbatch --n 8 --max_new_tokens 32

# N-batch sweep
python run_qwen3_moe.py --mode sweep --max_new_tokens 16

3. 玩具模型 benchmark

# 与 baseline 对比(玩具 MoE)
python benchmark/run_torrent.py --compare --n 4

# n 值扫描
python benchmark/run_torrent.py --sweep

# 单独跑 baseline
python benchmark/run_baseline.py --steps 20

项目结构

srep-moe/
├── pyproject.toml              # 包配置(pip install srep-moe)
├── src/torrent/                # 核心包
│   ├── __init__.py             # 统一导出
│   ├── nbatch_engine.py        # TorrentNBatchEngine(n-batch 并发推理)
│   ├── planner/
│   │   ├── hardware_profiler.py  # 测量 6 个硬件时延参数
│   │   └── constraint_solver.py  # 求解最优 n(论文 Algorithm 2)
│   ├── prefetch/
│   │   ├── correlation_table.py  # 路由模式在线学习
│   │   └── async_loader.py       # 双 CUDA stream 异步加载
│   ├── runtime/
│   │   ├── memory_manager.py     # 三级存储(GPU/CPU/Disk)
│   │   ├── expert_executor.py    # 热优先 + 就绪优先执行
│   │   └── pipeline.py           # TorrentPipeline / TorrentConfig
│   ├── metrics/
│   │   └── collector.py          # 吞吐/时延/bubble 指标
│   └── models/
│       └── __init__.py           # Qwen3MoE 集成(FP8 反量化 + patch)
├── models/
│   └── moe_model.py            # MoEModelWrapper(Mixtral offloading)
├── benchmark/
│   ├── run_baseline.py         # 串行整层预取 baseline
│   └── run_torrent.py          # Torrent multi-batch benchmark
├── demo.py                     # 端到端验证脚本
├── run_qwen3_moe.py            # Qwen3-30B 真实推理入口
└── experiments/
    └── qwen3_moe_eagle3/       # 实验记录与复现脚本

与 speculators 的关系

功能 speculators srep-moe
SD(投机解码)训练
EAGLE-3 草稿模型
vLLM 集成 参考
MoE CPU offloading
专家感知 n-batch
PCIe I/O 优化
Qwen3-30B 流水线

典型部署:先用 speculators 训练/导出 EAGLE-3 草稿头;再用 srep-moe 在 CPU-PCIe-GPU 环境运行联合推理。


论文复现验证

验证项 对应章节 脚本
硬件时延测量 §9.1 阶段 A demo.py --skip_compare --skip_sweep
约束求解最优 n §7 demo.py --skip_compare --skip_sweep
吞吐量 vs n 曲线 §12 图 1 benchmark/run_torrent.py --sweep
与 baseline 对比 §12 图 2 benchmark/run_torrent.py --compare
真实 Qwen3 推理 §9 实测 run_qwen3_moe.py --mode torrent

参考

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

srep_moe-0.1.0.tar.gz (44.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

srep_moe-0.1.0-py3-none-any.whl (49.8 kB view details)

Uploaded Python 3

File details

Details for the file srep_moe-0.1.0.tar.gz.

File metadata

  • Download URL: srep_moe-0.1.0.tar.gz
  • Upload date:
  • Size: 44.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for srep_moe-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ba1056d828aa2fecdfdc558fc00e8322fdc47de3ff0130f050238f8dadc66c20
MD5 db7c793dcdd2fc7c52a08d619f67e92d
BLAKE2b-256 3782feea1fd7291c71ad7cead96700c609e579ad52576a83484b1fa2ce832ad2

See more details on using hashes here.

File details

Details for the file srep_moe-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: srep_moe-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 49.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for srep_moe-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 db385df36e758847a2bfe1315c3301f0b757a05c2c56fd00f74c09b7e6ac7e49
MD5 11a2c4a38dedf889e9bf608ac87246a3
BLAKE2b-256 b61374267dab18050b941eb181e76965acb7dfc205da0399eb41761ef4ed83a2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page