Expert-Aware Multi-Batch Pipeline for MoE + Speculative Decoding inference optimization (CPU-PCIe-GPU).

These details have not been verified by PyPI

Project links

Repository

Project description

Torrent — Expert-Aware Multi-Batch MoE Pipeline

Torrent 是一个独立的 Python 包，实现了面向 CPU-PCIe-GPU offloading 场景的 MoE 推理加速。核心算法来源于论文 "Klotski: An Expert-Aware Multi-Batch Pipeline for MoE LLM Inference"（arXiv:2502.06888），集成目标模型为 Qwen3-30B-A3B + EAGLE-3 Speculative Decoding。

核心创新

专家感知多 batch 流水线（n-batch Pipeline）

将 n 个并发序列的专家 I/O 合并：每个专家权重从 CPU→GPU 只传输一次，服务 n 个序列中所有路由到该专家的 token。

期望唯一专家数（均匀独立路由，$E=128, k=8$）：

$$E_{unique}(n) = E \cdot \left(1 - \left(1 - \frac{k}{E}\right)^n\right)$$

per-token 专家 I/O 时间相对于 $n=1$ 的理论加速：

n	唯一专家	每层 I/O 时间	AggTPS	加速比
1	8.0	3.10ms	2.6	1.00x
4	29.3	11.3ms	2.8	1.08x
8	51.0	19.7ms	3.2	1.26x
16	81.5	31.5ms	4.0	1.54x
32	108	41.8ms	6.0	2.33x
64	121	46.8ms	10.7	4.15x
128	127	49.2ms	20.4	7.88x

（参数：$t_{io}=0.387\text{ms/expert}$, $t_{attn}=0.024\text{ms/seq}$, $L=48$）

其他关键技术

CorrelationTable：在线学习专家路由模式，预取下一层热专家
AsyncLoader：双 CUDA stream（io_stream + compute_stream）重叠 I/O 与计算
ExpertExecutor：热专家优先 + 就绪优先排序，最小化 intra-layer bubble
HardwareProfiler + ConstraintSolver：自动测量 6 个硬件时延参数，求解最优 n

安装

快速安装

# 1. 先安装 speculators（vllm-project 生态）
pip install speculators

# 2. 安装 srep-moe
git clone https://github.com/yourorg/srep-moe
cd srep-moe
pip install -e .

依赖

包	版本	说明
`speculators`	≥0.1.0	上游 SD 训练/推理库
`torch`	≥2.1.0	CUDA kernels
`transformers`	≥4.40.0	Qwen3/Mixtral
`safetensors`	≥0.4.0	FP8 shard 读取
`vllm`	≥0.6.0	(可选) baseline 对比

快速开始

1. 端到端验证（无需真实模型权重）

python demo.py               # 纯 CPU 玩具模型验证
python demo.py --gpu         # GPU 验证
python demo.py --large       # 更大配置

2. 真实 Qwen3-30B 推理

# 硬件测量
python run_qwen3_moe.py --mode profile

# 理论性能表（不加载模型）
python run_qwen3_moe.py --mode theory

# Torrent 单序列推理
python run_qwen3_moe.py --mode torrent --n 4 --max_new_tokens 64

# N-batch 并发推理
python run_qwen3_moe.py --mode nbatch --n 8 --max_new_tokens 32

# N-batch sweep
python run_qwen3_moe.py --mode sweep --max_new_tokens 16

3. 玩具模型 benchmark

# 与 baseline 对比（玩具 MoE）
python benchmark/run_torrent.py --compare --n 4

# n 值扫描
python benchmark/run_torrent.py --sweep

# 单独跑 baseline
python benchmark/run_baseline.py --steps 20

项目结构

srep-moe/
├── pyproject.toml              # 包配置（pip install srep-moe）
├── src/torrent/                # 核心包
│   ├── __init__.py             # 统一导出
│   ├── nbatch_engine.py        # TorrentNBatchEngine（n-batch 并发推理）
│   ├── planner/
│   │   ├── hardware_profiler.py  # 测量 6 个硬件时延参数
│   │   └── constraint_solver.py  # 求解最优 n（论文 Algorithm 2）
│   ├── prefetch/
│   │   ├── correlation_table.py  # 路由模式在线学习
│   │   └── async_loader.py       # 双 CUDA stream 异步加载
│   ├── runtime/
│   │   ├── memory_manager.py     # 三级存储（GPU/CPU/Disk）
│   │   ├── expert_executor.py    # 热优先 + 就绪优先执行
│   │   └── pipeline.py           # TorrentPipeline / TorrentConfig
│   ├── metrics/
│   │   └── collector.py          # 吞吐/时延/bubble 指标
│   └── models/
│       └── __init__.py           # Qwen3MoE 集成（FP8 反量化 + patch）
├── models/
│   └── moe_model.py            # MoEModelWrapper（Mixtral offloading）
├── benchmark/
│   ├── run_baseline.py         # 串行整层预取 baseline
│   └── run_torrent.py          # Torrent multi-batch benchmark
├── demo.py                     # 端到端验证脚本
├── run_qwen3_moe.py            # Qwen3-30B 真实推理入口
└── experiments/
    └── qwen3_moe_eagle3/       # 实验记录与复现脚本

与 speculators 的关系

功能	speculators	srep-moe
SD（投机解码）训练	✅	❌
EAGLE-3 草稿模型	✅	❌
vLLM 集成	✅	参考
MoE CPU offloading	❌	✅
专家感知 n-batch	❌	✅
PCIe I/O 优化	❌	✅
Qwen3-30B 流水线	❌	✅

典型部署：先用 speculators 训练/导出 EAGLE-3 草稿头；再用 srep-moe 在 CPU-PCIe-GPU 环境运行联合推理。

论文复现验证

验证项	对应章节	脚本
硬件时延测量	§9.1 阶段 A	`demo.py --skip_compare --skip_sweep`
约束求解最优 n	§7	`demo.py --skip_compare --skip_sweep`
吞吐量 vs n 曲线	§12 图 1	`benchmark/run_torrent.py --sweep`
与 baseline 对比	§12 图 2	`benchmark/run_torrent.py --compare`
真实 Qwen3 推理	§9 实测	`run_qwen3_moe.py --mode torrent`

参考

论文：arXiv:2502.06888 "Klotski: An Expert-Aware Multi-Batch Pipeline..."
speculators：github.com/vllm-project/speculators
Qwen3-30B-A3B：huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507-FP8

Project details

These details have not been verified by PyPI

Project links

Repository

Release history Release notifications | RSS feed

This version

0.1.0

Mar 9, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

srep_moe-0.1.0.tar.gz (44.8 kB view details)

Uploaded Mar 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

srep_moe-0.1.0-py3-none-any.whl (49.8 kB view details)

Uploaded Mar 9, 2026 Python 3

File details

Details for the file srep_moe-0.1.0.tar.gz.

File metadata

Download URL: srep_moe-0.1.0.tar.gz
Upload date: Mar 9, 2026
Size: 44.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for srep_moe-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`ba1056d828aa2fecdfdc558fc00e8322fdc47de3ff0130f050238f8dadc66c20`
MD5	`db7c793dcdd2fc7c52a08d619f67e92d`
BLAKE2b-256	`3782feea1fd7291c71ad7cead96700c609e579ad52576a83484b1fa2ce832ad2`

See more details on using hashes here.

File details

Details for the file srep_moe-0.1.0-py3-none-any.whl.

File metadata

Download URL: srep_moe-0.1.0-py3-none-any.whl
Upload date: Mar 9, 2026
Size: 49.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.9

File hashes

Hashes for srep_moe-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`db385df36e758847a2bfe1315c3301f0b757a05c2c56fd00f74c09b7e6ac7e49`
MD5	`11a2c4a38dedf889e9bf608ac87246a3`
BLAKE2b-256	`b61374267dab18050b941eb181e76965acb7dfc205da0399eb41761ef4ed83a2`

See more details on using hashes here.

srep-moe 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Torrent — Expert-Aware Multi-Batch MoE Pipeline

核心创新

专家感知多 batch 流水线（n-batch Pipeline）

其他关键技术

安装

快速安装

依赖

快速开始

1. 端到端验证（无需真实模型权重）

2. 真实 Qwen3-30B 推理

3. 玩具模型 benchmark

项目结构

与 speculators 的关系

论文复现验证

参考

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes