Skip to main content

DeepSeek-V4-Flash inference on single RTX 4090

Project description

Home-Seek

单卡 RTX 4090 推理 DeepSeek-V4-Flash (284B MoE)

License

English Version · 安装 · 快速开始 · 架构 · 性能 · 有效优化 · 无效优化


项目简介

Home-Seek 在 单张 NVIDIA RTX 4090 (24 GB VRAM) 上运行 DeepSeek-V4-Flash — 284B 总参数 (13B 激活)、原生支持百万 token 上下文的 MoE 模型。通过 FP4 量化、三级缓存体系、Triton 自定义 kernel、MTP 投机解码等优化, 实现可用推理服务。

指标 数值
Decode (warm, 单 prompt) 1.60 t/s
Decode (warm, 多 prompt) 1.55 t/s
Prefill (warm, 5-8 tok) ~2.0 t/s
TTFT (warm) ~3.4 s
峰值显存 19.9 GB
CPU cache ~45 GB FP4 pinned
MTP eager (上限) 5.96 t/s
MTP verified 1.38 t/s

asciicast


架构总览

graph TB
    subgraph Client["客户端"]
        CLI["CLI<br/>home-seek cli"]
        API["curl / OpenAI SDK"]
    end

    subgraph Server["HTTPServer (stdlib, daemon thread)"]
        Chat["/v1/chat/completions<br/>SSE 流式"]
        Stats["每轮统计<br/>prefill/decode t/s, TTFT"]
    end

    subgraph Engine["HomeSeekInferenceEngine"]
        direction TB
        Gen["generate()<br/>MTP 投机解码"]
        Fwd["_forward_layer()<br/>43 层共享 forward"]

        subgraph Attn["混合注意力"]
            SWA["滑动窗口<br/>128 tokens"]
            CSA["压缩稀疏<br/>4× 压缩, indexer top-512"]
            HCA["重度压缩<br/>128× 压缩"]
            MHC["流形超连接<br/>4× 残差, Sinkhorn 混合"]
        end

        subgraph MoE["DeepSeekMoE (43 层)"]
            Router["路由<br/>softplus+sqrt + top-6"]
            Shared["共享专家<br/>FP8 → BF16 懒反量"]
            Routed["6 路由专家<br/>FP4 → BF16 Triton 反量"]
            FusedFFN["FusedMoEFFN<br/>cuBLAS M≤8 / Triton M>8"]
        end

        subgraph Cache["四层缓存"]
            GPUHot["GPU Hot BF16<br/>max 64, FIFO"]
            GPUBF16["GPU BF16 LRU<br/>max 100"]
            CPUFP4["ExpertWeightCache<br/>FP4 压缩, ~3300 条目"]
            Disk["safetensors mmap<br/>RAID 1.5 GB/s"]
        end
    end

    Client --> Server
    Server --> Chat --> Gen
    Gen --> Fwd
    Fwd --> Attn
    Fwd --> MoE
    Fwd --> Cache

推理时序

sequenceDiagram
    participant C as 客户端
    participant S as 服务器
    participant E as 引擎
    participant CA as 缓存
    participant GPU as GPU

    C->>S: POST /chat (messages)
    S->>E: generate(input_ids)

    Note over E,GPU: Prefill 阶段
    E->>E: encode_messages() → input_ids
    loop 43 层
        E->>E: _forward_layer() prefill
        E->>GPU: QKV, attention, FFN
    end

    Note over E,GPU: Decode 阶段 (循环)
    loop 直到 stop token 或 max_tokens
        E->>CA: _load_expert_weights(layer, eid)
        alt GPU hot cache 命中
            CA-->>E: 返回 BF16 权重
        else GPU LRU 命中
            CA-->>E: 返回 BF16 权重
        else CPU FP4 cache 命中
            CA->>GPU: DMA pinned → GPU (non_blocking)
            GPU->>GPU: Triton 反量化 FP4 → BF16
        else 未命中 (冷启动)
            CA->>Disk: safetensors mmap 读取
            Disk-->>CA: FP4 数据 → CPU pinned
            CA->>GPU: DMA pinned → GPU
        end
        GPU->>GPU: FusedMoEFFN (cuBLAS/Triton)
        GPU->>GPU: 混合注意力 (SWA+CSA+HCA)
        GPU->>GPU: MHC Sinkhorn 混合
        GPU-->>E: logits
        E->>E: 采样下一个 token (argmax if t=0)
    end

    Note over E,GPU: MTP 投机解码 (可选)
    E->>E: _mtp_generate_draft (M=2)
    E->>E: _mtp_verify_batched (融合 T=3)
    E->>E: 接受已验证 tokens + bonus

    E-->>S: 生成 tokens + 统计
    S-->>C: SSE 流 + stats JSON

缓存体系

flowchart LR
    subgraph CPU["CPU (90 GiB)"]
        direction TB
        CPUFP4["ExpertWeightCache<br/>FP4 压缩<br/>~3292 条目<br/>2103 pinned<br/>~12.75 MB/个"]
        PAGECACHE["Page Cache (OS)<br/>~45 GB"]
    end

    subgraph GPU["GPU (24 GB VRAM)"]
        direction TB
        GPUHOT["_gpu_hot<br/>BF16, 64×48MB<br/>FIFO 淘汰<br/>预装 hot_experts"]
        GPUBF16["_gpu_bf16_cache<br/>BF16, 100×48MB<br/>LRU 淘汰"]
        PARAMS["非专家参数<br/>FP8/BF16<br/>~5 GB"]
        KV["KV Cache<br/>SWA+CSA+HCA<br/>~3 GB"]
    end

    subgraph DISK["磁盘 (RAID 1.5 GB/s)"]
        SAFE["safetensors<br/>46 文件<br/>~150 GB"]
    end

    Request["请求 expert (layer, eid)"] --> GPUHOT
    GPUHOT -- 未命中 --> GPUBF16
    GPUBF16 -- 未命中 --> CPUFP4
    CPUFP4 -- 未命中 --> SAFE
    CPUFP4 --> GPUHOT & GPUBF16
    PAGECACHE --> SAFE

键值: (layer, eid) | 每专家: I×D×51/32 ≈ 12.75 MB (FP4 data + f8 scale)


性能

基线 (temperature=0, max-tokens=20, R2-R5 warm avg)

模式 Prefill t/s Decode t/s vs 1.54 条件
No MTP 单 prompt 1.0 1.60 "Hello" ×5 轮
No MTP 多 prompt ~2.0 1.54 5 个不同 prompts
GQA fusion ~2.0 1.57 +2% --use-gqa-fusion, 噪声内
MTP eager (跳过验证) ~2.0 5.96 +287% 上限, 不用于生产
MTP verified ~2.0 1.38 −10% M=2, KV cache + 融合验证
MTP verified (旧) ~2.0 0.96 −38% M=4, 无 KV cache, 两阶段验证

瓶颈排序 (warm decode, R2-R5 avg)

| # | 瓶颈 | 每 token 耗时 | 占比 | 状态 | |:---|---:|---:|---:| | 1 | FFN 层 (DMA + dequant + matmul) | ~430ms | 69% | ⚠ 含文件 I/O | | 2 | Attention 层 (QKV proj + attn + compress) | ~190ms | 30% | ⚠ 大 GEMM 主导 | | 3 | 其中: 文件 I/O (page cache) | ~90ms | 14% | ⬇ 已大幅降低 (冷 4.5ms→温 1.0ms/load) | | 4 | MTP 接受率 | — | — | ⚠ ~38%, 需 ~60% 才能打平 | | 5 | Python dispatch | ~40ms | 6% | ⚠ 下一目标 (CUDA graph partial) |


有效优化 ✅

优化 收益 说明
CPU pinned memory +26% 引擎初始化时对 CPU FP4 条目调用 .pin_memory(), 消除 DMA 退化为同步拷贝。最大单项收益
FP4 量化 路由专家 FP4 (E2M1), 12.75 MB/专家 vs 48MB BF16, 4× 内存节省
FusedMoEFFN cuBLAS M≤8 +1.9% M≤8 时 Triton 15/16 SM 空转, cuBLAS 快 8× (fused_moe.py:297)
CPU cache ↔ page cache 平衡 消除冷启动 min(RAM/2, total_exp) ≈ 45 GB cache + 45 GB page cache, 不挤占 OS
f8 scale 保持 float8_e8m0fnu 4× 内存节省 _make_raw_entry 不转 fp32, 12.75 MB/专家 (vs 30 MB if fp32)
逐层热专家检测 提高命中率 _all_routed_are_hot 替代全局集, 更准确的热覆盖
热专家预装 减少冷 miss 启动时从 hot_experts.json 预装每层 hot 到 GPU FIFO
MTP argmax (t=0) 稳定接受率 temperature=0 时 draft 也 argmax, 消除随机噪声
MTP KV cache + 融合验证 +44% (0.96→1.38) 跨步注意力 KV cache + torch.cat 单次 forward
线程安全 ExpertWeightCache 多线程稳定 put() 包装 KeyError 处理并发 eviction
MTP expert scale float32 修复接受率 0% Triton dequantize 不支持 float8_e8m0fnu
Stop token 检测 防止垃圾输出 从 tokenizer.json 读取真正的 </|end▁of▁sentence|> token 1

无效优化 ❌

优化 尝试原因 失败原因 结论
GQA Attention fusion (--use-gqa-fusion) 消除 64× KV expand, 节省 HBM Attention matmul 仅占 _forward_attn <5%; 热点是 QKV/Wo projection 的大 GEMM 无吞吐提升 (1.60→1.57, 在噪声内)。默认关闭
MTP verified (M=2) 投机解码加速 接受率 ~38%, 需 ~60% 才能抵消 43 层验证。1 层 MTP vs 43 层主模型差距 慢于 no-MTP (1.38 vs 1.51)。权重重用原理有效, 但 draft 质量不足
CPU 全量预载 消除所有文件 I/O 11008×12.75 MB = 169 GB 挤占 page cache, 推理 +8% 退化为平衡策略 ~45 GB
Async DMA prefetch 重叠 DMA + compute CUDA stream 管理开销 > 收益; FP4 dequant 0.04ms vs DMA 0.8ms 无可重叠 禁用, 代码移入 scripts/
GPU FP4 store (旧架构) GPU 缓存 FP4 专家 与 CPU ExpertWeightCache 同键同容量同 LRU, 命中率 ~0% 移除
共享专家 cuBLAS fusion FP32 累加序一致 cuBLAS vs Triton 累加序差异 → 路由噪声 ±20% 不可用于 A/B 对比
MHC_post Triton kernel 替代 PyTorch fallback 始终 AssertionError 走 PyTorch fallback
ExpertCacheManager (expert_cache.py) 统一四层缓存抽象 engine.py 自建重复缓存, 未接入 半成品, 统计指向空缓存
mypy 类型检查 类型安全 项目未安装 mypy 已废弃 make typecheck

关键设计决策

为什么 MTP verified 不加速?

MTP Eager (跳过验证): 1 次主 fwd → 生成 2 drafts → 全接受 = 3 tok/步 → 5.96 t/s
MTP Verified:         1 次主 fwd → 生成 2 drafts → 验证 (43 层 fwd) → 接受 ~0.76 tok → 1.38 t/s

验证需要一次完整 43 层前向。接受率 ~38% 不够高, 验证开销超过 draft 收益。根本限制是 1 层 MTP 模块与 43 层主模型之间的能力差距。

为什么 GQA fusion 不加速?

操作 占比 FLOPs
QKV projection (3× GEMM) ~50% wq_a [1024,4096], wq_b [32768,1024], wkv [512,4096]
Wo projection (2× GEMM) ~25% wo_a [1024,4096], wo_b [4096,8192]
KV compress + 其他 ~23% compressor, RoPE, indexer
Attention matmul (优化目标) ~2% SDPA [64,1,512] @ [512,T_kv] — 可忽略

Roofline 分析

M=1 decode matmul: [1,4096] × [16384,4096]
  算术强度 = FLOPs / bytes = 2×M×K×N / (K×N×2B) = M/2

M=1:  算术强度 = 1.0  → HBM 上限 = 847 GB/s × 1.0 = 0.85 TFLOPS ✓
M=64: 算术强度 = 67   → HBM 上限 = 57 TFLOPS

M=1 decode 的 matmul 受 HBM 带宽限制, 与 GPU 算力无关。硬件升级 (如 48GB VRAM) 边际收益极低。

硬件升级性价比

方案 成本 收益 说明
第二张 4090 (流水线) ~$1,800 +100% 并发 双卡 = 2× 吞吐
本地 NVMe 专享 $0 +5% 文件 I/O 非主要瓶颈
48GB VRAM GPU ~$5,000 +5% M=1 利用率不变

安装

# 下载权重 (~150GB)
python -m home_seek download
# 或: huggingface-cli download QingGo/Home-Seek --local-dir weights

# 从源码安装
git clone https://github.com/QingGo/home-seek.git
cd home-seek
make install

快速开始

# 启动服务器
make server

# 交互式 CLI (另一个终端)
make cli

# OpenAI 兼容 API
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":50,"temperature":0}'

# 性能分析
make profile

# 多轮 profiling
uv run python -m home_seek.profiling_runner --rounds 5 \
  --prompts "Hello" "What is AI?" "Write a poem" "How are you?" "Hi" \
  --max-tokens 20 --temperature 0

CLI 命令

  • /think — 切换思考模式
  • /stats — 显示上一轮统计
  • /help — 帮助
  • /quit — 退出

API 响应统计

{
  "stats": {
    "encoding_tokens": 5, "encoding_time_ms": 7952, "encoding_speed_tps": 0.6,
    "ttft_ms": 7952,
    "generated_tokens": 9, "decode_time_ms": 6255, "decode_speed_tps": 0.8
  }
}

系统要求

  • GPU: NVIDIA RTX 4090 (24 GB VRAM), CUDA 12+
  • RAM: 90+ GB (容器)
  • 磁盘: 150 GB (模型权重)
  • OS: Linux

项目结构

src/home_seek/
├── inference_engine/          # 推理引擎包
│   ├── engine.py              # HomeSeekInferenceEngine
│   ├── weight_loader.py       # WeightLoader + FP4/FP8 加载
│   ├── layer_state.py         # LayerState (逐层 KV 状态)
│   └── expert_cache.py        # ExpertWeightCache + ExpertCacheManager
├── fused_moe.py               # FusedMoEFFN + SharedExpertFFN (Triton + cuBLAS)
├── gqa_attention.py           # [实验性] GQA 融合注意力 kernel
├── router.py                  # MoE 路由
├── compressor.py              # KV 压缩
├── hybrid_kv_cache.py         # Hybrid KV Cache (SWA+CSA+HCA)
├── mhc.py                     # MHC Sinkhorn split
├── model_config.py            # @dataclass 配置
├── _fp4.py                    # FP4 量化/反量化工具
├── profiling_runner.py        # 性能分析入口
└── expert_predictor.py        # 专家预测

tests/
├── test_fixes.py, test_fp4_experts.py, test_mtp.py, ...
└── integration/
    └── test_inference_e2e.py  # 端到端回归测试

设计纪律

  1. 修 bug 先写 L1 测试: L1 测试需 <1 秒, 能准确定位复现 bug
  2. make profile 验证性能改动: --temperature 0 消除路由噪声, 3+ 次取平均
  3. make lint test-unit 通过再提交: ruff 静态检查 + 单元测试
  4. 改 Triton kernel 后清 cache: rm -rf ~/.triton/cache/
  5. _forward_layer 共享方法: 所有层 forward 走此方法, 不复制粘贴

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

home_seek-0.2.0.tar.gz (122.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

home_seek-0.2.0-py3-none-any.whl (91.0 kB view details)

Uploaded Python 3

File details

Details for the file home_seek-0.2.0.tar.gz.

File metadata

  • Download URL: home_seek-0.2.0.tar.gz
  • Upload date:
  • Size: 122.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for home_seek-0.2.0.tar.gz
Algorithm Hash digest
SHA256 699527823bab9ee727e63bbad2202b71461a9418ef06155867c6a981738c6292
MD5 2fd0609cf13d21aeba3bbc2bbcdd2ad9
BLAKE2b-256 9dad4624985aa651777d1993276742a7d552fa493f83f8a116e7937a0b9f73d8

See more details on using hashes here.

File details

Details for the file home_seek-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: home_seek-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 91.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for home_seek-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f8cd74ddc0ad05125098a5d81d38491f0063092e687d7cc184619372e0ae8fdb
MD5 c7b120d217d94fbaf83d485a3d443a77
BLAKE2b-256 5c852f7c77dc03e49d1ba9ffa09c1d00a12e5c9cb9f4af969bea396030d3c4dc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page