Skip to main content

sageLLM: Modular LLM inference engine with PD separation for domestic computing power

Project description

sageLLM

Protocol Compliance (Mandatory)

🚀 Modular LLM Inference Engine for Domestic Computing Power

Ollama-like experience for Chinese hardware ecosystems (Huawei Ascend, NVIDIA)


✨ Features

  • 🎯 One-Click Install - pip install isagellm gets you started immediately
  • 🧠 CPU-First - Default CPU engine, no GPU required
  • 🇨🇳 Domestic Hardware - First-class support for Huawei Ascend NPU
  • 📊 Observable - Built-in metrics (TTFT, TBT, throughput, KV usage)
  • 🧩 Plugin System - Extend with custom backends and engines
  • 🔄 Mixed Inference - Unified LLM + Embedding client (MixedInferenceClient)
  • 🦙 Ollama Backend - Use a local Ollama server as an inference backend
  • 📈 Performance Profiling - Load profiling data and interpolate TTFT/throughput

📦 Quick Install

# Install sageLLM (CPU-first, no GPU required)
pip install isagellm

# With Control Plane (request routing & scheduling)
pip install 'isagellm[control-plane]'

# With API Gateway (OpenAI-compatible REST API)
pip install 'isagellm[gateway]'

# Full server (Control Plane + Gateway)
pip install 'isagellm[server]'

# With KV Cache
pip install 'isagellm[kv-cache]'

# With communication layer
pip install 'isagellm[comm]'

# With compression (quantization, sparse)
pip install 'isagellm[compression]'

# All optional features
pip install 'isagellm[all]'

# Reproduce exactly-tested sub-package versions (recommended for production)
pip install isagellm -c https://raw.githubusercontent.com/intellistream/sagellm/main-dev/constraints.txt

🚀 国内加速安装 PyTorch(推荐)

由于 PyTorch CUDA 版本从官方源下载较慢(~800MB),我们在 GitHub Releases 提供预先下载的 wheels:

# 方法 1:使用 sagellm CLI (推荐,最简单)
pip install isagellm
sage-llm install cuda --github     # 从 GitHub 下载,快速
sage-llm install cuda              # 从官方源下载(默认)

# 方法 2:直接使用 pip --find-links
pip install torch==2.5.1+cu121 torchvision torchaudio \
  --find-links https://github.com/intellistream/sagellm-pytorch-wheels/releases/download/v2.5.1-cu121/ \
  --trusted-host github.com

其他支持的后端

  • sage-llm install ascend - 华为昇腾 NPU
  • sage-llm install kunlun - 百度昆仑 XPU
  • sage-llm install haiguang - 海光 DCU
  • sage-llm install cpu - CPU-only(最小下载)

💡 为什么使用 GitHub 加速?

  • ✅ 国内访问速度快(GitHub CDN)
  • ✅ 无需配置镜像源
  • ✅ 官方 wheels,100% 可信

📦 Wheels 仓库: https://github.com/intellistream/sagellm-pytorch-wheels

🚀 Quick Start

CLI 命令统一

  • 统一主命令:sagellm
  • 兼容别名:sage-llm(保留向后兼容,建议迁移到 sagellm

CLI (像 vLLM/Ollama 一样简单)

# 一键启动(完整栈:Gateway + Engine)
pip install 'isagellm[gateway]'
sage-llm serve --model Qwen2-7B

# ✅ OpenAI API 自动可用
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2-7B",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# 查看系统信息
sage-llm info

# 单次推理(不启动服务器)
sage-llm run -p "What is LLM inference?"

# 推荐生产启动(通过 Gateway + Control Plane)
sage-llm serve --backend cpu --model sshleifer/tiny-gpt2 --port 8888
sage-llm serve \
  --backend cpu \
  --model sshleifer/tiny-gpt2 \
  --port 8888 \
  --with-embedding \
  --embedding-model sentence-transformers/all-MiniLM-L6-v2

# 一键启动:Gateway + LLM + Embedding
sage-llm serve \
  --backend cpu \
  --model sshleifer/tiny-gpt2 \
  --port 8888 \
  --with-embedding \
  --embedding-model sentence-transformers/all-MiniLM-L6-v2

Python API (Control Plane - Recommended)

import asyncio

from sagellm import ControlPlaneManager, BackendConfig, EngineConfig

# Install with: pip install 'isagellm[control-plane]'
async def main() -> None:
    manager = ControlPlaneManager(
        backend_config=BackendConfig(kind="cpu", device="cpu"),
        engine_configs=[
            EngineConfig(
                kind="cpu",
                model="sshleifer/tiny-gpt2",
                model_path="sshleifer/tiny-gpt2"
            )
        ]
    )

    await manager.start()
    try:
        # Requests are automatically routed to available engines
        response = await manager.execute_request(
            prompt="Hello, world!",
            max_tokens=128
        )
        print(response.output_text)
        print(f"TTFT: {response.metrics.ttft_ms:.2f} ms")
        print(f"Throughput: {response.metrics.throughput_tps:.2f} tokens/s")
    finally:
        await manager.stop()


asyncio.run(main())

⚠️ Important: Direct engine creation (create_engine()) is not exported from the umbrella package. All production code must use ControlPlaneManager for proper request routing, scheduling, and lifecycle management.

Mixed Inference (LLM + Embedding)

from sagellm import MixedInferenceClient, MixedRequest, RequestKind

# Unified client for both LLM and embedding
client = MixedInferenceClient(
    llm_url="http://localhost:8000",
    embedding_url="http://localhost:8001",
)

# LLM completion
resp = client.complete("What is 2+2?")
print(resp["text"])

# Embedding
vecs = client.embed("Hello world")

# Mixed batch dispatch
results = client.dispatch([
    MixedRequest(kind=RequestKind.LLM, content="Tell me a joke"),
    MixedRequest(kind=RequestKind.EMBEDDING, content="The quick brown fox"),
])

Ollama Backend

# Use a local Ollama server as inference backend
sage-llm ollama status                       # check health
sage-llm ollama list                         # list models
sage-llm ollama run -m llama3 -p "Hello!"    # single completion
sage-llm ollama chat -p "Explain Python"     # chat
from sagellm import OllamaClient

client = OllamaClient(model="llama3")
resp = client.complete("What is 2+2?")
print(resp["text"])
models = client.list_models()

Performance Profiling & Interpolation

from sagellm.profiling import PerformanceInterpolator

# Load CSV: columns isl, ttft, itl, throughput
interp = PerformanceInterpolator.from_csv("profiles/qwen2_7b_a100.csv")

# Predict metrics for a given input sequence length
ttft = interp.predict_ttft(512)          # → seconds
itl  = interp.predict_itl(512)           # → seconds/token
tput = interp.predict_throughput(512)    # → tokens/second

# Reverse: find max ISL that satisfies a TTFT budget
max_isl = interp.reverse_ttft(target_ttft=0.3)
print(f"Max ISL for 300ms TTFT: {max_isl} tokens")

Configuration

# ~/.sagellm/config.yaml
backend:
  kind: cpu  # Options: cpu, pytorch-cuda, pytorch-ascend
  device: cpu

engine:
  kind: cpu
  model: sshleifer/tiny-gpt2

control_plane:
  endpoint: "localhost:8080"

📊 Metrics & Validation

sageLLM provides comprehensive performance metrics:

{
  "ttft_ms": 45.2,
  "tbt_ms": 12.5,
  "throughput_tps": 80.0,
  "peak_mem_mb": 24576,
  "kv_used_tokens": 4096,
  "prefix_hit_rate": 0.85
}

Run benchmarks:

sage-llm demo --workload year1 --output metrics.json

🏗️ Architecture

isagellm (umbrella package)
├── isagellm-protocol       # Protocol v0.1 types
│   └── Request, Response, Metrics, Error, StreamEvent
├── isagellm-backend        # Hardware abstraction (L1 - Foundation)
│   └── BackendProvider, CPUBackend, (CUDABackend, AscendBackend)
├── isagellm-comm           # Communication primitives (L2 - Infrastructure)
│   └── Topology, CollectiveOps (all_reduce/gather), P2P (send/recv), Overlap
├── isagellm-kv-cache       # KV cache management (L2 - Optional)
│   └── PrefixCache, MemoryPool, EvictionPolicies, Predictor, KV Transfer
├── isagellm-compression    # Inference acceleration (quantization, sparsity, etc.) (L2 - Optional)
│   └── Quantization, Sparsity, SpeculativeDecoding, Fusion
├── isagellm-core           # Engine core & runtime (L3)
│   └── Config, Engine, Factory, DemoRunner, Adapters (vLLM/LMDeploy)
├── isagellm-control-plane  # Request routing & scheduling (L4 - Optional)
│   └── ControlPlaneManager, Router, Policies, Lifecycle
└── isagellm-gateway        # OpenAI-compatible REST API (L5 - Optional)
    └── FastAPI server, /v1/chat/completions, Session management

🔧 Development

Quick Setup (Development Mode)

# Clone all repositories
./scripts/clone-all-repos.sh

# Install all packages in editable mode
./quickstart.sh

# Open all repos in VS Code Multi-root Workspace
code sagellm.code-workspace

📖 See WORKSPACE_GUIDE.md for Multi-root Workspace usage.

Testing

# Clone and setup
git clone https://github.com/IntelliStream/sagellm.git
cd sagellm
pip install -e ".[dev]"

# Run tests
pytest -v

# Format & lint
ruff format .
ruff check . --fix

# Type check
mypy src/sagellm/

# Verify dependency hierarchy
python scripts/verify_dependencies.py

📖 Development Resources


📚 Documentation Index

用户文档

开发者文档

API 文档

子包文档

� 贡献指南

工作流程(必须遵循)

在提交代码前,必须严格遵循以下步骤:

1️⃣ 创建 Issue

描述你要解决的问题、实现的功能或改进:

gh issue create \
  --title "[Category] 简短描述" \
  --label "bug,sagellm-core" \
  --body "详细描述..."

Issue 类型

  • [Bug] - Bug 修复
  • [Feature] - 新功能
  • [Performance] - 性能优化
  • [Integration] - 与其他模块集成
  • [Docs] - 文档改进

2️⃣ 在本地分支开发

创建开发分支并解决问题:

# 从 main-dev 创建分支(不是 main!)
git fetch origin main-dev
git checkout -b fix/#123-short-description origin/main-dev

# 进行开发
# ...

# 确保通过所有检查
ruff format .
ruff check . --fix
pytest -v

分支命名约定

  • Bug 修复:bugfix/#123-xxx
  • 新功能:feature/#456-xxx
  • 文档:docs/#789-xxx
  • 性能:perf/#101-xxx

3️⃣ 发起 Pull Request

提交代码供审查:

git push origin fix/#123-short-description
gh pr create \
  --base main-dev \
  --head fix/#123-short-description \
  --title "Fix: [简短描述]" \
  --body "解决 #123

## 改动
- 改动 1
- 改动 2

## 测试
- 新增单元测试
- 所有测试通过 ✓"

PR 必须包含

  • 清晰的标题(Fix/Feature/Docs/Perf)
  • 关联 issue 号:Closes #123
  • 改动列表和测试说明
  • 通过所有 CI 检查

4️⃣ 代码审查与合并

等待审批后合并到 main-dev

# 在 GitHub 界面点击"Merge"按钮
# 合并到 main-dev(不是 main!)

合并前条件

  • ✅ 至少一名维护者审批
  • ✅ CI 检查全部通过(pytest, ruff)
  • ✅ 合并到 main-dev 分支

快速检查清单

在发起 PR 前检查:

  • main-dev 分支创建开发分支
  • 更新了 CHANGELOG.md
  • ruff format . 格式化代码
  • ruff check . --fix 通过 lint
  • pytest -v 通过所有测试
  • 关联了相关 issue:Closes #123

反面例子 ❌

  • ❌ 直接在 main 分支提交
  • ❌ PR 中没有关联 issue
  • ❌ 修改了代码但没有更新 CHANGELOG
  • ❌ 代码没有通过 lint 检查
  • ❌ 提交前没有运行测试

相关资源

  • Issue Labelsbug, enhancement, documentation, sagellm-core, sagellm-backend
  • GitHub CLIgh issue create, gh pr create
  • 更多信息:见 .github/copilot-instructions.md

�📚 Package Details

Package PyPI Name Import Name Description
sagellm isagellm sagellm Umbrella package (install this)
sagellm-protocol isagellm-protocol sagellm_protocol Protocol v0.1 types
sagellm-core isagellm-core sagellm_core Runtime & config
sagellm-backend isagellm-backend sagellm_backend Hardware abstraction

📄 License

Proprietary - IntelliStream. Internal use only.


Built with ❤️ by IntelliStream Team for domestic AI infrastructure

# test

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

isagellm-0.5.3.4.tar.gz (243.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

isagellm-0.5.3.4-py2.py3-none-any.whl (272.2 kB view details)

Uploaded Python 2Python 3

File details

Details for the file isagellm-0.5.3.4.tar.gz.

File metadata

  • Download URL: isagellm-0.5.3.4.tar.gz
  • Upload date:
  • Size: 243.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for isagellm-0.5.3.4.tar.gz
Algorithm Hash digest
SHA256 5627ec4eb17b91cc2a443f1f820db37b1bf44ce774224d463cb1b67c49655ee0
MD5 4e7d8088d4f5fe88cf441fec4050fb19
BLAKE2b-256 97decd7b595e48d40420e30576d3ad562f34a9e1b985a91b054fa21c566b099a

See more details on using hashes here.

File details

Details for the file isagellm-0.5.3.4-py2.py3-none-any.whl.

File metadata

  • Download URL: isagellm-0.5.3.4-py2.py3-none-any.whl
  • Upload date:
  • Size: 272.2 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for isagellm-0.5.3.4-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 d2ecde99a19ab5bf043b93e96061d5e4f58f618ee5acdb74e8ac2346bfdcc14d
MD5 caa192d973fea8f4a32a0340ce417986
BLAKE2b-256 d07c3942a1b8970ee146d8aa52fdb7a8abe3e51d898c57170496111deddd26e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page