Skip to main content

sageLLM: Modular LLM inference engine for domestic computing power (Huawei Ascend, NVIDIA)

Project description

sageLLM

๐Ÿš€ Modular LLM Inference Engine for Domestic Computing Power

Ollama-like experience for Chinese hardware ecosystems (Huawei Ascend, NVIDIA)


โœจ Features

  • ๐ŸŽฏ One-Click Install - pip install isagellm gets you started immediately
  • ๐Ÿ”Œ Mock-First - Test without GPU, perfect for CI/CD
  • ๐Ÿ‡จ๐Ÿ‡ณ Domestic Hardware - First-class support for Huawei Ascend NPU
  • ๐Ÿ“Š Observable - Built-in metrics (TTFT, TBT, throughput, KV usage)
  • ๐Ÿงฉ Plugin System - Extend with custom backends and engines

๐Ÿ“ฆ Quick Install

# Install sageLLM (includes mock backend, no GPU required)
pip install isagellm

# With Control Plane (request routing & scheduling)
pip install 'isagellm[control-plane]'

# With API Gateway (OpenAI-compatible REST API)
pip install 'isagellm[gateway]'

# Full server (Control Plane + Gateway)
pip install 'isagellm[server]'

# With CUDA support
pip install 'isagellm[cuda]'

# All features
pip install 'isagellm[all]'

๐Ÿš€ Quick Start

CLI (like ollama)

# Show system info
sage-llm info

# Start mock server (no GPU required)
sage-llm serve --mock

# Single inference
sage-llm run -p "What is LLM inference?" --mock

# Run Year1 demo validation
sage-llm demo --workload year1 --mock

# Start OpenAI-compatible API gateway
sage-llm gateway --mock --port 8080

Python API

from sagellm import Request, MockEngine

# Create mock engine (no GPU needed)
engine = MockEngine()

# Run inference
request = Request(
    request_id="demo-001",
    prompt="Hello, world!",
    max_tokens=128,
)
response = engine.generate(request)

print(f"Response: {response.text}")
print(f"TTFT: {response.metrics.ttft_ms:.2f} ms")
print(f"Throughput: {response.metrics.throughput_tps:.2f} tokens/s")

Configuration

# ~/.sage-llm/config.yaml
backend:
  kind: mock  # or: cuda, ascend

engine:
  kind: mock
  model: Qwen/Qwen2-7B

workload:
  segments:
    - short   # 128 in โ†’ 128 out
    - long    # 2048 in โ†’ 512 out
    - stress  # concurrent requests

๐Ÿ“Š Year 1 Demo Contract

sageLLM must produce these metrics for validation:

{
  "ttft_ms": 45.2,
  "tbt_ms": 12.5,
  "throughput_tps": 80.0,
  "peak_mem_mb": 24576,
  "kv_used_tokens": 4096,
  "prefix_hit_rate": 0.85,
  "evict_count": 3
}

Run validation:

sage-llm demo --workload year1 --output metrics.json

๐Ÿ—๏ธ Architecture

isagellm (umbrella package)
โ”œโ”€โ”€ isagellm-protocol       # Protocol v0.1 types
โ”‚   โ””โ”€โ”€ Request, Response, Metrics, Error, StreamEvent
โ”œโ”€โ”€ isagellm-core           # Runtime & Demo Runner
โ”‚   โ””โ”€โ”€ Config, Engine, Factory, DemoRunner
โ”œโ”€โ”€ isagellm-backend        # Hardware abstraction
โ”‚   โ””โ”€โ”€ BackendProvider, MockBackend, (CUDABackend, AscendBackend)
โ”œโ”€โ”€ isagellm-control-plane  # Request routing & scheduling (optional)
โ”‚   โ””โ”€โ”€ ControlPlaneManager, Router, Policies, Lifecycle
โ””โ”€โ”€ isagellm-gateway        # OpenAI-compatible REST API (optional)
    โ””โ”€โ”€ FastAPI server, /v1/chat/completions, Session management

๐Ÿ”ง Development

Quick Setup (Development Mode)

# Clone all repositories
./scripts/clone-all-repos.sh

# Install all packages in editable mode
./quickstart.sh

# Open all repos in VS Code Multi-root Workspace
code sagellm.code-workspace

๐Ÿ“– See WORKSPACE_GUIDE.md for Multi-root Workspace usage.

Testing

# Clone and setup
git clone https://github.com/IntelliStream/sagellm.git
cd sagellm
pip install -e ".[dev]"

# Run tests
pytest -v

# Format & lint
ruff format .
ruff check . --fix

# Type check
mypy src/sagellm/

# Verify dependency hierarchy
python scripts/verify_dependencies.py

๐Ÿ“– Development Resources

๐Ÿ“š Package Details

Package PyPI Name Import Name Description
sagellm isagellm sagellm Umbrella package (install this)
sagellm-protocol isagellm-protocol sagellm_protocol Protocol v0.1 types
sagellm-core isagellm-core sagellm_core Runtime & config
sagellm-backend isagellm-backend sagellm_backend Hardware abstraction

๐ŸŽฏ Roadmap

  • Year 1: Core inference with KV cache, prefix sharing, basic eviction
  • Year 2: Multi-node inference, advanced scheduling
  • Year 3: Full production-ready deployment

๐Ÿ“„ License

Proprietary - IntelliStream. Internal use only.


Built with โค๏ธ by IntelliStream Team for domestic AI infrastructure

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

isagellm-0.1.0.3-cp311-none-any.whl (51.7 kB view details)

Uploaded CPython 3.11

File details

Details for the file isagellm-0.1.0.3-cp311-none-any.whl.

File metadata

  • Download URL: isagellm-0.1.0.3-cp311-none-any.whl
  • Upload date:
  • Size: 51.7 kB
  • Tags: CPython 3.11
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for isagellm-0.1.0.3-cp311-none-any.whl
Algorithm Hash digest
SHA256 ade4e649cf0031dd73645ccbd241b6ac9a817c1cb0ee443f679050fb5b2c2f18
MD5 bcca36bb231019d9c60b598138a161c0
BLAKE2b-256 592ee7b833ed4f7f2b44c761b3b4539f7bef8ef8ab86eb46ff17e055215fd838

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page