Skip to main content

Ultra-fast LLM inference engine with a Vulkan compute backend

Project description

Sovereign Engine

Ultra-fast, modular LLM inference engine with a Vulkan compute backend
Designed to surpass llama.cpp in throughput and VRAM efficiency.

┌──────────────────────────────────────────────────────────────────────────┐
│  Sovereign Engine  v0.2.6                                                │
│  C++20 · Vulkan 1.3 · SPIR-V Compute · pybind11 · Mixed INT4 Quant      │
└──────────────────────────────────────────────────────────────────────────┘

Build License: MIT C++20 Vulkan 1.3


Table of Contents


Overview

Sovereign Engine is a from-scratch, GPU-first LLM inference runtime written in C++20.
It targets local inference on consumer hardware (NVIDIA/AMD/Intel) using Vulkan compute as the sole GPU backend, which means:

  • No CUDA dependency — runs on any Vulkan 1.2+ GPU.
  • Tight control over VRAM: paged KV cache, async layer streaming, dynamic CPU offload.
  • Mixed-precision quantisation inspired by EXL2 and HQQ — assign INT4/INT3/INT2 per-tensor based on measured sensitivity.
  • A clean Python API (via pybind11) and a stable C ABI for FFI from any language.

Key Features

Feature Details
Vulkan backend Compute-only, no graphics queue needed. Works on NVIDIA, AMD, Intel, ARM Mali.
Mixed-precision quantisation FP16 → INT8 → Q4_K → Q3_K → Q2_K per tensor, HQQ solver, EXL2-style importance scoring.
Async layer pipeline Double-buffered PCIe staging: GPU runs layer N while CPU DMA-copies layer N+1.
PagedAttention KV cache Block-based VRAM pool, copy-on-write forking, O(1) alloc/free.
Dynamic CPU offload Falls back to AVX-512 / NEON when VRAM pressure exceeds threshold.
Streaming generation Token-by-token callback; GIL-safe Python generator.
Rich sampling Temperature, Top-P, Top-K, Min-P, Repetition Penalty, Mirostat v1/v2, GBNF grammar, JSON schema.
Proprietary .sovereign format Page-aligned mmap, per-tensor CRC32C, zero-copy Vulkan upload.
GQA / MHA / MQA All attention variants supported via a single fused GLSL shader.
RoPE + sliding window Inline rotary embeddings, optional Mistral/Gemma sliding-window mask.

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           Python / C++ / C                               │
│                    (sovereign_inference.Engine)                          │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │       engine.cpp         │  prefill / decode_step /
                    │    (inference loop)      │  generate / forward
                    └──┬──────┬───────┬───────┘
                       │      │       │
         ┌─────────────▼─┐ ┌──▼────┐ ┌▼──────────────────┐
         │  VulkanContext │ │Quant  │ │ AsyncMemoryManager │
         │  (device,      │ │izer   │ │ (layer streaming,  │
         │   pipelines,   │ │       │ │  CPU offload)      │
         │   cmd bufs)    │ └───────┘ └────────────────────┘
         └───────┬────────┘                    │
                 │                   ┌─────────▼──────────┐
     ┌───────────▼────────────┐      │  PagedKVCache       │
     │  SPIR-V Compute Shaders│      │  (block pool,       │
     │  ┌─────────────────┐   │      │   CoW fork,         │
     │  │ rmsnorm.comp    │   │      │   descriptor sets)  │
     │  │ matmul_int4.comp│   │      └────────────────────┘
     │  │ attention_gqa   │   │
     │  │ silu_gate.comp  │   │
     │  │ sampler.comp    │   │
     │  └─────────────────┘   │
     └────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│                  sovereign-convert CLI                      │
│  SafeTensors → profile → budget allocate → HQQ quant       │
│               → pack INT4/3/2 → write .sovereign           │
└────────────────────────────────────────────────────────────┘

Project Structure

sovereign-engine/
├── CMakeLists.txt              # Root build configuration
├── README.md
├── .gitignore
│
├── include/sovereign/          # Public C++ headers
│   ├── engine.hpp              # Top-level inference API
│   ├── format.hpp              # .sovereign binary format spec
│   ├── vulkan_context.hpp      # Vulkan device + pipeline management
│   ├── memory_manager.hpp      # Async pipeline memory manager
│   ├── kv_cache.hpp            # PagedAttention KV cache
│   └── quantizer.hpp           # Mixed-precision quantiser
│
├── src/
│   ├── vulkan/
│   │   └── vulkan_context.cpp
│   ├── format/
│   │   └── format.cpp
│   ├── compute/
│   │   └── kv_cache.cpp
│   ├── inference/
│   │   └── engine.cpp
│   ├── quantizer/
│   │   └── quantizer.cpp
│   └── memory/
│       └── memory_manager.cpp
│
├── shaders/                    # GLSL compute shaders (compiled to SPIR-V)
│   ├── rmsnorm.comp
│   ├── matmul_int4.comp
│   ├── attention_gqa.comp
│   ├── silu_gate.comp
│   └── sampler.comp
│
├── bindings/
│   └── python/
│       └── sovereign_py.cpp    # pybind11 Python bindings
│
├── tools/
│   └── converter/
│       └── main.cpp            # sovereign-convert CLI
│
├── tests/
│   ├── CMakeLists.txt
│   ├── test_format.cpp
│   ├── test_quantizer.cpp
│   ├── test_kv_cache.cpp
│   └── test_engine.cpp
├── package.json                # Shader compiler package metadata
├── package-lock.json           # Shader compiler lock file
│
├── examples/
│   └── basic_generate.py       # Python streaming example
│
├── scripts/
│   ├── build.sh                # Build helper script
│   └── compile_shaders.js      # Shader compiler tool using WebGPU glslang
│
└── third_party/
    ├── volk/                   # Meta-loader for dynamic Vulkan loading (tracked)
    │   ├── volk.h
    │   └── volk.c
    └── vk_mem_alloc.h          # Fetched automatically via CMake (not tracked)

Requirements

Runtime

  • Vulkan 1.2+ Compatible GPU: Works on NVIDIA, AMD, Intel, Apple Silicon (via MoltenVK), and ARM Mali.
  • GPU Driver: Must support Vulkan 1.2 and the required extensions listed below. No SDK required at runtime!

Build (Zero-Dependency & SDK-Free)

Thanks to our dynamic meta-loader architecture (volk) and automatic CMake dependency management, the Vulkan SDK is completely optional to build Sovereign Engine!

Dependency Version Mandatory? Notes
CMake ≥ 3.25 Yes Handles the build orchestration
C++ Compiler C++20 Yes MSVC 2022 / GCC 12+ / Clang 15+
Vulkan SDK ≥ 1.3 No (Optional) If absent, CMake automatically fetches headers; uses precompiled SPIR-V shaders
Python ≥ 3.9 No (Optional) Only required to compile Python/pybind11 bindings

Required Vulkan Extensions

Your GPU driver must support:

VK_KHR_timeline_semaphore        (core in 1.2)
VK_KHR_synchronization2          (core in 1.3)
VK_EXT_memory_budget
VK_KHR_buffer_device_address
VK_KHR_shader_float16_int8
VK_EXT_scalar_block_layout
VK_KHR_8bit_storage
VK_KHR_16bit_storage

Building

Quick start

# Clone
git clone https://github.com/corbac10099/sovereign-engine.git
cd sovereign-engine

# Build (fetches vk_mem_alloc.h automatically)
chmod +x scripts/build.sh
./scripts/build.sh

# Or with all options explicit:
./scripts/build.sh --release --tests --python --avx512

Manual CMake

mkdir build && cd build
cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DSOVEREIGN_BUILD_PYTHON=ON \
    -DSOVEREIGN_BUILD_TESTS=ON \
    -DSOVEREIGN_ENABLE_AVX512=ON
cmake --build . --parallel $(nproc)

Debug build with AddressSanitizer

./scripts/build.sh --debug

Usage

1. Convert a Model

Download a HuggingFace model (e.g. Gemma 4B) in SafeTensors format, then convert:

# Basic conversion – mixed quantisation targeting 4.5 bpw
./build/sovereign-convert \
    --input  /path/to/gemma-4b/ \
    --output gemma-4b.sovereign \
    --arch   gemma \
    --quant  mixed \
    --bpw    4.5

# With calibration corpus for better importance scoring
./build/sovereign-convert \
    --input  /path/to/gemma-4b/ \
    --output gemma-4b-calibrated.sovereign \
    --quant  mixed \
    --bpw    4.5 \
    --calib  calibration_corpus.txt \
    --verbose

Quantisation modes:

Mode Approx bpw Description
fp16 16 No quantisation, maximum quality
int8 8 Symmetric INT8 throughout
q4k 4.5 Q4_K block quantisation
q3k 3.5 Q3_K block quantisation
q2k 2.6 Q2_K aggressive compression
mixed target Adaptive per-tensor (recommended)

Python-based Conversion

You can also convert models directly inside Python without having to compile the C++ CLI tool:

import sovereign_inference

result = sovereign_inference.convert(
    input_dir="/path/to/gemma-4b/",
    output_path="gemma-4b.sovereign",
    arch="gemma",
    quant="mixed",
    bpw=4.5,
    verbose=True
)

if result["success"]:
    print(f"Conversion successful! Achieved bpw: {result['achieved_bpw']:.2f}")
else:
    print(f"Conversion failed: {result['error_message']}")

2. Python API

import sovereign_inference

# Load model
cfg = sovereign_inference.LoadConfig()
cfg.gpu_layer_count       = 2**31 - 1   # load everything into VRAM
cfg.kv_cache_vram_fraction = 0.80

with sovereign_inference.Engine.load("gemma-4b.sovereign", cfg) as engine:
    print(f"Model  : {engine.model_name}")
    print(f"Device : {engine.device_name}  ({engine.vram_gib:.1f} GiB)")

    # --- Streaming generation ---
    params = sovereign_inference.GenerateParams()
    params.max_new_tokens        = 512
    params.sampling.temperature  = 0.7
    params.sampling.top_p        = 0.9
    params.sampling.min_p        = 0.05
    params.sampling.repetition_penalty = 1.1

    stats = engine.generate(
        prompt   = "Explain quantum entanglement briefly:",
        params   = params,
        callback = lambda tok, tid, lp: print(tok, end="", flush=True) or True,
    )
    print(f"\n[{stats.tokens_per_second:.1f} tok/s | {stats.generated_tokens} tokens]")

    # --- Generator protocol ---
    for text, token_id, logprob in engine.stream("Once upon a time", params):
        print(text, end="", flush=True)

    # --- Raw logits for custom sampling ---
    ids    = engine.tokenize("The sky is")
    logits = engine.forward(ids)   # numpy float32 array [vocab_size]

3. C++ API

#include "sovereign/engine.hpp"

int main() {
    sovereign::LoadConfig cfg;
    cfg.kv_cache_vram_fraction = 0.80;

    auto engine = sovereign::Engine::load("gemma-4b.sovereign", cfg);
    if (!engine) return 1;

    sovereign::GenerateParams params;
    params.max_new_tokens       = 512;
    params.sampling.temperature = 0.7f;
    params.sampling.top_p       = 0.9f;

    auto stats = engine->generate(
        "Explain quantum entanglement:",
        params,
        [](std::string_view tok, sovereign::TokenId, float) {
            std::cout << tok << std::flush;
            return true;   // return false to stop early
        });

    std::fprintf(stderr, "\n%.1f tok/s\n", stats.tokens_per_second);
}

4. C API (FFI)

#include "sovereign/engine.hpp"   // exposes extern "C" block

SovereignEngine* engine = sovereign_engine_load(
    "gemma-4b.sovereign",
    0,       // vram_budget (0 = auto)
    ~0u,     // gpu_layers  (all)
    true     // use_mmap
);

sovereign_engine_generate(
    engine,
    "Hello, world!",
    0.7f, 0.9f, 0, 0.05f, 1.1f,  // temperature, top_p, top_k, min_p, rep_penalty
    256,
    my_callback, NULL
);

sovereign_engine_free(engine);

The .sovereign Format

The .sovereign binary format is designed for zero-copy, memory-mapped inference:

┌──────────────┬──────────────────────────────────────────────────────┐
│ Offset       │ Section                                              │
├──────────────┼──────────────────────────────────────────────────────┤
│ 0x0000       │ FileHeader        (256 bytes, fixed)                 │
│ 0x0100       │ ModelConfig       (256 bytes, padded to 64B)         │
│ aligned      │ TokenizerBlob     (UTF-8 JSON)                       │
│ aligned      │ TensorIndex[]     (N × 192 bytes each)               │
│ PAGE-ALIGNED │ TensorDataBlock   (mmap-ready, 4K page aligned) ◀──┐ │
└──────────────┴──────────────────────────────────────────────────────┘
                                                                       │
Vulkan can mmap this block directly into a VkBuffer via               │
VK_EXT_external_memory_host — zero CPU copy during weight loading. ───┘

Key properties:

  • Magic bytes: SVRN (0x53, 0x56, 0x52, 0x4E)
  • All multi-byte fields: little-endian
  • Per-tensor CRC32C checksums (hardware-accelerated via SSE4.2)
  • Per-tensor DType field: supports F32, F16, BF16, INT8, INT4, INT3, INT2, Q4_K, Q3_K, Q2_K
  • Feature flags bitmask: MMAP_READY, HAS_TOKENIZER, GROUPED_QUERY, RoPE_SCALED, …

Quantiser

The quantiser runs a 3-phase pipeline:

Phase 1 – Calibration Profiling

Computes per-tensor activation statistics on a small calibration corpus (≥ 512 tokens):

  • Hessian proxy (mean squared activation magnitude)
  • Outlier ratio (fraction with |w| > 3σ)
  • Kurtosis (distribution peakedness)

Phase 2 – Budget Allocation

Assigns a DType to each tensor to hit a target average bpw:

importance ≥ 0.75  →  FP16 / INT8   (embeddings, first/last layers, norms)
importance ≥ 0.50  →  INT4 / Q4_K  (Q/K/V projections)
importance ≥ 0.25  →  Q3_K
importance <  0.25  →  Q2_K

Iteratively rebalances until |achieved_bpw - target_bpw| < 5%.

Phase 3 – HQQ Quantisation

Per-block iterative solver minimising the Hessian-weighted MSE:

min_{scale, zero} ‖W − dequant(quant(W, scale, zero))‖²_H

Default: 20 iterations, block size 128 elements, FP16 scale storage.


Memory Manager

The AsyncMemoryManager implements a double-buffered layer-streaming pipeline:

CPU Thread             GPU Compute Queue       DMA Transfer Queue
──────────             ─────────────────       ─────────────────

[Layer N-1 ready] ──▶  Compute(Layer N-1)
                                │
[Stream Layer N+1] ─────────────┼──────────▶ DMA(Layer N+1)
  (from mmap/RAM)               │                  │
                                ▼                  ▼
                        Compute(Layer N) ◀── Layer N ready

VRAM pressure response:

  • 88% → start evicting LRU layers (LRU free-list)

  • 95% → force CPU offload via AVX-512 / NEON kernels


KV Cache (PagedAttention)

Inspired by vLLM's PagedAttention:

  • One giant VRAM pool pre-allocated at startup (no per-block VkBuffer overhead).
  • Block size: 16 tokens per block (configurable, must be power-of-2).
  • Copy-on-write forking: beam search / speculative decoding shares blocks until a write occurs.
  • Descriptor sets pre-allocated per (block_id × layer) pair to avoid per-inference allocation.
  • Optional ConstantContextCache for RWKV / Mamba models (O(1) memory regardless of sequence length).

Vulkan Compute Shaders

All shaders are compiled from GLSL (.comp) to SPIR-V at CMake configure time:

Shader Purpose
rmsnorm.comp Fused RMSNorm with subgroup reduction; supports Gemma variant
matmul_int4.comp Tiled INT4×FP16 GEMM with on-the-fly dequantisation and double-buffered B tiles
attention_gqa.comp GQA/MHA/MQA fused attention: RoPE inline, PagedAttention block table, Flash-Attention tiled softmax
silu_gate.comp Fused SwiGLU (SiLU × hadamard) for LLaMA/Gemma FFN
sampler.comp GPU-resident sampling: temperature → top-K → softmax → top-P → min-P → multinomial

All shaders use GL_EXT_scalar_block_layout and GL_KHR_shader_subgroup_arithmetic for efficient subgroup reductions.


Running Tests

# Build and run all tests
./scripts/build.sh --tests
cd build && ctest --output-on-failure

# Run a specific suite
./build/test_quantizer --success
./build/test_format    --success
./build/test_kv_cache  --success
./build/test_engine    --success

# Integration test (requires a converted model)
SOVEREIGN_TEST_MODEL=gemma-4b.sovereign ctest -R test_integration

Roadmap

  • Continuous batching — interleave multiple requests in a single GPU pass
  • Speculative decoding — draft model integration for 2-4× decode speedup
  • Cooperative matrix — VK_KHR_cooperative_matrix path for tensor-core acceleration
  • io_uring Direct Storage — bypass staging buffers for PCIe 4.0+ NVMe
  • Rust bindings — PyO3 alternative to pybind11
  • Windows support — MinGW + Vulkan SDK on Windows
  • Web UI — minimal OpenAI-compatible HTTP server (compatible with llama.cpp clients)
  • LoRA / adapter merging — runtime LoRA weight injection without repack
  • RWKV / Mamba — constant-memory inference via ConstantContextCache
  • Benchmark suite — automated comparison vs llama.cpp on standard prompts

Contributing

Contributions are welcome. Please open an issue before submitting large pull requests.

# Fork, clone, then create a feature branch
git checkout -b feat/my-feature

# Build with tests + debug symbols
./scripts/build.sh --debug --tests

# Make sure all tests pass before submitting
cd build && ctest --output-on-failure

Code style: follow the existing C++20 conventions (no exceptions in hot paths, [[nodiscard]] everywhere, PIMPL for public headers, RAII for all Vulkan handles).


License

MIT License — see LICENSE for details.


Sovereign Engine is an independent project and is not affiliated with Google, NVIDIA, AMD, or any model vendor.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sovereign_inference-0.2.6.tar.gz (206.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sovereign_inference-0.2.6-cp311-cp311-win_amd64.whl (275.9 kB view details)

Uploaded CPython 3.11Windows x86-64

File details

Details for the file sovereign_inference-0.2.6.tar.gz.

File metadata

  • Download URL: sovereign_inference-0.2.6.tar.gz
  • Upload date:
  • Size: 206.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.1

File hashes

Hashes for sovereign_inference-0.2.6.tar.gz
Algorithm Hash digest
SHA256 a58112e9b38e0f1d68fef0e03d841cf9acfd8c1e0519ccdec7f7140f55fa91ad
MD5 cedad7e9f95265131ee2007f3107f94d
BLAKE2b-256 ab3ef994c34bda3668e20b6661e1d02e0d6a3f1ddaba47fe7b141bdf3394e7ad

See more details on using hashes here.

File details

Details for the file sovereign_inference-0.2.6-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for sovereign_inference-0.2.6-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e9f862dea74ed8747eb5a1ef71b8981439d585dfc4e29ade55e14ee906977626
MD5 5aad0d15c84be9de59a5cd6e1ed3fa73
BLAKE2b-256 a9fb1c19dcad8f7a1034b0ec5a6e454dfd877c314f5928739a41254f96453b09

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page