Ultra-fast LLM inference engine with a Vulkan compute backend

Project description

Sovereign Engine

Ultra-fast, modular LLM inference engine with a Vulkan compute backend
Designed to surpass llama.cpp in throughput and VRAM efficiency.

┌──────────────────────────────────────────────────────────────────────────┐
│  Sovereign Engine  v0.2.5                                                │
│  C++20 · Vulkan 1.3 · SPIR-V Compute · pybind11 · Mixed INT4 Quant      │
└──────────────────────────────────────────────────────────────────────────┘

Overview
Key Features
Architecture
Project Structure
Requirements
Building
Usage
The .sovereign Format
Quantiser
Memory Manager
KV Cache (PagedAttention)
Vulkan Compute Shaders
Running Tests
Roadmap
Contributing
License

Overview

Sovereign Engine is a from-scratch, GPU-first LLM inference runtime written in C++20.
It targets local inference on consumer hardware (NVIDIA/AMD/Intel) using Vulkan compute as the sole GPU backend, which means:

No CUDA dependency — runs on any Vulkan 1.2+ GPU.
Tight control over VRAM: paged KV cache, async layer streaming, dynamic CPU offload.
Mixed-precision quantisation inspired by EXL2 and HQQ — assign INT4/INT3/INT2 per-tensor based on measured sensitivity.
A clean Python API (via pybind11) and a stable C ABI for FFI from any language.

Key Features

Feature	Details
Vulkan backend	Compute-only, no graphics queue needed. Works on NVIDIA, AMD, Intel, ARM Mali.
Mixed-precision quantisation	FP16 → INT8 → Q4_K → Q3_K → Q2_K per tensor, HQQ solver, EXL2-style importance scoring.
Async layer pipeline	Double-buffered PCIe staging: GPU runs layer N while CPU DMA-copies layer N+1.
PagedAttention KV cache	Block-based VRAM pool, copy-on-write forking, O(1) alloc/free.
Dynamic CPU offload	Falls back to AVX-512 / NEON when VRAM pressure exceeds threshold.
Streaming generation	Token-by-token callback; GIL-safe Python generator.
Rich sampling	Temperature, Top-P, Top-K, Min-P, Repetition Penalty, Mirostat v1/v2, GBNF grammar, JSON schema.
Proprietary `.sovereign` format	Page-aligned mmap, per-tensor CRC32C, zero-copy Vulkan upload.
GQA / MHA / MQA	All attention variants supported via a single fused GLSL shader.
RoPE + sliding window	Inline rotary embeddings, optional Mistral/Gemma sliding-window mask.

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           Python / C++ / C                               │
│                    (sovereign_inference.Engine)                          │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │       engine.cpp         │  prefill / decode_step /
                    │    (inference loop)      │  generate / forward
                    └──┬──────┬───────┬───────┘
                       │      │       │
         ┌─────────────▼─┐ ┌──▼────┐ ┌▼──────────────────┐
         │  VulkanContext │ │Quant  │ │ AsyncMemoryManager │
         │  (device,      │ │izer   │ │ (layer streaming,  │
         │   pipelines,   │ │       │ │  CPU offload)      │
         │   cmd bufs)    │ └───────┘ └────────────────────┘
         └───────┬────────┘                    │
                 │                   ┌─────────▼──────────┐
     ┌───────────▼────────────┐      │  PagedKVCache       │
     │  SPIR-V Compute Shaders│      │  (block pool,       │
     │  ┌─────────────────┐   │      │   CoW fork,         │
     │  │ rmsnorm.comp    │   │      │   descriptor sets)  │
     │  │ matmul_int4.comp│   │      └────────────────────┘
     │  │ attention_gqa   │   │
     │  │ silu_gate.comp  │   │
     │  │ sampler.comp    │   │
     │  └─────────────────┘   │
     └────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│                  sovereign-convert CLI                      │
│  SafeTensors → profile → budget allocate → HQQ quant       │
│               → pack INT4/3/2 → write .sovereign           │
└────────────────────────────────────────────────────────────┘

Project Structure

sovereign-engine/
├── CMakeLists.txt              # Root build configuration
├── README.md
├── .gitignore
│
├── include/sovereign/          # Public C++ headers
│   ├── engine.hpp              # Top-level inference API
│   ├── format.hpp              # .sovereign binary format spec
│   ├── vulkan_context.hpp      # Vulkan device + pipeline management
│   ├── memory_manager.hpp      # Async pipeline memory manager
│   ├── kv_cache.hpp            # PagedAttention KV cache
│   └── quantizer.hpp           # Mixed-precision quantiser
│
├── src/
│   ├── vulkan/
│   │   └── vulkan_context.cpp
│   ├── format/
│   │   └── format.cpp
│   ├── compute/
│   │   └── kv_cache.cpp
│   ├── inference/
│   │   └── engine.cpp
│   ├── quantizer/
│   │   └── quantizer.cpp
│   └── memory/
│       └── memory_manager.cpp
│
├── shaders/                    # GLSL compute shaders (compiled to SPIR-V)
│   ├── rmsnorm.comp
│   ├── matmul_int4.comp
│   ├── attention_gqa.comp
│   ├── silu_gate.comp
│   └── sampler.comp
│
├── bindings/
│   └── python/
│       └── sovereign_py.cpp    # pybind11 Python bindings
│
├── tools/
│   └── converter/
│       └── main.cpp            # sovereign-convert CLI
│
├── tests/
│   ├── CMakeLists.txt
│   ├── test_format.cpp
│   ├── test_quantizer.cpp
│   ├── test_kv_cache.cpp
│   └── test_engine.cpp
├── package.json                # Shader compiler package metadata
├── package-lock.json           # Shader compiler lock file
│
├── examples/
│   └── basic_generate.py       # Python streaming example
│
├── scripts/
│   ├── build.sh                # Build helper script
│   └── compile_shaders.js      # Shader compiler tool using WebGPU glslang
│
└── third_party/
    ├── volk/                   # Meta-loader for dynamic Vulkan loading (tracked)
    │   ├── volk.h
    │   └── volk.c
    └── vk_mem_alloc.h          # Fetched automatically via CMake (not tracked)

Requirements

Runtime

Vulkan 1.2+ Compatible GPU: Works on NVIDIA, AMD, Intel, Apple Silicon (via MoltenVK), and ARM Mali.
GPU Driver: Must support Vulkan 1.2 and the required extensions listed below. No SDK required at runtime!

Build (Zero-Dependency & SDK-Free)

Thanks to our dynamic meta-loader architecture (volk) and automatic CMake dependency management, the Vulkan SDK is completely optional to build Sovereign Engine!

Dependency	Version	Mandatory?	Notes
CMake	≥ 3.25	Yes	Handles the build orchestration
C++ Compiler	C++20	Yes	MSVC 2022 / GCC 12+ / Clang 15+
Vulkan SDK	≥ 1.3	No (Optional)	If absent, CMake automatically fetches headers; uses precompiled SPIR-V shaders
Python	≥ 3.9	No (Optional)	Only required to compile Python/pybind11 bindings

Required Vulkan Extensions

Your GPU driver must support:

VK_KHR_timeline_semaphore        (core in 1.2)
VK_KHR_synchronization2          (core in 1.3)
VK_EXT_memory_budget
VK_KHR_buffer_device_address
VK_KHR_shader_float16_int8
VK_EXT_scalar_block_layout
VK_KHR_8bit_storage
VK_KHR_16bit_storage

Building

Quick start

# Clone
git clone https://github.com/corbac10099/sovereign-engine.git
cd sovereign-engine

# Build (fetches vk_mem_alloc.h automatically)
chmod +x scripts/build.sh
./scripts/build.sh

# Or with all options explicit:
./scripts/build.sh --release --tests --python --avx512

Manual CMake

mkdir build && cd build
cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DSOVEREIGN_BUILD_PYTHON=ON \
    -DSOVEREIGN_BUILD_TESTS=ON \
    -DSOVEREIGN_ENABLE_AVX512=ON
cmake --build . --parallel $(nproc)

Debug build with AddressSanitizer

./scripts/build.sh --debug

Usage

1. Convert a Model

Download a HuggingFace model (e.g. Gemma 4B) in SafeTensors format, then convert:

# Basic conversion – mixed quantisation targeting 4.5 bpw
./build/sovereign-convert \
    --input  /path/to/gemma-4b/ \
    --output gemma-4b.sovereign \
    --arch   gemma \
    --quant  mixed \
    --bpw    4.5

# With calibration corpus for better importance scoring
./build/sovereign-convert \
    --input  /path/to/gemma-4b/ \
    --output gemma-4b-calibrated.sovereign \
    --quant  mixed \
    --bpw    4.5 \
    --calib  calibration_corpus.txt \
    --verbose

Quantisation modes:

Mode	Approx bpw	Description
`fp16`	16	No quantisation, maximum quality
`int8`	8	Symmetric INT8 throughout
`q4k`	4.5	Q4_K block quantisation
`q3k`	3.5	Q3_K block quantisation
`q2k`	2.6	Q2_K aggressive compression
`mixed`	target	Adaptive per-tensor (recommended)

Python-based Conversion

You can also convert models directly inside Python without having to compile the C++ CLI tool:

import sovereign_inference

result = sovereign_inference.convert(
    input_dir="/path/to/gemma-4b/",
    output_path="gemma-4b.sovereign",
    arch="gemma",
    quant="mixed",
    bpw=4.5,
    verbose=True
)

if result["success"]:
    print(f"Conversion successful! Achieved bpw: {result['achieved_bpw']:.2f}")
else:
    print(f"Conversion failed: {result['error_message']}")

2. Python API

import sovereign_inference

# Load model
cfg = sovereign_inference.LoadConfig()
cfg.gpu_layer_count       = 2**31 - 1   # load everything into VRAM
cfg.kv_cache_vram_fraction = 0.80

with sovereign_inference.Engine.load("gemma-4b.sovereign", cfg) as engine:
    print(f"Model  : {engine.model_name}")
    print(f"Device : {engine.device_name}  ({engine.vram_gib:.1f} GiB)")

    # --- Streaming generation ---
    params = sovereign_inference.GenerateParams()
    params.max_new_tokens        = 512
    params.sampling.temperature  = 0.7
    params.sampling.top_p        = 0.9
    params.sampling.min_p        = 0.05
    params.sampling.repetition_penalty = 1.1

    stats = engine.generate(
        prompt   = "Explain quantum entanglement briefly:",
        params   = params,
        callback = lambda tok, tid, lp: print(tok, end="", flush=True) or True,
    )
    print(f"\n[{stats.tokens_per_second:.1f} tok/s | {stats.generated_tokens} tokens]")

    # --- Generator protocol ---
    for text, token_id, logprob in engine.stream("Once upon a time", params):
        print(text, end="", flush=True)

    # --- Raw logits for custom sampling ---
    ids    = engine.tokenize("The sky is")
    logits = engine.forward(ids)   # numpy float32 array [vocab_size]

3. C++ API

#include "sovereign/engine.hpp"

int main() {
    sovereign::LoadConfig cfg;
    cfg.kv_cache_vram_fraction = 0.80;

    auto engine = sovereign::Engine::load("gemma-4b.sovereign", cfg);
    if (!engine) return 1;

    sovereign::GenerateParams params;
    params.max_new_tokens       = 512;
    params.sampling.temperature = 0.7f;
    params.sampling.top_p       = 0.9f;

    auto stats = engine->generate(
        "Explain quantum entanglement:",
        params,
        [](std::string_view tok, sovereign::TokenId, float) {
            std::cout << tok << std::flush;
            return true;   // return false to stop early
        });

    std::fprintf(stderr, "\n%.1f tok/s\n", stats.tokens_per_second);
}

4. C API (FFI)

#include "sovereign/engine.hpp"   // exposes extern "C" block

SovereignEngine* engine = sovereign_engine_load(
    "gemma-4b.sovereign",
    0,       // vram_budget (0 = auto)
    ~0u,     // gpu_layers  (all)
    true     // use_mmap
);

sovereign_engine_generate(
    engine,
    "Hello, world!",
    0.7f, 0.9f, 0, 0.05f, 1.1f,  // temperature, top_p, top_k, min_p, rep_penalty
    256,
    my_callback, NULL
);

sovereign_engine_free(engine);

The .sovereign Format

The .sovereign binary format is designed for zero-copy, memory-mapped inference:

┌──────────────┬──────────────────────────────────────────────────────┐
│ Offset       │ Section                                              │
├──────────────┼──────────────────────────────────────────────────────┤
│ 0x0000       │ FileHeader        (256 bytes, fixed)                 │
│ 0x0100       │ ModelConfig       (256 bytes, padded to 64B)         │
│ aligned      │ TokenizerBlob     (UTF-8 JSON)                       │
│ aligned      │ TensorIndex[]     (N × 192 bytes each)               │
│ PAGE-ALIGNED │ TensorDataBlock   (mmap-ready, 4K page aligned) ◀──┐ │
└──────────────┴──────────────────────────────────────────────────────┘
                                                                       │
Vulkan can mmap this block directly into a VkBuffer via               │
VK_EXT_external_memory_host — zero CPU copy during weight loading. ───┘

Key properties:

Magic bytes: SVRN (0x53, 0x56, 0x52, 0x4E)
All multi-byte fields: little-endian
Per-tensor CRC32C checksums (hardware-accelerated via SSE4.2)
Per-tensor DType field: supports F32, F16, BF16, INT8, INT4, INT3, INT2, Q4_K, Q3_K, Q2_K
Feature flags bitmask: MMAP_READY, HAS_TOKENIZER, GROUPED_QUERY, RoPE_SCALED, …

Quantiser

The quantiser runs a 3-phase pipeline:

Phase 1 – Calibration Profiling

Computes per-tensor activation statistics on a small calibration corpus (≥ 512 tokens):

Hessian proxy (mean squared activation magnitude)
Outlier ratio (fraction with |w| > 3σ)
Kurtosis (distribution peakedness)

Phase 2 – Budget Allocation

Assigns a DType to each tensor to hit a target average bpw:

importance ≥ 0.75  →  FP16 / INT8   (embeddings, first/last layers, norms)
importance ≥ 0.50  →  INT4 / Q4_K  (Q/K/V projections)
importance ≥ 0.25  →  Q3_K
importance <  0.25  →  Q2_K

Iteratively rebalances until |achieved_bpw - target_bpw| < 5%.

Phase 3 – HQQ Quantisation

Per-block iterative solver minimising the Hessian-weighted MSE:

min_{scale, zero} ‖W − dequant(quant(W, scale, zero))‖²_H

Default: 20 iterations, block size 128 elements, FP16 scale storage.

Memory Manager

The AsyncMemoryManager implements a double-buffered layer-streaming pipeline:

CPU Thread             GPU Compute Queue       DMA Transfer Queue
──────────             ─────────────────       ─────────────────

[Layer N-1 ready] ──▶  Compute(Layer N-1)
                                │
[Stream Layer N+1] ─────────────┼──────────▶ DMA(Layer N+1)
  (from mmap/RAM)               │                  │
                                ▼                  ▼
                        Compute(Layer N) ◀── Layer N ready

VRAM pressure response:

88% → start evicting LRU layers (LRU free-list)
95% → force CPU offload via AVX-512 / NEON kernels

KV Cache (PagedAttention)

Inspired by vLLM's PagedAttention:

One giant VRAM pool pre-allocated at startup (no per-block VkBuffer overhead).
Block size: 16 tokens per block (configurable, must be power-of-2).
Copy-on-write forking: beam search / speculative decoding shares blocks until a write occurs.
Descriptor sets pre-allocated per (block_id × layer) pair to avoid per-inference allocation.
Optional ConstantContextCache for RWKV / Mamba models (O(1) memory regardless of sequence length).

Vulkan Compute Shaders

All shaders are compiled from GLSL (.comp) to SPIR-V at CMake configure time:

Shader	Purpose
`rmsnorm.comp`	Fused RMSNorm with subgroup reduction; supports Gemma variant
`matmul_int4.comp`	Tiled INT4×FP16 GEMM with on-the-fly dequantisation and double-buffered B tiles
`attention_gqa.comp`	GQA/MHA/MQA fused attention: RoPE inline, PagedAttention block table, Flash-Attention tiled softmax
`silu_gate.comp`	Fused SwiGLU (SiLU × hadamard) for LLaMA/Gemma FFN
`sampler.comp`	GPU-resident sampling: temperature → top-K → softmax → top-P → min-P → multinomial

All shaders use GL_EXT_scalar_block_layout and GL_KHR_shader_subgroup_arithmetic for efficient subgroup reductions.

Running Tests

# Build and run all tests
./scripts/build.sh --tests
cd build && ctest --output-on-failure

# Run a specific suite
./build/test_quantizer --success
./build/test_format    --success
./build/test_kv_cache  --success
./build/test_engine    --success

# Integration test (requires a converted model)
SOVEREIGN_TEST_MODEL=gemma-4b.sovereign ctest -R test_integration

Roadmap

Continuous batching — interleave multiple requests in a single GPU pass
Speculative decoding — draft model integration for 2-4× decode speedup
Cooperative matrix — VK_KHR_cooperative_matrix path for tensor-core acceleration
io_uring Direct Storage — bypass staging buffers for PCIe 4.0+ NVMe
Rust bindings — PyO3 alternative to pybind11
Windows support — MinGW + Vulkan SDK on Windows
Web UI — minimal OpenAI-compatible HTTP server (compatible with llama.cpp clients)
LoRA / adapter merging — runtime LoRA weight injection without repack
RWKV / Mamba — constant-memory inference via ConstantContextCache
Benchmark suite — automated comparison vs llama.cpp on standard prompts

Contributing

Contributions are welcome. Please open an issue before submitting large pull requests.

# Fork, clone, then create a feature branch
git checkout -b feat/my-feature

# Build with tests + debug symbols
./scripts/build.sh --debug --tests

# Make sure all tests pass before submitting
cd build && ctest --output-on-failure

Code style: follow the existing C++20 conventions (no exceptions in hot paths, [[nodiscard]] everywhere, PIMPL for public headers, RAII for all Vulkan handles).

License

MIT License — see LICENSE for details.

Sovereign Engine is an independent project and is not affiliated with Google, NVIDIA, AMD, or any model vendor.

Project details

Release history Release notifications | RSS feed

0.2.10

May 28, 2026

0.2.9

May 28, 2026

0.2.8

May 27, 2026

0.2.7

May 27, 2026

0.2.6

May 27, 2026

This version

0.2.5

May 27, 2026

0.2.4

May 27, 2026

0.2.3

May 27, 2026

0.2.2

May 27, 2026

0.2.1

May 27, 2026

0.2.0

May 27, 2026

0.1.0

May 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sovereign_inference-0.2.5.tar.gz (206.2 kB view details)

Uploaded May 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sovereign_inference-0.2.5-cp311-cp311-win_amd64.whl (276.0 kB view details)

Uploaded May 27, 2026 CPython 3.11Windows x86-64

File details

Details for the file sovereign_inference-0.2.5.tar.gz.

File metadata

Download URL: sovereign_inference-0.2.5.tar.gz
Upload date: May 27, 2026
Size: 206.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.1

File hashes

Hashes for sovereign_inference-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`d05e2767ab97ada80019339d2609f74aab73cd3ccdde8284f72fa4342ccf3aed`
MD5	`053f8ee2eddc51e3fc61cd353003576a`
BLAKE2b-256	`4754857ade7d83ef45d27b8159472fe47914437ce2369f4f3bdb486abfb2e111`

See more details on using hashes here.

File details

Details for the file sovereign_inference-0.2.5-cp311-cp311-win_amd64.whl.

File metadata

Download URL: sovereign_inference-0.2.5-cp311-cp311-win_amd64.whl
Upload date: May 27, 2026
Size: 276.0 kB
Tags: CPython 3.11, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.1

File hashes

Hashes for sovereign_inference-0.2.5-cp311-cp311-win_amd64.whl
Algorithm	Hash digest
SHA256	`cc238f769e35e75d1f5d84c742d52fdbe4b374a1075fde5eb81407db672cb499`
MD5	`0208fb1157cd619311ce4c275a06cd7c`
BLAKE2b-256	`779e2828b64696f8eb7d51fe215430227818a97dc7d6cfefa816606128851a4f`

See more details on using hashes here.

sovereign-inference 0.2.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

Sovereign Engine

Table of Contents

Overview

Key Features

Architecture

Project Structure

Requirements

Runtime

Build (Zero-Dependency & SDK-Free)

Required Vulkan Extensions

Building

Quick start

Manual CMake

Debug build with AddressSanitizer

Usage

1. Convert a Model

Python-based Conversion

2. Python API

3. C++ API

4. C API (FFI)

The .sovereign Format

Quantiser

Phase 1 – Calibration Profiling

Phase 2 – Budget Allocation

Phase 3 – HQQ Quantisation

Memory Manager

KV Cache (PagedAttention)

Vulkan Compute Shaders

Running Tests

Roadmap

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes