Ultra-fast LLM inference engine with a Vulkan compute backend
Project description
Sovereign Engine
Ultra-fast, modular LLM inference engine with a Vulkan compute backend
Designed to surpass llama.cpp in throughput and VRAM efficiency.
┌──────────────────────────────────────────────────────────────────────────┐
│ Sovereign Engine v0.2.5 │
│ C++20 · Vulkan 1.3 · SPIR-V Compute · pybind11 · Mixed INT4 Quant │
└──────────────────────────────────────────────────────────────────────────┘
Table of Contents
- Overview
- Key Features
- Architecture
- Project Structure
- Requirements
- Building
- Usage
- The .sovereign Format
- Quantiser
- Memory Manager
- KV Cache (PagedAttention)
- Vulkan Compute Shaders
- Running Tests
- Roadmap
- Contributing
- License
Overview
Sovereign Engine is a from-scratch, GPU-first LLM inference runtime written in C++20.
It targets local inference on consumer hardware (NVIDIA/AMD/Intel) using Vulkan compute as the sole GPU backend, which means:
- No CUDA dependency — runs on any Vulkan 1.2+ GPU.
- Tight control over VRAM: paged KV cache, async layer streaming, dynamic CPU offload.
- Mixed-precision quantisation inspired by EXL2 and HQQ — assign INT4/INT3/INT2 per-tensor based on measured sensitivity.
- A clean Python API (via pybind11) and a stable C ABI for FFI from any language.
Key Features
| Feature | Details |
|---|---|
| Vulkan backend | Compute-only, no graphics queue needed. Works on NVIDIA, AMD, Intel, ARM Mali. |
| Mixed-precision quantisation | FP16 → INT8 → Q4_K → Q3_K → Q2_K per tensor, HQQ solver, EXL2-style importance scoring. |
| Async layer pipeline | Double-buffered PCIe staging: GPU runs layer N while CPU DMA-copies layer N+1. |
| PagedAttention KV cache | Block-based VRAM pool, copy-on-write forking, O(1) alloc/free. |
| Dynamic CPU offload | Falls back to AVX-512 / NEON when VRAM pressure exceeds threshold. |
| Streaming generation | Token-by-token callback; GIL-safe Python generator. |
| Rich sampling | Temperature, Top-P, Top-K, Min-P, Repetition Penalty, Mirostat v1/v2, GBNF grammar, JSON schema. |
Proprietary .sovereign format |
Page-aligned mmap, per-tensor CRC32C, zero-copy Vulkan upload. |
| GQA / MHA / MQA | All attention variants supported via a single fused GLSL shader. |
| RoPE + sliding window | Inline rotary embeddings, optional Mistral/Gemma sliding-window mask. |
Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ Python / C++ / C │
│ (sovereign_inference.Engine) │
└────────────────────────────────┬────────────────────────────────────────┘
│
┌────────────▼────────────┐
│ engine.cpp │ prefill / decode_step /
│ (inference loop) │ generate / forward
└──┬──────┬───────┬───────┘
│ │ │
┌─────────────▼─┐ ┌──▼────┐ ┌▼──────────────────┐
│ VulkanContext │ │Quant │ │ AsyncMemoryManager │
│ (device, │ │izer │ │ (layer streaming, │
│ pipelines, │ │ │ │ CPU offload) │
│ cmd bufs) │ └───────┘ └────────────────────┘
└───────┬────────┘ │
│ ┌─────────▼──────────┐
┌───────────▼────────────┐ │ PagedKVCache │
│ SPIR-V Compute Shaders│ │ (block pool, │
│ ┌─────────────────┐ │ │ CoW fork, │
│ │ rmsnorm.comp │ │ │ descriptor sets) │
│ │ matmul_int4.comp│ │ └────────────────────┘
│ │ attention_gqa │ │
│ │ silu_gate.comp │ │
│ │ sampler.comp │ │
│ └─────────────────┘ │
└────────────────────────┘
┌────────────────────────────────────────────────────────────┐
│ sovereign-convert CLI │
│ SafeTensors → profile → budget allocate → HQQ quant │
│ → pack INT4/3/2 → write .sovereign │
└────────────────────────────────────────────────────────────┘
Project Structure
sovereign-engine/
├── CMakeLists.txt # Root build configuration
├── README.md
├── .gitignore
│
├── include/sovereign/ # Public C++ headers
│ ├── engine.hpp # Top-level inference API
│ ├── format.hpp # .sovereign binary format spec
│ ├── vulkan_context.hpp # Vulkan device + pipeline management
│ ├── memory_manager.hpp # Async pipeline memory manager
│ ├── kv_cache.hpp # PagedAttention KV cache
│ └── quantizer.hpp # Mixed-precision quantiser
│
├── src/
│ ├── vulkan/
│ │ └── vulkan_context.cpp
│ ├── format/
│ │ └── format.cpp
│ ├── compute/
│ │ └── kv_cache.cpp
│ ├── inference/
│ │ └── engine.cpp
│ ├── quantizer/
│ │ └── quantizer.cpp
│ └── memory/
│ └── memory_manager.cpp
│
├── shaders/ # GLSL compute shaders (compiled to SPIR-V)
│ ├── rmsnorm.comp
│ ├── matmul_int4.comp
│ ├── attention_gqa.comp
│ ├── silu_gate.comp
│ └── sampler.comp
│
├── bindings/
│ └── python/
│ └── sovereign_py.cpp # pybind11 Python bindings
│
├── tools/
│ └── converter/
│ └── main.cpp # sovereign-convert CLI
│
├── tests/
│ ├── CMakeLists.txt
│ ├── test_format.cpp
│ ├── test_quantizer.cpp
│ ├── test_kv_cache.cpp
│ └── test_engine.cpp
├── package.json # Shader compiler package metadata
├── package-lock.json # Shader compiler lock file
│
├── examples/
│ └── basic_generate.py # Python streaming example
│
├── scripts/
│ ├── build.sh # Build helper script
│ └── compile_shaders.js # Shader compiler tool using WebGPU glslang
│
└── third_party/
├── volk/ # Meta-loader for dynamic Vulkan loading (tracked)
│ ├── volk.h
│ └── volk.c
└── vk_mem_alloc.h # Fetched automatically via CMake (not tracked)
Requirements
Runtime
- Vulkan 1.2+ Compatible GPU: Works on NVIDIA, AMD, Intel, Apple Silicon (via MoltenVK), and ARM Mali.
- GPU Driver: Must support Vulkan 1.2 and the required extensions listed below. No SDK required at runtime!
Build (Zero-Dependency & SDK-Free)
Thanks to our dynamic meta-loader architecture (volk) and automatic CMake dependency management, the Vulkan SDK is completely optional to build Sovereign Engine!
| Dependency | Version | Mandatory? | Notes |
|---|---|---|---|
| CMake | ≥ 3.25 | Yes | Handles the build orchestration |
| C++ Compiler | C++20 | Yes | MSVC 2022 / GCC 12+ / Clang 15+ |
| Vulkan SDK | ≥ 1.3 | No (Optional) | If absent, CMake automatically fetches headers; uses precompiled SPIR-V shaders |
| Python | ≥ 3.9 | No (Optional) | Only required to compile Python/pybind11 bindings |
Required Vulkan Extensions
Your GPU driver must support:
VK_KHR_timeline_semaphore (core in 1.2)
VK_KHR_synchronization2 (core in 1.3)
VK_EXT_memory_budget
VK_KHR_buffer_device_address
VK_KHR_shader_float16_int8
VK_EXT_scalar_block_layout
VK_KHR_8bit_storage
VK_KHR_16bit_storage
Building
Quick start
# Clone
git clone https://github.com/corbac10099/sovereign-engine.git
cd sovereign-engine
# Build (fetches vk_mem_alloc.h automatically)
chmod +x scripts/build.sh
./scripts/build.sh
# Or with all options explicit:
./scripts/build.sh --release --tests --python --avx512
Manual CMake
mkdir build && cd build
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DSOVEREIGN_BUILD_PYTHON=ON \
-DSOVEREIGN_BUILD_TESTS=ON \
-DSOVEREIGN_ENABLE_AVX512=ON
cmake --build . --parallel $(nproc)
Debug build with AddressSanitizer
./scripts/build.sh --debug
Usage
1. Convert a Model
Download a HuggingFace model (e.g. Gemma 4B) in SafeTensors format, then convert:
# Basic conversion – mixed quantisation targeting 4.5 bpw
./build/sovereign-convert \
--input /path/to/gemma-4b/ \
--output gemma-4b.sovereign \
--arch gemma \
--quant mixed \
--bpw 4.5
# With calibration corpus for better importance scoring
./build/sovereign-convert \
--input /path/to/gemma-4b/ \
--output gemma-4b-calibrated.sovereign \
--quant mixed \
--bpw 4.5 \
--calib calibration_corpus.txt \
--verbose
Quantisation modes:
| Mode | Approx bpw | Description |
|---|---|---|
fp16 |
16 | No quantisation, maximum quality |
int8 |
8 | Symmetric INT8 throughout |
q4k |
4.5 | Q4_K block quantisation |
q3k |
3.5 | Q3_K block quantisation |
q2k |
2.6 | Q2_K aggressive compression |
mixed |
target | Adaptive per-tensor (recommended) |
Python-based Conversion
You can also convert models directly inside Python without having to compile the C++ CLI tool:
import sovereign_inference
result = sovereign_inference.convert(
input_dir="/path/to/gemma-4b/",
output_path="gemma-4b.sovereign",
arch="gemma",
quant="mixed",
bpw=4.5,
verbose=True
)
if result["success"]:
print(f"Conversion successful! Achieved bpw: {result['achieved_bpw']:.2f}")
else:
print(f"Conversion failed: {result['error_message']}")
2. Python API
import sovereign_inference
# Load model
cfg = sovereign_inference.LoadConfig()
cfg.gpu_layer_count = 2**31 - 1 # load everything into VRAM
cfg.kv_cache_vram_fraction = 0.80
with sovereign_inference.Engine.load("gemma-4b.sovereign", cfg) as engine:
print(f"Model : {engine.model_name}")
print(f"Device : {engine.device_name} ({engine.vram_gib:.1f} GiB)")
# --- Streaming generation ---
params = sovereign_inference.GenerateParams()
params.max_new_tokens = 512
params.sampling.temperature = 0.7
params.sampling.top_p = 0.9
params.sampling.min_p = 0.05
params.sampling.repetition_penalty = 1.1
stats = engine.generate(
prompt = "Explain quantum entanglement briefly:",
params = params,
callback = lambda tok, tid, lp: print(tok, end="", flush=True) or True,
)
print(f"\n[{stats.tokens_per_second:.1f} tok/s | {stats.generated_tokens} tokens]")
# --- Generator protocol ---
for text, token_id, logprob in engine.stream("Once upon a time", params):
print(text, end="", flush=True)
# --- Raw logits for custom sampling ---
ids = engine.tokenize("The sky is")
logits = engine.forward(ids) # numpy float32 array [vocab_size]
3. C++ API
#include "sovereign/engine.hpp"
int main() {
sovereign::LoadConfig cfg;
cfg.kv_cache_vram_fraction = 0.80;
auto engine = sovereign::Engine::load("gemma-4b.sovereign", cfg);
if (!engine) return 1;
sovereign::GenerateParams params;
params.max_new_tokens = 512;
params.sampling.temperature = 0.7f;
params.sampling.top_p = 0.9f;
auto stats = engine->generate(
"Explain quantum entanglement:",
params,
[](std::string_view tok, sovereign::TokenId, float) {
std::cout << tok << std::flush;
return true; // return false to stop early
});
std::fprintf(stderr, "\n%.1f tok/s\n", stats.tokens_per_second);
}
4. C API (FFI)
#include "sovereign/engine.hpp" // exposes extern "C" block
SovereignEngine* engine = sovereign_engine_load(
"gemma-4b.sovereign",
0, // vram_budget (0 = auto)
~0u, // gpu_layers (all)
true // use_mmap
);
sovereign_engine_generate(
engine,
"Hello, world!",
0.7f, 0.9f, 0, 0.05f, 1.1f, // temperature, top_p, top_k, min_p, rep_penalty
256,
my_callback, NULL
);
sovereign_engine_free(engine);
The .sovereign Format
The .sovereign binary format is designed for zero-copy, memory-mapped inference:
┌──────────────┬──────────────────────────────────────────────────────┐
│ Offset │ Section │
├──────────────┼──────────────────────────────────────────────────────┤
│ 0x0000 │ FileHeader (256 bytes, fixed) │
│ 0x0100 │ ModelConfig (256 bytes, padded to 64B) │
│ aligned │ TokenizerBlob (UTF-8 JSON) │
│ aligned │ TensorIndex[] (N × 192 bytes each) │
│ PAGE-ALIGNED │ TensorDataBlock (mmap-ready, 4K page aligned) ◀──┐ │
└──────────────┴──────────────────────────────────────────────────────┘
│
Vulkan can mmap this block directly into a VkBuffer via │
VK_EXT_external_memory_host — zero CPU copy during weight loading. ───┘
Key properties:
- Magic bytes:
SVRN(0x53, 0x56, 0x52, 0x4E) - All multi-byte fields: little-endian
- Per-tensor CRC32C checksums (hardware-accelerated via SSE4.2)
- Per-tensor
DTypefield: supports F32, F16, BF16, INT8, INT4, INT3, INT2, Q4_K, Q3_K, Q2_K - Feature flags bitmask:
MMAP_READY,HAS_TOKENIZER,GROUPED_QUERY,RoPE_SCALED, …
Quantiser
The quantiser runs a 3-phase pipeline:
Phase 1 – Calibration Profiling
Computes per-tensor activation statistics on a small calibration corpus (≥ 512 tokens):
- Hessian proxy (mean squared activation magnitude)
- Outlier ratio (fraction with |w| > 3σ)
- Kurtosis (distribution peakedness)
Phase 2 – Budget Allocation
Assigns a DType to each tensor to hit a target average bpw:
importance ≥ 0.75 → FP16 / INT8 (embeddings, first/last layers, norms)
importance ≥ 0.50 → INT4 / Q4_K (Q/K/V projections)
importance ≥ 0.25 → Q3_K
importance < 0.25 → Q2_K
Iteratively rebalances until |achieved_bpw - target_bpw| < 5%.
Phase 3 – HQQ Quantisation
Per-block iterative solver minimising the Hessian-weighted MSE:
min_{scale, zero} ‖W − dequant(quant(W, scale, zero))‖²_H
Default: 20 iterations, block size 128 elements, FP16 scale storage.
Memory Manager
The AsyncMemoryManager implements a double-buffered layer-streaming pipeline:
CPU Thread GPU Compute Queue DMA Transfer Queue
────────── ───────────────── ─────────────────
[Layer N-1 ready] ──▶ Compute(Layer N-1)
│
[Stream Layer N+1] ─────────────┼──────────▶ DMA(Layer N+1)
(from mmap/RAM) │ │
▼ ▼
Compute(Layer N) ◀── Layer N ready
VRAM pressure response:
-
88% → start evicting LRU layers (LRU free-list)
-
95% → force CPU offload via AVX-512 / NEON kernels
KV Cache (PagedAttention)
Inspired by vLLM's PagedAttention:
- One giant VRAM pool pre-allocated at startup (no per-block VkBuffer overhead).
- Block size: 16 tokens per block (configurable, must be power-of-2).
- Copy-on-write forking: beam search / speculative decoding shares blocks until a write occurs.
- Descriptor sets pre-allocated per
(block_id × layer)pair to avoid per-inference allocation. - Optional
ConstantContextCachefor RWKV / Mamba models (O(1) memory regardless of sequence length).
Vulkan Compute Shaders
All shaders are compiled from GLSL (.comp) to SPIR-V at CMake configure time:
| Shader | Purpose |
|---|---|
rmsnorm.comp |
Fused RMSNorm with subgroup reduction; supports Gemma variant |
matmul_int4.comp |
Tiled INT4×FP16 GEMM with on-the-fly dequantisation and double-buffered B tiles |
attention_gqa.comp |
GQA/MHA/MQA fused attention: RoPE inline, PagedAttention block table, Flash-Attention tiled softmax |
silu_gate.comp |
Fused SwiGLU (SiLU × hadamard) for LLaMA/Gemma FFN |
sampler.comp |
GPU-resident sampling: temperature → top-K → softmax → top-P → min-P → multinomial |
All shaders use GL_EXT_scalar_block_layout and GL_KHR_shader_subgroup_arithmetic for efficient subgroup reductions.
Running Tests
# Build and run all tests
./scripts/build.sh --tests
cd build && ctest --output-on-failure
# Run a specific suite
./build/test_quantizer --success
./build/test_format --success
./build/test_kv_cache --success
./build/test_engine --success
# Integration test (requires a converted model)
SOVEREIGN_TEST_MODEL=gemma-4b.sovereign ctest -R test_integration
Roadmap
- Continuous batching — interleave multiple requests in a single GPU pass
- Speculative decoding — draft model integration for 2-4× decode speedup
- Cooperative matrix — VK_KHR_cooperative_matrix path for tensor-core acceleration
- io_uring Direct Storage — bypass staging buffers for PCIe 4.0+ NVMe
- Rust bindings — PyO3 alternative to pybind11
- Windows support — MinGW + Vulkan SDK on Windows
- Web UI — minimal OpenAI-compatible HTTP server (compatible with llama.cpp clients)
- LoRA / adapter merging — runtime LoRA weight injection without repack
- RWKV / Mamba — constant-memory inference via
ConstantContextCache - Benchmark suite — automated comparison vs llama.cpp on standard prompts
Contributing
Contributions are welcome. Please open an issue before submitting large pull requests.
# Fork, clone, then create a feature branch
git checkout -b feat/my-feature
# Build with tests + debug symbols
./scripts/build.sh --debug --tests
# Make sure all tests pass before submitting
cd build && ctest --output-on-failure
Code style: follow the existing C++20 conventions (no exceptions in hot paths, [[nodiscard]] everywhere, PIMPL for public headers, RAII for all Vulkan handles).
License
MIT License — see LICENSE for details.
Sovereign Engine is an independent project and is not affiliated with Google, NVIDIA, AMD, or any model vendor.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sovereign_inference-0.2.5.tar.gz.
File metadata
- Download URL: sovereign_inference-0.2.5.tar.gz
- Upload date:
- Size: 206.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d05e2767ab97ada80019339d2609f74aab73cd3ccdde8284f72fa4342ccf3aed
|
|
| MD5 |
053f8ee2eddc51e3fc61cd353003576a
|
|
| BLAKE2b-256 |
4754857ade7d83ef45d27b8159472fe47914437ce2369f4f3bdb486abfb2e111
|
File details
Details for the file sovereign_inference-0.2.5-cp311-cp311-win_amd64.whl.
File metadata
- Download URL: sovereign_inference-0.2.5-cp311-cp311-win_amd64.whl
- Upload date:
- Size: 276.0 kB
- Tags: CPython 3.11, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc238f769e35e75d1f5d84c742d52fdbe4b374a1075fde5eb81407db672cb499
|
|
| MD5 |
0208fb1157cd619311ce4c275a06cd7c
|
|
| BLAKE2b-256 |
779e2828b64696f8eb7d51fe215430227818a97dc7d6cfefa816606128851a4f
|