Hardware-aware CLI that selects the best runtime and quantization for efficient LLM inference.
Project description
██╗ ██╗███████╗ ██████╗████████╗ ██████╗ ██████╗ ██████╗ ██████╗ ██╗███╗ ███╗███████╗ ██║ ██║██╔════╝██╔════╝╚══██╔══╝██╔═══██╗██╔══██╗██╔══██╗██╔══██╗██║████╗ ████║██╔════╝ ██║ ██║█████╗ ██║ ██║ ██║ ██║██████╔╝██████╔╝██████╔╝██║██╔████╔██║█████╗ ╚██╗ ██╔╝██╔══╝ ██║ ██║ ██║ ██║██╔══██╗██╔═══╝ ██╔══██╗██║██║╚██╔╝██║██╔══╝ ╚████╔╝ ███████╗╚██████╗ ██║ ╚██████╔╝██║ ██║██║ ██║ ██║██║██║ ╚═╝ ██║███████╗ ╚═══╝ ╚══════╝ ╚═════╝ ╚═╝ ╚═════╝ ╚═╝ ╚═╝╚═╝ ╚═╝ ╚═╝╚═╝╚═╝ ╚═╝╚══════╝
Compiler-style, hardware-aware LLM inference optimizer
VectorPrime takes a model file and your hardware, then finds the fastest way to run it. It profiles your CPU, GPU, and RAM; parses the model's intermediate representation to extract architecture metadata; generates every valid combination of runtime, quantization, thread count, and GPU offload layers; benchmarks candidates in parallel; and hands you back the configuration that maximizes tokens per second within your memory budget. The result is a ready-to-use Ollama bundle — no guesswork required.
VectorPrime is built for developers and researchers who run inference locally and want reproducible, hardware-specific performance without manually tuning runtime flags or hunting for the right quantization format. The Rust backend handles parallel benchmarking and hardware detection; a PyO3 native extension exposes everything through a clean Python API and a single pip install vectorprime.
Features
| Feature | Description | Status |
|---|---|---|
| Hardware profiling | Detects CPU core count, SIMD level (AVX/AVX2/AVX512), GPU VRAM and compute capability, and available RAM | Stable |
| Model IR analysis | Reads GGUF and ONNX model files to extract parameter count, architecture, context length, layer count, hidden size, attention heads, KV cache size, memory footprint, and FLOPs without running inference | Stable |
| Multi-runtime support | Benchmarks Ollama (primary), TensorRT (primary), ONNX Runtime (secondary), and llama.cpp (deprioritized) against each other on your hardware | Stable |
| Automatic quantization selection | Evaluates F16, Q8_0, Q4_K_M, Q4_0, Int8, and Int4 and picks the fastest that fits in memory | Stable |
| Parallel benchmarking | Tokio-based async executor runs up to 3 configurations concurrently | Stable |
| Optimization result caching | Caches results to ~/.llmforge/cache/ keyed by model identity and hardware profile; skips benchmarking entirely on a cache hit |
Stable |
| Ollama export | Generates a Modelfile with tuned num_thread and num_gpu values, ready for ollama create |
Stable |
| Format conversion | Bidirectional GGUF-to-ONNX and ONNX-to-GGUF conversion with full metadata round-trip | Stable |
| Python API | PyO3 native extension — import and call from any Python script or notebook | Stable |
| CLI interface | profile, optimize, convert-to-onnx, and convert-to-gguf subcommands |
Stable |
Quick Start
pip install vectorprime
# See what hardware VectorPrime detected
vectorprime profile
# Find the best inference configuration for a model
vectorprime optimize model.gguf
# Export the result as an Ollama bundle (Python API)
# See the Python API section below
Installation
For Users
pip install vectorprime
No Rust toolchain required! Pre-built wheels are available for:
- Python 3.9, 3.10, 3.11, 3.12
- Linux (x86-64, Arm64), macOS (x86-64, Arm64), Windows (x86-64)
Requirements:
- Python 3.9 or later
- At least one supported inference runtime installed and on
PATH
Optional runtime prerequisites:
# Ollama — recommended for most users
# https://ollama.com/download
# ONNX Runtime
pip install onnxruntime # CPU
pip install onnxruntime-gpu # CUDA GPU
# TensorRT (NVIDIA only, compute capability >= 7.0)
# https://developer.nvidia.com/tensorrt
# llama.cpp (provides llama-cli and llama-quantize)
# https://github.com/ggml-org/llama.cpp
VectorPrime detects which runtimes are available at startup and silently skips any whose binary is not found. vectorprime profile works with no runtimes installed.
Usage
Profile Hardware
vectorprime profile
Prints a JSON hardware profile to stdout:
{
"cpu": { "core_count": 16, "simd_level": "AVX2" },
"gpu": { "name": "NVIDIA GeForce RTX 4090", "vram_mb": 24564, "compute_capability": [8, 9] },
"ram": { "total_mb": 65536, "available_mb": 48000 }
}
Optimize a Model
vectorprime optimize model.gguf
─────────────────────────────────────
VectorPrime Optimization Result
─────────────────────────────────────
Runtime: Ollama
Quantization: Q4_K_M
Threads: 16
GPU Layers: 20
Throughput: 110.3 tokens/sec
Latency: 91.2 ms
Memory: 8.2 GB peak
─────────────────────────────────────
Optimized model written to: model-optimized.gguf
Options:
vectorprime optimize <model_path> [OPTIONS]
Arguments:
model_path Path to the model file (.gguf or .onnx).
Options:
--format {gguf,onnx} Model format. Auto-detected from extension when omitted.
--max-memory MB Warn if peak memory exceeds this limit (MB).
--gpu MODEL Target GPU model (e.g. 4090, a100, h100, or 'cpu' for
CPU-only). Overrides auto-detected hardware.
--latency MS Maximum tolerated latency (ms). Configurations above
this threshold are excluded.
--output PATH Destination path for the re-quantized output model.
--no-cache Bypass the result cache and run benchmarking even if
a cached result exists. The new result is stored after
completion.
Ollama Export (Python API)
Ollama export is available via the Python API. Call vectorprime.export_ollama(result, output_dir) to produce a Modelfile, model.gguf, and metadata.json bundle ready for ollama create. See the Python API section for a full example.
Convert Between Formats
# GGUF → ONNX
vectorprime convert-to-onnx model.gguf --output model.onnx
# ONNX → GGUF (metadata is round-tripped from the original GGUF when available)
vectorprime convert-to-gguf model.onnx --output model.gguf
Supported Runtimes
| Runtime | Priority | Backend Binary | Model Format | Notes |
|---|---|---|---|---|
| Ollama | Primary | ollama |
GGUF | Recommended for most users |
| TensorRT | Primary | trtexec |
ONNX | NVIDIA GPU, compute capability >= 7.0 |
| ONNX Runtime | Secondary | python3 + onnxruntime |
ONNX | CPU and CUDA execution providers |
| llama.cpp | Deprioritized | llama-cli |
GGUF | CPU + GPU offload via --n-gpu-layers |
Missing binaries return a structured NotInstalled error and are skipped — VectorPrime benchmarks whatever runtimes are present.
Caching
VectorPrime caches optimization results so repeated runs on the same model and hardware return instantly without re-running benchmarks.
Cache location: ~/.llmforge/cache/
Cache key: SHA-256 of {model_mtime}_{model_size}_{hardware_profile_json}. The key encodes both the model's identity (modification time and file size) and the full hardware profile. A result cached on one machine is not reused on a different machine, and a result cached for one model version is invalidated when the model file changes.
On cache hit: All benchmarking is skipped; the stored OptimizationResult is returned immediately.
On cache miss or read error: VectorPrime runs normally and writes the result to the cache after benchmarking completes.
Disabling the cache:
# CLI
vectorprime optimize model.gguf --no-cache
# Python API
result = vectorprime.optimize("model.gguf", use_cache=False)
How It Works
VectorPrime runs a 4-stage Bayesian optimization pipeline. Before Stage 1, a cache lookup is performed — if a result for the same model and hardware already exists, all benchmarking is skipped entirely.
[Cache] SHA-256 lookup in ~/.llmforge/cache/ — returns immediately on hit
[1] Hardware Profiling (0 benchmarks)
CPU cores, SIMD extensions (via raw-cpuid), GPU VRAM and compute
capability (via nvidia-smi), available RAM (via sysinfo).
[2] Model Graph Analysis (0 benchmarks)
Parses the model file — GGUF via a custom byte reader, ONNX via
protobuf — to extract parameter count, architecture, hidden size,
attention heads, KV cache size, and FLOPs per token without running
inference. Classifies workload as Memory-bound, Compute-bound, or
Balanced to guide quantization selection.
[3] Runtime Preselection (0 benchmarks)
Selects viable runtimes based on model format (GGUF or ONNX) and
available hardware. Prunes quantization options by VRAM/RAM budget.
Computes the search space: runtimes × quantizations × gpu_layers ×
threads × batch_size.
[4] Bayesian Optimization (≤ 12 benchmarks)
Runs 5 quasi-random Halton samples across the search space, then 7
Tree-structured Parzen Estimator (TPE) refinement iterations.
Each benchmark shells out to the runtime adapter (Ollama, TensorRT,
ONNX Runtime, or llama.cpp) and collects tokens/sec, latency, and
peak memory. The best configuration is returned and cached.
Falls back to full cartesian search if all 12 evaluations fail.
The result is cached to ~/.llmforge/cache/ after benchmarking, keyed by model identity and hardware profile.
Python API
import vectorprime
# Profile hardware
hw = vectorprime.profile_hardware()
print(hw.cpu_cores, hw.gpu_model, hw.ram_total_mb)
# Inspect a model's architecture without running inference
# Returns a dict with: format, param_count, architecture, context_length,
# layer_count, hidden_size, attention_head_count, attention_head_count_kv,
# feed_forward_length, kv_cache_size_mb, memory_footprint_mb, flops_per_token
model_info = vectorprime.analyze_model("model.gguf")
print(model_info["param_count"], model_info["architecture"], model_info["context_length"])
# Run optimization (results are cached by default in ~/.llmforge/cache/)
result = vectorprime.optimize("model.gguf", use_cache=True)
print(result.runtime, result.tokens_per_sec, result.latency_ms)
# Ollama 110.3 91.2
# Bypass the cache to force a fresh benchmark run
result = vectorprime.optimize("model.gguf", use_cache=False)
# Export an Ollama-ready bundle
manifest_json = vectorprime.export_ollama(result, "./optimized_model")
# Convert formats
vectorprime.convert_gguf_to_onnx("model.gguf", "model.onnx")
vectorprime.convert_onnx_to_gguf("model.onnx", "model-roundtrip.gguf")
Performance Example
Results from vectorprime optimize on a system with Intel Core i9-13900K (16 cores, AVX-512), NVIDIA RTX 4090 (24 GB VRAM), 64 GB DDR5 RAM. Your results will vary.
| Model | Runtime | Quantization | Threads | GPU Layers | Throughput (tok/s) | Latency (ms) | Memory (GB) |
|---|---|---|---|---|---|---|---|
| Llama 3.1 8B | LlamaCpp | Q4_K_M | 16 | 20 | 110.3 | 91.2 | 8.2 |
| Llama 3.1 8B | LlamaCpp | Q8_0 | 16 | 10 | 74.1 | 135.4 | 12.8 |
| Mistral 7B | LlamaCpp | Q4_K_M | 16 | 20 | 118.7 | 84.2 | 7.4 |
| Mistral 7B | OnnxRuntime | Int8 | 8 | 0 | 42.3 | 236.8 | 9.1 |
| Phi-3 Mini 3.8B | TensorRT | Int8 | 8 | 33 | 198.4 | 50.4 | 5.6 |
Architecture
VectorPrime is a Rust workspace. The Python layer (CLI + helpers) sits on top of a cdylib native extension compiled via PyO3 and maturin.
python/vectorprime/cli.py (argparse CLI — 4 subcommands)
|
v
vectorprime-bindings (PyO3 cdylib — _vectorprime.so)
|
+---> vectorprime-export (Ollama bundle generation)
| |
+---> vectorprime-optimizer (search + parallel benchmark loop)
| |
| +-----+-----+
| | |
+---> vectorprime-hardware vectorprime-runtime (adapter dispatch)
| | |
+---> vectorprime-model-ir |
| |
+---> vectorprime-core <--+
(shared types/traits/errors)
| Crate | Responsibility |
|---|---|
vectorprime-core |
HardwareProfile, OptimizationResult, RuntimeAdapter trait, GpuProbe trait, RuntimeError |
vectorprime-hardware |
CPU detection (raw-cpuid), NVIDIA GPU detection (nvidia-smi), RAM (sysinfo) |
vectorprime-model-ir |
GGUF byte reader and ONNX protobuf parser; extracts architecture metadata without inference |
vectorprime-runtime |
LlamaCppAdapter, OnnxAdapter, TensorRtAdapter; adapter registry and dispatch |
vectorprime-optimizer |
4-stage Bayesian/TPE optimization pipeline (hardware context, model context, runtime preselection, TPE search); result caching via ~/.llmforge/cache/ |
vectorprime-export |
Modelfile writer, GGUF copy, metadata.json serialization |
vectorprime-bindings |
PyO3 #[pymodule] wiring every crate into the _vectorprime extension module |
Build from Source
For end-users: Use
pip install vectorprimeinstead. For developers and contributors who want to modify the codebase, follow the setup below. Building from source requires the Rust toolchain.
Prerequisites
| Tool | Version | Install |
|---|---|---|
| Rust toolchain | 1.75+ | curl https://sh.rustup.rs -sSf | sh |
| Python | 3.9+ | python.org |
| maturin | 1.0+ | pip install maturin |
| Python dev headers | — | sudo apt install python3-dev (Debian/Ubuntu) |
Build
git clone https://github.com/TheRadDani/llm-forge
cd llm-forge
python -m venv .venv && source .venv/bin/activate
pip install maturin pytest numpy onnxruntime
# Compile the Rust extension and install into the active venv
maturin develop
# Verify
vectorprime profile
Run Tests
# All Rust unit tests
cargo test --workspace
# Code style and lint
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
# Python integration tests (no fixtures or GPU required)
pytest tests/ -v
Contributing
Contributions are welcome — bug reports, feature requests, documentation improvements, and new runtime adapters.
- Fork the repository and create a branch from
main - Make your changes with tests
- Confirm
cargo test --workspaceandpytest tests/both pass - Open a pull request with a clear description
Adding a new runtime: Implement RuntimeAdapter in crates/vectorprime-runtime/src/ and register the adapter in the AdapterRegistry. The optimizer and Python binding layers require no changes.
See open issues for contribution ideas.
License
MIT. See LICENSE for the full text.
Acknowledgments
VectorPrime builds on:
- llama.cpp — GGUF format specification and the
llama-cli/llama-quantizebinaries - ONNX Runtime — inference engine behind the ONNX adapter
- TensorRT — NVIDIA's high-performance inference library
- Ollama — local model runner that VectorPrime exports to
- PyO3 and maturin — Rust/Python interop and packaging
- Tokio — async runtime powering parallel benchmarking
- anyhow and thiserror — structured error handling
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vectorprime-0.6.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: vectorprime-0.6.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 14.0 MB
- Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7abf04fc7a9573d6678c38cea50653a3bf192721d3020e3e922f0c4cfbb08489
|
|
| MD5 |
b5d114c0c6c2923f40ae86c31750431b
|
|
| BLAKE2b-256 |
fb22ae8b846ebde3a1515a1c2ce1e68a53db149c3021aced23d566542cfaaf92
|
File details
Details for the file vectorprime-0.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: vectorprime-0.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 14.0 MB
- Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ed51cc64ceec299e10d57e745e4d7e909e30cc6733c6755243a91def8defe914
|
|
| MD5 |
3ba6fdcece434ed2d76e28eeed71cb52
|
|
| BLAKE2b-256 |
8f9462093b588a03b4680bea727bbb7109c37fe0a8c81607871b26e995204890
|
File details
Details for the file vectorprime-0.6.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: vectorprime-0.6.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 14.0 MB
- Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aa8bd42d2fdef1a33aa0f8ea37771f0e8cec2e193baa75f8b03daf60f988825f
|
|
| MD5 |
71a8ac78eea75cc80046e4e729ccabf1
|
|
| BLAKE2b-256 |
d02fdb024f484b1c3892dabb250b6b36977051077b5c73fc42ce83e60b4cb4aa
|
File details
Details for the file vectorprime-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: vectorprime-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 14.0 MB
- Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2e9bec0734cb18a43a6000fc1474b4737db28f23207195934d77b5a7b3983339
|
|
| MD5 |
d9e36066cea980b750eeb7b2d4529d9d
|
|
| BLAKE2b-256 |
27f1a98d8fb26dd847291c2a02d29275194401f8dcb63fa2b52f0f2123e51003
|
File details
Details for the file vectorprime-0.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: vectorprime-0.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 14.0 MB
- Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e397749f0914acf6201f9f284bc452b19a10c07022204777f012413bd8b664a
|
|
| MD5 |
922f6c1eb94596b6d6b4e27b093cd78a
|
|
| BLAKE2b-256 |
17227aa6cf00b9ab96c0df794a3fec9f8f529c8d1ce3d9e2d7b3aff755f86691
|