Skip to main content

Hardware-aware CLI that selects the best runtime and quantization for efficient LLM inference.

Project description

 ██╗   ██╗███████╗ ██████╗████████╗ ██████╗ ██████╗ ██████╗ ██████╗ ██╗███╗   ███╗███████╗
 ██║   ██║██╔════╝██╔════╝╚══██╔══╝██╔═══██╗██╔══██╗██╔══██╗██╔══██╗██║████╗ ████║██╔════╝
 ██║   ██║█████╗  ██║        ██║   ██║   ██║██████╔╝██████╔╝██████╔╝██║██╔████╔██║█████╗
 ╚██╗ ██╔╝██╔══╝  ██║        ██║   ██║   ██║██╔══██╗██╔═══╝ ██╔══██╗██║██║╚██╔╝██║██╔══╝
  ╚████╔╝ ███████╗╚██████╗   ██║   ╚██████╔╝██║  ██║██║     ██║  ██║██║██║ ╚═╝ ██║███████╗
   ╚═══╝  ╚══════╝ ╚═════╝   ╚═╝    ╚═════╝ ╚═╝  ╚═╝╚═╝     ╚═╝  ╚═╝╚═╝╚═╝     ╚═╝╚══════╝
  

Compiler-style, hardware-aware LLM inference optimizer

PyPI version Python versions License: MIT CI status GitHub stars PRs welcome


Vector Prime Banner

VectorPrime takes a model file and your hardware, then finds the fastest way to run it. It profiles your CPU, GPU, and RAM; parses the model's intermediate representation to extract architecture metadata; generates every valid combination of runtime, quantization, thread count, and GPU offload layers; benchmarks candidates in parallel; and hands you back the configuration that maximizes tokens per second within your memory budget. The result is a ready-to-use Ollama bundle — no guesswork required.

VectorPrime is built for developers and researchers who run inference locally and want reproducible, hardware-specific performance without manually tuning runtime flags or hunting for the right quantization format. The Rust backend handles parallel benchmarking and hardware detection; a PyO3 native extension exposes everything through a clean Python API and a single pip install vectorprime.


Features

Feature Description Status
Hardware profiling Detects CPU core count, SIMD level (AVX/AVX2/AVX512), GPU VRAM and compute capability, and available RAM Stable
Model IR analysis Reads GGUF and ONNX model files to extract parameter count, architecture, context length, layer count, hidden size, attention heads, KV cache size, memory footprint, and FLOPs without running inference Stable
Multi-runtime support Benchmarks Ollama (primary), TensorRT (primary), ONNX Runtime (secondary), and llama.cpp (deprioritized) against each other on your hardware Stable
Automatic quantization selection Evaluates F16, Q8_0, Q4_K_M, Q4_0, Int8, and Int4 and picks the fastest that fits in memory Stable
Parallel benchmarking Tokio-based async executor runs up to 3 configurations concurrently Stable
Optimization result caching Caches results to ~/.llmforge/cache/ keyed by model identity and hardware profile; skips benchmarking entirely on a cache hit Stable
Ollama export Generates a Modelfile with tuned num_thread and num_gpu values, ready for ollama create Stable
Format conversion Bidirectional GGUF-to-ONNX and ONNX-to-GGUF conversion with full metadata round-trip Stable
Python API PyO3 native extension — import and call from any Python script or notebook Stable
CLI interface profile, optimize, convert-to-onnx, and convert-to-gguf subcommands Stable

Quick Start

pip install vectorprime

# See what hardware VectorPrime detected
vectorprime profile

# Find the best inference configuration for a model
vectorprime optimize model.gguf

# Export the result as an Ollama bundle (Python API)
# See the Python API section below

Installation

For Users

pip install vectorprime

No Rust toolchain required! Pre-built wheels are available for:

  • Python 3.9, 3.10, 3.11, 3.12
  • Linux (x86-64, Arm64), macOS (x86-64, Arm64), Windows (x86-64)

Requirements:

  • Python 3.9 or later
  • At least one supported inference runtime installed and on PATH

Optional runtime prerequisites:

# Ollama — recommended for most users
# https://ollama.com/download

# ONNX Runtime
pip install onnxruntime          # CPU
pip install onnxruntime-gpu      # CUDA GPU

# TensorRT (NVIDIA only, compute capability >= 7.0)
# https://developer.nvidia.com/tensorrt

# llama.cpp (provides llama-cli and llama-quantize)
# https://github.com/ggml-org/llama.cpp

VectorPrime detects which runtimes are available at startup and silently skips any whose binary is not found. vectorprime profile works with no runtimes installed.


Usage

Profile Hardware

vectorprime profile

Prints a JSON hardware profile to stdout:

{
  "cpu": { "core_count": 16, "simd_level": "AVX2" },
  "gpu": { "name": "NVIDIA GeForce RTX 4090", "vram_mb": 24564, "compute_capability": [8, 9] },
  "ram": { "total_mb": 65536, "available_mb": 48000 }
}

Optimize a Model

vectorprime optimize model.gguf
─────────────────────────────────────
VectorPrime Optimization Result
─────────────────────────────────────
Runtime:       Ollama
Quantization:  Q4_K_M
Threads:       16
GPU Layers:    20
Throughput:    110.3 tokens/sec
Latency:       91.2 ms
Memory:        8.2 GB peak
─────────────────────────────────────
Optimized model written to: model-optimized.gguf

Options:

vectorprime optimize <model_path> [OPTIONS]

Arguments:
  model_path              Path to the model file (.gguf or .onnx).

Options:
  --format {gguf,onnx}    Model format. Auto-detected from extension when omitted.
  --max-memory MB         Warn if peak memory exceeds this limit (MB).
  --gpu MODEL             Target GPU model (e.g. 4090, a100, h100, or 'cpu' for
                          CPU-only). Overrides auto-detected hardware.
  --latency MS            Maximum tolerated latency (ms). Configurations above
                          this threshold are excluded.
  --output PATH           Destination path for the re-quantized output model.
  --no-cache              Bypass the result cache and run benchmarking even if
                          a cached result exists. The new result is stored after
                          completion.

Ollama Export (Python API)

Ollama export is available via the Python API. Call vectorprime.export_ollama(result, output_dir) to produce a Modelfile, model.gguf, and metadata.json bundle ready for ollama create. See the Python API section for a full example.

Convert Between Formats

# GGUF → ONNX
vectorprime convert-to-onnx model.gguf --output model.onnx

# ONNX → GGUF (metadata is round-tripped from the original GGUF when available)
vectorprime convert-to-gguf model.onnx --output model.gguf

Supported Runtimes

Runtime Priority Backend Binary Model Format Notes
Ollama Primary ollama GGUF Recommended for most users
TensorRT Primary trtexec ONNX NVIDIA GPU, compute capability >= 7.0
ONNX Runtime Secondary python3 + onnxruntime ONNX CPU and CUDA execution providers
llama.cpp Deprioritized llama-cli GGUF CPU + GPU offload via --n-gpu-layers

Missing binaries return a structured NotInstalled error and are skipped — VectorPrime benchmarks whatever runtimes are present.


Caching

VectorPrime caches optimization results so repeated runs on the same model and hardware return instantly without re-running benchmarks.

Cache location: ~/.llmforge/cache/

Cache key: SHA-256 of {model_mtime}_{model_size}_{hardware_profile_json}. The key encodes both the model's identity (modification time and file size) and the full hardware profile. A result cached on one machine is not reused on a different machine, and a result cached for one model version is invalidated when the model file changes.

On cache hit: All benchmarking is skipped; the stored OptimizationResult is returned immediately.

On cache miss or read error: VectorPrime runs normally and writes the result to the cache after benchmarking completes.

Disabling the cache:

# CLI
vectorprime optimize model.gguf --no-cache
# Python API
result = vectorprime.optimize("model.gguf", use_cache=False)

How It Works

VectorPrime runs a 4-stage Bayesian optimization pipeline. Before Stage 1, a cache lookup is performed — if a result for the same model and hardware already exists, all benchmarking is skipped entirely.

[Cache] SHA-256 lookup in ~/.llmforge/cache/ — returns immediately on hit

[1] Hardware Profiling (0 benchmarks)
      CPU cores, SIMD extensions (via raw-cpuid), GPU VRAM and compute
      capability (via nvidia-smi), available RAM (via sysinfo).

[2] Model Graph Analysis (0 benchmarks)
      Parses the model file — GGUF via a custom byte reader, ONNX via
      protobuf — to extract parameter count, architecture, hidden size,
      attention heads, KV cache size, and FLOPs per token without running
      inference. Classifies workload as Memory-bound, Compute-bound, or
      Balanced to guide quantization selection.

[3] Runtime Preselection (0 benchmarks)
      Selects viable runtimes based on model format (GGUF or ONNX) and
      available hardware. Prunes quantization options by VRAM/RAM budget.
      Computes the search space: runtimes × quantizations × gpu_layers ×
      threads × batch_size.

[4] Bayesian Optimization (≤ 12 benchmarks)
      Runs 5 quasi-random Halton samples across the search space, then 7
      Tree-structured Parzen Estimator (TPE) refinement iterations.
      Each benchmark shells out to the runtime adapter (Ollama, TensorRT,
      ONNX Runtime, or llama.cpp) and collects tokens/sec, latency, and
      peak memory. The best configuration is returned and cached.
      Falls back to full cartesian search if all 12 evaluations fail.

The result is cached to ~/.llmforge/cache/ after benchmarking, keyed by model identity and hardware profile.


Python API

import vectorprime

# Profile hardware
hw = vectorprime.profile_hardware()
print(hw.cpu_cores, hw.gpu_model, hw.ram_total_mb)

# Inspect a model's architecture without running inference
# Returns a dict with: format, param_count, architecture, context_length,
# layer_count, hidden_size, attention_head_count, attention_head_count_kv,
# feed_forward_length, kv_cache_size_mb, memory_footprint_mb, flops_per_token
model_info = vectorprime.analyze_model("model.gguf")
print(model_info["param_count"], model_info["architecture"], model_info["context_length"])

# Run optimization (results are cached by default in ~/.llmforge/cache/)
result = vectorprime.optimize("model.gguf", use_cache=True)
print(result.runtime, result.tokens_per_sec, result.latency_ms)
# Ollama  110.3  91.2

# Bypass the cache to force a fresh benchmark run
result = vectorprime.optimize("model.gguf", use_cache=False)

# Export an Ollama-ready bundle
manifest_json = vectorprime.export_ollama(result, "./optimized_model")

# Convert formats
vectorprime.convert_gguf_to_onnx("model.gguf", "model.onnx")
vectorprime.convert_onnx_to_gguf("model.onnx", "model-roundtrip.gguf")

Performance Example

Results from vectorprime optimize on a system with Intel Core i9-13900K (16 cores, AVX-512), NVIDIA RTX 4090 (24 GB VRAM), 64 GB DDR5 RAM. Your results will vary.

Model Runtime Quantization Threads GPU Layers Throughput (tok/s) Latency (ms) Memory (GB)
Llama 3.1 8B LlamaCpp Q4_K_M 16 20 110.3 91.2 8.2
Llama 3.1 8B LlamaCpp Q8_0 16 10 74.1 135.4 12.8
Mistral 7B LlamaCpp Q4_K_M 16 20 118.7 84.2 7.4
Mistral 7B OnnxRuntime Int8 8 0 42.3 236.8 9.1
Phi-3 Mini 3.8B TensorRT Int8 8 33 198.4 50.4 5.6

Architecture

VectorPrime is a Rust workspace. The Python layer (CLI + helpers) sits on top of a cdylib native extension compiled via PyO3 and maturin.

python/vectorprime/cli.py         (argparse CLI — 4 subcommands)
          |
          v
vectorprime-bindings              (PyO3 cdylib — _vectorprime.so)
          |
          +---> vectorprime-export      (Ollama bundle generation)
          |           |
          +---> vectorprime-optimizer   (search + parallel benchmark loop)
          |           |
          |     +-----+-----+
          |     |           |
          +---> vectorprime-hardware    vectorprime-runtime  (adapter dispatch)
          |     |                             |
          +---> vectorprime-model-ir          |
                          |                  |
                          +---> vectorprime-core <--+
                               (shared types/traits/errors)
Crate Responsibility
vectorprime-core HardwareProfile, OptimizationResult, RuntimeAdapter trait, GpuProbe trait, RuntimeError
vectorprime-hardware CPU detection (raw-cpuid), NVIDIA GPU detection (nvidia-smi), RAM (sysinfo)
vectorprime-model-ir GGUF byte reader and ONNX protobuf parser; extracts architecture metadata without inference
vectorprime-runtime LlamaCppAdapter, OnnxAdapter, TensorRtAdapter; adapter registry and dispatch
vectorprime-optimizer 4-stage Bayesian/TPE optimization pipeline (hardware context, model context, runtime preselection, TPE search); result caching via ~/.llmforge/cache/
vectorprime-export Modelfile writer, GGUF copy, metadata.json serialization
vectorprime-bindings PyO3 #[pymodule] wiring every crate into the _vectorprime extension module

Build from Source

For end-users: Use pip install vectorprime instead. For developers and contributors who want to modify the codebase, follow the setup below. Building from source requires the Rust toolchain.

Prerequisites

Tool Version Install
Rust toolchain 1.75+ curl https://sh.rustup.rs -sSf | sh
Python 3.9+ python.org
maturin 1.0+ pip install maturin
Python dev headers sudo apt install python3-dev (Debian/Ubuntu)

Build

git clone https://github.com/TheRadDani/llm-forge
cd llm-forge

python -m venv .venv && source .venv/bin/activate
pip install maturin pytest numpy onnxruntime

# Compile the Rust extension and install into the active venv
maturin develop

# Verify
vectorprime profile

Run Tests

# All Rust unit tests
cargo test --workspace

# Code style and lint
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings

# Python integration tests (no fixtures or GPU required)
pytest tests/ -v

Contributing

Contributions are welcome — bug reports, feature requests, documentation improvements, and new runtime adapters.

  1. Fork the repository and create a branch from main
  2. Make your changes with tests
  3. Confirm cargo test --workspace and pytest tests/ both pass
  4. Open a pull request with a clear description

Adding a new runtime: Implement RuntimeAdapter in crates/vectorprime-runtime/src/ and register the adapter in the AdapterRegistry. The optimizer and Python binding layers require no changes.

See open issues for contribution ideas.


License

MIT. See LICENSE for the full text.


Acknowledgments

VectorPrime builds on:

  • llama.cpp — GGUF format specification and the llama-cli / llama-quantize binaries
  • ONNX Runtime — inference engine behind the ONNX adapter
  • TensorRT — NVIDIA's high-performance inference library
  • Ollama — local model runner that VectorPrime exports to
  • PyO3 and maturin — Rust/Python interop and packaging
  • Tokio — async runtime powering parallel benchmarking
  • anyhow and thiserror — structured error handling

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

vectorprime-0.6.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

vectorprime-0.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

vectorprime-0.6.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

vectorprime-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

vectorprime-0.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB view details)

Uploaded CPython 3.9manylinux: glibc 2.17+ x86-64

File details

Details for the file vectorprime-0.6.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vectorprime-0.6.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7abf04fc7a9573d6678c38cea50653a3bf192721d3020e3e922f0c4cfbb08489
MD5 b5d114c0c6c2923f40ae86c31750431b
BLAKE2b-256 fb22ae8b846ebde3a1515a1c2ce1e68a53db149c3021aced23d566542cfaaf92

See more details on using hashes here.

File details

Details for the file vectorprime-0.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vectorprime-0.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ed51cc64ceec299e10d57e745e4d7e909e30cc6733c6755243a91def8defe914
MD5 3ba6fdcece434ed2d76e28eeed71cb52
BLAKE2b-256 8f9462093b588a03b4680bea727bbb7109c37fe0a8c81607871b26e995204890

See more details on using hashes here.

File details

Details for the file vectorprime-0.6.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vectorprime-0.6.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 aa8bd42d2fdef1a33aa0f8ea37771f0e8cec2e193baa75f8b03daf60f988825f
MD5 71a8ac78eea75cc80046e4e729ccabf1
BLAKE2b-256 d02fdb024f484b1c3892dabb250b6b36977051077b5c73fc42ce83e60b4cb4aa

See more details on using hashes here.

File details

Details for the file vectorprime-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vectorprime-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2e9bec0734cb18a43a6000fc1474b4737db28f23207195934d77b5a7b3983339
MD5 d9e36066cea980b750eeb7b2d4529d9d
BLAKE2b-256 27f1a98d8fb26dd847291c2a02d29275194401f8dcb63fa2b52f0f2123e51003

See more details on using hashes here.

File details

Details for the file vectorprime-0.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for vectorprime-0.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5e397749f0914acf6201f9f284bc452b19a10c07022204777f012413bd8b664a
MD5 922f6c1eb94596b6d6b4e27b093cd78a
BLAKE2b-256 17227aa6cf00b9ab96c0df794a3fec9f8f529c8d1ce3d9e2d7b3aff755f86691

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page