vectorprime

Hardware-aware CLI that selects the best runtime and quantization for efficient LLM inference.

These details have not been verified by PyPI

Project links

Project description

 ██╗   ██╗███████╗ ██████╗████████╗ ██████╗ ██████╗ ██████╗ ██████╗ ██╗███╗   ███╗███████╗
 ██║   ██║██╔════╝██╔════╝╚══██╔══╝██╔═══██╗██╔══██╗██╔══██╗██╔══██╗██║████╗ ████║██╔════╝
 ██║   ██║█████╗  ██║        ██║   ██║   ██║██████╔╝██████╔╝██████╔╝██║██╔████╔██║█████╗
 ╚██╗ ██╔╝██╔══╝  ██║        ██║   ██║   ██║██╔══██╗██╔═══╝ ██╔══██╗██║██║╚██╔╝██║██╔══╝
  ╚████╔╝ ███████╗╚██████╗   ██║   ╚██████╔╝██║  ██║██║     ██║  ██║██║██║ ╚═╝ ██║███████╗
   ╚═══╝  ╚══════╝ ╚═════╝   ╚═╝    ╚═════╝ ╚═╝  ╚═╝╚═╝     ╚═╝  ╚═╝╚═╝╚═╝     ╚═╝╚══════╝

Compiler-style, hardware-aware LLM inference optimizer

Vector Prime Banner

VectorPrime takes a model file and your hardware, then finds the fastest way to run it. It profiles your CPU, GPU, and RAM; parses the model's intermediate representation to extract architecture metadata; generates every valid combination of runtime, quantization, thread count, and GPU offload layers; benchmarks candidates in parallel; and hands you back the configuration that maximizes tokens per second within your memory budget. The result is a ready-to-use Ollama bundle — no guesswork required.

VectorPrime is built for developers and researchers who run inference locally and want reproducible, hardware-specific performance without manually tuning runtime flags or hunting for the right quantization format. The Rust backend handles parallel benchmarking and hardware detection; a PyO3 native extension exposes everything through a clean Python API and a single pip install vectorprime.

Features

Feature	Description	Status
Hardware profiling	Detects CPU core count, SIMD level (AVX/AVX2/AVX512), GPU VRAM and compute capability, and available RAM	Stable
Model IR analysis	Reads GGUF and ONNX model files to extract parameter count, architecture, context length, layer count, hidden size, attention heads, KV cache size, memory footprint, and FLOPs without running inference	Stable
Multi-runtime support	Benchmarks Ollama (primary), TensorRT (primary), ONNX Runtime (secondary), and llama.cpp (deprioritized) against each other on your hardware	Stable
Automatic quantization selection	Evaluates F16, Q8_0, Q4_K_M, Q4_0, Int8, and Int4 and picks the fastest that fits in memory	Stable
Parallel benchmarking	Tokio-based async executor runs up to 3 configurations concurrently	Stable
Optimization result caching	Caches results to `~/.llmforge/cache/` keyed by model identity and hardware profile; skips benchmarking entirely on a cache hit	Stable
Ollama export	Generates a `Modelfile` with tuned `num_thread` and `num_gpu` values, ready for `ollama create`	Stable
Format conversion	Bidirectional GGUF-to-ONNX and ONNX-to-GGUF conversion with full metadata round-trip	Stable
Python API	PyO3 native extension — import and call from any Python script or notebook	Stable
CLI interface	`profile`, `optimize`, `convert-to-onnx`, and `convert-to-gguf` subcommands	Stable

Quick Start

pip install vectorprime

# See what hardware VectorPrime detected
vectorprime profile

# Find the best inference configuration for a model
vectorprime optimize model.gguf

# Export the result as an Ollama bundle (Python API)
# See the Python API section below

Installation

For Users

pip install vectorprime

No Rust toolchain required! Pre-built wheels are available for:

Python 3.9, 3.10, 3.11, 3.12
Linux (x86-64, Arm64), macOS (x86-64, Arm64), Windows (x86-64)

Requirements:

Python 3.9 or later
At least one supported inference runtime installed and on PATH

Optional runtime prerequisites:

# Ollama — recommended for most users
# https://ollama.com/download

# ONNX Runtime
pip install onnxruntime          # CPU
pip install onnxruntime-gpu      # CUDA GPU

# TensorRT (NVIDIA only, compute capability >= 7.0)
# https://developer.nvidia.com/tensorrt

# llama.cpp (provides llama-cli and llama-quantize)
# https://github.com/ggml-org/llama.cpp

VectorPrime detects which runtimes are available at startup and silently skips any whose binary is not found. vectorprime profile works with no runtimes installed.

Usage

Profile Hardware

vectorprime profile

Prints a JSON hardware profile to stdout:

{
  "cpu": { "core_count": 16, "simd_level": "AVX2" },
  "gpu": { "name": "NVIDIA GeForce RTX 4090", "vram_mb": 24564, "compute_capability": [8, 9] },
  "ram": { "total_mb": 65536, "available_mb": 48000 }
}

Optimize a Model

vectorprime optimize model.gguf

─────────────────────────────────────
VectorPrime Optimization Result
─────────────────────────────────────
Runtime:       Ollama
Quantization:  Q4_K_M
Threads:       16
GPU Layers:    20
Throughput:    110.3 tokens/sec
Latency:       91.2 ms
Memory:        8.2 GB peak
─────────────────────────────────────
Optimized model written to: model-optimized.gguf

Options:

vectorprime optimize <model_path> [OPTIONS]

Arguments:
  model_path              Path to the model file (.gguf or .onnx).

Options:
  --format {gguf,onnx}    Model format. Auto-detected from extension when omitted.
  --max-memory MB         Warn if peak memory exceeds this limit (MB).
  --gpu MODEL             Target GPU model (e.g. 4090, a100, h100, or 'cpu' for
                          CPU-only). Overrides auto-detected hardware.
  --latency MS            Maximum tolerated latency (ms). Configurations above
                          this threshold are excluded.
  --output PATH           Destination path for the re-quantized output model.
  --no-cache              Bypass the result cache and run benchmarking even if
                          a cached result exists. The new result is stored after
                          completion.

Ollama Export (Python API)

Ollama export is available via the Python API. Call vectorprime.export_ollama(result, output_dir) to produce a Modelfile, model.gguf, and metadata.json bundle ready for ollama create. See the Python API section for a full example.

Convert Between Formats

# GGUF → ONNX
vectorprime convert-to-onnx model.gguf --output model.onnx

# ONNX → GGUF (metadata is round-tripped from the original GGUF when available)
vectorprime convert-to-gguf model.onnx --output model.gguf

Supported Runtimes

Runtime	Priority	Backend Binary	Model Format	Notes
Ollama	Primary	`ollama`	GGUF	Recommended for most users
TensorRT	Primary	`trtexec`	ONNX	NVIDIA GPU, compute capability >= 7.0
ONNX Runtime	Secondary	`python3` + `onnxruntime`	ONNX	CPU and CUDA execution providers
llama.cpp	Deprioritized	`llama-cli`	GGUF	CPU + GPU offload via `--n-gpu-layers`

Missing binaries return a structured NotInstalled error and are skipped — VectorPrime benchmarks whatever runtimes are present.

Caching

VectorPrime caches optimization results so repeated runs on the same model and hardware return instantly without re-running benchmarks.

Cache location: ~/.llmforge/cache/

Cache key: SHA-256 of {model_mtime}_{model_size}_{hardware_profile_json}. The key encodes both the model's identity (modification time and file size) and the full hardware profile. A result cached on one machine is not reused on a different machine, and a result cached for one model version is invalidated when the model file changes.

On cache hit: All benchmarking is skipped; the stored OptimizationResult is returned immediately.

On cache miss or read error: VectorPrime runs normally and writes the result to the cache after benchmarking completes.

Disabling the cache:

# CLI
vectorprime optimize model.gguf --no-cache

# Python API
result = vectorprime.optimize("model.gguf", use_cache=False)

How It Works

VectorPrime runs a 4-stage Bayesian optimization pipeline. Before Stage 1, a cache lookup is performed — if a result for the same model and hardware already exists, all benchmarking is skipped entirely.

[Cache] SHA-256 lookup in ~/.llmforge/cache/ — returns immediately on hit

[1] Hardware Profiling (0 benchmarks)
      CPU cores, SIMD extensions (via raw-cpuid), GPU VRAM and compute
      capability (via nvidia-smi), available RAM (via sysinfo).

[2] Model Graph Analysis (0 benchmarks)
      Parses the model file — GGUF via a custom byte reader, ONNX via
      protobuf — to extract parameter count, architecture, hidden size,
      attention heads, KV cache size, and FLOPs per token without running
      inference. Classifies workload as Memory-bound, Compute-bound, or
      Balanced to guide quantization selection.

[3] Runtime Preselection (0 benchmarks)
      Selects viable runtimes based on model format (GGUF or ONNX) and
      available hardware. Prunes quantization options by VRAM/RAM budget.
      Computes the search space: runtimes × quantizations × gpu_layers ×
      threads × batch_size.

[4] Bayesian Optimization (≤ 12 benchmarks)
      Runs 5 quasi-random Halton samples across the search space, then 7
      Tree-structured Parzen Estimator (TPE) refinement iterations.
      Each benchmark shells out to the runtime adapter (Ollama, TensorRT,
      ONNX Runtime, or llama.cpp) and collects tokens/sec, latency, and
      peak memory. The best configuration is returned and cached.
      Falls back to full cartesian search if all 12 evaluations fail.

The result is cached to ~/.llmforge/cache/ after benchmarking, keyed by model identity and hardware profile.

Python API

import vectorprime

# Profile hardware
hw = vectorprime.profile_hardware()
print(hw.cpu_cores, hw.gpu_model, hw.ram_total_mb)

# Inspect a model's architecture without running inference
# Returns a dict with: format, param_count, architecture, context_length,
# layer_count, hidden_size, attention_head_count, attention_head_count_kv,
# feed_forward_length, kv_cache_size_mb, memory_footprint_mb, flops_per_token
model_info = vectorprime.analyze_model("model.gguf")
print(model_info["param_count"], model_info["architecture"], model_info["context_length"])

# Run optimization (results are cached by default in ~/.llmforge/cache/)
result = vectorprime.optimize("model.gguf", use_cache=True)
print(result.runtime, result.tokens_per_sec, result.latency_ms)
# Ollama  110.3  91.2

# Bypass the cache to force a fresh benchmark run
result = vectorprime.optimize("model.gguf", use_cache=False)

# Export an Ollama-ready bundle
manifest_json = vectorprime.export_ollama(result, "./optimized_model")

# Convert formats
vectorprime.convert_gguf_to_onnx("model.gguf", "model.onnx")
vectorprime.convert_onnx_to_gguf("model.onnx", "model-roundtrip.gguf")

Performance Example

Results from vectorprime optimize on a system with Intel Core i9-13900K (16 cores, AVX-512), NVIDIA RTX 4090 (24 GB VRAM), 64 GB DDR5 RAM. Your results will vary.

Model	Runtime	Quantization	Threads	GPU Layers	Throughput (tok/s)	Latency (ms)	Memory (GB)
Llama 3.1 8B	LlamaCpp	Q4_K_M	16	20	110.3	91.2	8.2
Llama 3.1 8B	LlamaCpp	Q8_0	16	10	74.1	135.4	12.8
Mistral 7B	LlamaCpp	Q4_K_M	16	20	118.7	84.2	7.4
Mistral 7B	OnnxRuntime	Int8	8	0	42.3	236.8	9.1
Phi-3 Mini 3.8B	TensorRT	Int8	8	33	198.4	50.4	5.6

Architecture

VectorPrime is a Rust workspace. The Python layer (CLI + helpers) sits on top of a cdylib native extension compiled via PyO3 and maturin.

python/vectorprime/cli.py         (argparse CLI — 4 subcommands)
          |
          v
vectorprime-bindings              (PyO3 cdylib — _vectorprime.so)
          |
          +---> vectorprime-export      (Ollama bundle generation)
          |           |
          +---> vectorprime-optimizer   (search + parallel benchmark loop)
          |           |
          |     +-----+-----+
          |     |           |
          +---> vectorprime-hardware    vectorprime-runtime  (adapter dispatch)
          |     |                             |
          +---> vectorprime-model-ir          |
                          |                  |
                          +---> vectorprime-core <--+
                               (shared types/traits/errors)

Crate	Responsibility
`vectorprime-core`	`HardwareProfile`, `OptimizationResult`, `RuntimeAdapter` trait, `GpuProbe` trait, `RuntimeError`
`vectorprime-hardware`	CPU detection (raw-cpuid), NVIDIA GPU detection (nvidia-smi), RAM (sysinfo)
`vectorprime-model-ir`	GGUF byte reader and ONNX protobuf parser; extracts architecture metadata without inference
`vectorprime-runtime`	`LlamaCppAdapter`, `OnnxAdapter`, `TensorRtAdapter`; adapter registry and dispatch
`vectorprime-optimizer`	4-stage Bayesian/TPE optimization pipeline (hardware context, model context, runtime preselection, TPE search); result caching via `~/.llmforge/cache/`
`vectorprime-export`	`Modelfile` writer, GGUF copy, metadata.json serialization
`vectorprime-bindings`	PyO3 `#[pymodule]` wiring every crate into the `_vectorprime` extension module

Build from Source

For end-users: Use pip install vectorprime instead. For developers and contributors who want to modify the codebase, follow the setup below. Building from source requires the Rust toolchain.

Prerequisites

Tool	Version	Install
Rust toolchain	1.75+	`curl https://sh.rustup.rs -sSf \| sh`
Python	3.9+	python.org
maturin	1.0+	`pip install maturin`
Python dev headers	—	`sudo apt install python3-dev` (Debian/Ubuntu)

Build

git clone https://github.com/TheRadDani/llm-forge
cd llm-forge

python -m venv .venv && source .venv/bin/activate
pip install maturin pytest numpy onnxruntime

# Compile the Rust extension and install into the active venv
maturin develop

# Verify
vectorprime profile

Run Tests

# All Rust unit tests
cargo test --workspace

# Code style and lint
cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings

# Python integration tests (no fixtures or GPU required)
pytest tests/ -v

Contributing

Contributions are welcome — bug reports, feature requests, documentation improvements, and new runtime adapters.

Fork the repository and create a branch from main
Make your changes with tests
Confirm cargo test --workspace and pytest tests/ both pass
Open a pull request with a clear description

Adding a new runtime: Implement RuntimeAdapter in crates/vectorprime-runtime/src/ and register the adapter in the AdapterRegistry. The optimizer and Python binding layers require no changes.

See open issues for contribution ideas.

License

MIT. See LICENSE for the full text.

Acknowledgments

VectorPrime builds on:

llama.cpp — GGUF format specification and the llama-cli / llama-quantize binaries
ONNX Runtime — inference engine behind the ONNX adapter
TensorRT — NVIDIA's high-performance inference library
Ollama — local model runner that VectorPrime exports to
PyO3 and maturin — Rust/Python interop and packaging
Tokio — async runtime powering parallel benchmarking
anyhow and thiserror — structured error handling

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.6.0

Mar 15, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vectorprime-0.6.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB view details)

Uploaded Mar 15, 2026 CPython 3.13manylinux: glibc 2.17+ x86-64

vectorprime-0.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB view details)

Uploaded Mar 15, 2026 CPython 3.12manylinux: glibc 2.17+ x86-64

vectorprime-0.6.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB view details)

Uploaded Mar 15, 2026 CPython 3.11manylinux: glibc 2.17+ x86-64

vectorprime-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB view details)

Uploaded Mar 15, 2026 CPython 3.10manylinux: glibc 2.17+ x86-64

vectorprime-0.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB view details)

Uploaded Mar 15, 2026 CPython 3.9manylinux: glibc 2.17+ x86-64

File details

Details for the file vectorprime-0.6.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: vectorprime-0.6.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Mar 15, 2026
Size: 14.0 MB
Tags: CPython 3.13, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for vectorprime-0.6.0-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`7abf04fc7a9573d6678c38cea50653a3bf192721d3020e3e922f0c4cfbb08489`
MD5	`b5d114c0c6c2923f40ae86c31750431b`
BLAKE2b-256	`fb22ae8b846ebde3a1515a1c2ce1e68a53db149c3021aced23d566542cfaaf92`

See more details on using hashes here.

File details

Details for the file vectorprime-0.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: vectorprime-0.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Mar 15, 2026
Size: 14.0 MB
Tags: CPython 3.12, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for vectorprime-0.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`ed51cc64ceec299e10d57e745e4d7e909e30cc6733c6755243a91def8defe914`
MD5	`3ba6fdcece434ed2d76e28eeed71cb52`
BLAKE2b-256	`8f9462093b588a03b4680bea727bbb7109c37fe0a8c81607871b26e995204890`

See more details on using hashes here.

File details

Details for the file vectorprime-0.6.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: vectorprime-0.6.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Mar 15, 2026
Size: 14.0 MB
Tags: CPython 3.11, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for vectorprime-0.6.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`aa8bd42d2fdef1a33aa0f8ea37771f0e8cec2e193baa75f8b03daf60f988825f`
MD5	`71a8ac78eea75cc80046e4e729ccabf1`
BLAKE2b-256	`d02fdb024f484b1c3892dabb250b6b36977051077b5c73fc42ce83e60b4cb4aa`

See more details on using hashes here.

File details

Details for the file vectorprime-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: vectorprime-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Mar 15, 2026
Size: 14.0 MB
Tags: CPython 3.10, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for vectorprime-0.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`2e9bec0734cb18a43a6000fc1474b4737db28f23207195934d77b5a7b3983339`
MD5	`d9e36066cea980b750eeb7b2d4529d9d`
BLAKE2b-256	`27f1a98d8fb26dd847291c2a02d29275194401f8dcb63fa2b52f0f2123e51003`

See more details on using hashes here.

File details

Details for the file vectorprime-0.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: vectorprime-0.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Mar 15, 2026
Size: 14.0 MB
Tags: CPython 3.9, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for vectorprime-0.6.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`5e397749f0914acf6201f9f284bc452b19a10c07022204777f012413bd8b664a`
MD5	`922f6c1eb94596b6d6b4e27b093cd78a`
BLAKE2b-256	`17227aa6cf00b9ab96c0df794a3fec9f8f529c8d1ce3d9e2d7b3aff755f86691`

See more details on using hashes here.

vectorprime 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Features

Quick Start

Installation

For Users

Usage

Profile Hardware

Optimize a Model

Ollama Export (Python API)

Convert Between Formats

Supported Runtimes

Caching

How It Works

Python API

Performance Example

Architecture

Build from Source

Prerequisites

Build

Run Tests

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes