Skip to main content

Local GGUF AI inference library built on llama-cpp-python with hardware auto-tuning

Project description

Aurestral

Local GGUF inference for Python, powered by llama-cpp-python. Aurestral discovers models in your project’s models/ folder, auto-tunes thread counts, context size, and GPU offload for your hardware, and ships with an interactive chatbot CLI.

Requirements

Installation

pip install aurestral

For NVIDIA GPU acceleration, install llama-cpp-python with CUDA support first, then Aurestral:

# Windows / Linux (CUDA)
set CMAKE_ARGS=-DGGML_CUDA=on
set FORCE_CMAKE=1
pip install llama-cpp-python --force-reinstall --no-cache-dir

pip install aurestral

On macOS, the default llama-cpp-python wheel typically includes Metal acceleration.

Project layout

Place GGUF files in a models/ directory at your project root (or set AURESTRAL_MODELS_DIR):

my-project/
├── models/
│   └── llama-3.2-3b-instruct.Q4_K_M.gguf
└── main.py

Quick start

Interactive chatbot

cd my-project
aurestral
# or explicitly:
aurestral chat -m llama-3.2-3b-instruct.Q4_K_M.gguf

Chat commands: /help, /clear, /exit

Python API

from aurestral import load_model, ChatSession, generate

# One-shot completion
text = generate("Explain quantum entanglement in one sentence.")
print(text)

# Reusable model handle
model = load_model()  # auto-picks sole GGUF, or pass name="my-model"
reply = model.chat([
    {"role": "user", "content": "Hello!"},
])
print(reply)

# Multi-turn session with streaming
session = ChatSession.create(system_prompt="You are a concise coding assistant.")
session.send("Write a Python hello world.", stream=True)

List models and hardware info

aurestral list
aurestral info
aurestral run "The capital of France is" --stream

Hardware auto-tuning

On load, Aurestral inspects CPU cores, RAM, and whether llama-cpp-python was built with GPU offload support. It sets:

Setting Behavior
n_threads Physical cores minus one
n_ctx 1k–8k based on available RAM
n_gpu_layers -1 (all layers) when GPU offload is available
use_mlock Enabled on high-RAM CPU-only setups
flash_attn Enabled when GPU offload is available

Override defaults with InferenceConfig or auto_tune=False:

from aurestral import InferenceConfig, load_model

cfg = InferenceConfig(n_ctx=8192, n_gpu_layers=35)
model = load_model("my-model.gguf", config=cfg, auto_tune=False)

Configuration reference

Environment

  • AURESTRAL_MODELS_DIR — path to models folder (instead of ./models)

InferenceConfig — load-time: n_ctx, n_batch, n_threads, n_gpu_layers, use_mmap, use_mlock, flash_attn

GenerateConfig — generation-time: max_tokens, temperature, top_p, top_k, repeat_penalty, stop, stream

Publishing to PyPI

pip install build twine
python -m build
twine upload dist/*

License

MIT License — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aurestral-1.0.0.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aurestral-1.0.0-py3-none-any.whl (13.6 kB view details)

Uploaded Python 3

File details

Details for the file aurestral-1.0.0.tar.gz.

File metadata

  • Download URL: aurestral-1.0.0.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for aurestral-1.0.0.tar.gz
Algorithm Hash digest
SHA256 6c6bc3566392bfe162f58f8745f4f026d686082ff2cfe5ec708233c9d18a1078
MD5 32701ca9a1bb515d33146229bdb27592
BLAKE2b-256 8cfc36b83e451c2bc76fb3110c8a922c12ab5559e94544d5d51177b79b2c7b6c

See more details on using hashes here.

File details

Details for the file aurestral-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: aurestral-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 13.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for aurestral-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 76925d5c61ece1b99b609c651671038ed9e752a3a752dc27bb13c98ca9bdadfc
MD5 b5379cf4a4a6a9ca3500106b315aaa8f
BLAKE2b-256 a747c123d71366ac58c96a19dc3be6d06297b0bf33b861091d45744561e957f8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page