Local GGUF AI inference library built on llama-cpp-python with hardware auto-tuning
Project description
Aurestral
Local GGUF inference for Python, powered by llama-cpp-python. Aurestral discovers models in your project’s models/ folder, auto-tunes thread counts, context size, and GPU offload for your hardware, and ships with an interactive chatbot CLI.
Requirements
- Python 3.9+
- A GGUF model file (e.g. from Hugging Face)
Installation
pip install aurestral
For NVIDIA GPU acceleration, install llama-cpp-python with CUDA support first, then Aurestral:
# Windows / Linux (CUDA)
set CMAKE_ARGS=-DGGML_CUDA=on
set FORCE_CMAKE=1
pip install llama-cpp-python --force-reinstall --no-cache-dir
pip install aurestral
On macOS, the default llama-cpp-python wheel typically includes Metal acceleration.
Project layout
Place GGUF files in a models/ directory at your project root (or set AURESTRAL_MODELS_DIR):
my-project/
├── models/
│ └── llama-3.2-3b-instruct.Q4_K_M.gguf
└── main.py
Quick start
Interactive chatbot
cd my-project
aurestral
# or explicitly:
aurestral chat -m llama-3.2-3b-instruct.Q4_K_M.gguf
Chat commands: /help, /clear, /exit
Python API
from aurestral import load_model, ChatSession, generate
# One-shot completion
text = generate("Explain quantum entanglement in one sentence.")
print(text)
# Reusable model handle
model = load_model() # auto-picks sole GGUF, or pass name="my-model"
reply = model.chat([
{"role": "user", "content": "Hello!"},
])
print(reply)
# Multi-turn session with streaming
session = ChatSession.create(system_prompt="You are a concise coding assistant.")
session.send("Write a Python hello world.", stream=True)
List models and hardware info
aurestral list
aurestral info
aurestral run "The capital of France is" --stream
Hardware auto-tuning
On load, Aurestral inspects CPU cores, RAM, and whether llama-cpp-python was built with GPU offload support. It sets:
| Setting | Behavior |
|---|---|
n_threads |
Physical cores minus one |
n_ctx |
1k–8k based on available RAM |
n_gpu_layers |
-1 (all layers) when GPU offload is available |
use_mlock |
Enabled on high-RAM CPU-only setups |
flash_attn |
Enabled when GPU offload is available |
Override defaults with InferenceConfig or auto_tune=False:
from aurestral import InferenceConfig, load_model
cfg = InferenceConfig(n_ctx=8192, n_gpu_layers=35)
model = load_model("my-model.gguf", config=cfg, auto_tune=False)
Configuration reference
Environment
AURESTRAL_MODELS_DIR— path to models folder (instead of./models)
InferenceConfig — load-time: n_ctx, n_batch, n_threads, n_gpu_layers, use_mmap, use_mlock, flash_attn
GenerateConfig — generation-time: max_tokens, temperature, top_p, top_k, repeat_penalty, stop, stream
Publishing to PyPI
pip install build twine
python -m build
twine upload dist/*
License
MIT License — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aurestral-1.0.0.tar.gz.
File metadata
- Download URL: aurestral-1.0.0.tar.gz
- Upload date:
- Size: 12.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c6bc3566392bfe162f58f8745f4f026d686082ff2cfe5ec708233c9d18a1078
|
|
| MD5 |
32701ca9a1bb515d33146229bdb27592
|
|
| BLAKE2b-256 |
8cfc36b83e451c2bc76fb3110c8a922c12ab5559e94544d5d51177b79b2c7b6c
|
File details
Details for the file aurestral-1.0.0-py3-none-any.whl.
File metadata
- Download URL: aurestral-1.0.0-py3-none-any.whl
- Upload date:
- Size: 13.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
76925d5c61ece1b99b609c651671038ed9e752a3a752dc27bb13c98ca9bdadfc
|
|
| MD5 |
b5379cf4a4a6a9ca3500106b315aaa8f
|
|
| BLAKE2b-256 |
a747c123d71366ac58c96a19dc3be6d06297b0bf33b861091d45744561e957f8
|