Local GGUF AI inference library built on llama-cpp-python with hardware auto-tuning
Project description
Aurestral
Placeholder Local GGUF inference for Python, powered by llama-cpp-python. Aurestral discovers models in your project’s models/ folder, auto-tunes thread counts, context size, and GPU offload for your hardware, and ships with an interactive chatbot CLI. To be evolved, stay tuned - Aurestral Console.
Requirements
- Python 3.9–3.12 (3.13+ may work; 3.14 is not supported yet — no
llama-cpp-pythonwheels) - A GGUF model file (e.g. from Hugging Face)
Installation
Recommended (Windows)
PyPI only ships a source tarball for llama-cpp-python, which often fails on Windows (long paths). Install a prebuilt wheel first, then Aurestral:
# Use Python 3.11 or 3.12 (not 3.14)
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
# Prebuilt CPU wheel (fast, no compile)
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
pip install --upgrade aurestral
NVIDIA GPU (e.g. RTX 4060) — use the CUDA wheel index instead of CPU:
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
pip install --upgrade aurestral
Simple install (Linux / macOS)
pip install aurestral
On macOS, prebuilt wheels usually include Metal acceleration.
If install still fails on Windows
- Enable long path support in Windows settings.
- Use a short temp folder before installing:
New-Item -ItemType Directory -Force C:\tmp | Out-Null $env:TEMP = "C:\tmp" $env:TMP = "C:\tmp" pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
- Confirm Python version:
python --versionshould show 3.12.x, not 3.14.
Project layout
Place GGUF files in a models/ directory at your project root (or set AURESTRAL_MODELS_DIR):
my-project/
├── models/
│ └── llama-3.2-3b-instruct.Q4_K_M.gguf
└── main.py
Quick start
Interactive chatbot
cd my-project
aurestral
# or explicitly:
aurestral chat -m llama-3.2-3b-instruct.Q4_K_M.gguf
Chat commands: /help, /clear, /exit
Python API
from aurestral import load_model, ChatSession, generate
# One-shot completion
text = generate("Explain quantum entanglement in one sentence.")
print(text)
# Reusable model handle
model = load_model() # auto-picks sole GGUF, or pass name="my-model"
reply = model.chat([
{"role": "user", "content": "Hello!"},
])
print(reply)
# Multi-turn session with streaming
session = ChatSession.create(system_prompt="You are a concise coding assistant.")
session.send("Write a Python hello world.", stream=True)
List models and hardware info
aurestral list
aurestral info
aurestral run "The capital of France is" --stream
Hardware auto-tuning
On load, Aurestral inspects CPU cores, RAM, and whether llama-cpp-python was built with GPU offload support. It sets:
| Setting | Behavior |
|---|---|
n_threads |
Physical cores minus one |
n_ctx |
1k–8k based on available RAM |
n_gpu_layers |
-1 (all layers) when GPU offload is available |
use_mlock |
Enabled on high-RAM CPU-only setups |
flash_attn |
Enabled when GPU offload is available |
Override defaults with InferenceConfig or auto_tune=False:
from aurestral import InferenceConfig, load_model
cfg = InferenceConfig(n_ctx=8192, n_gpu_layers=35)
model = load_model("my-model.gguf", config=cfg, auto_tune=False)
Configuration reference
Environment
AURESTRAL_MODELS_DIR— path to models folder (instead of./models)
InferenceConfig — load-time: n_ctx, n_batch, n_threads, n_gpu_layers, use_mmap, use_mlock, flash_attn
GenerateConfig — generation-time: max_tokens, temperature, top_p, top_k, repeat_penalty, stop, stream
License
MIT License — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aurestral-1.0.1.tar.gz.
File metadata
- Download URL: aurestral-1.0.1.tar.gz
- Upload date:
- Size: 13.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a5c8c7be6ebe36544582dc2279a7c3ee7b7dc0ff5f00ffa63e1aa1f73bf5565
|
|
| MD5 |
b97cb60852da70d2bcc198bb214cb05b
|
|
| BLAKE2b-256 |
aa295a6a8f594c5c176061fd925db0f75dcedad80e0b01ae3b9543e87eaa683c
|
File details
Details for the file aurestral-1.0.1-py3-none-any.whl.
File metadata
- Download URL: aurestral-1.0.1-py3-none-any.whl
- Upload date:
- Size: 14.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7df9bbb056775e64799f4588b9b7f53e988bcdb4575637e47e270c4eb175f238
|
|
| MD5 |
b26406ac6ca0c31da9ff864701f4aedb
|
|
| BLAKE2b-256 |
7a597838c66b58b5e462496ff71e31196ee259200e4d307210b8494a18097a86
|