Local GGUF AI inference library built on llama-cpp-python with hardware auto-tuning
Project description
Aurestral
Placeholder Local GGUF inference for Python, powered by llama-cpp-python. Aurestral discovers models in your project’s models/ folder, auto-tunes thread counts, context size, and GPU offload for your hardware, and ships with an interactive chatbot CLI. To be evolved, stay tuned - Aurestral Console.
Requirements
- Python 3.9–3.12 (3.13+ may work; 3.14 is not supported yet — no
llama-cpp-pythonwheels) - A GGUF model file (e.g. from Hugging Face)
Installation
Recommended (Windows)
PyPI only ships a source tarball for llama-cpp-python, which often fails on Windows (long paths). Install a prebuilt wheel first, then Aurestral:
# Use Python 3.11 or 3.12 (not 3.14)
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1
# Prebuilt CPU wheel (fast, no compile)
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
pip install --upgrade aurestral
NVIDIA GPU (e.g. RTX 4060) — use the CUDA wheel index instead of CPU:
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
pip install --upgrade aurestral
Simple install (Linux / macOS)
pip install aurestral
On macOS, prebuilt wheels usually include Metal acceleration.
If install still fails on Windows
- Enable long path support in Windows settings.
- Use a short temp folder before installing:
New-Item -ItemType Directory -Force C:\tmp | Out-Null $env:TEMP = "C:\tmp" $env:TMP = "C:\tmp" pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
- Confirm Python version:
python --versionshould show 3.12.x, not 3.14.
0xc000001d / illegal instruction when loading a model
This means the llama-cpp-python binary does not match your PC (common with the wrong CUDA wheel or Python 3.14).
pip uninstall llama-cpp-python -y
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
python examples/chatbot.py --cpu-only
For GPU (RTX 4060), use the CUDA 12.4 wheel (requires CUDA 12.4 runtime):
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
Project layout
Place GGUF files in a models/ directory at your project root (or set AURESTRAL_MODELS_DIR):
my-project/
├── models/
│ └── llama-3.2-3b-instruct.Q4_K_M.gguf
└── main.py
Quick start
Interactive chatbot
cd my-project
aurestral
aurestral -m llama-3.2-3b-instruct.Q4_K_M.gguf
Chat commands: /clear, /exit
Example chatbot script
This repo includes a full terminal chatbot in examples/chatbot.py:
# From repo root (with models/*.gguf present)
pip install -e .
python examples/chatbot.py
# Options
python examples/chatbot.py -m your-model.gguf --system "You are a coding tutor."
python examples/chatbot.py --max-tokens 256 --temperature 0.5
Python API
from aurestral import load_model, ChatSession, generate
# One-shot completion
text = generate("Explain quantum entanglement in one sentence.")
print(text)
# Reusable model handle
model = load_model() # auto-picks sole GGUF, or pass name="my-model"
reply = model.chat([
{"role": "user", "content": "Hello!"},
])
print(reply)
# Multi-turn session with streaming
session = ChatSession.create(system_prompt="You are a concise coding assistant.")
session.send("Write a Python hello world.", stream=True)
Hardware auto-tuning
On load, Aurestral inspects CPU cores, RAM, and whether llama-cpp-python was built with GPU offload support. It sets:
| Setting | Behavior |
|---|---|
n_threads |
Physical cores minus one |
n_ctx |
1k–8k based on available RAM |
n_gpu_layers |
-1 (all layers) when GPU offload is available |
use_mlock |
Enabled on high-RAM CPU-only setups |
flash_attn |
Enabled when GPU offload is available |
Override defaults with InferenceConfig or auto_tune=False:
from aurestral import InferenceConfig, load_model
cfg = InferenceConfig(n_ctx=8192, n_gpu_layers=35)
model = load_model("my-model.gguf", config=cfg, auto_tune=False)
Configuration reference
Environment
AURESTRAL_MODELS_DIR— path to models folder (instead of./models)
InferenceConfig — load-time: n_ctx, n_batch, n_threads, n_gpu_layers, use_mmap, use_mlock, flash_attn
GenerateConfig — generation-time: max_tokens, temperature, top_p, top_k, repeat_penalty, stop, stream
License
MIT License — see LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file aurestral-1.0.2.tar.gz.
File metadata
- Download URL: aurestral-1.0.2.tar.gz
- Upload date:
- Size: 16.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b7518eafcf4fc81b087d407129d7e2c1e977ad72099321627f5e18cae1f2ce51
|
|
| MD5 |
f2d02c27953228a22b3867b5504aa9ac
|
|
| BLAKE2b-256 |
fe7caf8991d5cacf34b84121dacd7eb95f0a3e478c9e89b1f7778b80584acb10
|
File details
Details for the file aurestral-1.0.2-py3-none-any.whl.
File metadata
- Download URL: aurestral-1.0.2-py3-none-any.whl
- Upload date:
- Size: 17.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e36009e2ad105ab80a635b7518f1027fa19c399f9e3febca07d713be8c18ef81
|
|
| MD5 |
487f654de20e827f72cdb0633500fd71
|
|
| BLAKE2b-256 |
056931ae9a1652388ff09ef948ab1a11de595c7240aedac6122148befbfd9628
|