Skip to main content

Local GGUF AI inference library built on llama-cpp-python with hardware auto-tuning

Project description

Aurestral

Placeholder Local GGUF inference for Python, powered by llama-cpp-python. Aurestral discovers models in your project’s models/ folder, auto-tunes thread counts, context size, and GPU offload for your hardware, and ships with an interactive chatbot CLI. To be evolved, stay tuned - Aurestral Console.

Requirements

  • Python 3.9–3.12 (3.13+ may work; 3.14 is not supported yet — no llama-cpp-python wheels)
  • A GGUF model file (e.g. from Hugging Face)

Installation

Recommended (Windows)

PyPI only ships a source tarball for llama-cpp-python, which often fails on Windows (long paths). Install a prebuilt wheel first, then Aurestral:

# Use Python 3.11 or 3.12 (not 3.14)
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1

# Prebuilt CPU wheel (fast, no compile)
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu

pip install --upgrade aurestral

NVIDIA GPU (e.g. RTX 4060) — use the CUDA wheel index instead of CPU:

pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
pip install --upgrade aurestral

Simple install (Linux / macOS)

pip install aurestral

On macOS, prebuilt wheels usually include Metal acceleration.

If install still fails on Windows

  1. Enable long path support in Windows settings.
  2. Use a short temp folder before installing:
    New-Item -ItemType Directory -Force C:\tmp | Out-Null
    $env:TEMP = "C:\tmp"
    $env:TMP = "C:\tmp"
    pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
    
  3. Confirm Python version: python --version should show 3.12.x, not 3.14.

Project layout

Place GGUF files in a models/ directory at your project root (or set AURESTRAL_MODELS_DIR):

my-project/
├── models/
│   └── llama-3.2-3b-instruct.Q4_K_M.gguf
└── main.py

Quick start

Interactive chatbot

cd my-project
aurestral
# or explicitly:
aurestral chat -m llama-3.2-3b-instruct.Q4_K_M.gguf

Chat commands: /help, /clear, /exit

Python API

from aurestral import load_model, ChatSession, generate

# One-shot completion
text = generate("Explain quantum entanglement in one sentence.")
print(text)

# Reusable model handle
model = load_model()  # auto-picks sole GGUF, or pass name="my-model"
reply = model.chat([
    {"role": "user", "content": "Hello!"},
])
print(reply)

# Multi-turn session with streaming
session = ChatSession.create(system_prompt="You are a concise coding assistant.")
session.send("Write a Python hello world.", stream=True)

List models and hardware info

aurestral list
aurestral info
aurestral run "The capital of France is" --stream

Hardware auto-tuning

On load, Aurestral inspects CPU cores, RAM, and whether llama-cpp-python was built with GPU offload support. It sets:

Setting Behavior
n_threads Physical cores minus one
n_ctx 1k–8k based on available RAM
n_gpu_layers -1 (all layers) when GPU offload is available
use_mlock Enabled on high-RAM CPU-only setups
flash_attn Enabled when GPU offload is available

Override defaults with InferenceConfig or auto_tune=False:

from aurestral import InferenceConfig, load_model

cfg = InferenceConfig(n_ctx=8192, n_gpu_layers=35)
model = load_model("my-model.gguf", config=cfg, auto_tune=False)

Configuration reference

Environment

  • AURESTRAL_MODELS_DIR — path to models folder (instead of ./models)

InferenceConfig — load-time: n_ctx, n_batch, n_threads, n_gpu_layers, use_mmap, use_mlock, flash_attn

GenerateConfig — generation-time: max_tokens, temperature, top_p, top_k, repeat_penalty, stop, stream

License

MIT License — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aurestral-1.0.1.tar.gz (13.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

aurestral-1.0.1-py3-none-any.whl (14.0 kB view details)

Uploaded Python 3

File details

Details for the file aurestral-1.0.1.tar.gz.

File metadata

  • Download URL: aurestral-1.0.1.tar.gz
  • Upload date:
  • Size: 13.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for aurestral-1.0.1.tar.gz
Algorithm Hash digest
SHA256 9a5c8c7be6ebe36544582dc2279a7c3ee7b7dc0ff5f00ffa63e1aa1f73bf5565
MD5 b97cb60852da70d2bcc198bb214cb05b
BLAKE2b-256 aa295a6a8f594c5c176061fd925db0f75dcedad80e0b01ae3b9543e87eaa683c

See more details on using hashes here.

File details

Details for the file aurestral-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: aurestral-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 14.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for aurestral-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 7df9bbb056775e64799f4588b9b7f53e988bcdb4575637e47e270c4eb175f238
MD5 b26406ac6ca0c31da9ff864701f4aedb
BLAKE2b-256 7a597838c66b58b5e462496ff71e31196ee259200e4d307210b8494a18097a86

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page