Skip to main content

cyllama is a comprehensive zero-dependencies Python library for local AI inference using the state-of-the-art llama, whisper, and stable-diffusion .cpp ecosystem.

Project description

cyllama - Fast, Pythonic AI Inference

cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art .cpp ecosystem:

It combines the performance of compiled Cython wrappers with a simple, high-level Python API for cross-modal AI inference.

Documentation | PyPI | Changelog

Features

  • High-level API -- complete(), chat(), LLM class for quick prototyping / text generation.
  • Streaming -- token-by-token output with callbacks
  • Batch processing -- process multiple prompts 3-10x faster
  • GPU acceleration -- Metal (macOS), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform)
  • Speculative decoding -- 2-3x speedup with draft models
  • Agent framework -- ReActAgent, ConstrainedAgent, ContractAgent with tool calling
  • RAG -- retrieval-augmented generation with local embeddings and sqlite-vector
  • Speech recognition -- whisper.cpp transcription and translation
  • Image/Video generation -- stable-diffusion.cpp handles image, image-edit and video models.
  • OpenAI-compatible servers -- EmbeddedServer (C/Mongoose) and PythonServer with chat completions and embeddings endpoints
  • Framework integrations -- OpenAI API client, LangChain LLM interface

Installation

From PyPI

pip install cyllama

This installs the cpu-backend for linux and windows. For MacOS, the Metal backend is installed by default to take advantage of Apple Silicon.

GPU-Accelerated Variants

GPU variants are available on PyPI as separate packages (dynamically linked, Linux x86_64 only for now):

pip install cyllama-cuda12   # NVIDIA GPU (CUDA 12.4)
pip install cyllama-rocm     # AMD GPU (ROCm 6.3, requires glibc >= 2.35)
pip install cyllama-sycl     # Intel GPU (oneAPI SYCL 2025.3)
pip install cyllama-vulkan   # Cross-platform GPU (Vulkan)

All variants install the same cyllama Python package -- only the compiled backend differs. Install one at a time (they replace each other). GPU variants require the corresponding driver/runtime installed on your system.

You can verify which backend is active after installation:

cyllama info

You can also query the backend configuration at runtime:

from cyllama import _backend
print(_backend.cuda)   # True if built with CUDA
print(_backend.metal)  # True if built with Metal

Build from source with a specific backend

GGML_CUDA=1 pip install cyllama --no-binary cyllama
GGML_VULKAN=1 pip install cyllama --no-binary cyllama

Command-Line Interface

cyllama provides a unified CLI for all major functionality:

# Text generation
cyllama gen -m models/llama.gguf -p "What is Python?" --stream
cyllama gen -m models/llama.gguf -p "Write a haiku" --temperature 0.9 --json

# Chat (single-turn or interactive)
cyllama chat -m models/llama.gguf -p "Explain gravity" -s "You are a physicist"
cyllama chat -m models/llama.gguf                      # interactive mode
cyllama chat -m models/llama.gguf -n 1024              # interactive, up to 1024 tokens per response
cyllama chat -m models/llama.gguf --stats              # show session stats on exit

# Embeddings
cyllama embed -m models/bge-small.gguf -t "hello world" -t "another text"
cyllama embed -m models/bge-small.gguf --dim                        # print dimensions
cyllama embed -m models/bge-small.gguf --similarity "cats" -f corpus.txt --threshold 0.5

# Other commands
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ -p "How do I configure X?"
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -f file.md   # interactive mode
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ --db docs.sqlite -p "..."  # index to persistent DB
cyllama rag -m models/llama.gguf -e models/bge-small.gguf --db docs.sqlite -p "..."           # reuse existing DB, no re-indexing
cyllama server -m models/llama.gguf --port 8080
cyllama transcribe -m models/ggml-base.en.bin audio.wav
cyllama tts -m models/tts.gguf -p "Hello world"
cyllama sd txt2img --model models/sd.gguf --prompt "a sunset"
cyllama info       # build and backend information
cyllama memory -m models/llama.gguf  # GPU memory estimation

Run cyllama --help or cyllama <command> --help for full usage. See CLI Cheatsheet for the complete reference.

Quick Start

from cyllama import complete

# One line is all you need
response = complete(
    "Explain quantum computing in simple terms",
    model_path="models/llama.gguf",
    temperature=0.7,
    max_tokens=200
)
print(response)

Key Features

Simple by Default, Powerful When Needed

High-Level API - Get started in seconds:

from cyllama import complete, chat, LLM

# One-shot completion
response = complete("What is Python?", model_path="model.gguf")

# Multi-turn chat
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
]
response = chat(messages, model_path="model.gguf")

# Reusable LLM instance (faster for multiple prompts)
llm = LLM("model.gguf")
response1 = llm("Question 1")
response2 = llm("Question 2")  # Model stays loaded!

Streaming Support - Real-time token-by-token output:

for chunk in complete("Tell me a story", model_path="model.gguf", stream=True):
    print(chunk, end="", flush=True)

Performance Optimized

Batch Processing - Process multiple prompts 3-10x faster:

from cyllama import batch_generate

prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]
responses = batch_generate(prompts, model_path="model.gguf")

Speculative Decoding - 2-3x speedup with draft models:

from cyllama.llama.llama_cpp import Speculative, SpeculativeParams

params = SpeculativeParams(n_max=16, p_min=0.75)
spec = Speculative(params, ctx_target)
draft_tokens = spec.draft(prompt_tokens, last_token)

Memory Optimization - Smart GPU layer allocation:

from cyllama import estimate_gpu_layers

estimate = estimate_gpu_layers(
    model_path="model.gguf",
    available_vram_mb=8000
)
print(f"Recommended GPU layers: {estimate.n_gpu_layers}")

N-gram Cache - 2-10x speedup for repetitive text:

from cyllama.llama.llama_cpp import NgramCache

cache = NgramCache()
cache.update(tokens, ngram_min=2, ngram_max=4)
draft = cache.draft(input_tokens, n_draft=16)

Response Caching - Cache LLM responses for repeated prompts:

from cyllama import LLM

# Enable caching with 100 entries and 1 hour TTL
llm = LLM("model.gguf", cache_size=100, cache_ttl=3600, seed=42)

response1 = llm("What is Python?")  # Cache miss - generates response
response2 = llm("What is Python?")  # Cache hit - returns cached response instantly

# Check cache statistics
info = llm.cache_info()  # ResponseCacheInfo(hits=1, misses=1, maxsize=100, currsize=1, ttl=3600)

# Clear cache when needed
llm.cache_clear()

Note: Caching requires a fixed seed (not the default random sentinel) since random seeds produce non-deterministic output. Streaming responses are not cached.

Framework Integrations

OpenAI-Compatible API - Drop-in replacement:

from cyllama.integrations import OpenAIClient

client = OpenAIClient(model_path="model.gguf")

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7
)
print(response.choices[0].message.content)

LangChain Integration - Seamless ecosystem access:

from cyllama.integrations import CyllamaLLM
from langchain.chains import LLMChain

llm = CyllamaLLM(model_path="model.gguf", temperature=0.7)
chain = LLMChain(llm=llm, prompt=prompt_template)
result = chain.run(topic="AI")

Agent Framework

Cyllama includes a zero-dependency agent framework with three agent architectures:

ReActAgent - Reasoning + Acting agent with tool calling:

from cyllama import LLM
from cyllama.agents import ReActAgent, tool
from simpleeval import simple_eval

@tool
def calculate(expression: str) -> str:
    """Evaluate a math expression safely."""
    return str(simple_eval(expression))

llm = LLM("model.gguf")
agent = ReActAgent(llm=llm, tools=[calculate])
result = agent.run("What is 25 * 4?")
print(result.answer)

ConstrainedAgent - Grammar-enforced tool calling for 100% reliability:

from cyllama.agents import ConstrainedAgent

agent = ConstrainedAgent(llm=llm, tools=[calculate])
result = agent.run("Calculate 100 / 4")  # Guaranteed valid tool calls

ContractAgent - Contract-based agent with C++26-inspired pre/post conditions:

from cyllama.agents import ContractAgent, tool, pre, post, ContractPolicy

@tool
@pre(lambda args: args['x'] != 0, "cannot divide by zero")
@post(lambda r: r is not None, "result must not be None")
def divide(a: float, x: float) -> float:
    """Divide a by x."""
    return a / x

agent = ContractAgent(
    llm=llm,
    tools=[divide],
    policy=ContractPolicy.ENFORCE,
    task_precondition=lambda task: len(task) > 10,
    answer_postcondition=lambda ans: len(ans) > 0,
)
result = agent.run("What is 100 divided by 4?")

See Agents Overview for detailed agent documentation.

Speech Recognition

Whisper Transcription - Transcribe audio files with timestamps:

from cyllama.whisper import WhisperContext, WhisperFullParams
import numpy as np

# Load model and audio
ctx = WhisperContext("models/ggml-base.en.bin")
samples = load_audio_as_16khz_float32("audio.wav")  # Your audio loading function

# Transcribe
params = WhisperFullParams()
ctx.full(samples, params)

# Get results
for i in range(ctx.full_n_segments()):
    start = ctx.full_get_segment_t0(i) / 100.0
    end = ctx.full_get_segment_t1(i) / 100.0
    text = ctx.full_get_segment_text(i)
    print(f"[{start:.2f}s - {end:.2f}s] {text}")

See Whisper docs for full documentation.

Stable Diffusion

Image Generation - Generate images from text using stable-diffusion.cpp:

from cyllama.sd import text_to_image

# Simple text-to-image
image = text_to_image(
    model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
    prompt="a photo of a cute cat",
    width=512,
    height=512,
    sample_steps=4,
    cfg_scale=1.0
)
image.save("output.png")

Advanced Generation - Full control with SDContext:

from cyllama.sd import SDContext, SDContextParams, SampleMethod, Scheduler

params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4

ctx = SDContext(params)
images = ctx.generate(
    prompt="a beautiful mountain landscape",
    negative_prompt="blurry, ugly",
    width=512,
    height=512,
    sample_method=SampleMethod.EULER,
    scheduler=Scheduler.DISCRETE
)

CLI Tool - Command-line interface:

# Text to image
cyllama sd txt2img \
    --model models/sd_xl_turbo_1.0.q8_0.gguf \
    --prompt "a beautiful sunset" \
    --output sunset.png

# Image to image
cyllama sd img2img \
    --model models/sd-v1-5.gguf \
    --init-img input.png \
    --prompt "oil painting style" \
    --strength 0.7

# Show system info
cyllama sd info

Supports SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, z-image-turbo, video generation (Wan/CogVideoX), LoRA, ControlNet, inpainting, and ESRGAN upscaling. See Stable Diffusion docs for full documentation.

RAG (Retrieval-Augmented Generation)

CLI - Query your documents from the command line:

# Single query against a directory of docs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ -p "How do I configure X?" --stream

# Interactive mode with source display
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -f guide.md -f faq.md --sources

# Persistent vector store: index once, reuse across runs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ --db docs.sqlite -p "How do I configure X?"   # first run: indexes to docs.sqlite
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    --db docs.sqlite -p "Another question?"                # later runs: reuse index, no re-embedding

Simple RAG - Query your documents with LLMs:

from cyllama.rag import RAG

# Create RAG instance with embedding and generation models
rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)

# Add documents
rag.add_texts([
    "Python is a high-level programming language.",
    "Machine learning is a subset of artificial intelligence.",
    "Neural networks are inspired by biological neurons."
])

# Query
response = rag.query("What is Python?")
print(response.text)

Load Documents - Support for multiple file formats:

from cyllama.rag import RAG, load_directory

rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)

# Load all documents from a directory
documents = load_directory("docs/", glob="**/*.md")
rag.add_documents(documents)

response = rag.query("How do I configure the system?")

Hybrid Search - Combine vector and keyword search:

from cyllama.rag import RAG, HybridStore, Embedder

embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf")
store = HybridStore("knowledge.db", embedder)

store.add_texts(["Document content..."])

# Hybrid search with configurable weights
results = store.search("query", k=5, vector_weight=0.7, fts_weight=0.3)

Embedding Cache - Speed up repeated queries with LRU caching:

from cyllama.rag import Embedder

# Enable cache with 1000 entries
embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf", cache_size=1000)

embedder.embed("hello")  # Cache miss
embedder.embed("hello")  # Cache hit - instant return

info = embedder.cache_info()
print(f"Hits: {info.hits}, Misses: {info.misses}")

Agent Integration - Use RAG as an agent tool:

from cyllama import LLM
from cyllama.agents import ReActAgent
from cyllama.rag import RAG, create_rag_tool

rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)
rag.add_texts(["Your knowledge base..."])

# Create a tool from the RAG instance
search_tool = create_rag_tool(rag)

llm = LLM("models/llama.gguf")
agent = ReActAgent(llm=llm, tools=[search_tool])
result = agent.run("Find information about X in the knowledge base")

Supports text chunking, multiple embedding pooling strategies, LRU caching for repeated queries, async operations, reranking, and SQLite-vector for persistent storage. See RAG Overview for full documentation.

Common Utilities

GGUF File Manipulation - Inspect and modify model files:

from cyllama.llama.llama_cpp import GGUFContext

ctx = GGUFContext.from_file("model.gguf")
metadata = ctx.get_all_metadata()
print(f"Model: {metadata['general.name']}")

Structured Output - JSON schema to grammar conversion (pure Python, no C++ dependency):

from cyllama.llama.llama_cpp import json_schema_to_grammar

schema = {"type": "object", "properties": {"name": {"type": "string"}}}
grammar = json_schema_to_grammar(schema)

Huggingface Model Downloads:

from cyllama.llama.llama_cpp import download_model, list_cached_models, get_hf_file

# Download from HuggingFace (saves to ~/.cache/llama.cpp/)
download_model("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Or with explicit parameters
download_model(hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Download specific file to custom path
download_model(
    hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF",
    hf_file="Llama-3.2-1B-Instruct-Q8_0.gguf",
    model_path="./models/my_model.gguf"
)

# Get file info without downloading
info = get_hf_file("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")
print(info)  # {'repo': '...', 'gguf_file': '...', 'mmproj_file': '...'}

# List cached models
models = list_cached_models()

What's Inside

Text Generation (llama.cpp)

  • Full llama.cpp API - Complete Cython wrapper with strong typing
  • High-Level API - Simple, Pythonic interface (LLM, complete, chat)
  • Streaming Support - Token-by-token generation with callbacks
  • Batch Processing - Efficient parallel inference
  • Multimodal - LLAVA and vision-language models
  • Speculative Decoding - 2-3x inference speedup with draft models

Speech Recognition (whisper.cpp)

  • Full whisper.cpp API - Complete Cython wrapper
  • High-Level API - Simple transcribe() function
  • Multiple Formats - WAV, MP3, FLAC, and more
  • Language Detection - Automatic or specified language
  • Timestamps - Word and segment-level timing

Image & Video Generation (stable-diffusion.cpp)

  • Full stable-diffusion.cpp API - Complete Cython wrapper
  • Text-to-Image - SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2
  • Image-to-Image - Transform existing images
  • Inpainting - Mask-based editing
  • ControlNet - Guided generation with edge/pose/depth
  • Video Generation - Wan, CogVideoX models
  • Upscaling - ESRGAN 4x upscaling

Cross-Cutting Features

  • GPU Acceleration - Metal, CUDA, Vulkan backends
  • Memory Optimization - Smart GPU layer allocation
  • Agent Framework - ReActAgent, ConstrainedAgent, ContractAgent
  • Framework Integration - OpenAI API, LangChain, FastAPI

Why Cyllama?

Performance: Compiled Cython wrappers with minimal overhead

  • Strong type checking at compile time
  • Zero-copy data passing where possible
  • Efficient memory management
  • Native integration with llama.cpp optimizations

Simplicity: From 50 lines to 1 line for basic generation

  • Intuitive, Pythonic API design
  • Automatic resource management
  • Sensible defaults, full control when needed

Production-Ready: Battle-tested and comprehensive

  • 1460+ passing tests with extensive coverage
  • Comprehensive documentation and examples
  • Proper error handling and logging
  • Framework integration for real applications

Up-to-Date: Tracks bleeding-edge llama.cpp

  • Regular updates with latest features
  • All high-priority APIs wrapped
  • Performance optimizations included

Status

Current Version: 0.2.8 (Apr 2026) llama.cpp Version: b8757 Build System: scikit-build-core + CMake Test Coverage: 1460+ tests passing

Platform & GPU Availability

Pre-built wheels on PyPI:

Package Backend Platform Arch Linking
cyllama CPU Linux x86_64 static
cyllama CPU Windows x86_64 static
cyllama Metal macOS arm64 (Apple Silicon) static
cyllama Metal macOS x86_64 (Intel) static
cyllama-cuda12 CUDA 12.4 Linux x86_64 dynamic
cyllama-rocm ROCm 6.3 Linux x86_64 dynamic
cyllama-sycl Intel SYCL (oneAPI 2025.3) Linux x86_64 dynamic
cyllama-vulkan Vulkan Linux x86_64 dynamic

We will be adding additional wheel support for more platforms in the future, starting with vulkan and cuda12 support Windows.

Build from source (any platform with a C++ toolchain):

Backend macOS Linux Windows
CPU make build-cpu make build-cpu make build-cpu
Metal make build-metal (default) -- --
CUDA -- make build-cuda make build-cuda
ROCm (HIP) -- make build-hip --
Vulkan make build-vulkan make build-vulkan make build-vulkan
SYCL -- make build-sycl --
OpenCL make build-opencl make build-opencl make build-opencl

All source builds support both static (make build-<backend>) and dynamic (make build-<backend>-dynamic) linking.

Recent Releases

  • v0.2.9 (Apr 2026) - Fixed CUDA image generation crash (SD now statically links its own vendored ggml by default), --stats works in streaming mode, exposed LlamaContext.get_perf_data() / LlamaSampler.get_perf_data(), MtmdContextParams.warmup property, replaced deprecated mtmd_image_tokens_get_nx/ny with mtmd_decoder_pos API, llama.cpp b8802, stable-diffusion.cpp master-567-ee5bf95
  • v0.2.8 (Apr 2026) - Expanded Cython bindings for LlamaContextParams (flash_attn_type, embeddings, op_offload, swa_full, kv_unified), ~30 new WhisperFullParams properties, SDSampleParams/SDImageGenParams additions (skip-layer guidance, custom sigmas, LoRA, IP-Adapter, Photo Maker, step-cache surface), whisper_cpp.disable_logging(), cyllama transcribe -v flag, centralized defaults in cyllama._defaults aligned with llama.cpp C library, Gemma 4 interactive chat fix, Qwen3 reasoning-block truncation fix, CUDA wheel double-free fix
  • v0.2.7 (Apr 2026) - SD defaults aligned with C library: wtype auto-detect (fixes blank images on CUDA), sample_method/scheduler auto-resolve, eta changed from 0.0 to infinity sentinel
  • v0.2.6 (Apr 2026) - Removed accidental pytest-review runtime dependency from 0.2.5
  • v0.2.5 (Apr 2026) - Typed loader exceptions, concurrent-use guard on LLM/Embedder/WhisperContext/SDContext, persistent RAG vector store (cyllama rag --db), corpus deduplication, vendored jinja2 chat templates (fixes Gemma 4 and other non-substring-detectable templates), Qwen3 <think>-block stripping + n-gram repetition guard, readline history for REPLs, memory-leak regression tests, llama.cpp b8757
  • v0.2.4 (Apr 2026) - Unified CLI (cyllama gen, chat, embed, rag, ...), cyllama rag command-line RAG, Ctrl+C during inference, embeddings endpoint, Embedder logging fix, interactive chat token limit fix
  • v0.2.3 (Apr 2026) - SD flow_shift black-image fix, GPU OOM validation, dynamic Linux install fixes, wheel backend discovery after auditwheel/delvewheel rename, CLI entry point, wheel smoke tests, OpenCL targets, CUDA tuning flags
  • v0.2.2 (Apr 2026) - CUDA wheel size stability (PTX-only sm_75), portability flags moved from manage.py to CI workflows
  • v0.2.1 (Mar 2026) - Code quality hardening: GIL release for whisper/encode, async stream fixes, memory-aware embedding cache, CI robustness, 30+ bug fixes, 1150+ tests
  • v0.2.0 (Mar 2026) - Dynamic-linked GPU wheels (CUDA, ROCm, SYCL, Vulkan) on PyPI, unified ggml, sqlite-vector vendored
  • v0.1.21 (Mar 2026) - GPU wheel builds: CUDA + ROCm, sqlite-vector bundled
  • v0.1.20 (Feb 2026) - Update llama.cpp + stable-diffusion.cpp
  • v0.1.19 (Dev 2025) - Metal fix for stable-diffusion.cpp
  • v0.1.18 (Dec 2025) - Remaining stable-diffusion.cpp wrapped
  • v0.1.16 (Dec 2025) - Response class, Async API, Chat templates
  • v0.1.12 (Nov 2025) - Initial wrapper of stable-diffusion.cpp
  • v0.1.11 (Nov 2025) - ACP support, build improvements
  • v0.1.10 (Nov 2025) - Agent Framework, bug fixes
  • v0.1.9 (Nov 2025) - High-level APIs, integrations, batch processing, comprehensive documentation
  • v0.1.8 (Nov 2025) - Speculative decoding API
  • v0.1.7 (Nov 2025) - GGUF, JSON Schema, Downloads, N-gram Cache
  • v0.1.6 (Nov 2025) - Multimodal test fixes
  • v0.1.5 (Oct 2025) - Mongoose server, embedded server
  • v0.1.4 (Oct 2025) - Memory estimation, performance optimizations

See CHANGELOG.md for complete release history.

Building from Source

To build cyllama from source:

  1. A recent version of python3 (currently testing on python 3.13)

  2. Git clone the latest version of cyllama:

    git clone https://github.com/shakfu/cyllama.git
    cd cyllama
    
  3. We use uv for package management:

    If you don't have it see the link above to install it, otherwise:

    uv sync
    
  4. Type make in the terminal.

    This will:

    1. Download and build llama.cpp, whisper.cpp and stable-diffusion.cpp
    2. Install them into the thirdparty folder
    3. Build cyllama using scikit-build-core + CMake

Build Commands

# Full build (default: static linking, builds llama.cpp from source)
make              # Build dependencies + editable install

# Dynamic linking (downloads pre-built llama.cpp release)
make build-dynamic  # No source compilation needed for llama.cpp

# Build wheel for distribution
make wheel        # Creates wheel in dist/
make dist         # Creates sdist + wheel in dist/

# Backend-specific builds (static)
make build-cpu    # CPU only
make build-metal  # macOS Metal (default on macOS)
make build-cuda   # NVIDIA CUDA
make build-vulkan # Vulkan (cross-platform)
make build-hip    # AMD ROCm
make build-sycl   # Intel SYCL
make build-opencl # OpenCL

# Backend-specific builds (dynamic -- shared libs)
make build-cpu-dynamic
make build-cuda-dynamic
make build-vulkan-dynamic
make build-metal-dynamic
make build-hip-dynamic
make build-sycl-dynamic
make build-opencl-dynamic

# Backend-specific wheels (static and dynamic)
make wheel-cuda           # Static wheel
make wheel-cuda-dynamic   # Dynamic wheel with shared libs

# Clean and rebuild
make clean        # Remove build artifacts + dynamic libs
make reset        # Full reset including thirdparty and .venv
make remake       # Clean rebuild with tests

# Code quality
make lint         # Lint with ruff (auto-fix)
make format       # Format with ruff
make typecheck    # Type check with mypy
make qa           # Run all: lint, typecheck, format

# Memory leak detection
make leaks        # RSS-growth leak check (10 cycles, 20% threshold)

# Publishing
make check        # Validate wheels with twine
make publish      # Upload to PyPI
make publish-test # Upload to TestPyPI

GPU Acceleration

By default, cyllama builds with Metal support on macOS and CPU-only on Linux. To enable other GPU backends (CUDA, Vulkan, etc.):

# Static builds (all libs compiled in)
make build-cuda
make build-vulkan

# Dynamic builds (shared libs installed alongside extension)
make build-cuda-dynamic
make build-vulkan-dynamic

# Multiple backends
export GGML_CUDA=1 GGML_VULKAN=1
make build

See Build Backends for comprehensive backend build instructions.

Multi-GPU Configuration

For systems with multiple GPUs, cyllama provides full control over GPU selection and model splitting:

from cyllama import LLM, GenerationConfig

# Use a specific GPU (GPU index 1)
llm = LLM("model.gguf", main_gpu=1)

# Multi-GPU with layer splitting (default mode)
llm = LLM("model.gguf", split_mode=1, n_gpu_layers=-1)

# Multi-GPU with tensor parallelism (row splitting)
llm = LLM("model.gguf", split_mode=2, n_gpu_layers=-1)

# Custom tensor split: 30% GPU 0, 70% GPU 1
llm = LLM("model.gguf", tensor_split=[0.3, 0.7])

# Full configuration via GenerationConfig
config = GenerationConfig(
    main_gpu=0,
    split_mode=1,          # 0=NONE, 1=LAYER, 2=ROW
    tensor_split=[1, 2],   # 1/3 GPU0, 2/3 GPU1
    n_gpu_layers=-1
)
llm = LLM("model.gguf", config=config)

Split Modes:

  • 0 (NONE): Single GPU only, uses main_gpu
  • 1 (LAYER): Split layers and KV cache across GPUs (default)
  • 2 (ROW): Tensor parallelism - split layers with row-wise distribution

Testing

The tests directory in this repo provides extensive examples of using cyllama.

However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good small model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. cyllama expects models to be stored in a models folder in the cloned cyllama directory. So to create the models directory if doesn't exist and download this model, you can just type:

make download

This basically just does:

cd cyllama
mkdir models && cd models
wget https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf

Now you can test it using llama-cli or llama-simple:

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"

With 1460+ passing tests, the library is ready for both quick prototyping and production use:

make test  # Run full test suite

You can also explore interactively:

python3 -i scripts/start.py

>>> from cyllama import complete
>>> response = complete("What is 2+2?", model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf")
>>> print(response)

Documentation

Full documentation is available at https://shakfu.github.io/cyllama/ (built with MkDocs).

To serve docs locally: make docs-serve

  • User Guide - Comprehensive guide covering all features
  • CLI Cheatsheet - Complete CLI reference for all commands
  • API Reference - Complete API documentation
  • RAG Overview - Retrieval-augmented generation guide
  • Cookbook - Practical recipes and patterns
  • Changelog - Complete release history
  • Examples - See tests/examples/ for working code samples

Roadmap

Completed

  • Full llama.cpp API wrapper with Cython
  • High-level API (LLM, complete, chat)
  • Async API support (AsyncLLM, complete_async, chat_async)
  • Response class with stats and serialization
  • Built-in chat template system (llama.cpp templates)
  • Batch processing utilities
  • OpenAI-compatible API client
  • LangChain integration
  • Speculative decoding
  • GGUF file manipulation
  • JSON schema to grammar conversion
  • Model download helper
  • N-gram cache
  • OpenAI-compatible servers (PythonServer, EmbeddedServer, LlamaServer) with chat and embeddings
  • Whisper.cpp integration
  • Multimodal support (LLAVA)
  • Memory estimation utilities
  • Agent Framework (ReActAgent, ConstrainedAgent, ContractAgent)
  • Stable Diffusion (stable-diffusion.cpp) - image/video generation
  • RAG utilities (text chunking, document processing)

Future

  • Web UI for testing

Contributing

Contributions are welcome! Please see the User Guide for development guidelines.

License

This project wraps llama.cpp, whisper.cpp, and stable-diffusion.cpp which all follow the MIT licensing terms, as does cyllama.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cyllama-0.2.10-cp314-cp314-win_amd64.whl (12.5 MB view details)

Uploaded CPython 3.14Windows x86-64

cyllama-0.2.10-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.2 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

cyllama-0.2.10-cp314-cp314-macosx_11_0_x86_64.whl (13.4 MB view details)

Uploaded CPython 3.14macOS 11.0+ x86-64

cyllama-0.2.10-cp314-cp314-macosx_11_0_arm64.whl (12.9 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

cyllama-0.2.10-cp313-cp313-win_amd64.whl (12.3 MB view details)

Uploaded CPython 3.13Windows x86-64

cyllama-0.2.10-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

cyllama-0.2.10-cp313-cp313-macosx_11_0_x86_64.whl (13.4 MB view details)

Uploaded CPython 3.13macOS 11.0+ x86-64

cyllama-0.2.10-cp313-cp313-macosx_11_0_arm64.whl (12.9 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

cyllama-0.2.10-cp312-cp312-win_amd64.whl (12.3 MB view details)

Uploaded CPython 3.12Windows x86-64

cyllama-0.2.10-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

cyllama-0.2.10-cp312-cp312-macosx_11_0_x86_64.whl (13.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ x86-64

cyllama-0.2.10-cp312-cp312-macosx_11_0_arm64.whl (12.9 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

cyllama-0.2.10-cp311-cp311-win_amd64.whl (12.3 MB view details)

Uploaded CPython 3.11Windows x86-64

cyllama-0.2.10-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

cyllama-0.2.10-cp311-cp311-macosx_11_0_x86_64.whl (13.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ x86-64

cyllama-0.2.10-cp311-cp311-macosx_11_0_arm64.whl (12.9 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

cyllama-0.2.10-cp310-cp310-win_amd64.whl (12.3 MB view details)

Uploaded CPython 3.10Windows x86-64

cyllama-0.2.10-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

cyllama-0.2.10-cp310-cp310-macosx_11_0_x86_64.whl (13.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ x86-64

cyllama-0.2.10-cp310-cp310-macosx_11_0_arm64.whl (12.9 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file cyllama-0.2.10-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.10-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 12.5 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.10-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 5dd068b0e65ca655182552fe45980d4fcb9216e9fa9b3547fdc056ebec5b3046
MD5 e56c8930ed894ea3ebbaebb6a39943f0
BLAKE2b-256 9a849206eeaa1e385ec74e240ba2aab626a229f72c2a3b7ab80cf048e5f6cf64

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 2dea935e403dfba8c2ddb0422871b2393961a4537461cc0135f2b9b53c84d25a
MD5 57d358f7273c4e77653e63d110b27515
BLAKE2b-256 6c786bb45369a9b8e4b007e418b7a78f18b51fd42218de203b07c35954c2f7e4

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp314-cp314-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp314-cp314-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 f9eaa4f3726395ed7d370372ef05082b2ee896a502650f340899b0702c6f8a4b
MD5 f8985ffa9fc8009a37faaa9a95e20ab0
BLAKE2b-256 6c89f68aa6a4fcab45dadf412514c79d1d8d5a628bd13bdc1fdf6779abc3a7b6

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fb6e5f6b3c865fd32f9a68762f43906472d5451301fca46aca7a866cd04cc19a
MD5 5a0c247dc1d9982d5d249e6285d4db31
BLAKE2b-256 a2b77013f3d420ff981f5c4dbc9d70379b26f10c4f5ff23563abc1fd5dd7ae5e

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.10-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.10-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 578972a90619a53a246c6ee8635f6d9261dda9774b045da9a41db37b6f7d149a
MD5 4deaf87d88c665d2f11b8c0fb3ec2e48
BLAKE2b-256 b0464c891102dc2bf9166c0272ed582954442df838302dc9c75fb0f4cebfad5b

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 1328c9e1e0e221271e415cb9d64d848d98b40679be9e1f6070be6106c5996aff
MD5 ff6510baf9c95e632b75abef1ab14e69
BLAKE2b-256 65a4dfe4ed452011301488dd450662cddc3eab0a80a052065254ed563e12c10e

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp313-cp313-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp313-cp313-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 51c54617f324522494957732c40fc9376c3c0019b6985e7da96991aba32cd446
MD5 7bd0251e07d0afe861eab04648837d25
BLAKE2b-256 1f30168b80b7931611163e0f08cc85013b4ba6936be6114e17fd7e4f845c34d3

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fa9e57a4457966fee45fa4dfd3257913a1ad8e59e84c982d52336963c1bd339f
MD5 40cf6efa31ff675e77c6dd106bf6932d
BLAKE2b-256 8b1d8ada11ed29c78beb206497e51cdaf0ad5fc3c334bc84c9078e7ca9a825df

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.10-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.10-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 fc8b9ae8f349b1dcf5a00bd158f10e07b387ca799eb4c5253a5eee460b9c0fdf
MD5 2a98825259a33526a4655be2d41225af
BLAKE2b-256 e1526ee45d396136625be540ded045b37cd7d8ad0413eeddc3175797e3ee4f7e

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 3ee5342f3e62638b03bf640ad3e65b18d64d70d74f0e2bc00c5d8ce924c04bb8
MD5 4bac5a709be3f1b5b31047d4901edfd6
BLAKE2b-256 6848892b79ae7e3ab6857d329fb66b92c0545ddadbbed9aed412bb80cb275cdd

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp312-cp312-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp312-cp312-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 38102378a3447b8e0a9e1584bf00a3791b0fe9bdc77646899530f6bb7f5788a0
MD5 95f8032cef45fd57c4ba29158c4594d4
BLAKE2b-256 fb18259a2d6b3173ebeae0ca980c14084190486501b57699a3a53549f0037d58

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3826ea9f29ea009bba92c40241d64cb290f673f4ce341c23de182ca5993af5df
MD5 5c624a2f7293851e2b199d017e87cd79
BLAKE2b-256 644e6aa1867fe8612a4b92ee93a6af540f0c868c6c14585ae8894be10ab65e98

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.10-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.10-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 8de55c7a669e27f8b7d5be9ed35646d8f7f40e9c7601f3e24196babec45bc182
MD5 1fb9993297ea17ca9aef0651187af310
BLAKE2b-256 1f638533e988ed285d0be6ad3a4658d69c079f24cfcfbd1cc6bde5d93c6d7c2f

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 1c5cd8baf8b8885daca890d015ee2a2163e32f8a4d927564aafc865b403f7107
MD5 7268c5a2134dc91e7b6487cc81e96e90
BLAKE2b-256 0506d6f1a8baf2f45fea6a2b698f56e5fca440776a156cb5b721fbbc0cb4b861

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp311-cp311-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp311-cp311-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 951adf200e4a96250ac94c1fc39a85ce82f51a588b37046785811f2b4ef0b09a
MD5 0f14ef1578d44cbb8e12699b2461c6a4
BLAKE2b-256 debba273ffdd15add6d5a4515536580d8266a3f9b5097447e78a0e9d2fae782b

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 18c18ef4b154da9e4e000256a2702e316bbf335debd10909aba56803a9326c15
MD5 791f9a780461559fddffc293bada0d21
BLAKE2b-256 1ea2cd21f40d9a73cb29eaf982223637cee2430aa8f49b940f2ec50d2fff8a76

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.10-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.10-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 6e4ba8c27ad2b9727daa64be38c4f2a7c3c7ddb5ecaa54eaf7e116d24bedb9db
MD5 30771c89cee8a0930657b9c157d3abe6
BLAKE2b-256 34841aaa7619a39dddeeee0ae22d53c12bb203e5e14fc58af07f24c51aab3d4f

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 e4b8d0fabf3d6fd423ff4b3a672b7b680ae5e3084fd24198283a7140ad96b5b2
MD5 e6249185053cd3c2fee65e48a14de2c3
BLAKE2b-256 09a268e7bac952d4c4f238ca2f7550c2189e49906c6036905a7c32e621731da7

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp310-cp310-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp310-cp310-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 c24ef0316421dd46d3bb1175954994f1e3914f822d0e2385377ad0949684a1ba
MD5 7aaf33764770e317184cce906019d7eb
BLAKE2b-256 ffdb428b998b553ae175a64dcef9d07933e5a87f70785f6d3d2a5f243d3d0c0f

See more details on using hashes here.

File details

Details for the file cyllama-0.2.10-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.10-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4e5262d6fb64b265b6b1f492a601da594afc8028369d7400ed88ef8a7f6dadcd
MD5 b14da3539a3679cde4455b422ff03667
BLAKE2b-256 e26f8695659c9b59a02b8e029e3c89ab0d7444dc3ac12f43f5efb49dd6c06787

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page