Skip to main content

cyllama is a comprehensive zero-dependencies Python library for local AI inference using the state-of-the-art llama, whisper, and stable-diffusion .cpp ecosystem.

Project description

cyllama - Fast, Pythonic AI Inference

cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art .cpp ecosystem:

It combines the performance of compiled Cython wrappers with a simple, high-level Python API for cross-modal AI inference.

Documentation | PyPI | Changelog

Features

  • High-level API -- complete(), chat(), LLM class for quick prototyping / text generation.

  • Streaming -- token-by-token output with callbacks

  • Batch processing -- process multiple prompts 3-10x faster

  • GPU acceleration -- Metal (macOS), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform), SYCL (Intel)

  • Speculative decoding -- 2-3x speedup with draft models

  • Agent framework -- ReActAgent, ConstrainedAgent, ContractAgent with tool calling

  • RAG -- retrieval-augmented generation with local embeddings and sqlite-vector

  • Speech recognition -- whisper.cpp transcription and translation

  • Image/Video generation -- stable-diffusion.cpp handles image, image-edit and video models.

  • OpenAI-compatible servers -- EmbeddedServer (C/Mongoose) and PythonServer with chat completions and embeddings endpoints

  • Framework integrations -- OpenAI API client, LangChain LLM interface

Installation

From PyPI

pip install cyllama

This installs the cpu-backend for linux and windows. For MacOS, the Metal backend is installed by default to take advantage of Apple Silicon.

GPU-Accelerated Variants

GPU variants are available on PyPI as separate packages (dynamically linked, Linux x86_64 only for now):

pip install cyllama-cuda12   # NVIDIA GPU (CUDA 12.4)
pip install cyllama-rocm     # AMD GPU (ROCm 6.3, requires glibc >= 2.35)
pip install cyllama-sycl     # Intel GPU (oneAPI SYCL 2025.3)
pip install cyllama-vulkan   # Cross-platform GPU (Vulkan)

All variants install the same cyllama Python package -- only the compiled backend differs. Install one at a time (they replace each other). GPU variants require the corresponding driver/runtime installed on your system.

You can verify which backend is active after installation:

cyllama info

You can also query the backend configuration at runtime:

from cyllama import _backend
print(_backend.cuda)   # True if built with CUDA
print(_backend.metal)  # True if built with Metal

Build from source with a specific backend

GGML_CUDA=1 pip install cyllama --no-binary cyllama
GGML_VULKAN=1 pip install cyllama --no-binary cyllama

Command-Line Interface

cyllama provides a unified CLI for all major functionality:

# Text generation
cyllama gen -m models/llama.gguf -p "What is Python?" --stream
cyllama gen -m models/llama.gguf -p "Write a haiku" --temperature 0.9 --json

# Chat (single-turn or interactive)
cyllama chat -m models/llama.gguf -p "Explain gravity" -s "You are a physicist"
cyllama chat -m models/llama.gguf                      # interactive mode
cyllama chat -m models/llama.gguf -n 1024              # interactive, up to 1024 tokens per response
cyllama chat -m models/llama.gguf --stats              # show session stats on exit

# Embeddings
cyllama embed -m models/bge-small.gguf -t "hello world" -t "another text"
cyllama embed -m models/bge-small.gguf --dim                        # print dimensions
cyllama embed -m models/bge-small.gguf --similarity "cats" -f corpus.txt --threshold 0.5

# Other commands
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ -p "How do I configure X?"
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -f file.md   # interactive mode
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ --db docs.sqlite -p "..."  # index to persistent DB
cyllama rag -m models/llama.gguf -e models/bge-small.gguf --db docs.sqlite -p "..."           # reuse existing DB, no re-indexing
cyllama server -m models/llama.gguf --port 8080
cyllama transcribe -m models/ggml-base.en.bin audio.wav
cyllama tts -m models/tts.gguf -p "Hello world"
cyllama sd txt2img --model models/sd.gguf --prompt "a sunset"
cyllama info       # build and backend information
cyllama memory -m models/llama.gguf  # GPU memory estimation

Run cyllama --help or cyllama <command> --help for full usage. See CLI Cheatsheet for the complete reference.

Quick Start

from cyllama import complete

# One line is all you need
response = complete(
    "Explain quantum computing in simple terms",
    model_path="models/llama.gguf",
    temperature=0.7,
    max_tokens=200
)
print(response)

Key Features

Simple by Default, Powerful When Needed

High-Level API - Get started in seconds:

from cyllama import complete, chat, LLM

# One-shot completion
response = complete("What is Python?", model_path="model.gguf")

# Multi-turn chat
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
]
response = chat(messages, model_path="model.gguf")

# Reusable LLM instance (faster for multiple prompts)
llm = LLM("model.gguf")
response1 = llm("Question 1")
response2 = llm("Question 2")  # Model stays loaded!

Streaming Support - Real-time token-by-token output:

for chunk in complete("Tell me a story", model_path="model.gguf", stream=True):
    print(chunk, end="", flush=True)

Performance Optimized

Batch Processing - Process multiple prompts 3-10x faster:

from cyllama import batch_generate

prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]
responses = batch_generate(prompts, model_path="model.gguf")

Speculative Decoding - 2-3x speedup with draft models:

from cyllama.llama.llama_cpp import Speculative, SpeculativeParams

params = SpeculativeParams(n_max=16, p_min=0.75)
spec = Speculative(params, ctx_target)
draft_tokens = spec.draft(prompt_tokens, last_token)

Memory Optimization - Smart GPU layer allocation:

from cyllama import estimate_gpu_layers

estimate = estimate_gpu_layers(
    model_path="model.gguf",
    available_vram_mb=8000
)
print(f"Recommended GPU layers: {estimate.n_gpu_layers}")

N-gram Cache - 2-10x speedup for repetitive text:

from cyllama.llama.llama_cpp import NgramCache

cache = NgramCache()
cache.update(tokens, ngram_min=2, ngram_max=4)
draft = cache.draft(input_tokens, n_draft=16)

Response Caching - Cache LLM responses for repeated prompts:

from cyllama import LLM

# Enable caching with 100 entries and 1 hour TTL
llm = LLM("model.gguf", cache_size=100, cache_ttl=3600, seed=42)

response1 = llm("What is Python?")  # Cache miss - generates response
response2 = llm("What is Python?")  # Cache hit - returns cached response instantly

# Check cache statistics
info = llm.cache_info()  # ResponseCacheInfo(hits=1, misses=1, maxsize=100, currsize=1, ttl=3600)

# Clear cache when needed
llm.cache_clear()

Note: Caching requires a fixed seed (not the default random sentinel) since random seeds produce non-deterministic output. Streaming responses are not cached.

Framework Integrations

OpenAI-Compatible API - Drop-in replacement:

from cyllama.integrations import OpenAIClient

client = OpenAIClient(model_path="model.gguf")

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7
)
print(response.choices[0].message.content)

LangChain Integration - Seamless ecosystem access:

from cyllama.integrations import CyllamaLLM
from langchain.chains import LLMChain

llm = CyllamaLLM(model_path="model.gguf", temperature=0.7)
chain = LLMChain(llm=llm, prompt=prompt_template)
result = chain.run(topic="AI")

Agent Framework

Cyllama includes a zero-dependency agent framework with three agent architectures:

ReActAgent - Reasoning + Acting agent with tool calling:

from cyllama import LLM
from cyllama.agents import ReActAgent, tool
from simpleeval import simple_eval

@tool
def calculate(expression: str) -> str:
    """Evaluate a math expression safely."""
    return str(simple_eval(expression))

llm = LLM("model.gguf")
agent = ReActAgent(llm=llm, tools=[calculate])
result = agent.run("What is 25 * 4?")
print(result.answer)

ConstrainedAgent - Grammar-enforced tool calling for 100% reliability:

from cyllama.agents import ConstrainedAgent

agent = ConstrainedAgent(llm=llm, tools=[calculate])
result = agent.run("Calculate 100 / 4")  # Guaranteed valid tool calls

ContractAgent - Contract-based agent with C++26-inspired pre/post conditions:

from cyllama.agents import ContractAgent, tool, pre, post, ContractPolicy

@tool
@pre(lambda args: args['x'] != 0, "cannot divide by zero")
@post(lambda r: r is not None, "result must not be None")
def divide(a: float, x: float) -> float:
    """Divide a by x."""
    return a / x

agent = ContractAgent(
    llm=llm,
    tools=[divide],
    policy=ContractPolicy.ENFORCE,
    task_precondition=lambda task: len(task) > 10,
    answer_postcondition=lambda ans: len(ans) > 0,
)
result = agent.run("What is 100 divided by 4?")

See Agents Overview for detailed agent documentation.

Speech Recognition

Whisper Transcription - Transcribe audio files with timestamps:

from cyllama.whisper import WhisperContext, WhisperFullParams
import numpy as np

# Load model and audio
ctx = WhisperContext("models/ggml-base.en.bin")
samples = load_audio_as_16khz_float32("audio.wav")  # Your audio loading function

# Transcribe
params = WhisperFullParams()
ctx.full(samples, params)

# Get results
for i in range(ctx.full_n_segments()):
    start = ctx.full_get_segment_t0(i) / 100.0
    end = ctx.full_get_segment_t1(i) / 100.0
    text = ctx.full_get_segment_text(i)
    print(f"[{start:.2f}s - {end:.2f}s] {text}")

See Whisper docs for full documentation.

Stable Diffusion

Image Generation - Generate images from text using stable-diffusion.cpp:

from cyllama.sd import text_to_image

# Simple text-to-image
image = text_to_image(
    model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
    prompt="a photo of a cute cat",
    width=512,
    height=512,
    sample_steps=4,
    cfg_scale=1.0
)
image.save("output.png")

Advanced Generation - Full control with SDContext:

from cyllama.sd import SDContext, SDContextParams

params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4

ctx = SDContext(params)
# sample_method / scheduler / eta / wtype default to auto-resolve
# sentinels (SD C-library defaults) -- pass explicitly only to override.
images = ctx.generate(
    prompt="a beautiful mountain landscape",
    negative_prompt="blurry, ugly",
    width=512,
    height=512,
)

CLI Tool - Command-line interface:

# Text to image
cyllama sd txt2img \
    --model models/sd_xl_turbo_1.0.q8_0.gguf \
    --prompt "a beautiful sunset" \
    --output sunset.png

# Image to image
cyllama sd img2img \
    --model models/sd-v1-5.gguf \
    --init-img input.png \
    --prompt "oil painting style" \
    --strength 0.7

# Show system info
cyllama sd info

Supports SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, z-image-turbo, video generation (Wan/CogVideoX), LoRA, ControlNet, inpainting, and ESRGAN upscaling. See Stable Diffusion docs for full documentation.

RAG (Retrieval-Augmented Generation)

CLI - Query your documents from the command line:

# Single query against a directory of docs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ -p "How do I configure X?" --stream

# Interactive mode with source display
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -f guide.md -f faq.md --sources

# Persistent vector store: index once, reuse across runs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ --db docs.sqlite -p "How do I configure X?"   # first run: indexes to docs.sqlite
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    --db docs.sqlite -p "Another question?"                # later runs: reuse index, no re-embedding

Simple RAG - Query your documents with LLMs:

from cyllama.rag import RAG

# Create RAG instance with embedding and generation models
rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)

# Add documents
rag.add_texts([
    "Python is a high-level programming language.",
    "Machine learning is a subset of artificial intelligence.",
    "Neural networks are inspired by biological neurons."
])

# Query
response = rag.query("What is Python?")
print(response.text)

Load Documents - Support for multiple file formats:

from cyllama.rag import RAG, load_directory

rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)

# Load all documents from a directory
documents = load_directory("docs/", glob="**/*.md")
rag.add_documents(documents)

response = rag.query("How do I configure the system?")

Hybrid Search - Combine vector and keyword search:

from cyllama.rag import RAG, HybridStore, Embedder

embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf")
store = HybridStore("knowledge.db", embedder)

store.add_texts(["Document content..."])

# Hybrid search with configurable weights
results = store.search("query", k=5, vector_weight=0.7, fts_weight=0.3)

Embedding Cache - Speed up repeated queries with LRU caching:

from cyllama.rag import Embedder

# Enable cache with 1000 entries
embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf", cache_size=1000)

embedder.embed("hello")  # Cache miss
embedder.embed("hello")  # Cache hit - instant return

info = embedder.cache_info()
print(f"Hits: {info.hits}, Misses: {info.misses}")

Agent Integration - Use RAG as an agent tool:

from cyllama import LLM
from cyllama.agents import ReActAgent
from cyllama.rag import RAG, create_rag_tool

rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)
rag.add_texts(["Your knowledge base..."])

# Create a tool from the RAG instance
search_tool = create_rag_tool(rag)

llm = LLM("models/llama.gguf")
agent = ReActAgent(llm=llm, tools=[search_tool])
result = agent.run("Find information about X in the knowledge base")

Supports text chunking, multiple embedding pooling strategies, LRU caching for repeated queries, async operations, reranking, and SQLite-vector for persistent storage. See RAG Overview for full documentation.

Common Utilities

GGUF File Manipulation - Inspect and modify model files:

from cyllama.llama.llama_cpp import GGUFContext

ctx = GGUFContext.from_file("model.gguf")
metadata = ctx.get_all_metadata()
print(f"Model: {metadata['general.name']}")

Structured Output - JSON schema to grammar conversion (pure Python, no C++ dependency):

from cyllama.llama.llama_cpp import json_schema_to_grammar

schema = {"type": "object", "properties": {"name": {"type": "string"}}}
grammar = json_schema_to_grammar(schema)

Huggingface Model Downloads:

from cyllama.llama.llama_cpp import download_model, list_cached_models, get_hf_file

# Download from HuggingFace (saves to ~/.cache/llama.cpp/)
download_model("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Or with explicit parameters
download_model(hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Download specific file to custom path
download_model(
    hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF",
    hf_file="Llama-3.2-1B-Instruct-Q8_0.gguf",
    model_path="./models/my_model.gguf"
)

# Get file info without downloading
info = get_hf_file("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")
print(info)  # {'repo': '...', 'gguf_file': '...', 'mmproj_file': '...'}

# List cached models
models = list_cached_models()

What's Inside

Text Generation (llama.cpp)

  • Full llama.cpp API - Complete Cython wrapper with strong typing

  • High-Level API - Simple, Pythonic interface (LLM, complete, chat)

  • Streaming Support - Token-by-token generation with callbacks

  • Batch Processing - Efficient parallel inference

  • Multimodal - LLAVA and vision-language models

  • Speculative Decoding - 2-3x inference speedup with draft models

Speech Recognition (whisper.cpp)

  • Full whisper.cpp API - Complete Cython wrapper

  • High-Level API - Simple transcribe() function

  • Multiple Formats - WAV, MP3, FLAC, and more

  • Language Detection - Automatic or specified language

  • Timestamps - Word and segment-level timing

Image & Video Generation (stable-diffusion.cpp)

  • Full stable-diffusion.cpp API - Complete Cython wrapper

  • Text-to-Image - SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, Z-Image

  • Image-to-Image - Transform existing images

  • Inpainting - Mask-based editing

  • ControlNet - Guided generation with edge/pose/depth

  • Video Generation - Wan, CogVideoX models

  • Upscaling - ESRGAN 4x upscaling

Cross-Cutting Features

  • GPU Acceleration - Metal, CUDA, ROCm, Vulkan, SYCL backends

  • Memory Optimization - Smart GPU layer allocation

  • Agent Framework - ReActAgent, ConstrainedAgent, ContractAgent

  • Framework Integration - OpenAI API, LangChain, FastAPI

Why Cyllama?

Performance: Compiled Cython wrappers with minimal overhead

  • Strong type checking at compile time

  • Zero-copy data passing where possible

  • Efficient memory management

  • Native integration with llama.cpp optimizations

Simplicity: From 50 lines to 1 line for basic generation

  • Intuitive, Pythonic API design

  • Automatic resource management

  • Sensible defaults, full control when needed

Production-Ready: Battle-tested and comprehensive

  • 1489+ passing tests with extensive coverage

  • Comprehensive documentation and examples

  • Proper error handling and logging

  • Framework integration for real applications

Up-to-Date: Tracks bleeding-edge llama.cpp

  • Regular updates with latest features

  • All high-priority APIs wrapped

  • Performance optimizations included

Status

Current Version: 0.2.13 (Apr 2026) llama.cpp Version: b8833 Build System: scikit-build-core + CMake Test Coverage: 1489+ tests passing

Platform & GPU Availability

Pre-built wheels on PyPI:

Package Backend Platform Arch Linking
cyllama CPU Linux x86_64 static
cyllama CPU Windows x86_64 static
cyllama Metal macOS arm64 (Apple Silicon) static
cyllama Metal macOS x86_64 (Intel) static
cyllama-cuda12 CUDA 12.4 Linux x86_64 dynamic
cyllama-rocm ROCm 6.3 Linux x86_64 dynamic
cyllama-sycl Intel SYCL (oneAPI 2025.3) Linux x86_64 dynamic
cyllama-vulkan Vulkan Linux x86_64 dynamic

We will be adding additional wheel support for more platforms in the future, starting with vulkan and cuda12 support Windows.

Build from source (any platform with a C++ toolchain):

Backend macOS Linux Windows
CPU make build-cpu make build-cpu make build-cpu
Metal make build-metal (default) -- --
CUDA -- make build-cuda make build-cuda
ROCm (HIP) -- make build-hip --
Vulkan make build-vulkan make build-vulkan make build-vulkan
SYCL -- make build-sycl --
OpenCL make build-opencl make build-opencl make build-opencl

All source builds support both static (make build-<backend>) and dynamic (make build-<backend>-dynamic) linking.

Recent Releases

See CHANGELOG.md for full release notes.

  • v0.2.13 (Apr 2026) - QdrantVectorStore reference adapter for VectorStoreProtocol; pipeline-integrated reranking (RAGConfig.rerank) with RerankerProtocol; ccache + concurrency groups on CPU cibw workflows; Windows GPU-wheel LoadLibraryW PATH fix

  • v0.2.12 - Windows-CUDA, Windows-Vulkan, and macOS-Intel Vulkan GPU wheels; canonical delocate/auditwheel/delvewheel packaging. Experimental abi3 wheels (cp312+)

  • v0.2.11 (Apr 2026) - Pluggable RAG backends (VectorStoreProtocol / EmbedderProtocol) and MCP client API on LLM

  • v0.2.10 (Apr 2026) - GPU wheel size halved; packaging fixes (build_config.json, auditwheel SONAME, Vulkan ABI)

  • v0.2.9 (Apr 2026) - CUDA + SD stability fixes; get_perf_data() telemetry APIs

  • v0.2.8 (Apr 2026) - Expanded Cython bindings across llama / whisper / SD; interactive-chat streaming & sampling

  • v0.2.7 (Apr 2026) - SD defaults aligned with C library (fixes blank CUDA images)

  • v0.2.6 (Apr 2026) - Hotfix: remove accidental test-only runtime dependency

  • v0.2.5 (Apr 2026) - RAG hardening: persistent store, corpus dedup, vendored jinja2 chat templates

  • v0.2.4 (Apr 2026) - Unified cyllama CLI (gen, chat, embed, rag, …)

  • v0.2.3 (Apr 2026) - Wheel packaging and GPU portability fixes

  • v0.2.2 (Apr 2026) - CUDA wheel size stability

  • v0.2.1 (Mar 2026) - Code-quality hardening, GIL release, async fixes

  • v0.2.0 (Mar 2026) - Dynamic-linked GPU wheels on PyPI (CUDA, ROCm, SYCL, Vulkan)

  • v0.1.21 (Mar 2026) - GPU wheel builds: CUDA + ROCm, sqlite-vector bundled

  • v0.1.20 (Feb 2026) - Update llama.cpp + stable-diffusion.cpp

  • v0.1.19 (Dec 2025) - Metal fix for stable-diffusion.cpp

  • v0.1.18 (Dec 2025) - Remaining stable-diffusion.cpp wrapped

  • v0.1.16 (Dec 2025) - Response class, Async API, Chat templates

  • v0.1.12 (Nov 2025) - Initial wrapper of stable-diffusion.cpp

  • v0.1.11 (Nov 2025) - ACP support, build improvements

  • v0.1.10 (Nov 2025) - Agent Framework, bug fixes

  • v0.1.9 (Nov 2025) - High-level APIs, integrations, batch processing, comprehensive documentation

  • v0.1.8 (Nov 2025) - Speculative decoding API

  • v0.1.7 (Nov 2025) - GGUF, JSON Schema, Downloads, N-gram Cache

  • v0.1.6 (Nov 2025) - Multimodal test fixes

  • v0.1.5 (Oct 2025) - Mongoose server, embedded server

  • v0.1.4 (Oct 2025) - Memory estimation, performance optimizations

See CHANGELOG.md for complete release history.

Building from Source

To build cyllama from source:

  1. A recent version of python3 (currently testing on python 3.13)

  2. Git clone the latest version of cyllama:

    git clone https://github.com/shakfu/cyllama.git
    cd cyllama
    
  3. We use uv for package management:

    If you don't have it see the link above to install it, otherwise:

    uv sync
    
  4. Type make in the terminal.

    This will:

    1. Download and build llama.cpp, whisper.cpp and stable-diffusion.cpp
    2. Install them into the thirdparty folder
    3. Build cyllama using scikit-build-core + CMake

Build Commands

# Full build (default: static linking, builds llama.cpp from source)
make              # Build dependencies + editable install

# Dynamic linking (downloads pre-built llama.cpp release)
make build-dynamic  # No source compilation needed for llama.cpp

# Build wheel for distribution
make wheel        # Creates wheel in dist/
make dist         # Creates sdist + wheel in dist/

# Backend-specific builds (static)
make build-cpu    # CPU only
make build-metal  # macOS Metal (default on macOS)
make build-cuda   # NVIDIA CUDA
make build-vulkan # Vulkan (cross-platform)
make build-hip    # AMD ROCm
make build-sycl   # Intel SYCL
make build-opencl # OpenCL

# Backend-specific builds (dynamic -- shared libs)
make build-cpu-dynamic
make build-cuda-dynamic
make build-vulkan-dynamic
make build-metal-dynamic
make build-hip-dynamic
make build-sycl-dynamic
make build-opencl-dynamic

# Backend-specific wheels (static and dynamic)
make wheel-cuda           # Static wheel
make wheel-cuda-dynamic   # Dynamic wheel with shared libs

# Clean and rebuild
make clean        # Remove build artifacts + dynamic libs
make reset        # Full reset including thirdparty and .venv
make remake       # Clean rebuild with tests

# Code quality
make lint         # Lint with ruff (auto-fix)
make format       # Format with ruff
make typecheck    # Type check with mypy
make qa           # Run all: lint, typecheck, format

# Memory leak detection
make leaks        # RSS-growth leak check (10 cycles, 20% threshold)

# Publishing
make check        # Validate wheels with twine
make publish      # Upload to PyPI
make publish-test # Upload to TestPyPI

GPU Acceleration

By default, cyllama builds with Metal support on macOS and CPU-only on Linux. To enable other GPU backends (CUDA, Vulkan, etc.):

# Static builds (all libs compiled in)
make build-cuda
make build-vulkan

# Dynamic builds (shared libs installed alongside extension)
make build-cuda-dynamic
make build-vulkan-dynamic

# Multiple backends
export GGML_CUDA=1 GGML_VULKAN=1
make build

See Build Backends for comprehensive backend build instructions.

Multi-GPU Configuration

For systems with multiple GPUs, cyllama provides full control over GPU selection and model splitting:

from cyllama import LLM, GenerationConfig

# Use a specific GPU (GPU index 1)
llm = LLM("model.gguf", main_gpu=1)

# Multi-GPU with layer splitting (default mode)
llm = LLM("model.gguf", split_mode=1, n_gpu_layers=-1)

# Multi-GPU with tensor parallelism (row splitting)
llm = LLM("model.gguf", split_mode=2, n_gpu_layers=-1)

# Custom tensor split: 30% GPU 0, 70% GPU 1
llm = LLM("model.gguf", tensor_split=[0.3, 0.7])

# Full configuration via GenerationConfig
config = GenerationConfig(
    main_gpu=0,
    split_mode=1,          # 0=NONE, 1=LAYER, 2=ROW
    tensor_split=[1, 2],   # 1/3 GPU0, 2/3 GPU1
    n_gpu_layers=-1
)
llm = LLM("model.gguf", config=config)

Split Modes:

  • 0 (NONE): Single GPU only, uses main_gpu

  • 1 (LAYER): Split layers and KV cache across GPUs (default)

  • 2 (ROW): Tensor parallelism - split layers with row-wise distribution

Testing

The tests directory in this repo provides extensive examples of using cyllama.

However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good small model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. cyllama expects models to be stored in a models folder in the cloned cyllama directory. So to create the models directory if doesn't exist and download this model, you can just type:

make download

This basically just does:

cd cyllama
mkdir models && cd models
wget https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf

Now you can test it using llama-cli or llama-simple:

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"

With 1489+ passing tests, the library is ready for both quick prototyping and production use:

make test  # Run full test suite

You can also explore interactively:

python3 -i scripts/start.py

>>> from cyllama import complete
>>> response = complete("What is 2+2?", model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf")
>>> print(response)

Documentation

Full documentation is available at https://shakfu.github.io/cyllama/ (built with MkDocs).

To serve docs locally: make docs-serve

  • User Guide - Comprehensive guide covering all features

  • CLI Cheatsheet - Complete CLI reference for all commands

  • API Reference - Complete API documentation

  • RAG Overview - Retrieval-augmented generation guide

  • Cookbook - Practical recipes and patterns

  • Changelog - Complete release history

  • Examples - See tests/examples/ for working code samples

Roadmap

Completed

  • Full llama.cpp API wrapper with Cython

  • High-level API (LLM, complete, chat)

  • Async API support (AsyncLLM, complete_async, chat_async)

  • Response class with stats and serialization

  • Built-in chat template system (llama.cpp templates)

  • Batch processing utilities

  • OpenAI-compatible API client

  • LangChain integration

  • Speculative decoding

  • GGUF file manipulation

  • JSON schema to grammar conversion

  • Model download helper

  • N-gram cache

  • OpenAI-compatible servers (PythonServer, EmbeddedServer, LlamaServer) with chat and embeddings

  • Whisper.cpp integration

  • Multimodal support (LLAVA)

  • Memory estimation utilities

  • Agent Framework (ReActAgent, ConstrainedAgent, ContractAgent)

  • Stable Diffusion (stable-diffusion.cpp) - image/video generation

  • RAG utilities (text chunking, document processing)

Future

  • Web UI for testing

Contributing

Contributions are welcome! Please see the User Guide for development guidelines.

License

This project wraps llama.cpp, whisper.cpp, and stable-diffusion.cpp which all follow the MIT licensing terms, as does cyllama.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cyllama-0.2.13-cp314-cp314-win_amd64.whl (12.6 MB view details)

Uploaded CPython 3.14Windows x86-64

cyllama-0.2.13-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

cyllama-0.2.13-cp314-cp314-macosx_11_0_x86_64.whl (13.5 MB view details)

Uploaded CPython 3.14macOS 11.0+ x86-64

cyllama-0.2.13-cp314-cp314-macosx_11_0_arm64.whl (12.9 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

cyllama-0.2.13-cp313-cp313-win_amd64.whl (12.3 MB view details)

Uploaded CPython 3.13Windows x86-64

cyllama-0.2.13-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

cyllama-0.2.13-cp313-cp313-macosx_11_0_x86_64.whl (13.4 MB view details)

Uploaded CPython 3.13macOS 11.0+ x86-64

cyllama-0.2.13-cp313-cp313-macosx_11_0_arm64.whl (12.9 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

cyllama-0.2.13-cp312-cp312-win_amd64.whl (12.3 MB view details)

Uploaded CPython 3.12Windows x86-64

cyllama-0.2.13-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

cyllama-0.2.13-cp312-cp312-macosx_11_0_x86_64.whl (13.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ x86-64

cyllama-0.2.13-cp312-cp312-macosx_11_0_arm64.whl (12.9 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

cyllama-0.2.13-cp311-cp311-win_amd64.whl (12.3 MB view details)

Uploaded CPython 3.11Windows x86-64

cyllama-0.2.13-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

cyllama-0.2.13-cp311-cp311-macosx_11_0_x86_64.whl (13.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ x86-64

cyllama-0.2.13-cp311-cp311-macosx_11_0_arm64.whl (12.9 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

cyllama-0.2.13-cp310-cp310-win_amd64.whl (12.3 MB view details)

Uploaded CPython 3.10Windows x86-64

cyllama-0.2.13-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

cyllama-0.2.13-cp310-cp310-macosx_11_0_x86_64.whl (13.4 MB view details)

Uploaded CPython 3.10macOS 11.0+ x86-64

cyllama-0.2.13-cp310-cp310-macosx_11_0_arm64.whl (12.9 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file cyllama-0.2.13-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.13-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 12.6 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.13-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 0fe8e9e642a684727700d47ce8d743ee39d0ab24db67bbd5f77ea87a9920f37b
MD5 51f4bca7963f1ee8f984fc80ea0d6fa0
BLAKE2b-256 2a6321c506e9757998e6ec1bfef3ebbf300335ae84d6ef0b8af7b1914a0a1f03

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 9c4afbe496e9d25fa92b14a78985708f42c328b6f4dc8b572a5563ee118ad57f
MD5 52033371eb107180f9b229ff170e6206
BLAKE2b-256 7cd564025e9fdeb938cc18658ee109a432340cd5e315bec13ca80a1658772166

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp314-cp314-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp314-cp314-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 0af34851485725158477d791bbfb9af47e798c4b5dc479886d99a395be2d1127
MD5 3c9c1f5303ff7d4daa9c4f1fc6924ec4
BLAKE2b-256 01b08344cf85df2a4f42d73ea46c0cfd745fcd53c59caab568c3aaea12d01bf6

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a926844a7e94336ccca33e3a42dcbfa4e95644f2c707bdf879100cfd3fa8b2ab
MD5 ed33c852cb91c7f2492afdefe63bb06f
BLAKE2b-256 cb4caa1094d82b9505b2cd09aa1b9322ef600a8ac54082a8828a39fbfd8ca0a7

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.13-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.13-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 a77f766e9255ee7f48cdb760582d430815d831ed20f8728f930c9f75b144c374
MD5 df77ac8fe3d7288018c173449a48852e
BLAKE2b-256 8c8ce4d3e498a1b9363a7950bcd6ecddc3fc4b779d267d8e9d725bd9faa5a49e

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 99f37e933c363e09cff8a9e9d0b2b3ed8f312c05256faf54a133d94402f6a554
MD5 8b2ae1e32a0d0b50883395f500a64048
BLAKE2b-256 9455215cf78a7c586e40a97cb327b4c44a3b347f0a4f47980313f7daed6c59c7

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp313-cp313-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp313-cp313-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 995ce618d2de4d5b4e14d2fb47e88c1e6026fb04902fc159f3cc79bc06e16f20
MD5 d7d20c724b94957887295c9aa2890cbf
BLAKE2b-256 0e64f3d7367f441737b89e1f46cd2aebc32fb0e8f1c2026f4d849de8755650ca

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 53d9fe2592be9f54bf5355af34b1d20be1b4c95de9b570d6c134016e413f5a37
MD5 f48b89ed6e3cbb6b5522de364df1d553
BLAKE2b-256 124ae580e1dd8450eb09ea71171f7a580e1ecd0cbdcb6ee4edd361ad2a15edb6

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.13-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.13-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 12ded8bc0f2f739b289e1d6c0a3f39ab321e48ab243b72f0818f9ecbf5862518
MD5 9e1686168f9177723e57a9befbc67d02
BLAKE2b-256 3efdf669e8de5d381a2ae0bf85cfbece9bf885115878989549c544864fb0f2d9

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 884d688acdaf65e7d0016136e03b7f6ab46a3c06021b12e8fe929481811bf814
MD5 721f4fb13cc682328c02dabfd8a79455
BLAKE2b-256 7a31500e1d52f71946347c7c9a18af664b1a23fe1cb0ec8d794a380c7ea63909

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp312-cp312-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp312-cp312-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 489d92d0dd7ea48d1d65ea8909d22e46d496356a9e45a2b292b24decfd7d0d70
MD5 beaccadd7bf9146282ffccb885391619
BLAKE2b-256 012e71251124c6fe4f0c9146687620c379a58aee36ecde284b736ea4acf61b1a

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2cdf315790dd11d3068639c95fe6aa5cf04ea3020b2c1bbee775c868c5d6b6cb
MD5 431f6c348658342a001abb23429594c6
BLAKE2b-256 f9626b2bec659d68d6eb50b012df6cb48f2047df933cfc09457363b09416b6b1

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.13-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.13-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 4fb0b1855eabcdbae5ac9f5a2dba1f810b8402866bfabe207f20515eae4e97de
MD5 49dd55d4af91fae805b7548c5ce6ae3c
BLAKE2b-256 69f598b95a859b4c2bc278c440ea97dae826ccc224327b6607a89a2a5831ee9c

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 64846b8d39771b6dede455985586ea9c9105001159ca84a8389ebaec1cff2111
MD5 cae8001c9851399b2951e18eb60d3bd2
BLAKE2b-256 9d3e80159ae6ece41f31fcdbc40125fa88facbeebfe7197476cd3a9be015b9a2

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp311-cp311-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp311-cp311-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 ec62e3ae82b1f4ffc5249b23c2e9647a99efe1f5f78f461ead4a75df56411084
MD5 41b1feb643627e6fdbcdaf30bf683e1a
BLAKE2b-256 9ef4d7b3c027c765c9b54c2e6cc7620a833769166fa3b7286e7bd3f89af88d4a

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 aecc60df8b4a4c2f03972a2caff05a378a53754c59cd2ce74ac7c869b20af217
MD5 cef6a9bd203ea171b3da93bc9ec24bc7
BLAKE2b-256 0619c0b81f7975fec9306a9a0e745a1ddc58fb647d26052f762eb39ea061c560

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.13-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.13-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 9cb596f8e30992d6462240befc44f470ac4692e40bcf56f029739aa141ef56e2
MD5 872913fe5d84e1ca289ebce4b1ed8639
BLAKE2b-256 0d38e770a466e094cf72ed2663631b774500a62358f0f38354cd487e9c7e6ba5

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 c7cb9f8d452da2d051f080f1d03f47882c9181e6f080366261be3b12e4e4ecf3
MD5 697269a9b4d3301a81f76271116ac3cf
BLAKE2b-256 57ca26f24eb43f6021b8948b3d880533ba531827c2f685fa004d54a14ff76d4a

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp310-cp310-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp310-cp310-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 73363e4614e00368969cb5dc83d5e98542f9cde6b462ee4c45ab057d5adc9cd1
MD5 837989b6e3012f1d0b0c399d6b0b7498
BLAKE2b-256 3a9f51243dc25ca2325c943c3bb58bafcc5b94e594559b443de09bbe356595dc

See more details on using hashes here.

File details

Details for the file cyllama-0.2.13-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.13-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 16a994e5a64a93283d7cbd544e0c51fae54d5d3f97779ae88e5e1e1a0b596240
MD5 db70a2a40571cfe8af4d1260b5e86288
BLAKE2b-256 806b74375b38d52e42ee903559659be09a961a3ba82e503e3aa0cae9c5e32c48

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page