Skip to main content

cyllama is a comprehensive zero-dependencies Python library for local AI inference using the state-of-the-art llama, whisper, and stable-diffusion .cpp ecosystem.

Project description

cyllama - Fast, Pythonic AI Inference

cyllama is a no-dependencies Python library for local AI inference built on the .cpp inference stack:

It combines the performance of compiled Cython wrappers with a simple, high-level Python API for cross-modal AI inference.

Documentation | PyPI | Changelog

Features

  • High-level API -- complete(), chat(), LLM class for quick prototyping / text generation.

  • Streaming -- token-by-token output with callbacks

  • Batch processing -- process multiple prompts in parallel

  • GPU acceleration -- Metal (macOS), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform), SYCL (Intel)

  • Speculative decoding -- 2-3x speedup with draft models

  • Agent framework -- ReActAgent, ConstrainedAgent, ContractAgent with tool calling; multi-agent composition (agent_as_tool, TieredAgentTeam); JSON-Schema constraints on tool args via Annotated[] markers; per-tool timeouts and coercion

  • RAG -- retrieval-augmented generation with local embeddings and sqlite-vector

  • Speech recognition -- whisper.cpp transcription and translation

  • Image/Video generation -- stable-diffusion.cpp handles image, image-edit and video models.

  • OpenAI-compatible servers -- EmbeddedServer (C/Mongoose) and PythonServer with chat completions and embeddings endpoints

  • Framework integrations -- OpenAI API client, LangChain LLM interface

Installation

From PyPI

pip install cyllama

This installs the cpu-backend for linux and windows. For MacOS, the Metal backend is installed by default to take advantage of Apple Silicon.

GPU-Accelerated Variants

GPU variants are available on PyPI as separate packages (dynamically linked, Linux x86_64 only for now):

pip install cyllama-cuda12   # NVIDIA GPU (CUDA 12.4)
pip install cyllama-rocm     # AMD GPU (ROCm 6.3, requires glibc >= 2.35)
pip install cyllama-sycl     # Intel GPU (oneAPI SYCL 2025.3)
pip install cyllama-vulkan   # Cross-platform GPU (Vulkan)

All variants install the same cyllama Python package -- only the compiled backend differs. Install one at a time (they replace each other). GPU variants require the corresponding driver/runtime installed on your system.

You can verify which backend is active after installation:

cyllama info

You can also query the backend configuration at runtime:

from cyllama import _backend
print(_backend.cuda)   # True if built with CUDA
print(_backend.metal)  # True if built with Metal

Optional integrations

cyllama has zero hard dependencies beyond its compiled core. Features built on third-party libraries discover them lazily at runtime, so you install only what you actually use.

PDF parsing (cyllama.rag.PDFLoader) supports four pluggable backends. Install whichever fits your needs:

Backend Install Strengths Capabilities
pypdf pip install pypdf Pure-Python, lightweight, per-page text per_page
pymupdf pip install pymupdf Fast, per-page text, table/image awareness per_page, tables, images
pdfminer pip install pdfminer.six Pure-Python, layout-aware extraction layout
docling pip install docling Highest quality; OCR, tables, layout, markdown ocr, tables, images, layout, markdown (heavy; pulls in torch + CV stack)

PDFLoader(backend="auto") (the default) picks the first installed backend in the order above (lightest-first). Select explicitly with PDFLoader(backend="docling"), or filter by capability with PDFLoader(require={"ocr"}). See cyllama.rag.available_pdf_backends() and pdf_backend_info(name) for runtime introspection.

Other optional integrations -- install directly when needed:

Feature Install
Qdrant vector store pip install qdrant-client

Build from source with a specific backend

GGML_CUDA=1 pip install cyllama --no-binary cyllama
GGML_VULKAN=1 pip install cyllama --no-binary cyllama

Command-Line Interface

cyllama provides a unified CLI for all major functionality:

# Text generation
cyllama gen -m models/llama.gguf -p "What is Python?" --stream
cyllama gen -m models/llama.gguf -p "Write a haiku" --temperature 0.9 --json

# Chat (single-turn or interactive)
cyllama chat -m models/llama.gguf -p "Explain gravity" -s "You are a physicist"
cyllama chat -m models/llama.gguf                      # interactive mode
cyllama chat -m models/llama.gguf -n 1024              # interactive, up to 1024 tokens per response
cyllama chat -m models/llama.gguf --stats              # show session stats on exit

# Embeddings
cyllama embed -m models/bge-small.gguf -t "hello world" -t "another text"
cyllama embed -m models/bge-small.gguf --dim                        # print dimensions
cyllama embed -m models/bge-small.gguf --similarity "cats" -f corpus.txt --threshold 0.5

# Other commands
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ -p "How do I configure X?"
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -f file.md   # interactive mode
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ --db docs.sqlite -p "..."  # index to persistent DB
cyllama rag -m models/llama.gguf -e models/bge-small.gguf --db docs.sqlite -p "..."           # reuse existing DB, no re-indexing
cyllama server -m models/llama.gguf --port 8080
cyllama transcribe -m models/ggml-base.en.bin audio.wav
cyllama tts -m models/tts.gguf -p "Hello world"
cyllama sd txt2img --model models/sd.gguf --prompt "a sunset"
cyllama info       # build and backend information
cyllama memory -m models/llama.gguf  # GPU memory estimation

Run cyllama --help or cyllama <command> --help for full usage. See CLI Cheatsheet for the complete reference.

Quick Start

from cyllama import complete

# One line is all you need
response = complete(
    "Explain quantum computing in simple terms",
    model_path="models/llama.gguf",
    temperature=0.7,
    max_tokens=200
)
print(response)

Key Features

Simple by Default, Configurable When Needed

High-Level API - Get started in seconds:

from cyllama import complete, chat, LLM

# One-shot completion
response = complete("What is Python?", model_path="model.gguf")

# Multi-turn chat
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
]
response = chat(messages, model_path="model.gguf")

# Reusable LLM instance (faster for multiple prompts)
llm = LLM("model.gguf")
response1 = llm("Question 1")
response2 = llm("Question 2")  # Model stays loaded!

Streaming Support - Real-time token-by-token output:

for chunk in complete("Tell me a story", model_path="model.gguf", stream=True):
    print(chunk, end="", flush=True)

Performance Optimized

Batch Processing - Process multiple prompts 3-10x faster:

from cyllama import batch_generate

prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]
responses = batch_generate(prompts, model_path="model.gguf")

Speculative Decoding - 2-3x speedup with draft models:

from cyllama.llama.llama_cpp import Speculative, SpeculativeParams

params = SpeculativeParams(n_max=16, p_min=0.75)
spec = Speculative(params, ctx_target)
draft_tokens = spec.draft(prompt_tokens, last_token)

Memory Optimization - Smart GPU layer allocation:

from cyllama import estimate_gpu_layers

estimate = estimate_gpu_layers(
    model_path="model.gguf",
    available_vram_mb=8000
)
print(f"Recommended GPU layers: {estimate.n_gpu_layers}")

N-gram Cache - 2-10x speedup for repetitive text:

from cyllama.llama.llama_cpp import NgramCache

cache = NgramCache()
cache.update(tokens, ngram_min=2, ngram_max=4)
draft = cache.draft(input_tokens, n_draft=16)

Response Caching - Cache LLM responses for repeated prompts:

from cyllama import LLM

# Enable caching with 100 entries and 1 hour TTL
llm = LLM("model.gguf", cache_size=100, cache_ttl=3600, seed=42)

response1 = llm("What is Python?")  # Cache miss - generates response
response2 = llm("What is Python?")  # Cache hit - returns cached response instantly

# Check cache statistics
info = llm.cache_info()  # ResponseCacheInfo(hits=1, misses=1, maxsize=100, currsize=1, ttl=3600)

# Clear cache when needed
llm.cache_clear()

Note: Caching requires a fixed seed (not the default random sentinel) since random seeds produce non-deterministic output. Streaming responses are not cached.

Framework Integrations

OpenAI-Compatible API - Drop-in replacement:

from cyllama.integrations import OpenAIClient

client = OpenAIClient(model_path="model.gguf")

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7
)
print(response.choices[0].message.content)

LangChain Integration:

from cyllama.integrations import CyllamaLLM
from langchain.chains import LLMChain

llm = CyllamaLLM(model_path="model.gguf", temperature=0.7)
chain = LLMChain(llm=llm, prompt=prompt_template)
result = chain.run(topic="AI")

Agent Framework

Cyllama includes a zero-dependency agent framework with three agent architectures:

ReActAgent - Reasoning + Acting agent with tool calling:

from cyllama import LLM
from cyllama.agents import ReActAgent, tool
from simpleeval import simple_eval

@tool
def calculate(expression: str) -> str:
    """Evaluate a math expression safely."""
    return str(simple_eval(expression))

llm = LLM("model.gguf")
agent = ReActAgent(llm=llm, tools=[calculate])
result = agent.run("What is 25 * 4?")
print(result.answer)

ConstrainedAgent - Grammar-enforced tool calling for 100% reliability:

from cyllama.agents import ConstrainedAgent

agent = ConstrainedAgent(llm=llm, tools=[calculate])
result = agent.run("Calculate 100 / 4")  # Guaranteed valid tool calls

ContractAgent - Contract-based agent with C++26-inspired pre/post conditions:

from cyllama.agents import ContractAgent, tool, pre, post, ContractPolicy

@tool
@pre(lambda args: args['x'] != 0, "cannot divide by zero")
@post(lambda r: r is not None, "result must not be None")
def divide(a: float, x: float) -> float:
    """Divide a by x."""
    return a / x

agent = ContractAgent(
    llm=llm,
    tools=[divide],
    policy=ContractPolicy.ENFORCE,
    task_preconditions=[lambda task: len(task) > 10],
    answer_postconditions=[lambda ans: len(ans) > 0],
)
result = agent.run("What is 100 divided by 4?")

Schema constraints via Annotated[] -- attach JSON-Schema bounds directly to type hints (Ge, Le, MultipleOf, MinLen, MaxLen, Pattern); the dispatch layer enforces them before the tool runs:

from typing import Annotated, Literal
from cyllama.agents import tool, Ge, Le, Pattern

@tool
def fetch(
    table: Annotated[str, Pattern(r"^[a-z_]+$")],
    limit: Annotated[int, Ge(1), Le(1000)],
    mode: Literal["preview", "full"] = "preview",
) -> list[dict]: ...

Multi-agent composition -- wrap any agent as a tool for supervisor / worker setups; pair smaller worker LLMs with a larger planner via TieredAgentTeam:

from cyllama.agents import agent_as_tool, AgentRole, TieredAgentTeam

team = TieredAgentTeam(
    supervisor=ReActAgent(llm=LLM("models/strong.gguf"), tools=[]),
    workers=[
        AgentRole("researcher", researcher, "Find facts."),
        AgentRole("coder", coder, "Modify code."),
    ],
)
result = team.run("Refactor X using technique Y.")

See Agents Overview for detailed agent documentation, plus Contract Recipes for nine worked patterns of when to use schema vs contracts.

Speech Recognition

Whisper Transcription - Transcribe audio files with timestamps:

from cyllama.whisper import WhisperContext, WhisperFullParams
import numpy as np

# Load model and audio
ctx = WhisperContext("models/ggml-base.en.bin")
samples = load_audio_as_16khz_float32("audio.wav")  # Your audio loading function

# Transcribe
params = WhisperFullParams()
ctx.full(samples, params)

# Get results
for i in range(ctx.full_n_segments()):
    start = ctx.full_get_segment_t0(i) / 100.0
    end = ctx.full_get_segment_t1(i) / 100.0
    text = ctx.full_get_segment_text(i)
    print(f"[{start:.2f}s - {end:.2f}s] {text}")

See Whisper docs for full documentation.

Stable Diffusion

Image Generation - Generate images from text using stable-diffusion.cpp:

from cyllama.sd import text_to_image

# Simple text-to-image
image = text_to_image(
    model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
    prompt="a photo of a cute cat",
    width=512,
    height=512,
    sample_steps=4,
    cfg_scale=1.0
)
image.save("output.png")

Advanced Generation - Full control with SDContext:

from cyllama.sd import SDContext, SDContextParams

params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4

ctx = SDContext(params)
# sample_method / scheduler / eta / wtype default to auto-resolve
# sentinels (SD C-library defaults) -- pass explicitly only to override.
images = ctx.generate(
    prompt="a beautiful mountain landscape",
    negative_prompt="blurry, ugly",
    width=512,
    height=512,
)

CLI Tool - Command-line interface:

# Text to image
cyllama sd txt2img \
    --model models/sd_xl_turbo_1.0.q8_0.gguf \
    --prompt "a beautiful sunset" \
    --output sunset.png

# Image to image
cyllama sd img2img \
    --model models/sd-v1-5.gguf \
    --init-img input.png \
    --prompt "oil painting style" \
    --strength 0.7

# Show system info
cyllama sd info

Supports SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, z-image-turbo, video generation (Wan/CogVideoX), LoRA, ControlNet, inpainting, and ESRGAN upscaling. See Stable Diffusion docs for full documentation.

RAG (Retrieval-Augmented Generation)

CLI - Query your documents from the command line:

# Single query against a directory of docs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ -p "How do I configure X?" --stream

# Interactive mode with source display
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -f guide.md -f faq.md --sources

# Persistent vector store: index once, reuse across runs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ --db docs.sqlite -p "How do I configure X?"   # first run: indexes to docs.sqlite
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    --db docs.sqlite -p "Another question?"                # later runs: reuse index, no re-embedding

Simple RAG - Query your documents with LLMs:

from cyllama.rag import RAG

# Create RAG instance with embedding and generation models
rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)

# Add documents
rag.add_texts([
    "Python is a high-level programming language.",
    "Machine learning is a subset of artificial intelligence.",
    "Neural networks are inspired by biological neurons."
])

# Query
response = rag.query("What is Python?")
print(response.text)

Load Documents - Support for multiple file formats:

from cyllama.rag import RAG, load_directory

rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)

# Load all documents from a directory
documents = load_directory("docs/", glob="**/*.md")
rag.add_documents(documents)

response = rag.query("How do I configure the system?")

Hybrid Search - Combine vector and keyword search:

from cyllama.rag import RAG, HybridStore, Embedder

embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf")
store = HybridStore("knowledge.db", embedder)

store.add_texts(["Document content..."])

# Hybrid search with configurable weights
results = store.search("query", k=5, vector_weight=0.7, fts_weight=0.3)

Embedding Cache - Speed up repeated queries with LRU caching:

from cyllama.rag import Embedder

# Enable cache with 1000 entries
embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf", cache_size=1000)

embedder.embed("hello")  # Cache miss
embedder.embed("hello")  # Cache hit - instant return

info = embedder.cache_info()
print(f"Hits: {info.hits}, Misses: {info.misses}")

Agent Integration - Use RAG as an agent tool:

from cyllama import LLM
from cyllama.agents import ReActAgent
from cyllama.rag import RAG, create_rag_tool

rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)
rag.add_texts(["Your knowledge base..."])

# Create a tool from the RAG instance
search_tool = create_rag_tool(rag)

llm = LLM("models/llama.gguf")
agent = ReActAgent(llm=llm, tools=[search_tool])
result = agent.run("Find information about X in the knowledge base")

Supports text chunking, multiple embedding pooling strategies, LRU caching for repeated queries, async operations, reranking, and SQLite-vector for persistent storage. See RAG Overview for full documentation.

Common Utilities

GGUF File Manipulation - Inspect and modify model files:

from cyllama.llama.llama_cpp import GGUFContext

ctx = GGUFContext.from_file("model.gguf")
metadata = ctx.get_all_metadata()
print(f"Model: {metadata['general.name']}")

Structured Output - JSON schema to grammar conversion (pure Python, no C++ dependency):

from cyllama.llama.llama_cpp import json_schema_to_grammar

schema = {"type": "object", "properties": {"name": {"type": "string"}}}
grammar = json_schema_to_grammar(schema)

Huggingface Model Downloads:

from cyllama.llama.llama_cpp import download_model, list_cached_models, get_hf_file

# Download from HuggingFace (saves to ~/.cache/llama.cpp/)
download_model("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Or with explicit parameters
download_model(hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Download specific file to custom path
download_model(
    hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF",
    hf_file="Llama-3.2-1B-Instruct-Q8_0.gguf",
    model_path="./models/my_model.gguf"
)

# Get file info without downloading
info = get_hf_file("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")
print(info)  # {'repo': '...', 'gguf_file': '...', 'mmproj_file': '...'}

# List cached models
models = list_cached_models()

What's Inside

Text Generation (llama.cpp)

  • Full llama.cpp API - Cython wrapper with strong typing

  • High-Level API - Simple, Pythonic interface (LLM, complete, chat)

  • Streaming Support - Token-by-token generation with callbacks

  • Batch Processing - Efficient parallel inference

  • Multimodal - LLAVA and vision-language models

  • Speculative Decoding - 2-3x inference speedup with draft models

Speech Recognition (whisper.cpp)

  • Full whisper.cpp API - Cython wrapper

  • High-Level API - Simple transcribe() function

  • Multiple Formats - WAV, MP3, FLAC, and more

  • Language Detection - Automatic or specified language

  • Timestamps - Word and segment-level timing

Image & Video Generation (stable-diffusion.cpp)

  • Full stable-diffusion.cpp API - Cython wrapper

  • Text-to-Image - SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, Z-Image

  • Image-to-Image - Transform existing images

  • Inpainting - Mask-based editing

  • ControlNet - Guided generation with edge/pose/depth

  • Video Generation - Wan, CogVideoX models

  • Upscaling - ESRGAN 4x upscaling

Cross-Cutting Features

  • GPU Acceleration - Metal, CUDA, ROCm, Vulkan, SYCL backends

  • Memory Optimization - Smart GPU layer allocation

  • Agent Framework - ReActAgent, ConstrainedAgent, ContractAgent

  • Framework Integration - OpenAI API, LangChain, FastAPI

Why Cyllama?

Performance: Compiled Cython wrappers with minimal overhead

  • Strong type checking at compile time

  • Zero-copy data passing where possible

  • Efficient memory management

  • Native integration with llama.cpp optimizations

Simplicity: From 50 lines to 1 line for basic generation

  • Pythonic API

  • Automatic resource management

  • Sensible defaults, full control when needed

Well-tested with broad api coverage

  • Extensive test coverage across the API surface

  • Documentation and examples fir each module

  • Proper error handling and logging

  • Framework integration for real applications

Up-to-Date: Tracks bleeding-edge llama.cpp

  • Regular updates with latest features

  • All high-priority APIs wrapped

  • Performance optimizations included

Status

Build System: scikit-build-core + CMake

See pyproject.toml for the current cyllama version and CHANGELOG.md for the pinned llama.cpp / whisper.cpp / stable-diffusion.cpp revisions.

Platform & GPU Availability

Pre-built wheels on PyPI:

Package Backend Platform Arch Linking
cyllama CPU Linux x86_64 static
cyllama CPU Windows x86_64 static
cyllama Metal macOS arm64 (Apple Silicon) static
cyllama Metal macOS x86_64 (Intel) static
cyllama-cuda12 CUDA 12.4 Linux x86_64 dynamic
cyllama-rocm ROCm 6.3 Linux x86_64 dynamic
cyllama-sycl Intel SYCL (oneAPI 2025.3) Linux x86_64 dynamic
cyllama-vulkan Vulkan Linux x86_64 dynamic

We will be adding additional wheel support for more platforms in the future, starting with vulkan and cuda12 support Windows.

Build from source (any platform with a C++ toolchain):

Backend macOS Linux Windows
CPU make build-cpu make build-cpu make build-cpu
Metal make build-metal (default) -- --
CUDA -- make build-cuda make build-cuda
ROCm (HIP) -- make build-hip --
Vulkan make build-vulkan make build-vulkan make build-vulkan
SYCL -- make build-sycl --
OpenCL make build-opencl make build-opencl make build-opencl

All source builds support both static (make build-<backend>) and dynamic (make build-<backend>-dynamic) linking.

Recent Releases

See CHANGELOG.md for full release notes.

  • v0.2.15 (May 2026) - Context-manager protocol across llama/whisper/SD resource classes; whisper streaming callbacks; GIL released on per-token hot paths and across long native calls; new model_save_to_file / model_quantize top-level wrappers; LlamaContext state serialization switched to bytes; broad correctness sweep

  • v0.2.14 (Apr 2026) - stable-diffusion.cpp hires-fix two-pass generation; two-layer generation cancellation on LLM; llama.cpp upgraded b8833 -> b8931

  • v0.2.13 (Apr 2026) - QdrantVectorStore reference adapter for VectorStoreProtocol; pipeline-integrated reranking (RAGConfig.rerank) with RerankerProtocol; ccache + concurrency groups on CPU cibw workflows

  • v0.2.12 - Windows-CUDA, Windows-Vulkan, and macOS-Intel Vulkan GPU wheels; canonical delocate/auditwheel/delvewheel packaging. Experimental abi3 wheels (cp312+)

  • v0.2.11 (Apr 2026) - Pluggable RAG backends (VectorStoreProtocol / EmbedderProtocol) and MCP client API on LLM

  • v0.2.10 (Apr 2026) - GPU wheel size halved; packaging fixes (build_config.json, auditwheel SONAME, Vulkan ABI)

  • v0.2.9 (Apr 2026) - CUDA + SD stability fixes; get_perf_data() telemetry APIs

  • v0.2.8 (Apr 2026) - Expanded Cython bindings across llama / whisper / SD; interactive-chat streaming & sampling

  • v0.2.7 (Apr 2026) - SD defaults aligned with C library (fixes blank CUDA images)

  • v0.2.6 (Apr 2026) - Hotfix: remove accidental test-only runtime dependency

  • v0.2.5 (Apr 2026) - RAG hardening: persistent store, corpus dedup, vendored jinja2 chat templates

  • v0.2.4 (Apr 2026) - Unified cyllama CLI (gen, chat, embed, rag, …)

  • v0.2.3 (Apr 2026) - Wheel packaging and GPU portability fixes

  • v0.2.2 (Apr 2026) - CUDA wheel size stability

  • v0.2.1 (Mar 2026) - Code-quality hardening, GIL release, async fixes

  • v0.2.0 (Mar 2026) - Dynamic-linked GPU wheels on PyPI (CUDA, ROCm, SYCL, Vulkan)

  • v0.1.21 (Mar 2026) - GPU wheel builds: CUDA + ROCm, sqlite-vector bundled

  • v0.1.20 (Feb 2026) - Update llama.cpp + stable-diffusion.cpp

  • v0.1.19 (Dec 2025) - Metal fix for stable-diffusion.cpp

  • v0.1.18 (Dec 2025) - Remaining stable-diffusion.cpp wrapped

  • v0.1.16 (Dec 2025) - Response class, Async API, Chat templates

  • v0.1.12 (Nov 2025) - Initial wrapper of stable-diffusion.cpp

  • v0.1.11 (Nov 2025) - ACP support, build improvements

  • v0.1.10 (Nov 2025) - Agent Framework, bug fixes

  • v0.1.9 (Nov 2025) - High-level APIs, integrations, batch processing, comprehensive documentation

  • v0.1.8 (Nov 2025) - Speculative decoding API

  • v0.1.7 (Nov 2025) - GGUF, JSON Schema, Downloads, N-gram Cache

  • v0.1.6 (Nov 2025) - Multimodal test fixes

  • v0.1.5 (Oct 2025) - Mongoose server, embedded server

  • v0.1.4 (Oct 2025) - Memory estimation, performance optimizations

See CHANGELOG.md for complete release history.

Building from Source

To build cyllama from source:

  1. A recent version of python3 (currently testing on python 3.13)

  2. Git clone the latest version of cyllama:

    git clone https://github.com/shakfu/cyllama.git
    cd cyllama
    
  3. We use uv for package management:

    If you don't have it see the link above to install it, otherwise:

    uv sync
    
  4. Type make in the terminal.

    This will:

    1. Download and build llama.cpp, whisper.cpp and stable-diffusion.cpp
    2. Install them into the thirdparty folder
    3. Build cyllama using scikit-build-core + CMake

Build Commands

# Full build (default: static linking, builds llama.cpp from source)
make              # Build dependencies + editable install

# Dynamic linking (downloads pre-built llama.cpp release)
make build-dynamic  # No source compilation needed for llama.cpp

# Build wheel for distribution
make wheel        # Creates wheel in dist/
make dist         # Creates sdist + wheel in dist/

# Backend-specific builds (static)
make build-cpu    # CPU only
make build-metal  # macOS Metal (default on macOS)
make build-cuda   # NVIDIA CUDA
make build-vulkan # Vulkan (cross-platform)
make build-hip    # AMD ROCm
make build-sycl   # Intel SYCL
make build-opencl # OpenCL

# Backend-specific builds (dynamic -- shared libs)
make build-cpu-dynamic
make build-cuda-dynamic
make build-vulkan-dynamic
make build-metal-dynamic
make build-hip-dynamic
make build-sycl-dynamic
make build-opencl-dynamic

# Backend-specific wheels (static and dynamic)
make wheel-cuda           # Static wheel
make wheel-cuda-dynamic   # Dynamic wheel with shared libs

# Clean and rebuild
make clean        # Remove build artifacts + dynamic libs
make reset        # Full reset including thirdparty and .venv
make remake       # Clean rebuild with tests

# Code quality
make lint         # Lint with ruff (auto-fix)
make format       # Format with ruff
make typecheck    # Type check with mypy
make qa           # Run all: lint, typecheck, format

# Memory leak detection
make leaks        # RSS-growth leak check (10 cycles, 20% threshold)

# Publishing
make check        # Validate wheels with twine
make publish      # Upload to PyPI
make publish-test # Upload to TestPyPI

GPU Acceleration

By default, cyllama builds with Metal support on macOS and CPU-only on Linux. To enable other GPU backends (CUDA, Vulkan, etc.):

# Static builds (all libs compiled in)
make build-cuda
make build-vulkan

# Dynamic builds (shared libs installed alongside extension)
make build-cuda-dynamic
make build-vulkan-dynamic

# Multiple backends
export GGML_CUDA=1 GGML_VULKAN=1
make build

See Build Backends for comprehensive backend build instructions.

Multi-GPU Configuration

For systems with multiple GPUs, cyllama provides full control over GPU selection and model splitting:

from cyllama import LLM, GenerationConfig

# Use a specific GPU (GPU index 1)
llm = LLM("model.gguf", main_gpu=1)

# Multi-GPU with layer splitting (default mode)
llm = LLM("model.gguf", split_mode=1, n_gpu_layers=-1)

# Multi-GPU with tensor parallelism (row splitting)
llm = LLM("model.gguf", split_mode=2, n_gpu_layers=-1)

# Custom tensor split: 30% GPU 0, 70% GPU 1
llm = LLM("model.gguf", tensor_split=[0.3, 0.7])

# Full configuration via GenerationConfig
config = GenerationConfig(
    main_gpu=0,
    split_mode=1,          # 0=NONE, 1=LAYER, 2=ROW
    tensor_split=[1, 2],   # 1/3 GPU0, 2/3 GPU1
    n_gpu_layers=-1
)
llm = LLM("model.gguf", config=config)

Split Modes:

  • 0 (NONE): Single GPU only, uses main_gpu

  • 1 (LAYER): Split layers and KV cache across GPUs (default)

  • 2 (ROW): Tensor parallelism - split layers with row-wise distribution

Testing

The tests directory in this repo provides extensive examples of using cyllama.

However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good small model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. cyllama expects models to be stored in a models folder in the cloned cyllama directory. So to create the models directory if doesn't exist and download this model, you can just type:

make download

This basically just does:

cd cyllama
mkdir models && cd models
wget https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf

Now you can test it using llama-cli or llama-simple:

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"

The library covers both quick prototyping and longer-running deployments:

make test  # Run full test suite

You can also explore interactively:

python3 -i scripts/start.py

>>> from cyllama import complete
>>> response = complete("What is 2+2?", model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf")
>>> print(response)

Documentation

Full documentation is available at https://shakfu.github.io/cyllama/ (built with MkDocs).

To serve docs locally: make docs-serve

Roadmap

Completed

  • Full llama.cpp API wrapper with Cython

  • High-level API (LLM, complete, chat)

  • Async API support (AsyncLLM, complete_async, chat_async)

  • Response class with stats and serialization

  • Built-in chat template system (llama.cpp templates)

  • Batch processing utilities

  • OpenAI-compatible API client

  • LangChain integration

  • Speculative decoding

  • GGUF file manipulation

  • JSON schema to grammar conversion

  • Model download helper

  • N-gram cache

  • OpenAI-compatible servers (PythonServer, EmbeddedServer, LlamaServer) with chat and embeddings

  • Whisper.cpp integration

  • Multimodal support (LLAVA)

  • Memory estimation utilities

  • Agent Framework (ReActAgent, ConstrainedAgent, ContractAgent)

  • Stable Diffusion (stable-diffusion.cpp) - image/video generation

  • RAG utilities (text chunking, document processing)

Future

  • Web UI for testing

Contributing

Contributions are welcome! Please see the User Guide for development guidelines.

License

This project wraps llama.cpp, whisper.cpp, and stable-diffusion.cpp which all follow the MIT licensing terms, as does cyllama.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cyllama-0.2.18-cp314-cp314-win_amd64.whl (12.9 MB view details)

Uploaded CPython 3.14Windows x86-64

cyllama-0.2.18-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.7 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

cyllama-0.2.18-cp314-cp314-macosx_11_0_x86_64.whl (13.8 MB view details)

Uploaded CPython 3.14macOS 11.0+ x86-64

cyllama-0.2.18-cp314-cp314-macosx_11_0_arm64.whl (13.2 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

cyllama-0.2.18-cp313-cp313-win_amd64.whl (12.7 MB view details)

Uploaded CPython 3.13Windows x86-64

cyllama-0.2.18-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

cyllama-0.2.18-cp313-cp313-macosx_11_0_x86_64.whl (13.8 MB view details)

Uploaded CPython 3.13macOS 11.0+ x86-64

cyllama-0.2.18-cp313-cp313-macosx_11_0_arm64.whl (13.2 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

cyllama-0.2.18-cp312-cp312-win_amd64.whl (12.7 MB view details)

Uploaded CPython 3.12Windows x86-64

cyllama-0.2.18-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

cyllama-0.2.18-cp312-cp312-macosx_11_0_x86_64.whl (13.8 MB view details)

Uploaded CPython 3.12macOS 11.0+ x86-64

cyllama-0.2.18-cp312-cp312-macosx_11_0_arm64.whl (13.2 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

cyllama-0.2.18-cp311-cp311-win_amd64.whl (12.7 MB view details)

Uploaded CPython 3.11Windows x86-64

cyllama-0.2.18-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

cyllama-0.2.18-cp311-cp311-macosx_11_0_x86_64.whl (13.8 MB view details)

Uploaded CPython 3.11macOS 11.0+ x86-64

cyllama-0.2.18-cp311-cp311-macosx_11_0_arm64.whl (13.2 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

cyllama-0.2.18-cp310-cp310-win_amd64.whl (12.7 MB view details)

Uploaded CPython 3.10Windows x86-64

cyllama-0.2.18-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.8 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

cyllama-0.2.18-cp310-cp310-macosx_11_0_x86_64.whl (13.8 MB view details)

Uploaded CPython 3.10macOS 11.0+ x86-64

cyllama-0.2.18-cp310-cp310-macosx_11_0_arm64.whl (13.2 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file cyllama-0.2.18-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.18-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 12.9 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.18-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 41f5fc62146193a5c1a168e0fea7c540d707bf709a8c88a0e58ca7537353f00b
MD5 ac4243c688fb1cfb8aa1f4745a291734
BLAKE2b-256 1b67410aabedacaf2dcf13dadeec6f41dba2f803cb97eff3b7c4d896595613dd

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 b7663cce6ccd7293cafc797ffc2df10dbc2414f7ae0f7984c64d7901cddf8309
MD5 048eccfc419ecf213343cdc488520878
BLAKE2b-256 4a80f37ce9f52ef22ba59e3ec9855bad85cb90732a2a3a273118d8e8521e4dcc

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp314-cp314-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp314-cp314-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 9491400a98d1b3c7eba9aa83891de348729600d8c1495468c3dfbe803ac307ef
MD5 ed6b0c4faa03a2469e19cfb4c1540f77
BLAKE2b-256 1cf4aea17c70d8fa444ea1fe8340a4659edc9b85b3ad4219e4ec93aca3825cad

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 72428ee66c2dae6700127d83fac90e33db4cde911702129535248632eef3c015
MD5 a876de0eb077d2c4e4b8fab26b6eec4d
BLAKE2b-256 28d77931d10b7a70d2f45f049d6374854a1a147e648f4ebeef7a6b6445feff25

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.18-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 12.7 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.18-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 097576f5d11c7dc4928d1c0b2a06b37b0895a66d93bcb6241fc9593d595348a3
MD5 aad0997794e012f59ccfb6c85c91a89d
BLAKE2b-256 360a17355b6150fdf242ba90ad22476ce4469cc5d142c6079b83b22782efe01b

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 ad70fe633c6e95d26c292f0e540a60239620507bde3071f438839bc03bb15a4a
MD5 136b53bfb9feabd850c26ad564b2a088
BLAKE2b-256 cc378ad0936902a066a676fcd8e50f5aca17bee555db22587d73212c27d21581

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp313-cp313-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp313-cp313-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 0dbf4cec196565299a7387fd3e3b68eaab50503e92ac0d12b0ccea080d6d7779
MD5 dc0a69712e4ac1928afc5aee9d304064
BLAKE2b-256 52597018a1b0b6884597743dacae27d175dc1bd5f60478c6707056f19884060f

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d9cda34d38eb69b0a1d812cb02d4c65cdd733a88f941f595a7ce1182a3a1d805
MD5 8d6e605eda890368e67ecfc56c290cbf
BLAKE2b-256 93c75aac2c454396a23f485a6d1ffa01d3ff22d232fdb4dc03b4f0d03e763fb1

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.18-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 12.7 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.18-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 dbd2d78c6ab8d5eca13aed3321d796e7284910aab5fd95be467488e94ca79593
MD5 5b41f8bd135c76d9d495d281fe653908
BLAKE2b-256 b5ce3963729bbc256debd0ec516a3668d3ae26e897dc68c585b8c9e714653799

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 ce99811c906d255062c75f7230881e7f37e6f24cb11c662df6aa356ce25cb59c
MD5 c4396296e79cd26bfbc159b05dd97437
BLAKE2b-256 8ece9a5304b7ed83f4c3c8d2e3364bfb608b3907c27115d35813046e2909c893

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp312-cp312-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp312-cp312-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 b801d0709f4faa8cfcdfbd60c42b7f8d8757ce57967c652c9151c5d96e3cd57b
MD5 d8b27c92050bf50ea8d5d04c80401405
BLAKE2b-256 34453ca14155b34fb6de98d693b28fc171d7ef096dd39e85fa2e50aa632d93d2

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 596380cde507ce144ce10587d7581baa8c3ac9aa3343f12338cfeafca22a8671
MD5 e02ecdf86295dcdfba4e26e339098933
BLAKE2b-256 e2c277943656181808de6646e30541020877fd42107c04c57256e800235cbba3

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.18-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 12.7 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.18-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 e81194b1b3b7f5ab7445715af0f3b38ae43e70a8d142ae20e5e23e3e619a1c3f
MD5 3f6ff6c9738feb2f8027ff79b82537a0
BLAKE2b-256 b86e444a2dfde7f1679907de5fb88d15ea9b611b1247ba612fb1d35beb26d896

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 64bdb8d43f07d2cc3bf08babcc0d86f3650f6631dbc96719755dc3563652a562
MD5 f7c7b119c37027fa1a89a56f70199c89
BLAKE2b-256 ee438963572ff892b5233ab7d9ad45db2ee1a2fd3f20d9c5cd81c5c41473006c

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp311-cp311-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp311-cp311-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 6c7c9a28f5b43908938443a913d27d261a84967a0105c0b09a0fb33b37406939
MD5 5bf65e9400ff876f6bcd4ee2797dfd4d
BLAKE2b-256 ad8049f6c9b39d361cb4cefbad8a9fd6b710668401a15ef4473d33eeebac2215

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8fffff1e5bb3cf4739f8bb50826bbe29a943f35aad094841af686356781b5163
MD5 3ec3fe1255347d68f80e0f7fc163e16e
BLAKE2b-256 b40463f7b0b682a01885e5a6e60fd0a1ca044179f09f41044ff9e471a4a5e64f

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.18-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 12.7 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.18-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 ded2c3cc67860b75b0b863138ff6b75c398027a1c3545c55a6ed8ceafd03759f
MD5 b863896038003372479a56c30a54f7b6
BLAKE2b-256 ab8cd3af9b15e6c54a2bff34e7d6d68fb785a8c58315ad0d2955bb1788269499

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 fe44c3ed88f20a85f1514265a47650c5e0808827f4013ee59bae177e44f35236
MD5 cdbc40fd99eafae76b23ac291dc5441c
BLAKE2b-256 6832acbfc9e529f84d3cdef2293135aa065d691f6cc22bae8ac7731c88db917d

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp310-cp310-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp310-cp310-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 c24510b94a3a14db90fd31ba5a5f31e250a03eea84061fc980e6437e9623a141
MD5 08317eb6a52a48e57657f527ac0dca7e
BLAKE2b-256 e6bcf2eda34e98ae8fc6514c54bf77a018c728c68c7aa0640c81ef33afa49e37

See more details on using hashes here.

File details

Details for the file cyllama-0.2.18-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.18-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2293fd505e61e4d33d16ef33fae5c8e1c3ab925461a62ddedd19ecc6ad230f8d
MD5 2dc42cebd13d584d074f95e5056319b0
BLAKE2b-256 d6e8875c86411f849b2e5944d69907b4b77c6a595423f7bc5b3f8d1f31dd464e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page