Skip to main content

cyllama is a comprehensive zero-dependencies Python library for local AI inference using the state-of-the-art llama, whisper, and stable-diffusion .cpp ecosystem.

Project description

cyllama - Fast, Pythonic AI Inference

cyllama is a no-dependencies Python library for local AI inference built on the .cpp inference stack:

It combines the performance of compiled Cython wrappers with a simple, high-level Python API for cross-modal AI inference.

Documentation | PyPI | Changelog

Features

  • High-level API -- complete(), chat(), LLM class for quick prototyping / text generation.

  • Streaming -- token-by-token output with callbacks

  • Batch processing -- process multiple prompts in parallel

  • GPU acceleration -- Metal (macOS), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform), SYCL (Intel)

  • Speculative decoding -- 2-3x speedup with draft models

  • Agent framework -- ReActAgent, ConstrainedAgent, ContractAgent with tool calling

  • RAG -- retrieval-augmented generation with local embeddings and sqlite-vector

  • Speech recognition -- whisper.cpp transcription and translation

  • Image/Video generation -- stable-diffusion.cpp handles image, image-edit and video models.

  • OpenAI-compatible servers -- EmbeddedServer (C/Mongoose) and PythonServer with chat completions and embeddings endpoints

  • Framework integrations -- OpenAI API client, LangChain LLM interface

Installation

From PyPI

pip install cyllama

This installs the cpu-backend for linux and windows. For MacOS, the Metal backend is installed by default to take advantage of Apple Silicon.

GPU-Accelerated Variants

GPU variants are available on PyPI as separate packages (dynamically linked, Linux x86_64 only for now):

pip install cyllama-cuda12   # NVIDIA GPU (CUDA 12.4)
pip install cyllama-rocm     # AMD GPU (ROCm 6.3, requires glibc >= 2.35)
pip install cyllama-sycl     # Intel GPU (oneAPI SYCL 2025.3)
pip install cyllama-vulkan   # Cross-platform GPU (Vulkan)

All variants install the same cyllama Python package -- only the compiled backend differs. Install one at a time (they replace each other). GPU variants require the corresponding driver/runtime installed on your system.

You can verify which backend is active after installation:

cyllama info

You can also query the backend configuration at runtime:

from cyllama import _backend
print(_backend.cuda)   # True if built with CUDA
print(_backend.metal)  # True if built with Metal

Build from source with a specific backend

GGML_CUDA=1 pip install cyllama --no-binary cyllama
GGML_VULKAN=1 pip install cyllama --no-binary cyllama

Command-Line Interface

cyllama provides a unified CLI for all major functionality:

# Text generation
cyllama gen -m models/llama.gguf -p "What is Python?" --stream
cyllama gen -m models/llama.gguf -p "Write a haiku" --temperature 0.9 --json

# Chat (single-turn or interactive)
cyllama chat -m models/llama.gguf -p "Explain gravity" -s "You are a physicist"
cyllama chat -m models/llama.gguf                      # interactive mode
cyllama chat -m models/llama.gguf -n 1024              # interactive, up to 1024 tokens per response
cyllama chat -m models/llama.gguf --stats              # show session stats on exit

# Embeddings
cyllama embed -m models/bge-small.gguf -t "hello world" -t "another text"
cyllama embed -m models/bge-small.gguf --dim                        # print dimensions
cyllama embed -m models/bge-small.gguf --similarity "cats" -f corpus.txt --threshold 0.5

# Other commands
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ -p "How do I configure X?"
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -f file.md   # interactive mode
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ --db docs.sqlite -p "..."  # index to persistent DB
cyllama rag -m models/llama.gguf -e models/bge-small.gguf --db docs.sqlite -p "..."           # reuse existing DB, no re-indexing
cyllama server -m models/llama.gguf --port 8080
cyllama transcribe -m models/ggml-base.en.bin audio.wav
cyllama tts -m models/tts.gguf -p "Hello world"
cyllama sd txt2img --model models/sd.gguf --prompt "a sunset"
cyllama info       # build and backend information
cyllama memory -m models/llama.gguf  # GPU memory estimation

Run cyllama --help or cyllama <command> --help for full usage. See CLI Cheatsheet for the complete reference.

Quick Start

from cyllama import complete

# One line is all you need
response = complete(
    "Explain quantum computing in simple terms",
    model_path="models/llama.gguf",
    temperature=0.7,
    max_tokens=200
)
print(response)

Key Features

Simple by Default, Configurable When Needed

High-Level API - Get started in seconds:

from cyllama import complete, chat, LLM

# One-shot completion
response = complete("What is Python?", model_path="model.gguf")

# Multi-turn chat
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
]
response = chat(messages, model_path="model.gguf")

# Reusable LLM instance (faster for multiple prompts)
llm = LLM("model.gguf")
response1 = llm("Question 1")
response2 = llm("Question 2")  # Model stays loaded!

Streaming Support - Real-time token-by-token output:

for chunk in complete("Tell me a story", model_path="model.gguf", stream=True):
    print(chunk, end="", flush=True)

Performance Optimized

Batch Processing - Process multiple prompts 3-10x faster:

from cyllama import batch_generate

prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]
responses = batch_generate(prompts, model_path="model.gguf")

Speculative Decoding - 2-3x speedup with draft models:

from cyllama.llama.llama_cpp import Speculative, SpeculativeParams

params = SpeculativeParams(n_max=16, p_min=0.75)
spec = Speculative(params, ctx_target)
draft_tokens = spec.draft(prompt_tokens, last_token)

Memory Optimization - Smart GPU layer allocation:

from cyllama import estimate_gpu_layers

estimate = estimate_gpu_layers(
    model_path="model.gguf",
    available_vram_mb=8000
)
print(f"Recommended GPU layers: {estimate.n_gpu_layers}")

N-gram Cache - 2-10x speedup for repetitive text:

from cyllama.llama.llama_cpp import NgramCache

cache = NgramCache()
cache.update(tokens, ngram_min=2, ngram_max=4)
draft = cache.draft(input_tokens, n_draft=16)

Response Caching - Cache LLM responses for repeated prompts:

from cyllama import LLM

# Enable caching with 100 entries and 1 hour TTL
llm = LLM("model.gguf", cache_size=100, cache_ttl=3600, seed=42)

response1 = llm("What is Python?")  # Cache miss - generates response
response2 = llm("What is Python?")  # Cache hit - returns cached response instantly

# Check cache statistics
info = llm.cache_info()  # ResponseCacheInfo(hits=1, misses=1, maxsize=100, currsize=1, ttl=3600)

# Clear cache when needed
llm.cache_clear()

Note: Caching requires a fixed seed (not the default random sentinel) since random seeds produce non-deterministic output. Streaming responses are not cached.

Framework Integrations

OpenAI-Compatible API - Drop-in replacement:

from cyllama.integrations import OpenAIClient

client = OpenAIClient(model_path="model.gguf")

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7
)
print(response.choices[0].message.content)

LangChain Integration:

from cyllama.integrations import CyllamaLLM
from langchain.chains import LLMChain

llm = CyllamaLLM(model_path="model.gguf", temperature=0.7)
chain = LLMChain(llm=llm, prompt=prompt_template)
result = chain.run(topic="AI")

Agent Framework

Cyllama includes a zero-dependency agent framework with three agent architectures:

ReActAgent - Reasoning + Acting agent with tool calling:

from cyllama import LLM
from cyllama.agents import ReActAgent, tool
from simpleeval import simple_eval

@tool
def calculate(expression: str) -> str:
    """Evaluate a math expression safely."""
    return str(simple_eval(expression))

llm = LLM("model.gguf")
agent = ReActAgent(llm=llm, tools=[calculate])
result = agent.run("What is 25 * 4?")
print(result.answer)

ConstrainedAgent - Grammar-enforced tool calling for 100% reliability:

from cyllama.agents import ConstrainedAgent

agent = ConstrainedAgent(llm=llm, tools=[calculate])
result = agent.run("Calculate 100 / 4")  # Guaranteed valid tool calls

ContractAgent - Contract-based agent with C++26-inspired pre/post conditions:

from cyllama.agents import ContractAgent, tool, pre, post, ContractPolicy

@tool
@pre(lambda args: args['x'] != 0, "cannot divide by zero")
@post(lambda r: r is not None, "result must not be None")
def divide(a: float, x: float) -> float:
    """Divide a by x."""
    return a / x

agent = ContractAgent(
    llm=llm,
    tools=[divide],
    policy=ContractPolicy.ENFORCE,
    task_precondition=lambda task: len(task) > 10,
    answer_postcondition=lambda ans: len(ans) > 0,
)
result = agent.run("What is 100 divided by 4?")

See Agents Overview for detailed agent documentation.

Speech Recognition

Whisper Transcription - Transcribe audio files with timestamps:

from cyllama.whisper import WhisperContext, WhisperFullParams
import numpy as np

# Load model and audio
ctx = WhisperContext("models/ggml-base.en.bin")
samples = load_audio_as_16khz_float32("audio.wav")  # Your audio loading function

# Transcribe
params = WhisperFullParams()
ctx.full(samples, params)

# Get results
for i in range(ctx.full_n_segments()):
    start = ctx.full_get_segment_t0(i) / 100.0
    end = ctx.full_get_segment_t1(i) / 100.0
    text = ctx.full_get_segment_text(i)
    print(f"[{start:.2f}s - {end:.2f}s] {text}")

See Whisper docs for full documentation.

Stable Diffusion

Image Generation - Generate images from text using stable-diffusion.cpp:

from cyllama.sd import text_to_image

# Simple text-to-image
image = text_to_image(
    model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
    prompt="a photo of a cute cat",
    width=512,
    height=512,
    sample_steps=4,
    cfg_scale=1.0
)
image.save("output.png")

Advanced Generation - Full control with SDContext:

from cyllama.sd import SDContext, SDContextParams

params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4

ctx = SDContext(params)
# sample_method / scheduler / eta / wtype default to auto-resolve
# sentinels (SD C-library defaults) -- pass explicitly only to override.
images = ctx.generate(
    prompt="a beautiful mountain landscape",
    negative_prompt="blurry, ugly",
    width=512,
    height=512,
)

CLI Tool - Command-line interface:

# Text to image
cyllama sd txt2img \
    --model models/sd_xl_turbo_1.0.q8_0.gguf \
    --prompt "a beautiful sunset" \
    --output sunset.png

# Image to image
cyllama sd img2img \
    --model models/sd-v1-5.gguf \
    --init-img input.png \
    --prompt "oil painting style" \
    --strength 0.7

# Show system info
cyllama sd info

Supports SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, z-image-turbo, video generation (Wan/CogVideoX), LoRA, ControlNet, inpainting, and ESRGAN upscaling. See Stable Diffusion docs for full documentation.

RAG (Retrieval-Augmented Generation)

CLI - Query your documents from the command line:

# Single query against a directory of docs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ -p "How do I configure X?" --stream

# Interactive mode with source display
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -f guide.md -f faq.md --sources

# Persistent vector store: index once, reuse across runs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ --db docs.sqlite -p "How do I configure X?"   # first run: indexes to docs.sqlite
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    --db docs.sqlite -p "Another question?"                # later runs: reuse index, no re-embedding

Simple RAG - Query your documents with LLMs:

from cyllama.rag import RAG

# Create RAG instance with embedding and generation models
rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)

# Add documents
rag.add_texts([
    "Python is a high-level programming language.",
    "Machine learning is a subset of artificial intelligence.",
    "Neural networks are inspired by biological neurons."
])

# Query
response = rag.query("What is Python?")
print(response.text)

Load Documents - Support for multiple file formats:

from cyllama.rag import RAG, load_directory

rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)

# Load all documents from a directory
documents = load_directory("docs/", glob="**/*.md")
rag.add_documents(documents)

response = rag.query("How do I configure the system?")

Hybrid Search - Combine vector and keyword search:

from cyllama.rag import RAG, HybridStore, Embedder

embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf")
store = HybridStore("knowledge.db", embedder)

store.add_texts(["Document content..."])

# Hybrid search with configurable weights
results = store.search("query", k=5, vector_weight=0.7, fts_weight=0.3)

Embedding Cache - Speed up repeated queries with LRU caching:

from cyllama.rag import Embedder

# Enable cache with 1000 entries
embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf", cache_size=1000)

embedder.embed("hello")  # Cache miss
embedder.embed("hello")  # Cache hit - instant return

info = embedder.cache_info()
print(f"Hits: {info.hits}, Misses: {info.misses}")

Agent Integration - Use RAG as an agent tool:

from cyllama import LLM
from cyllama.agents import ReActAgent
from cyllama.rag import RAG, create_rag_tool

rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)
rag.add_texts(["Your knowledge base..."])

# Create a tool from the RAG instance
search_tool = create_rag_tool(rag)

llm = LLM("models/llama.gguf")
agent = ReActAgent(llm=llm, tools=[search_tool])
result = agent.run("Find information about X in the knowledge base")

Supports text chunking, multiple embedding pooling strategies, LRU caching for repeated queries, async operations, reranking, and SQLite-vector for persistent storage. See RAG Overview for full documentation.

Common Utilities

GGUF File Manipulation - Inspect and modify model files:

from cyllama.llama.llama_cpp import GGUFContext

ctx = GGUFContext.from_file("model.gguf")
metadata = ctx.get_all_metadata()
print(f"Model: {metadata['general.name']}")

Structured Output - JSON schema to grammar conversion (pure Python, no C++ dependency):

from cyllama.llama.llama_cpp import json_schema_to_grammar

schema = {"type": "object", "properties": {"name": {"type": "string"}}}
grammar = json_schema_to_grammar(schema)

Huggingface Model Downloads:

from cyllama.llama.llama_cpp import download_model, list_cached_models, get_hf_file

# Download from HuggingFace (saves to ~/.cache/llama.cpp/)
download_model("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Or with explicit parameters
download_model(hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Download specific file to custom path
download_model(
    hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF",
    hf_file="Llama-3.2-1B-Instruct-Q8_0.gguf",
    model_path="./models/my_model.gguf"
)

# Get file info without downloading
info = get_hf_file("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")
print(info)  # {'repo': '...', 'gguf_file': '...', 'mmproj_file': '...'}

# List cached models
models = list_cached_models()

What's Inside

Text Generation (llama.cpp)

  • Full llama.cpp API - Cython wrapper with strong typing

  • High-Level API - Simple, Pythonic interface (LLM, complete, chat)

  • Streaming Support - Token-by-token generation with callbacks

  • Batch Processing - Efficient parallel inference

  • Multimodal - LLAVA and vision-language models

  • Speculative Decoding - 2-3x inference speedup with draft models

Speech Recognition (whisper.cpp)

  • Full whisper.cpp API - Cython wrapper

  • High-Level API - Simple transcribe() function

  • Multiple Formats - WAV, MP3, FLAC, and more

  • Language Detection - Automatic or specified language

  • Timestamps - Word and segment-level timing

Image & Video Generation (stable-diffusion.cpp)

  • Full stable-diffusion.cpp API - Cython wrapper

  • Text-to-Image - SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, Z-Image

  • Image-to-Image - Transform existing images

  • Inpainting - Mask-based editing

  • ControlNet - Guided generation with edge/pose/depth

  • Video Generation - Wan, CogVideoX models

  • Upscaling - ESRGAN 4x upscaling

Cross-Cutting Features

  • GPU Acceleration - Metal, CUDA, ROCm, Vulkan, SYCL backends

  • Memory Optimization - Smart GPU layer allocation

  • Agent Framework - ReActAgent, ConstrainedAgent, ContractAgent

  • Framework Integration - OpenAI API, LangChain, FastAPI

Why Cyllama?

Performance: Compiled Cython wrappers with minimal overhead

  • Strong type checking at compile time

  • Zero-copy data passing where possible

  • Efficient memory management

  • Native integration with llama.cpp optimizations

Simplicity: From 50 lines to 1 line for basic generation

  • Pythonic API

  • Automatic resource management

  • Sensible defaults, full control when needed

Well-tested with broad api coverage

  • Extensive test coverage across the API surface

  • Documentation and examples fir each module

  • Proper error handling and logging

  • Framework integration for real applications

Up-to-Date: Tracks bleeding-edge llama.cpp

  • Regular updates with latest features

  • All high-priority APIs wrapped

  • Performance optimizations included

Status

Build System: scikit-build-core + CMake

See pyproject.toml for the current cyllama version and CHANGELOG.md for the pinned llama.cpp / whisper.cpp / stable-diffusion.cpp revisions.

Platform & GPU Availability

Pre-built wheels on PyPI:

Package Backend Platform Arch Linking
cyllama CPU Linux x86_64 static
cyllama CPU Windows x86_64 static
cyllama Metal macOS arm64 (Apple Silicon) static
cyllama Metal macOS x86_64 (Intel) static
cyllama-cuda12 CUDA 12.4 Linux x86_64 dynamic
cyllama-rocm ROCm 6.3 Linux x86_64 dynamic
cyllama-sycl Intel SYCL (oneAPI 2025.3) Linux x86_64 dynamic
cyllama-vulkan Vulkan Linux x86_64 dynamic

We will be adding additional wheel support for more platforms in the future, starting with vulkan and cuda12 support Windows.

Build from source (any platform with a C++ toolchain):

Backend macOS Linux Windows
CPU make build-cpu make build-cpu make build-cpu
Metal make build-metal (default) -- --
CUDA -- make build-cuda make build-cuda
ROCm (HIP) -- make build-hip --
Vulkan make build-vulkan make build-vulkan make build-vulkan
SYCL -- make build-sycl --
OpenCL make build-opencl make build-opencl make build-opencl

All source builds support both static (make build-<backend>) and dynamic (make build-<backend>-dynamic) linking.

Recent Releases

See CHANGELOG.md for full release notes.

  • v0.2.13 (Apr 2026) - QdrantVectorStore reference adapter for VectorStoreProtocol; pipeline-integrated reranking (RAGConfig.rerank) with RerankerProtocol; ccache + concurrency groups on CPU cibw workflows; Windows GPU-wheel LoadLibraryW PATH fix

  • v0.2.12 - Windows-CUDA, Windows-Vulkan, and macOS-Intel Vulkan GPU wheels; canonical delocate/auditwheel/delvewheel packaging. Experimental abi3 wheels (cp312+)

  • v0.2.11 (Apr 2026) - Pluggable RAG backends (VectorStoreProtocol / EmbedderProtocol) and MCP client API on LLM

  • v0.2.10 (Apr 2026) - GPU wheel size halved; packaging fixes (build_config.json, auditwheel SONAME, Vulkan ABI)

  • v0.2.9 (Apr 2026) - CUDA + SD stability fixes; get_perf_data() telemetry APIs

  • v0.2.8 (Apr 2026) - Expanded Cython bindings across llama / whisper / SD; interactive-chat streaming & sampling

  • v0.2.7 (Apr 2026) - SD defaults aligned with C library (fixes blank CUDA images)

  • v0.2.6 (Apr 2026) - Hotfix: remove accidental test-only runtime dependency

  • v0.2.5 (Apr 2026) - RAG hardening: persistent store, corpus dedup, vendored jinja2 chat templates

  • v0.2.4 (Apr 2026) - Unified cyllama CLI (gen, chat, embed, rag, …)

  • v0.2.3 (Apr 2026) - Wheel packaging and GPU portability fixes

  • v0.2.2 (Apr 2026) - CUDA wheel size stability

  • v0.2.1 (Mar 2026) - Code-quality hardening, GIL release, async fixes

  • v0.2.0 (Mar 2026) - Dynamic-linked GPU wheels on PyPI (CUDA, ROCm, SYCL, Vulkan)

  • v0.1.21 (Mar 2026) - GPU wheel builds: CUDA + ROCm, sqlite-vector bundled

  • v0.1.20 (Feb 2026) - Update llama.cpp + stable-diffusion.cpp

  • v0.1.19 (Dec 2025) - Metal fix for stable-diffusion.cpp

  • v0.1.18 (Dec 2025) - Remaining stable-diffusion.cpp wrapped

  • v0.1.16 (Dec 2025) - Response class, Async API, Chat templates

  • v0.1.12 (Nov 2025) - Initial wrapper of stable-diffusion.cpp

  • v0.1.11 (Nov 2025) - ACP support, build improvements

  • v0.1.10 (Nov 2025) - Agent Framework, bug fixes

  • v0.1.9 (Nov 2025) - High-level APIs, integrations, batch processing, comprehensive documentation

  • v0.1.8 (Nov 2025) - Speculative decoding API

  • v0.1.7 (Nov 2025) - GGUF, JSON Schema, Downloads, N-gram Cache

  • v0.1.6 (Nov 2025) - Multimodal test fixes

  • v0.1.5 (Oct 2025) - Mongoose server, embedded server

  • v0.1.4 (Oct 2025) - Memory estimation, performance optimizations

See CHANGELOG.md for complete release history.

Building from Source

To build cyllama from source:

  1. A recent version of python3 (currently testing on python 3.13)

  2. Git clone the latest version of cyllama:

    git clone https://github.com/shakfu/cyllama.git
    cd cyllama
    
  3. We use uv for package management:

    If you don't have it see the link above to install it, otherwise:

    uv sync
    
  4. Type make in the terminal.

    This will:

    1. Download and build llama.cpp, whisper.cpp and stable-diffusion.cpp
    2. Install them into the thirdparty folder
    3. Build cyllama using scikit-build-core + CMake

Build Commands

# Full build (default: static linking, builds llama.cpp from source)
make              # Build dependencies + editable install

# Dynamic linking (downloads pre-built llama.cpp release)
make build-dynamic  # No source compilation needed for llama.cpp

# Build wheel for distribution
make wheel        # Creates wheel in dist/
make dist         # Creates sdist + wheel in dist/

# Backend-specific builds (static)
make build-cpu    # CPU only
make build-metal  # macOS Metal (default on macOS)
make build-cuda   # NVIDIA CUDA
make build-vulkan # Vulkan (cross-platform)
make build-hip    # AMD ROCm
make build-sycl   # Intel SYCL
make build-opencl # OpenCL

# Backend-specific builds (dynamic -- shared libs)
make build-cpu-dynamic
make build-cuda-dynamic
make build-vulkan-dynamic
make build-metal-dynamic
make build-hip-dynamic
make build-sycl-dynamic
make build-opencl-dynamic

# Backend-specific wheels (static and dynamic)
make wheel-cuda           # Static wheel
make wheel-cuda-dynamic   # Dynamic wheel with shared libs

# Clean and rebuild
make clean        # Remove build artifacts + dynamic libs
make reset        # Full reset including thirdparty and .venv
make remake       # Clean rebuild with tests

# Code quality
make lint         # Lint with ruff (auto-fix)
make format       # Format with ruff
make typecheck    # Type check with mypy
make qa           # Run all: lint, typecheck, format

# Memory leak detection
make leaks        # RSS-growth leak check (10 cycles, 20% threshold)

# Publishing
make check        # Validate wheels with twine
make publish      # Upload to PyPI
make publish-test # Upload to TestPyPI

GPU Acceleration

By default, cyllama builds with Metal support on macOS and CPU-only on Linux. To enable other GPU backends (CUDA, Vulkan, etc.):

# Static builds (all libs compiled in)
make build-cuda
make build-vulkan

# Dynamic builds (shared libs installed alongside extension)
make build-cuda-dynamic
make build-vulkan-dynamic

# Multiple backends
export GGML_CUDA=1 GGML_VULKAN=1
make build

See Build Backends for comprehensive backend build instructions.

Multi-GPU Configuration

For systems with multiple GPUs, cyllama provides full control over GPU selection and model splitting:

from cyllama import LLM, GenerationConfig

# Use a specific GPU (GPU index 1)
llm = LLM("model.gguf", main_gpu=1)

# Multi-GPU with layer splitting (default mode)
llm = LLM("model.gguf", split_mode=1, n_gpu_layers=-1)

# Multi-GPU with tensor parallelism (row splitting)
llm = LLM("model.gguf", split_mode=2, n_gpu_layers=-1)

# Custom tensor split: 30% GPU 0, 70% GPU 1
llm = LLM("model.gguf", tensor_split=[0.3, 0.7])

# Full configuration via GenerationConfig
config = GenerationConfig(
    main_gpu=0,
    split_mode=1,          # 0=NONE, 1=LAYER, 2=ROW
    tensor_split=[1, 2],   # 1/3 GPU0, 2/3 GPU1
    n_gpu_layers=-1
)
llm = LLM("model.gguf", config=config)

Split Modes:

  • 0 (NONE): Single GPU only, uses main_gpu

  • 1 (LAYER): Split layers and KV cache across GPUs (default)

  • 2 (ROW): Tensor parallelism - split layers with row-wise distribution

Testing

The tests directory in this repo provides extensive examples of using cyllama.

However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good small model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. cyllama expects models to be stored in a models folder in the cloned cyllama directory. So to create the models directory if doesn't exist and download this model, you can just type:

make download

This basically just does:

cd cyllama
mkdir models && cd models
wget https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf

Now you can test it using llama-cli or llama-simple:

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"

The library covers both quick prototyping and longer-running deployments:

make test  # Run full test suite

You can also explore interactively:

python3 -i scripts/start.py

>>> from cyllama import complete
>>> response = complete("What is 2+2?", model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf")
>>> print(response)

Documentation

Full documentation is available at https://shakfu.github.io/cyllama/ (built with MkDocs).

To serve docs locally: make docs-serve

Roadmap

Completed

  • Full llama.cpp API wrapper with Cython

  • High-level API (LLM, complete, chat)

  • Async API support (AsyncLLM, complete_async, chat_async)

  • Response class with stats and serialization

  • Built-in chat template system (llama.cpp templates)

  • Batch processing utilities

  • OpenAI-compatible API client

  • LangChain integration

  • Speculative decoding

  • GGUF file manipulation

  • JSON schema to grammar conversion

  • Model download helper

  • N-gram cache

  • OpenAI-compatible servers (PythonServer, EmbeddedServer, LlamaServer) with chat and embeddings

  • Whisper.cpp integration

  • Multimodal support (LLAVA)

  • Memory estimation utilities

  • Agent Framework (ReActAgent, ConstrainedAgent, ContractAgent)

  • Stable Diffusion (stable-diffusion.cpp) - image/video generation

  • RAG utilities (text chunking, document processing)

Future

  • Web UI for testing

Contributing

Contributions are welcome! Please see the User Guide for development guidelines.

License

This project wraps llama.cpp, whisper.cpp, and stable-diffusion.cpp which all follow the MIT licensing terms, as does cyllama.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cyllama-0.2.14-cp314-cp314-win_amd64.whl (12.6 MB view details)

Uploaded CPython 3.14Windows x86-64

cyllama-0.2.14-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.4 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

cyllama-0.2.14-cp314-cp314-macosx_11_0_x86_64.whl (13.5 MB view details)

Uploaded CPython 3.14macOS 11.0+ x86-64

cyllama-0.2.14-cp314-cp314-macosx_11_0_arm64.whl (13.0 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

cyllama-0.2.14-cp313-cp313-win_amd64.whl (12.4 MB view details)

Uploaded CPython 3.13Windows x86-64

cyllama-0.2.14-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

cyllama-0.2.14-cp313-cp313-macosx_11_0_x86_64.whl (13.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ x86-64

cyllama-0.2.14-cp313-cp313-macosx_11_0_arm64.whl (13.0 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

cyllama-0.2.14-cp312-cp312-win_amd64.whl (12.4 MB view details)

Uploaded CPython 3.12Windows x86-64

cyllama-0.2.14-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.4 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

cyllama-0.2.14-cp312-cp312-macosx_11_0_x86_64.whl (13.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ x86-64

cyllama-0.2.14-cp312-cp312-macosx_11_0_arm64.whl (13.0 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

cyllama-0.2.14-cp311-cp311-win_amd64.whl (12.4 MB view details)

Uploaded CPython 3.11Windows x86-64

cyllama-0.2.14-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.4 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

cyllama-0.2.14-cp311-cp311-macosx_11_0_x86_64.whl (13.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ x86-64

cyllama-0.2.14-cp311-cp311-macosx_11_0_arm64.whl (13.0 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

cyllama-0.2.14-cp310-cp310-win_amd64.whl (12.4 MB view details)

Uploaded CPython 3.10Windows x86-64

cyllama-0.2.14-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.4 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

cyllama-0.2.14-cp310-cp310-macosx_11_0_x86_64.whl (13.5 MB view details)

Uploaded CPython 3.10macOS 11.0+ x86-64

cyllama-0.2.14-cp310-cp310-macosx_11_0_arm64.whl (13.0 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file cyllama-0.2.14-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.14-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 12.6 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.14-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 a3e976e12b31fd04f58a5bdd1c0891c9f1f1f4c05da00b5585fd9189aaf4bf17
MD5 cfe3fe8b2ac1546c27fde37fd0bfe2b5
BLAKE2b-256 4d1f22a59e557fa9c3d6e4e907da68cadb06fd138e5bd81fb07b8b7ab90abc93

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 1e720e321299168a2a5106c5d725faffcd37ac84445ab7b930735e21a88e0bc6
MD5 da96fbf7f5884fac6e59fd367f8822a9
BLAKE2b-256 5133a3f9cafaece82dfef751a62f3605f7caaed85d73ceab7c2e607c7701f6b1

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp314-cp314-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp314-cp314-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 615c8fc97c3bb2bb3bf5d583c7ec11debabe60bac9d5f11090d8609378aea199
MD5 4fe8686306f32ce51ba5b40b6f21fa6a
BLAKE2b-256 435a35d763ad3e1ecb66eeca09e4c5c67b301b077b2824e25e48c30c50d8fcac

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0ce0e32d7fd98384ec4c160e81f76c8352e4ea3071dd9a58786fc9e02154f23c
MD5 1d0a3c7346f0e97eac2f0f6c1e812bae
BLAKE2b-256 30ce2c9252996b81d674b05e638839a52516f46832e4f44a7a03bf7d95a38d54

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.14-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 12.4 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.14-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 7fe50bcc40bf405f84712a1e55416c1110bfc5f9e82e625a07e7b887d7604a08
MD5 ce94563cf849798f0410a8b9f0cd167e
BLAKE2b-256 2d6f6c717220e3e86025aa0c91c2f16d265c94ad29f07cf64838a48a53173b3a

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 6f827da7a4eb7d288545c4c2fd5925544aeeb5de569eec255480ef09500bb201
MD5 c032a8108dbded856138f3776580da45
BLAKE2b-256 f38085457862dee1158c84c5147feeed1b7609a6b9cb739d161842a3a56a4d59

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp313-cp313-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp313-cp313-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 4cbc78390b2ef3d3f261b1652e4665596cc71aefcca0fa249d5d0a8219a23003
MD5 6e93fbf43153074da77e74a93f20488f
BLAKE2b-256 94481f457471ee9026fc4f7b8017a32ce87166ee3897f3398f159cc07b46b083

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 35d3523678309bbaf0ce1e2e3769be05d78e1c7768b60a3a9b4528b351d50a02
MD5 3f22c3b2d8a27523fa88caa501e97d91
BLAKE2b-256 0a7222ee051045b9fdb1881f17df04a852ae4d7492d5d1806d1fb576948f1368

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.14-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 12.4 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.14-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 a4122fe0d9bfd9e815dd325fe2b1b0c10bccf17f2def4556212f3c75ec8f0366
MD5 71a28df3d184fa5f40835b197180a3f6
BLAKE2b-256 a8a1dd9abc9fcf87d82b1ee3945df378b74d73558bda5072f3609b89d52f4b52

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 8c03739e5e9297cb634a0a648ec70a110abc387d10d54cdbca77697753b35bc1
MD5 0a5dae5756b38d85763dd5901a48ffe1
BLAKE2b-256 479617caa74adff8022705ef2a5d76856281242abfae03335ad81acab420b7bb

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp312-cp312-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp312-cp312-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 44a0e560510f190b93b8852ba7f5a477c6a58088c76a908e255bc99b6376312d
MD5 3fd3abc77b6ffb69c0d80ffb6a6a438f
BLAKE2b-256 385e0ea73f05f9df0be9979c3d3ca834573db8c254c711b3b98a9ec00372e43e

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 94e347fe3f5333054a995ad9632264c708cd938423a529d2d1344f121f51d83c
MD5 154cb0737f4753dba8b2c562e4430bb9
BLAKE2b-256 cd90009118a188fa4c101f4fc43d295154c48318b4f8e18bfe04b736ba967232

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.14-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 12.4 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.14-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 43b41d1a9d7e7282a7ea7cfc717780d3ed4648584395f237711003b3e93f362d
MD5 a388cf72488bc1e6472f2aa8d5d2c019
BLAKE2b-256 b395304afa6fa2d33ca15c5a1ba670754a53c9239b2a2922c771e6b2b8a7c07f

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 89910c16ac14d88d13cd81378c2c62c2acd55002851d596fc31dadcbe9e7db7f
MD5 13e3543055f2170f0d5765e8c76a3451
BLAKE2b-256 5604e6875fc021adaba6cb0a468a17a2c6c48b4de7b34ba56ce27b03975726fd

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp311-cp311-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp311-cp311-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 5b353c2398cb07560b171ba52b2e8f0dd982b91befb3e073dce032d7a425ec8e
MD5 358ff9af6b6655c1b5ce129f5c9fc0a9
BLAKE2b-256 3fc05290086acb62bffd1e3426595cfddb4866d9a492e9d329131fc557ee3139

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8bbf587bf95710d2bbb42299be02ed8e154dc9d37e32996204bde7e8a1c76e37
MD5 97c7c17a5c67ce2845bb47c92aecf5b2
BLAKE2b-256 0326bc5885748458b9d2184f3a38c334572e82a7658fb21005d142ca988b6acb

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.14-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 12.4 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.14-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 87d449149569a351f0649f2b9d8ddd436e034aa7f8fea227fed37b3cf4dc95df
MD5 2dc6d758d423a3f1c917f37e3b075f92
BLAKE2b-256 00bd2e5628bd55562dd99670d6834fafe8e8c19d0a4d6840ef2f81398179a26f

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 a7e753c8872f2a0506c1dc3f32149f94671a5e328141ed58c347e6a2b0a09396
MD5 1b144e76e8893000f90eb2656cc26b45
BLAKE2b-256 4f97f051272978ad0a08bb22556766e644bfb9e03d7d7e2d0d2d3a219d4cc8ff

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp310-cp310-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp310-cp310-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 f3434d9613ac809eb6b0ba975cebe6447011d0326ea783804f4f353da420c53a
MD5 b1d8d047a1c119c3d24c27dcce717746
BLAKE2b-256 e20af5c346c0427711156e5844487fd5bc315e228c552b369e91070254a9b4a2

See more details on using hashes here.

File details

Details for the file cyllama-0.2.14-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.14-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 df84e4a210c977b42607a8b745793794d29da216b6b9f1a429bb8d195bc460ed
MD5 f63d377005982a2be9e8e320327895be
BLAKE2b-256 92e6b0efd5450139682ad67d3bc9c0c954c2c7dc7bc96e744d072edc036598fd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page