Skip to main content

A cython wrapper of llama.cpp

Project description

cyllama - Fast, Pythonic AI Inference

cyllama is a comprehensive no-dependencies Python library for local AI inference built on the state-of-the-art .cpp ecosystem:

It combines the performance of compiled Cython wrappers with a simple, high-level Python API for cross-modal AI inference.

Documentation | PyPI | Changelog

Features

  • High-level API -- complete(), chat(), LLM class for quick prototyping / text generation.
  • Streaming -- token-by-token output with callbacks
  • Batch processing -- process multiple prompts 3-10x faster
  • GPU acceleration -- Metal (macOS), CUDA (NVIDIA), ROCm (AMD), Vulkan (cross-platform)
  • Speculative decoding -- 2-3x speedup with draft models
  • Agent framework -- ReActAgent, ConstrainedAgent, ContractAgent with tool calling
  • RAG -- retrieval-augmented generation with local embeddings and sqlite-vector
  • Speech recognition -- whisper.cpp transcription and translation
  • Image/Video generation -- stable-diffusion.cpp handles image, image-edit and video models.
  • OpenAI-compatible servers -- EmbeddedServer (C/Mongoose) and PythonServer with chat completions and embeddings endpoints
  • Framework integrations -- OpenAI API client, LangChain LLM interface

Installation

From PyPI

pip install cyllama

This installs the cpu-backend for linux and windows. For MacOS, the Metal backend is installed, by default, to take advantage of Apple Silicon.

GPU-Accelerated Variants

GPU variants are available on PyPI as separate packages (dynamically linked, Linux x86_64 only):

pip install cyllama-cuda12   # NVIDIA GPU (CUDA 12.4)
pip install cyllama-rocm     # AMD GPU (ROCm 6.3, requires glibc >= 2.35)
pip install cyllama-sycl     # Intel GPU (oneAPI SYCL 2025.3)
pip install cyllama-vulkan   # Cross-platform GPU (Vulkan)

All variants install the same cyllama Python package -- only the compiled backend differs. Install one at a time (they replace each other). GPU variants require the corresponding driver/runtime installed on your system.

You can verify which backend is active after installation:

cyllama info

You can also query the backend configuration at runtime:

from cyllama import _backend
print(_backend.cuda)   # True if built with CUDA
print(_backend.metal)  # True if built with Metal

Build from source with a specific backend

GGML_CUDA=1 pip install cyllama --no-binary cyllama
GGML_VULKAN=1 pip install cyllama --no-binary cyllama

Command-Line Interface

cyllama provides a unified CLI for all major functionality:

# Text generation
cyllama gen -m models/llama.gguf -p "What is Python?" --stream
cyllama gen -m models/llama.gguf -p "Write a haiku" --temperature 0.9 --json

# Chat (single-turn or interactive)
cyllama chat -m models/llama.gguf -p "Explain gravity" -s "You are a physicist"
cyllama chat -m models/llama.gguf                      # interactive mode
cyllama chat -m models/llama.gguf -n 1024              # interactive, up to 1024 tokens per response

# Embeddings
cyllama embed -m models/bge-small.gguf -t "hello world" -t "another text"
cyllama embed -m models/bge-small.gguf --dim                        # print dimensions
cyllama embed -m models/bge-small.gguf --similarity "cats" -f corpus.txt --threshold 0.5

# Other commands
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ -p "How do I configure X?"
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -f file.md   # interactive mode
cyllama rag -m models/llama.gguf -e models/bge-small.gguf -d docs/ --db docs.sqlite -p "..."  # index to persistent DB
cyllama rag -m models/llama.gguf -e models/bge-small.gguf --db docs.sqlite -p "..."           # reuse existing DB, no re-indexing
cyllama server -m models/llama.gguf --port 8080
cyllama transcribe -m models/ggml-base.en.bin audio.wav
cyllama tts -m models/tts.gguf -p "Hello world"
cyllama sd txt2img --model models/sd.gguf --prompt "a sunset"
cyllama info       # build and backend information
cyllama memory -m models/llama.gguf  # GPU memory estimation

Run cyllama --help or cyllama <command> --help for full usage. See CLI Cheatsheet for the complete reference.

Quick Start

from cyllama import complete

# One line is all you need
response = complete(
    "Explain quantum computing in simple terms",
    model_path="models/llama.gguf",
    temperature=0.7,
    max_tokens=200
)
print(response)

Key Features

Simple by Default, Powerful When Needed

High-Level API - Get started in seconds:

from cyllama import complete, chat, LLM

# One-shot completion
response = complete("What is Python?", model_path="model.gguf")

# Multi-turn chat
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"}
]
response = chat(messages, model_path="model.gguf")

# Reusable LLM instance (faster for multiple prompts)
llm = LLM("model.gguf")
response1 = llm("Question 1")
response2 = llm("Question 2")  # Model stays loaded!

Streaming Support - Real-time token-by-token output:

for chunk in complete("Tell me a story", model_path="model.gguf", stream=True):
    print(chunk, end="", flush=True)

Performance Optimized

Batch Processing - Process multiple prompts 3-10x faster:

from cyllama import batch_generate

prompts = ["What is 2+2?", "What is 3+3?", "What is 4+4?"]
responses = batch_generate(prompts, model_path="model.gguf")

Speculative Decoding - 2-3x speedup with draft models:

from cyllama.llama.llama_cpp import Speculative, SpeculativeParams

params = SpeculativeParams(n_max=16, p_min=0.75)
spec = Speculative(params, ctx_target)
draft_tokens = spec.draft(prompt_tokens, last_token)

Memory Optimization - Smart GPU layer allocation:

from cyllama import estimate_gpu_layers

estimate = estimate_gpu_layers(
    model_path="model.gguf",
    available_vram_mb=8000
)
print(f"Recommended GPU layers: {estimate.n_gpu_layers}")

N-gram Cache - 2-10x speedup for repetitive text:

from cyllama.llama.llama_cpp import NgramCache

cache = NgramCache()
cache.update(tokens, ngram_min=2, ngram_max=4)
draft = cache.draft(input_tokens, n_draft=16)

Response Caching - Cache LLM responses for repeated prompts:

from cyllama import LLM

# Enable caching with 100 entries and 1 hour TTL
llm = LLM("model.gguf", cache_size=100, cache_ttl=3600, seed=42)

response1 = llm("What is Python?")  # Cache miss - generates response
response2 = llm("What is Python?")  # Cache hit - returns cached response instantly

# Check cache statistics
info = llm.cache_info()  # ResponseCacheInfo(hits=1, misses=1, maxsize=100, currsize=1, ttl=3600)

# Clear cache when needed
llm.cache_clear()

Note: Caching requires a fixed seed (seed != -1) since random seeds produce non-deterministic output. Streaming responses are not cached.

Framework Integrations

OpenAI-Compatible API - Drop-in replacement:

from cyllama.integrations import OpenAIClient

client = OpenAIClient(model_path="model.gguf")

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "Hello!"}],
    temperature=0.7
)
print(response.choices[0].message.content)

LangChain Integration - Seamless ecosystem access:

from cyllama.integrations import CyllamaLLM
from langchain.chains import LLMChain

llm = CyllamaLLM(model_path="model.gguf", temperature=0.7)
chain = LLMChain(llm=llm, prompt=prompt_template)
result = chain.run(topic="AI")

Agent Framework

Cyllama includes a zero-dependency agent framework with three agent architectures:

ReActAgent - Reasoning + Acting agent with tool calling:

from cyllama import LLM
from cyllama.agents import ReActAgent, tool
from simpleeval import simple_eval

@tool
def calculate(expression: str) -> str:
    """Evaluate a math expression safely."""
    return str(simple_eval(expression))

llm = LLM("model.gguf")
agent = ReActAgent(llm=llm, tools=[calculate])
result = agent.run("What is 25 * 4?")
print(result.answer)

ConstrainedAgent - Grammar-enforced tool calling for 100% reliability:

from cyllama.agents import ConstrainedAgent

agent = ConstrainedAgent(llm=llm, tools=[calculate])
result = agent.run("Calculate 100 / 4")  # Guaranteed valid tool calls

ContractAgent - Contract-based agent with C++26-inspired pre/post conditions:

from cyllama.agents import ContractAgent, tool, pre, post, ContractPolicy

@tool
@pre(lambda args: args['x'] != 0, "cannot divide by zero")
@post(lambda r: r is not None, "result must not be None")
def divide(a: float, x: float) -> float:
    """Divide a by x."""
    return a / x

agent = ContractAgent(
    llm=llm,
    tools=[divide],
    policy=ContractPolicy.ENFORCE,
    task_precondition=lambda task: len(task) > 10,
    answer_postcondition=lambda ans: len(ans) > 0,
)
result = agent.run("What is 100 divided by 4?")

See Agents Overview for detailed agent documentation.

Speech Recognition

Whisper Transcription - Transcribe audio files with timestamps:

from cyllama.whisper import WhisperContext, WhisperFullParams
import numpy as np

# Load model and audio
ctx = WhisperContext("models/ggml-base.en.bin")
samples = load_audio_as_16khz_float32("audio.wav")  # Your audio loading function

# Transcribe
params = WhisperFullParams()
ctx.full(samples, params)

# Get results
for i in range(ctx.full_n_segments()):
    start = ctx.full_get_segment_t0(i) / 100.0
    end = ctx.full_get_segment_t1(i) / 100.0
    text = ctx.full_get_segment_text(i)
    print(f"[{start:.2f}s - {end:.2f}s] {text}")

See Whisper docs for full documentation.

Stable Diffusion

Image Generation - Generate images from text using stable-diffusion.cpp:

from cyllama.sd import text_to_image

# Simple text-to-image
image = text_to_image(
    model_path="models/sd_xl_turbo_1.0.q8_0.gguf",
    prompt="a photo of a cute cat",
    width=512,
    height=512,
    sample_steps=4,
    cfg_scale=1.0
)
image.save("output.png")

Advanced Generation - Full control with SDContext:

from cyllama.sd import SDContext, SDContextParams, SampleMethod, Scheduler

params = SDContextParams()
params.model_path = "models/sd_xl_turbo_1.0.q8_0.gguf"
params.n_threads = 4

ctx = SDContext(params)
images = ctx.generate(
    prompt="a beautiful mountain landscape",
    negative_prompt="blurry, ugly",
    width=512,
    height=512,
    sample_method=SampleMethod.EULER,
    scheduler=Scheduler.DISCRETE
)

CLI Tool - Command-line interface:

# Text to image
cyllama sd txt2img \
    --model models/sd_xl_turbo_1.0.q8_0.gguf \
    --prompt "a beautiful sunset" \
    --output sunset.png

# Image to image
cyllama sd img2img \
    --model models/sd-v1-5.gguf \
    --init-img input.png \
    --prompt "oil painting style" \
    --strength 0.7

# Show system info
cyllama sd info

Supports SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2, z-image-turbo, video generation (Wan/CogVideoX), LoRA, ControlNet, inpainting, and ESRGAN upscaling. See Stable Diffusion docs for full documentation.

RAG (Retrieval-Augmented Generation)

CLI - Query your documents from the command line:

# Single query against a directory of docs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ -p "How do I configure X?" --stream

# Interactive mode with source display
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -f guide.md -f faq.md --sources

# Persistent vector store: index once, reuse across runs
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    -d docs/ --db docs.sqlite -p "How do I configure X?"   # first run: indexes to docs.sqlite
cyllama rag -m models/llama.gguf -e models/bge-small.gguf \
    --db docs.sqlite -p "Another question?"                # later runs: reuse index, no re-embedding

Simple RAG - Query your documents with LLMs:

from cyllama.rag import RAG

# Create RAG instance with embedding and generation models
rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)

# Add documents
rag.add_texts([
    "Python is a high-level programming language.",
    "Machine learning is a subset of artificial intelligence.",
    "Neural networks are inspired by biological neurons."
])

# Query
response = rag.query("What is Python?")
print(response.text)

Load Documents - Support for multiple file formats:

from cyllama.rag import RAG, load_directory

rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)

# Load all documents from a directory
documents = load_directory("docs/", glob="**/*.md")
rag.add_documents(documents)

response = rag.query("How do I configure the system?")

Hybrid Search - Combine vector and keyword search:

from cyllama.rag import RAG, HybridStore, Embedder

embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf")
store = HybridStore("knowledge.db", embedder)

store.add_texts(["Document content..."])

# Hybrid search with configurable weights
results = store.search("query", k=5, vector_weight=0.7, fts_weight=0.3)

Embedding Cache - Speed up repeated queries with LRU caching:

from cyllama.rag import Embedder

# Enable cache with 1000 entries
embedder = Embedder("models/bge-small-en-v1.5-q8_0.gguf", cache_size=1000)

embedder.embed("hello")  # Cache miss
embedder.embed("hello")  # Cache hit - instant return

info = embedder.cache_info()
print(f"Hits: {info.hits}, Misses: {info.misses}")

Agent Integration - Use RAG as an agent tool:

from cyllama import LLM
from cyllama.agents import ReActAgent
from cyllama.rag import RAG, create_rag_tool

rag = RAG(
    embedding_model="models/bge-small-en-v1.5-q8_0.gguf",
    generation_model="models/llama.gguf"
)
rag.add_texts(["Your knowledge base..."])

# Create a tool from the RAG instance
search_tool = create_rag_tool(rag)

llm = LLM("models/llama.gguf")
agent = ReActAgent(llm=llm, tools=[search_tool])
result = agent.run("Find information about X in the knowledge base")

Supports text chunking, multiple embedding pooling strategies, LRU caching for repeated queries, async operations, reranking, and SQLite-vector for persistent storage. See RAG Overview for full documentation.

Common Utilities

GGUF File Manipulation - Inspect and modify model files:

from cyllama.llama.llama_cpp import GGUFContext

ctx = GGUFContext.from_file("model.gguf")
metadata = ctx.get_all_metadata()
print(f"Model: {metadata['general.name']}")

Structured Output - JSON schema to grammar conversion (pure Python, no C++ dependency):

from cyllama.llama.llama_cpp import json_schema_to_grammar

schema = {"type": "object", "properties": {"name": {"type": "string"}}}
grammar = json_schema_to_grammar(schema)

Huggingface Model Downloads:

from cyllama.llama.llama_cpp import download_model, list_cached_models, get_hf_file

# Download from HuggingFace (saves to ~/.cache/llama.cpp/)
download_model("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Or with explicit parameters
download_model(hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF:latest")

# Download specific file to custom path
download_model(
    hf_repo="bartowski/Llama-3.2-1B-Instruct-GGUF",
    hf_file="Llama-3.2-1B-Instruct-Q8_0.gguf",
    model_path="./models/my_model.gguf"
)

# Get file info without downloading
info = get_hf_file("bartowski/Llama-3.2-1B-Instruct-GGUF:latest")
print(info)  # {'repo': '...', 'gguf_file': '...', 'mmproj_file': '...'}

# List cached models
models = list_cached_models()

What's Inside

Text Generation (llama.cpp)

  • Full llama.cpp API - Complete Cython wrapper with strong typing
  • High-Level API - Simple, Pythonic interface (LLM, complete, chat)
  • Streaming Support - Token-by-token generation with callbacks
  • Batch Processing - Efficient parallel inference
  • Multimodal - LLAVA and vision-language models
  • Speculative Decoding - 2-3x inference speedup with draft models

Speech Recognition (whisper.cpp)

  • Full whisper.cpp API - Complete Cython wrapper
  • High-Level API - Simple transcribe() function
  • Multiple Formats - WAV, MP3, FLAC, and more
  • Language Detection - Automatic or specified language
  • Timestamps - Word and segment-level timing

Image & Video Generation (stable-diffusion.cpp)

  • Full stable-diffusion.cpp API - Complete Cython wrapper
  • Text-to-Image - SD 1.x/2.x, SDXL, SD3, FLUX, FLUX2
  • Image-to-Image - Transform existing images
  • Inpainting - Mask-based editing
  • ControlNet - Guided generation with edge/pose/depth
  • Video Generation - Wan, CogVideoX models
  • Upscaling - ESRGAN 4x upscaling

Cross-Cutting Features

  • GPU Acceleration - Metal, CUDA, Vulkan backends
  • Memory Optimization - Smart GPU layer allocation
  • Agent Framework - ReActAgent, ConstrainedAgent, ContractAgent
  • Framework Integration - OpenAI API, LangChain, FastAPI

Why Cyllama?

Performance: Compiled Cython wrappers with minimal overhead

  • Strong type checking at compile time
  • Zero-copy data passing where possible
  • Efficient memory management
  • Native integration with llama.cpp optimizations

Simplicity: From 50 lines to 1 line for basic generation

  • Intuitive, Pythonic API design
  • Automatic resource management
  • Sensible defaults, full control when needed

Production-Ready: Battle-tested and comprehensive

  • 1450+ passing tests with extensive coverage
  • Comprehensive documentation and examples
  • Proper error handling and logging
  • Framework integration for real applications

Up-to-Date: Tracks bleeding-edge llama.cpp

  • Regular updates with latest features
  • All high-priority APIs wrapped
  • Performance optimizations included

Status

Current Version: 0.2.5 (Apr 2026) llama.cpp Version: b8757 Build System: scikit-build-core + CMake Test Coverage: 1450+ tests passing Platform: macOS (tested), Linux (tested), Windows (tested)

Recent Releases

  • v0.2.5 (Apr 2026) - Typed loader exceptions, concurrent-use guard on LLM/Embedder/WhisperContext/SDContext, persistent RAG vector store (cyllama rag --db), corpus deduplication, vendored jinja2 chat templates (fixes Gemma 4 and other non-substring-detectable templates), Qwen3 <think>-block stripping + n-gram repetition guard, readline history for REPLs, memory-leak regression tests, llama.cpp b8757
  • v0.2.4 (Apr 2026) - Unified CLI (cyllama gen, chat, embed, rag, ...), cyllama rag command-line RAG, Ctrl+C during inference, embeddings endpoint, Embedder logging fix, interactive chat token limit fix
  • v0.2.3 (Apr 2026) - SD flow_shift black-image fix, GPU OOM validation, dynamic Linux install fixes, wheel backend discovery after auditwheel/delvewheel rename, CLI entry point, wheel smoke tests, OpenCL targets, CUDA tuning flags
  • v0.2.2 (Apr 2026) - CUDA wheel size stability (PTX-only sm_75), portability flags moved from manage.py to CI workflows
  • v0.2.1 (Mar 2026) - Code quality hardening: GIL release for whisper/encode, async stream fixes, memory-aware embedding cache, CI robustness, 30+ bug fixes, 1150+ tests
  • v0.2.0 (Mar 2026) - Dynamic-linked GPU wheels (CUDA, ROCm, SYCL, Vulkan) on PyPI, unified ggml, sqlite-vector vendored
  • v0.1.21 (Mar 2026) - GPU wheel builds: CUDA + ROCm, sqlite-vector bundled
  • v0.1.20 (Feb 2026) - Update llama.cpp + stable-diffusion.cpp
  • v0.1.19 (Dev 2025) - Metal fix for stable-diffusion.cpp
  • v0.1.18 (Dec 2025) - Remaining stable-diffusion.cpp wrapped
  • v0.1.16 (Dec 2025) - Response class, Async API, Chat templates
  • v0.1.12 (Nov 2025) - Initial wrapper of stable-diffusion.cpp
  • v0.1.11 (Nov 2025) - ACP support, build improvements
  • v0.1.10 (Nov 2025) - Agent Framework, bug fixes
  • v0.1.9 (Nov 2025) - High-level APIs, integrations, batch processing, comprehensive documentation
  • v0.1.8 (Nov 2025) - Speculative decoding API
  • v0.1.7 (Nov 2025) - GGUF, JSON Schema, Downloads, N-gram Cache
  • v0.1.6 (Nov 2025) - Multimodal test fixes
  • v0.1.5 (Oct 2025) - Mongoose server, embedded server
  • v0.1.4 (Oct 2025) - Memory estimation, performance optimizations

See CHANGELOG.md for complete release history.

Building from Source

To build cyllama from source:

  1. A recent version of python3 (currently testing on python 3.13)

  2. Git clone the latest version of cyllama:

    git clone https://github.com/shakfu/cyllama.git
    cd cyllama
    
  3. We use uv for package management:

    If you don't have it see the link above to install it, otherwise:

    uv sync
    
  4. Type make in the terminal.

    This will:

    1. Download and build llama.cpp, whisper.cpp and stable-diffusion.cpp
    2. Install them into the thirdparty folder
    3. Build cyllama using scikit-build-core + CMake

Build Commands

# Full build (default: static linking, builds llama.cpp from source)
make              # Build dependencies + editable install

# Dynamic linking (downloads pre-built llama.cpp release)
make build-dynamic  # No source compilation needed for llama.cpp

# Build wheel for distribution
make wheel        # Creates wheel in dist/
make dist         # Creates sdist + wheel in dist/

# Backend-specific builds (static)
make build-cpu    # CPU only
make build-metal  # macOS Metal (default on macOS)
make build-cuda   # NVIDIA CUDA
make build-vulkan # Vulkan (cross-platform)
make build-hip    # AMD ROCm
make build-sycl   # Intel SYCL
make build-opencl # OpenCL

# Backend-specific builds (dynamic -- shared libs)
make build-cpu-dynamic
make build-cuda-dynamic
make build-vulkan-dynamic
make build-metal-dynamic
make build-hip-dynamic
make build-sycl-dynamic
make build-opencl-dynamic

# Backend-specific wheels (static and dynamic)
make wheel-cuda           # Static wheel
make wheel-cuda-dynamic   # Dynamic wheel with shared libs

# Clean and rebuild
make clean        # Remove build artifacts + dynamic libs
make reset        # Full reset including thirdparty and .venv
make remake       # Clean rebuild with tests

# Code quality
make lint         # Lint with ruff (auto-fix)
make format       # Format with ruff
make typecheck    # Type check with mypy
make qa           # Run all: lint, typecheck, format

# Memory leak detection
make leaks        # RSS-growth leak check (10 cycles, 20% threshold)

# Publishing
make check        # Validate wheels with twine
make publish      # Upload to PyPI
make publish-test # Upload to TestPyPI

GPU Acceleration

By default, cyllama builds with Metal support on macOS and CPU-only on Linux. To enable other GPU backends (CUDA, Vulkan, etc.):

# Static builds (all libs compiled in)
make build-cuda
make build-vulkan

# Dynamic builds (shared libs installed alongside extension)
make build-cuda-dynamic
make build-vulkan-dynamic

# Multiple backends
export GGML_CUDA=1 GGML_VULKAN=1
make build

See Build Backends for comprehensive backend build instructions.

Multi-GPU Configuration

For systems with multiple GPUs, cyllama provides full control over GPU selection and model splitting:

from cyllama import LLM, GenerationConfig

# Use a specific GPU (GPU index 1)
llm = LLM("model.gguf", main_gpu=1)

# Multi-GPU with layer splitting (default mode)
llm = LLM("model.gguf", split_mode=1, n_gpu_layers=99)

# Multi-GPU with tensor parallelism (row splitting)
llm = LLM("model.gguf", split_mode=2, n_gpu_layers=99)

# Custom tensor split: 30% GPU 0, 70% GPU 1
llm = LLM("model.gguf", tensor_split=[0.3, 0.7])

# Full configuration via GenerationConfig
config = GenerationConfig(
    main_gpu=0,
    split_mode=1,          # 0=NONE, 1=LAYER, 2=ROW
    tensor_split=[1, 2],   # 1/3 GPU0, 2/3 GPU1
    n_gpu_layers=99
)
llm = LLM("model.gguf", config=config)

Split Modes:

  • 0 (NONE): Single GPU only, uses main_gpu
  • 1 (LAYER): Split layers and KV cache across GPUs (default)
  • 2 (ROW): Tensor parallelism - split layers with row-wise distribution

Testing

The tests directory in this repo provides extensive examples of using cyllama.

However, as a first step, you should download a smallish llm in the .gguf model from huggingface. A good small model to start and which is assumed by tests is Llama-3.2-1B-Instruct-Q8_0.gguf. cyllama expects models to be stored in a models folder in the cloned cyllama directory. So to create the models directory if doesn't exist and download this model, you can just type:

make download

This basically just does:

cd cyllama
mkdir models && cd models
wget https://huggingface.co/unsloth/Llama-3.2-1B-Instruct-GGUF/resolve/main/Llama-3.2-1B-Instruct-Q8_0.gguf

Now you can test it using llama-cli or llama-simple:

bin/llama-cli -c 512 -n 32 -m models/Llama-3.2-1B-Instruct-Q8_0.gguf \
 -p "Is mathematics discovered or invented?"

With 1450+ passing tests, the library is ready for both quick prototyping and production use:

make test  # Run full test suite

You can also explore interactively:

python3 -i scripts/start.py

>>> from cyllama import complete
>>> response = complete("What is 2+2?", model_path="models/Llama-3.2-1B-Instruct-Q8_0.gguf")
>>> print(response)

Documentation

Full documentation is available at https://shakfu.github.io/cyllama/ (built with MkDocs).

To serve docs locally: make docs-serve

  • User Guide - Comprehensive guide covering all features
  • CLI Cheatsheet - Complete CLI reference for all commands
  • API Reference - Complete API documentation
  • RAG Overview - Retrieval-augmented generation guide
  • Cookbook - Practical recipes and patterns
  • Changelog - Complete release history
  • Examples - See tests/examples/ for working code samples

Roadmap

Completed

  • Full llama.cpp API wrapper with Cython
  • High-level API (LLM, complete, chat)
  • Async API support (AsyncLLM, complete_async, chat_async)
  • Response class with stats and serialization
  • Built-in chat template system (llama.cpp templates)
  • Batch processing utilities
  • OpenAI-compatible API client
  • LangChain integration
  • Speculative decoding
  • GGUF file manipulation
  • JSON schema to grammar conversion
  • Model download helper
  • N-gram cache
  • OpenAI-compatible servers (PythonServer, EmbeddedServer, LlamaServer) with chat and embeddings
  • Whisper.cpp integration
  • Multimodal support (LLAVA)
  • Memory estimation utilities
  • Agent Framework (ReActAgent, ConstrainedAgent, ContractAgent)
  • Stable Diffusion (stable-diffusion.cpp) - image/video generation
  • RAG utilities (text chunking, document processing)

Future

  • Web UI for testing

Contributing

Contributions are welcome! Please see the User Guide for development guidelines.

License

This project wraps llama.cpp, whisper.cpp, and stable-diffusion.cpp which all follow the MIT licensing terms, as does cyllama.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cyllama-0.2.7-cp314-cp314-win_amd64.whl (12.6 MB view details)

Uploaded CPython 3.14Windows x86-64

cyllama-0.2.7-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.14manylinux: glibc 2.17+ x86-64

cyllama-0.2.7-cp314-cp314-macosx_11_0_x86_64.whl (13.5 MB view details)

Uploaded CPython 3.14macOS 11.0+ x86-64

cyllama-0.2.7-cp314-cp314-macosx_11_0_arm64.whl (13.0 MB view details)

Uploaded CPython 3.14macOS 11.0+ ARM64

cyllama-0.2.7-cp313-cp313-win_amd64.whl (12.3 MB view details)

Uploaded CPython 3.13Windows x86-64

cyllama-0.2.7-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.17+ x86-64

cyllama-0.2.7-cp313-cp313-macosx_11_0_x86_64.whl (13.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ x86-64

cyllama-0.2.7-cp313-cp313-macosx_11_0_arm64.whl (12.9 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

cyllama-0.2.7-cp312-cp312-win_amd64.whl (12.3 MB view details)

Uploaded CPython 3.12Windows x86-64

cyllama-0.2.7-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.17+ x86-64

cyllama-0.2.7-cp312-cp312-macosx_11_0_x86_64.whl (13.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ x86-64

cyllama-0.2.7-cp312-cp312-macosx_11_0_arm64.whl (12.9 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

cyllama-0.2.7-cp311-cp311-win_amd64.whl (12.3 MB view details)

Uploaded CPython 3.11Windows x86-64

cyllama-0.2.7-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.17+ x86-64

cyllama-0.2.7-cp311-cp311-macosx_11_0_x86_64.whl (13.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ x86-64

cyllama-0.2.7-cp311-cp311-macosx_11_0_arm64.whl (12.9 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

cyllama-0.2.7-cp310-cp310-win_amd64.whl (12.3 MB view details)

Uploaded CPython 3.10Windows x86-64

cyllama-0.2.7-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (15.3 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.17+ x86-64

cyllama-0.2.7-cp310-cp310-macosx_11_0_x86_64.whl (13.5 MB view details)

Uploaded CPython 3.10macOS 11.0+ x86-64

cyllama-0.2.7-cp310-cp310-macosx_11_0_arm64.whl (13.0 MB view details)

Uploaded CPython 3.10macOS 11.0+ ARM64

File details

Details for the file cyllama-0.2.7-cp314-cp314-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.7-cp314-cp314-win_amd64.whl
  • Upload date:
  • Size: 12.6 MB
  • Tags: CPython 3.14, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.7-cp314-cp314-win_amd64.whl
Algorithm Hash digest
SHA256 c6c7177069f2a5e4800f3f4415e52b4903a9c7c0ec84fd03933d6e48a99d6490
MD5 c886a51e9b8971da0ca456cb8773f2af
BLAKE2b-256 c667e804e98dad15158ddd441a828add6d09b93c165f4866c3d303978fbf8de3

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp314-cp314-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 602e235e4844bcdc71625d40191163d9b3c36199064533b7515b083dc86fda20
MD5 ce1e6912c05614234d823e3d6a935f78
BLAKE2b-256 0f19b93f4023147b25dab8d8d5e196ecc4f20f038e04dedf04de63e87c486e6e

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp314-cp314-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp314-cp314-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 7dd69a77fc6627ca5befe5ba4d2259fdeb77b41d0237dcf8665396dd3b5884b4
MD5 4cd70c8bb325969b7ab5d54e34e3a17d
BLAKE2b-256 18bef77ed7748b0a5993c7762d1aa052ed6b239964c39dca9fd6c08375938711

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp314-cp314-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp314-cp314-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4daada2ee1783cfb8995b5764fbc33b38da88a8431f60dd6f1b056572aeaa04d
MD5 5c3bdb5e7ec3ec374bd5e916fa4c2dc3
BLAKE2b-256 1896a36c802014a7fbd37cd1053b93848c8249757b0fbe0575cae1daab544f02

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp313-cp313-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.7-cp313-cp313-win_amd64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.13, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.7-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 d71817d9f746cfa3f4b05e32958fbc64d628edcfe1b22cf4d2f4aef0438b9935
MD5 63106e7e5f9ff98c612094116d597f5c
BLAKE2b-256 659b04306b2b81394210730058a1430f776607b5da41bc9744c39e0eb4a974a9

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp313-cp313-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 f72ea05bd7208d67d76ff06071bef5c5f4e4689a8620fb8c494472a963ebc238
MD5 fd7d50d644f1e52190666bc156847f2d
BLAKE2b-256 e51731049157569dbc5421a2e4ebe05cf15fa526db027aa20672796b7f23b2e0

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp313-cp313-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp313-cp313-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 bee3ea4bba10404f39a230ed054bc670e632845864fcb2df897e3f7e26d1fdf7
MD5 027614c7f216e2fb89f5e9fc8fdc8027
BLAKE2b-256 5532d2ff5ae91db5e0ecbeeaeb29628270bb30508b375228a6cd95fe8989a3fa

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 1e76e950fe450fd3f8124a5301e943ae36856d688ccda0411bc6803be73b36ae
MD5 b72b408c9d97bafcbeb8535484d15ed8
BLAKE2b-256 6fdcc471595388aa1a34b7649a214de1272708bd7c02ef243ee8a708c47657da

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp312-cp312-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.7-cp312-cp312-win_amd64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.12, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.7-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 ea7392530b86b332474b7d0243e0cce061110000b0bfdf70fffdeaccf26637a9
MD5 f83d24ace81140eeff3ac58d950c339e
BLAKE2b-256 da3e203a2dca3408019b8c30434273f9b6ae5c599490cae4f68be0005d17a401

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp312-cp312-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 adfa2b04e8ae5e8bd9ad34bc7c5457a9b62c34aaa509f0f1a8d570ced8416493
MD5 99e9d7d970439070a20003956ff11cb1
BLAKE2b-256 f3f40295ffd4c2697cd05415dca9839beea89b47721d42f43221cf6b77b85473

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp312-cp312-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp312-cp312-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 5b1c9502ef762c996b27bceaac517e0189524c762734055ceb2426ac1d798c32
MD5 bbdee39b33046a07611ebbfe0d297f8b
BLAKE2b-256 ac1821c4462080529fddef639bbdb8b8f8c08a1533e936b9c6e015c9bf096b29

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8915b47c4142244c249c8a724abbfe377499c84dd9c5cd42003c9ce6d95eb5d1
MD5 758b3af2ae6d4151d13e9f64d78d57f9
BLAKE2b-256 2bce51e117d29fe7f8aa4bdd7a882afaec1adac863dda255df588d540ab52b5d

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp311-cp311-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.7-cp311-cp311-win_amd64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.11, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.7-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 ba0dc2647ec7f300c55d28c129f6aa83a7d1e3d6eb00e81de4929e396befa83d
MD5 b5f83108bda567bd31616615083b8709
BLAKE2b-256 6c470be668f201a6a0d2bdce038cadb7248f08c051a79dfba0972bedce079fb3

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 c897df94036d29909879d0bafea13fe493f8f18a7b46cd4b4e16d733d3685b7e
MD5 9268d76e1dbb8df32c75689cc5661988
BLAKE2b-256 8ac7f06f083b362377b424ff8d3866acb2f6df36252325ac9925066256db1f7c

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp311-cp311-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp311-cp311-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 bb978abc1400f3ee4e8e68ccae1ead6a660922506d0b3ef4c3fccfad28fc3b1b
MD5 aeb02b209c4ff621e79e14fbfd071537
BLAKE2b-256 6c06c680c6075ba97d1290f71901f66e61c27c5947c7c4b3383df020a66b4041

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ff86a5078a34062bd483a813bdfff4325f7f21fb413e7fbe2fc95d4a08b872f2
MD5 55f0625d260effad4417f6627db3b4e1
BLAKE2b-256 e5eede4b247caefdaf7b66fdf50d6c0244a5b46e6073c8e14b56751449a584f4

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp310-cp310-win_amd64.whl.

File metadata

  • Download URL: cyllama-0.2.7-cp310-cp310-win_amd64.whl
  • Upload date:
  • Size: 12.3 MB
  • Tags: CPython 3.10, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.2

File hashes

Hashes for cyllama-0.2.7-cp310-cp310-win_amd64.whl
Algorithm Hash digest
SHA256 87a5ff46adc685d66a5454e87a683650a477899afe190ed535eddf1cdb79aa41
MD5 1a4dd3b0b27f8bbb3289ac19b375def4
BLAKE2b-256 55297c1ef1ed9ec4ba82ca5b9dc36170842ef899cb6846ce3258d04c9b507ced

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl
Algorithm Hash digest
SHA256 1d09e6f1bdd419fdd63dd1187568e98f1e556148d47e1d7e0851aef55b640251
MD5 34075404d99c529e0946d760dc85a826
BLAKE2b-256 55e7b010de86aa981c6dfcb4bbeee4f166698b527c7fe86e42991bf4fc4f2394

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp310-cp310-macosx_11_0_x86_64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp310-cp310-macosx_11_0_x86_64.whl
Algorithm Hash digest
SHA256 bc02fb5bc88d52a077fb1b40df75ce36726dae86aa16d81c0c0569eda0615180
MD5 2d2c5fa8f7c18f257857aefa8f1d8cc2
BLAKE2b-256 b9193432044b0da344365ee95693bcb9fefefd3c2d9d9e2afecf56cfc08f0d83

See more details on using hashes here.

File details

Details for the file cyllama-0.2.7-cp310-cp310-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cyllama-0.2.7-cp310-cp310-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9ec5870420991add7368e7c4a092a68a78b0b9a29c4d35f7b3a6847d69a31b57
MD5 8575fe7db460a7d0ac5ff94cf41642ff
BLAKE2b-256 7de0795b9a4f86a16a8c21166bb2ce40f1be74983d9a665addac42f3447a979e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page