Skip to main content

Run LLMs locally. One command.

Project description

koda

Run any LLM locally. One command.

Python 3.12+ License: MIT Platform

Koda downloads and runs quantized LLMs on your machine. No cloud, no API keys, no Docker. It speaks the Ollama and OpenAI protocols, so any compatible client works out of the box.

koda pull llama3.2
koda run llama3.2

Inspired by Ollama. Built with llama.cpp + FastAPI.


Requirements

  • Python 3.12+
  • RAM: 4 GB minimum (8 GB+ recommended for 7B models)
  • Disk: varies by model — 2–5 GB per model
  • GPU: optional but recommended — CUDA (Linux/Windows) or Metal (Apple Silicon)

Install

macOS / Linux

curl -fsSL https://raw.githubusercontent.com/rjcuff/koda/main/install.sh | bash

The installer detects your platform and automatically builds llama-cpp-python with GPU support (CUDA or Metal) if available. After install, run:

source ~/.bashrc   # or ~/.zshrc on zsh
koda version

Windows

irm https://raw.githubusercontent.com/rjcuff/koda/main/install.ps1 | iex

Manual

git clone https://github.com/rjcuff/koda
cd koda
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

GPU support (optional but faster)

CUDA (Linux / Windows):

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --no-cache-dir
pip install -e .

Apple Silicon (Metal):

CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --no-cache-dir
pip install -e .

CPU-only works fine if you skip the above — inference is just slower.


Quick Start

# 1. Download a model
koda pull llama3.2

# 2. Chat interactively
koda run llama3.2

# 3. Or start the API server
koda serve

Type /bye to exit an interactive session.


Commands

Command Description
koda pull <model> Download a model from HuggingFace
koda list Show downloaded models
koda list --available Show all pullable models
koda run <model> Start an interactive chat session
koda run <model> --system "..." Chat with a custom system prompt
koda run <model> --ctx 8192 Set the context window size
koda run --kodafile Kodafile Run with a Kodafile config
koda serve Start the API server on :11434
koda serve --host 0.0.0.0 --port 8080 Custom host and port
koda create Generate a Kodafile template
koda version Show Koda version

Available Models

Name Description Size
llama3.2 / llama3.2:3b Meta Llama 3.2 3B Instruct 2.0 GB
llama3.1 / llama3.1:8b Meta Llama 3.1 8B Instruct 4.9 GB
mistral Mistral 7B Instruct v0.3 4.4 GB
phi3 Microsoft Phi-3 Mini 4K 2.2 GB
gemma2 Google Gemma 2 2B Instruct 1.6 GB
qwen2.5 Qwen 2.5 7B Instruct 4.7 GB
deepseek-r1 DeepSeek R1 Distill Qwen 7B 4.7 GB

All models use Q4_K_M quantization — a good balance of quality and size. Run koda list --available to see the current list.


API Server

koda serve
# Listening on http://127.0.0.1:11434

The server implements both the Ollama and OpenAI API protocols, so you can point any compatible client at it without any code changes.

Ollama-compatible endpoints

# List downloaded models
curl http://localhost:11434/api/tags

# Text generation
curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "prompt": "Why is the sky blue?"}'

# Chat completion
curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'

# List running models
curl http://localhost:11434/api/ps

# Pull a model
curl http://localhost:11434/api/pull \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2"}'

# Delete a model
curl -X DELETE http://localhost:11434/api/delete \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2"}'

OpenAI-compatible endpoints

Drop-in replacement for any client that supports a custom base URL:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="koda")

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)
# List models
curl http://localhost:11434/v1/models

# Chat completion
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'

Streaming works on both protocols — set "stream": true in the request body.


Python Library

Use Koda directly in your Python code. No server, no daemon, no subprocess.

from koda import Koda

k = Koda()

# Download a model (no-op if already present)
k.pull("llama3.2")

# Chat — returns a string
reply = k.chat("llama3.2", [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2 + 2?"},
])
print(reply)

# Streaming chat — returns a token iterator
for token in k.chat("llama3.2", [{"role": "user", "content": "Tell me a story"}], stream=True):
    print(token, end="", flush=True)

# Raw text completion (no chat template applied)
text = k.generate("llama3.2", "The capital of France is")
print(text)

# Streaming completion
for token in k.generate("llama3.2", "Once upon a time", stream=True):
    print(token, end="", flush=True)

# Manage loaded models
print(k.models())    # all models in the registry
print(k.loaded())    # models currently in memory
k.unload("llama3.2") # free memory

Custom context window:

k = Koda(n_ctx=8192)

Kodafile

A Kodafile is a YAML config file that defines a model's behavior — useful for project-specific assistants or repeatable setups.

koda create            # writes a Kodafile template in the current directory
koda run --kodafile Kodafile

Kodafile format:

base: llama3.2
system: You are a concise coding assistant. Respond in plain text, no markdown.
parameters:
  n_ctx: 8192
  temperature: 0.7
  top_p: 0.9
  repeat_penalty: 1.1
Field Description Default
base Model name (required)
system System prompt "You are a helpful assistant."
parameters.n_ctx Context window size (tokens) 4096
parameters.temperature Sampling temperature 0.8
parameters.top_p Nucleus sampling
parameters.top_k Top-K sampling
parameters.repeat_penalty Repetition penalty
parameters.max_tokens Max tokens to generate (-1 = unlimited) -1

Project Structure

koda/
├── koda/
│   ├── api.py          # Python library API (Koda class)
│   ├── cli.py          # CLI commands: pull, list, run, serve, create
│   ├── config.py       # Paths and defaults (~/.koda/)
│   ├── inference.py    # Model loading + thread-safe in-memory cache
│   ├── kodafile.py     # Kodafile YAML config format
│   ├── pull.py         # HuggingFace model downloads
│   ├── registry.py     # Model name → HuggingFace repo mapping
│   └── server.py       # FastAPI server — Ollama + OpenAI APIs
├── install.sh          # macOS / Linux one-line installer
├── install.ps1         # Windows one-line installer
└── pyproject.toml

Models are stored in ~/.koda/models/.


Stack

Component Library
Inference llama-cpp-python
API server FastAPI + uvicorn
CLI Typer + Rich
Model downloads huggingface_hub
GPU backends CUDA (Linux/Windows) · Metal (Apple Silicon) · CPU fallback

Contributing

See CONTRIBUTING.md for how to add models, endpoints, and commands.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

koda_llm-0.1.0.tar.gz (19.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

koda_llm-0.1.0-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file koda_llm-0.1.0.tar.gz.

File metadata

  • Download URL: koda_llm-0.1.0.tar.gz
  • Upload date:
  • Size: 19.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for koda_llm-0.1.0.tar.gz
Algorithm Hash digest
SHA256 efadfc4925c74e444c12b62d3f148890132e687b68a8d057d41e2da2f6a7a70e
MD5 7896b9bce40a64ac18015809a568ade4
BLAKE2b-256 6792c09ec013329181363bd63c50617c800b25fd167e039238cd7d1bbb8795e4

See more details on using hashes here.

Provenance

The following attestation bundles were made for koda_llm-0.1.0.tar.gz:

Publisher: publish.yml on rjcuff/koda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file koda_llm-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: koda_llm-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for koda_llm-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 87ca9efc831c7b4d1676363dfe6e4fbf12ab82a1cab375012eb5de80d212cecb
MD5 c1130b1690e6e7be14e0faac71c128cc
BLAKE2b-256 8f852b2846a945d772b0643206215036f82fa31729aeab1eed39f0114a089e45

See more details on using hashes here.

Provenance

The following attestation bundles were made for koda_llm-0.1.0-py3-none-any.whl:

Publisher: publish.yml on rjcuff/koda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page