Run LLMs locally. One command.
Project description
koda
Run any LLM locally. One command.
Koda downloads and runs quantized LLMs on your machine. No cloud, no API keys, no Docker. It speaks the Ollama and OpenAI protocols, so any compatible client works out of the box.
koda pull llama3.2
koda run llama3.2
Inspired by Ollama. Built with llama.cpp + FastAPI.
Requirements
- Python 3.12+
- RAM: 4 GB minimum (8 GB+ recommended for 7B models)
- Disk: varies by model — 2–5 GB per model
- GPU: optional but recommended — CUDA (Linux/Windows) or Metal (Apple Silicon)
Install
macOS / Linux
curl -fsSL https://raw.githubusercontent.com/rjcuff/koda/main/install.sh | bash
The installer detects your platform and automatically builds llama-cpp-python with GPU support (CUDA or Metal) if available. After install, run:
source ~/.bashrc # or ~/.zshrc on zsh
koda version
Windows
irm https://raw.githubusercontent.com/rjcuff/koda/main/install.ps1 | iex
Manual
git clone https://github.com/rjcuff/koda
cd koda
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
GPU support (optional but faster)
CUDA (Linux / Windows):
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --no-cache-dir
pip install -e .
Apple Silicon (Metal):
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --no-cache-dir
pip install -e .
CPU-only works fine if you skip the above — inference is just slower.
Quick Start
# 1. Download a model
koda pull llama3.2
# 2. Chat interactively
koda run llama3.2
# 3. Or start the API server
koda serve
Type /bye to exit an interactive session.
Commands
| Command | Description |
|---|---|
koda pull <model> |
Download a model from HuggingFace |
koda list |
Show downloaded models |
koda list --available |
Show all pullable models |
koda run <model> |
Start an interactive chat session |
koda run <model> --system "..." |
Chat with a custom system prompt |
koda run <model> --ctx 8192 |
Set the context window size |
koda run --kodafile Kodafile |
Run with a Kodafile config |
koda serve |
Start the API server on :11434 |
koda serve --host 0.0.0.0 --port 8080 |
Custom host and port |
koda create |
Generate a Kodafile template |
koda version |
Show Koda version |
Available Models
| Name | Description | Size |
|---|---|---|
llama3.2 / llama3.2:3b |
Meta Llama 3.2 3B Instruct | 2.0 GB |
llama3.1 / llama3.1:8b |
Meta Llama 3.1 8B Instruct | 4.9 GB |
mistral |
Mistral 7B Instruct v0.3 | 4.4 GB |
phi3 |
Microsoft Phi-3 Mini 4K | 2.2 GB |
gemma2 |
Google Gemma 2 2B Instruct | 1.6 GB |
qwen2.5 |
Qwen 2.5 7B Instruct | 4.7 GB |
deepseek-r1 |
DeepSeek R1 Distill Qwen 7B | 4.7 GB |
All models use Q4_K_M quantization — a good balance of quality and size. Run koda list --available to see the current list.
API Server
koda serve
# Listening on http://127.0.0.1:11434
The server implements both the Ollama and OpenAI API protocols, so you can point any compatible client at it without any code changes.
Ollama-compatible endpoints
# List downloaded models
curl http://localhost:11434/api/tags
# Text generation
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2", "prompt": "Why is the sky blue?"}'
# Chat completion
curl http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'
# List running models
curl http://localhost:11434/api/ps
# Pull a model
curl http://localhost:11434/api/pull \
-H "Content-Type: application/json" \
-d '{"name": "llama3.2"}'
# Delete a model
curl -X DELETE http://localhost:11434/api/delete \
-H "Content-Type: application/json" \
-d '{"name": "llama3.2"}'
OpenAI-compatible endpoints
Drop-in replacement for any client that supports a custom base URL:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="koda")
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)
# List models
curl http://localhost:11434/v1/models
# Chat completion
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'
Streaming works on both protocols — set "stream": true in the request body.
Python Library
Use Koda directly in your Python code. No server, no daemon, no subprocess.
from koda import Koda
k = Koda()
# Download a model (no-op if already present)
k.pull("llama3.2")
# Chat — returns a string
reply = k.chat("llama3.2", [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2 + 2?"},
])
print(reply)
# Streaming chat — returns a token iterator
for token in k.chat("llama3.2", [{"role": "user", "content": "Tell me a story"}], stream=True):
print(token, end="", flush=True)
# Raw text completion (no chat template applied)
text = k.generate("llama3.2", "The capital of France is")
print(text)
# Streaming completion
for token in k.generate("llama3.2", "Once upon a time", stream=True):
print(token, end="", flush=True)
# Manage loaded models
print(k.models()) # all models in the registry
print(k.loaded()) # models currently in memory
k.unload("llama3.2") # free memory
Custom context window:
k = Koda(n_ctx=8192)
Kodafile
A Kodafile is a YAML config file that defines a model's behavior — useful for project-specific assistants or repeatable setups.
koda create # writes a Kodafile template in the current directory
koda run --kodafile Kodafile
Kodafile format:
base: llama3.2
system: You are a concise coding assistant. Respond in plain text, no markdown.
parameters:
n_ctx: 8192
temperature: 0.7
top_p: 0.9
repeat_penalty: 1.1
| Field | Description | Default |
|---|---|---|
base |
Model name (required) | — |
system |
System prompt | "You are a helpful assistant." |
parameters.n_ctx |
Context window size (tokens) | 4096 |
parameters.temperature |
Sampling temperature | 0.8 |
parameters.top_p |
Nucleus sampling | — |
parameters.top_k |
Top-K sampling | — |
parameters.repeat_penalty |
Repetition penalty | — |
parameters.max_tokens |
Max tokens to generate (-1 = unlimited) |
-1 |
Project Structure
koda/
├── koda/
│ ├── api.py # Python library API (Koda class)
│ ├── cli.py # CLI commands: pull, list, run, serve, create
│ ├── config.py # Paths and defaults (~/.koda/)
│ ├── inference.py # Model loading + thread-safe in-memory cache
│ ├── kodafile.py # Kodafile YAML config format
│ ├── pull.py # HuggingFace model downloads
│ ├── registry.py # Model name → HuggingFace repo mapping
│ └── server.py # FastAPI server — Ollama + OpenAI APIs
├── install.sh # macOS / Linux one-line installer
├── install.ps1 # Windows one-line installer
└── pyproject.toml
Models are stored in ~/.koda/models/.
Stack
| Component | Library |
|---|---|
| Inference | llama-cpp-python |
| API server | FastAPI + uvicorn |
| CLI | Typer + Rich |
| Model downloads | huggingface_hub |
| GPU backends | CUDA (Linux/Windows) · Metal (Apple Silicon) · CPU fallback |
Contributing
See CONTRIBUTING.md for how to add models, endpoints, and commands.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file koda_llm-0.1.0.tar.gz.
File metadata
- Download URL: koda_llm-0.1.0.tar.gz
- Upload date:
- Size: 19.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efadfc4925c74e444c12b62d3f148890132e687b68a8d057d41e2da2f6a7a70e
|
|
| MD5 |
7896b9bce40a64ac18015809a568ade4
|
|
| BLAKE2b-256 |
6792c09ec013329181363bd63c50617c800b25fd167e039238cd7d1bbb8795e4
|
Provenance
The following attestation bundles were made for koda_llm-0.1.0.tar.gz:
Publisher:
publish.yml on rjcuff/koda
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
koda_llm-0.1.0.tar.gz -
Subject digest:
efadfc4925c74e444c12b62d3f148890132e687b68a8d057d41e2da2f6a7a70e - Sigstore transparency entry: 1416548337
- Sigstore integration time:
-
Permalink:
rjcuff/koda@893b945a8d9f9afac0d382e87a639570764e34cf -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/rjcuff
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@893b945a8d9f9afac0d382e87a639570764e34cf -
Trigger Event:
push
-
Statement type:
File details
Details for the file koda_llm-0.1.0-py3-none-any.whl.
File metadata
- Download URL: koda_llm-0.1.0-py3-none-any.whl
- Upload date:
- Size: 15.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87ca9efc831c7b4d1676363dfe6e4fbf12ab82a1cab375012eb5de80d212cecb
|
|
| MD5 |
c1130b1690e6e7be14e0faac71c128cc
|
|
| BLAKE2b-256 |
8f852b2846a945d772b0643206215036f82fa31729aeab1eed39f0114a089e45
|
Provenance
The following attestation bundles were made for koda_llm-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on rjcuff/koda
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
koda_llm-0.1.0-py3-none-any.whl -
Subject digest:
87ca9efc831c7b4d1676363dfe6e4fbf12ab82a1cab375012eb5de80d212cecb - Sigstore transparency entry: 1416548422
- Sigstore integration time:
-
Permalink:
rjcuff/koda@893b945a8d9f9afac0d382e87a639570764e34cf -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/rjcuff
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@893b945a8d9f9afac0d382e87a639570764e34cf -
Trigger Event:
push
-
Statement type: