Run LLMs locally. One command.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rjcuff

These details have not been verified by PyPI

Project description

koda

Run any LLM locally. One command.

Koda downloads and runs quantized LLMs on your machine. No cloud, no API keys, no Docker. It speaks the Ollama and OpenAI protocols, so any compatible client works out of the box.

koda pull llama3.2
koda run llama3.2

Inspired by Ollama. Built with llama.cpp + FastAPI.

Requirements

Python 3.12+
RAM: 4 GB minimum (8 GB+ recommended for 7B models)
Disk: varies by model — 2–5 GB per model
GPU: optional but recommended — CUDA (Linux/Windows) or Metal (Apple Silicon)

Install

macOS / Linux

curl -fsSL https://raw.githubusercontent.com/rjcuff/koda/main/install.sh | bash

The installer detects your platform and automatically builds llama-cpp-python with GPU support (CUDA or Metal) if available. After install, run:

source ~/.bashrc   # or ~/.zshrc on zsh
koda version

Windows

irm https://raw.githubusercontent.com/rjcuff/koda/main/install.ps1 | iex

Manual

git clone https://github.com/rjcuff/koda
cd koda
python3 -m venv .venv && source .venv/bin/activate
pip install -e .

GPU support (optional but faster)

CUDA (Linux / Windows):

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --no-cache-dir
pip install -e .

Apple Silicon (Metal):

CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --no-cache-dir
pip install -e .

CPU-only works fine if you skip the above — inference is just slower.

Quick Start

# 1. Download a model
koda pull llama3.2

# 2. Chat interactively
koda run llama3.2

# 3. Or start the API server
koda serve

Type /bye to exit an interactive session.

Commands

Command	Description
`koda pull <model>`	Download a model from HuggingFace
`koda list`	Show downloaded models
`koda list --available`	Show all pullable models
`koda run <model>`	Start an interactive chat session
`koda run <model> --system "..."`	Chat with a custom system prompt
`koda run <model> --ctx 8192`	Set the context window size
`koda run --kodafile Kodafile`	Run with a Kodafile config
`koda serve`	Start the API server on :11434
`koda serve --host 0.0.0.0 --port 8080`	Custom host and port
`koda create`	Generate a Kodafile template
`koda version`	Show Koda version

Available Models

Name	Description	Size
`llama3.2` / `llama3.2:3b`	Meta Llama 3.2 3B Instruct	2.0 GB
`llama3.1` / `llama3.1:8b`	Meta Llama 3.1 8B Instruct	4.9 GB
`mistral`	Mistral 7B Instruct v0.3	4.4 GB
`phi3`	Microsoft Phi-3 Mini 4K	2.2 GB
`gemma2`	Google Gemma 2 2B Instruct	1.6 GB
`qwen2.5`	Qwen 2.5 7B Instruct	4.7 GB
`deepseek-r1`	DeepSeek R1 Distill Qwen 7B	4.7 GB

All models use Q4_K_M quantization — a good balance of quality and size. Run koda list --available to see the current list.

API Server

koda serve
# Listening on http://127.0.0.1:11434

The server implements both the Ollama and OpenAI API protocols, so you can point any compatible client at it without any code changes.

Ollama-compatible endpoints

# List downloaded models
curl http://localhost:11434/api/tags

# Text generation
curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "prompt": "Why is the sky blue?"}'

# Chat completion
curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'

# List running models
curl http://localhost:11434/api/ps

# Pull a model
curl http://localhost:11434/api/pull \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2"}'

# Delete a model
curl -X DELETE http://localhost:11434/api/delete \
  -H "Content-Type: application/json" \
  -d '{"name": "llama3.2"}'

OpenAI-compatible endpoints

Drop-in replacement for any client that supports a custom base URL:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="koda")

response = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

# List models
curl http://localhost:11434/v1/models

# Chat completion
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama3.2", "messages": [{"role": "user", "content": "Hello"}]}'

Streaming works on both protocols — set "stream": true in the request body.

Python Library

Use Koda directly in your Python code. No server, no daemon, no subprocess.

from koda import Koda

k = Koda()

# Download a model (no-op if already present)
k.pull("llama3.2")

# Chat — returns a string
reply = k.chat("llama3.2", [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2 + 2?"},
])
print(reply)

# Streaming chat — returns a token iterator
for token in k.chat("llama3.2", [{"role": "user", "content": "Tell me a story"}], stream=True):
    print(token, end="", flush=True)

# Raw text completion (no chat template applied)
text = k.generate("llama3.2", "The capital of France is")
print(text)

# Streaming completion
for token in k.generate("llama3.2", "Once upon a time", stream=True):
    print(token, end="", flush=True)

# Manage loaded models
print(k.models())    # all models in the registry
print(k.loaded())    # models currently in memory
k.unload("llama3.2") # free memory

Custom context window:

k = Koda(n_ctx=8192)

Kodafile

A Kodafile is a YAML config file that defines a model's behavior — useful for project-specific assistants or repeatable setups.

koda create            # writes a Kodafile template in the current directory
koda run --kodafile Kodafile

Kodafile format:

base: llama3.2
system: You are a concise coding assistant. Respond in plain text, no markdown.
parameters:
  n_ctx: 8192
  temperature: 0.7
  top_p: 0.9
  repeat_penalty: 1.1

Field	Description	Default
`base`	Model name (required)	—
`system`	System prompt	`"You are a helpful assistant."`
`parameters.n_ctx`	Context window size (tokens)	`4096`
`parameters.temperature`	Sampling temperature	`0.8`
`parameters.top_p`	Nucleus sampling	—
`parameters.top_k`	Top-K sampling	—
`parameters.repeat_penalty`	Repetition penalty	—
`parameters.max_tokens`	Max tokens to generate (`-1` = unlimited)	`-1`

Project Structure

koda/
├── koda/
│   ├── api.py          # Python library API (Koda class)
│   ├── cli.py          # CLI commands: pull, list, run, serve, create
│   ├── config.py       # Paths and defaults (~/.koda/)
│   ├── inference.py    # Model loading + thread-safe in-memory cache
│   ├── kodafile.py     # Kodafile YAML config format
│   ├── pull.py         # HuggingFace model downloads
│   ├── registry.py     # Model name → HuggingFace repo mapping
│   └── server.py       # FastAPI server — Ollama + OpenAI APIs
├── install.sh          # macOS / Linux one-line installer
├── install.ps1         # Windows one-line installer
└── pyproject.toml

Models are stored in ~/.koda/models/.

Stack

Component	Library
Inference	llama-cpp-python
API server	FastAPI + uvicorn
CLI	Typer + Rich
Model downloads	huggingface_hub
GPU backends	CUDA (Linux/Windows) · Metal (Apple Silicon) · CPU fallback

Contributing

See CONTRIBUTING.md for how to add models, endpoints, and commands.

License

MIT

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

rjcuff

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

koda_llm-0.1.0.tar.gz (19.4 kB view details)

Uploaded May 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

koda_llm-0.1.0-py3-none-any.whl (15.6 kB view details)

Uploaded May 1, 2026 Python 3

File details

Details for the file koda_llm-0.1.0.tar.gz.

File metadata

Download URL: koda_llm-0.1.0.tar.gz
Upload date: May 1, 2026
Size: 19.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for koda_llm-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`efadfc4925c74e444c12b62d3f148890132e687b68a8d057d41e2da2f6a7a70e`
MD5	`7896b9bce40a64ac18015809a568ade4`
BLAKE2b-256	`6792c09ec013329181363bd63c50617c800b25fd167e039238cd7d1bbb8795e4`

See more details on using hashes here.

Provenance

The following attestation bundles were made for koda_llm-0.1.0.tar.gz:

Publisher: publish.yml on rjcuff/koda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: koda_llm-0.1.0.tar.gz
- Subject digest: efadfc4925c74e444c12b62d3f148890132e687b68a8d057d41e2da2f6a7a70e
- Sigstore transparency entry: 1416548337
- Sigstore integration time: May 1, 2026
Source repository:
- Permalink: rjcuff/koda@893b945a8d9f9afac0d382e87a639570764e34cf
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/rjcuff
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@893b945a8d9f9afac0d382e87a639570764e34cf
- Trigger Event: push

File details

Details for the file koda_llm-0.1.0-py3-none-any.whl.

File metadata

Download URL: koda_llm-0.1.0-py3-none-any.whl
Upload date: May 1, 2026
Size: 15.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for koda_llm-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`87ca9efc831c7b4d1676363dfe6e4fbf12ab82a1cab375012eb5de80d212cecb`
MD5	`c1130b1690e6e7be14e0faac71c128cc`
BLAKE2b-256	`8f852b2846a945d772b0643206215036f82fa31729aeab1eed39f0114a089e45`

See more details on using hashes here.

Provenance

The following attestation bundles were made for koda_llm-0.1.0-py3-none-any.whl:

Publisher: publish.yml on rjcuff/koda

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: koda_llm-0.1.0-py3-none-any.whl
- Subject digest: 87ca9efc831c7b4d1676363dfe6e4fbf12ab82a1cab375012eb5de80d212cecb
- Sigstore transparency entry: 1416548422
- Sigstore integration time: May 1, 2026
Source repository:
- Permalink: rjcuff/koda@893b945a8d9f9afac0d382e87a639570764e34cf
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/rjcuff
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@893b945a8d9f9afac0d382e87a639570764e34cf
- Trigger Event: push

koda-llm 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

koda

Requirements

Install

macOS / Linux

Windows

Manual

GPU support (optional but faster)

Quick Start

Commands

Available Models

API Server

Ollama-compatible endpoints

OpenAI-compatible endpoints

Python Library

Kodafile

Project Structure

Stack

Contributing

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance