Octomil — serve, deploy, and observe ML models on edge devices

These details have not been verified by PyPI

Project links

Project description

Octomil

Run LLMs on your laptop, phone, or edge device. One command. OpenAI-compatible API.

What is this?

Octomil is a CLI + Python SDK that serves open-weight LLMs locally with an OpenAI-compatible API. It auto-detects your hardware, picks the fastest inference engine, and gives you a drop-in replacement for cloud API calls -- works on Mac (MLX), Linux/Windows (llama.cpp), and deploys to phones.

Quick start

curl -fsSL https://get.octomil.com | sh
octomil serve gemma-1b

That's it. You now have an OpenAI-compatible server on localhost:8080:

curl http://localhost:8080/v1/chat/completions \
  -d '{"model": "gemma-1b", "messages": [{"role": "user", "content": "Hello!"}]}'

Or use any OpenAI client library:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
r = client.chat.completions.create(
    model="gemma-1b",
    messages=[{"role": "user", "content": "Explain quantum computing in 2 sentences."}],
)
print(r.choices[0].message.content)

Features

Auto engine selection -- benchmarks all available engines and picks the fastest:

octomil serve llama-3b
# => Detected: mlx-lm (38 tok/s), llama.cpp (29 tok/s), ollama (25 tok/s)
# => Using mlx-lm

60+ models -- Gemma, Llama, Phi, Qwen, DeepSeek, Mistral, Mixtral, and more:

octomil serve phi-mini          # Microsoft Phi-4 Mini (3.8B)
octomil serve deepseek-r1-7b    # DeepSeek R1 reasoning
octomil serve qwen3-4b          # Alibaba Qwen 3
octomil serve whisper-small     # Speech-to-text

Deploy to phones -- push models to iOS/Android devices:

octomil deploy gemma-1b --phone --rollout 10   # canary to 10% of devices
octomil status gemma-1b                        # monitor rollout
octomil rollback gemma-1b                      # instant rollback

Benchmark your hardware:

octomil benchmark gemma-1b
# Model: gemma-1b (4bit)
# Engine: mlx-lm
# Tokens/sec: 42.3
# Memory: 1.2 GB
# Time to first token: 89ms

Model conversion -- convert to CoreML (iOS) or TFLite (Android):

octomil convert model.pt --target ios,android

Multi-model serving -- load multiple models, route by request:

octomil serve --models smollm-360m,phi-mini,llama-3b

Supported engines

Engine	Platform	Install
MLX	Apple Silicon Mac	`pip install 'octomil-sdk[mlx]'`
llama.cpp	Mac, Linux, Windows	`pip install 'octomil-sdk[llama]'`
ONNX Runtime	All platforms	`pip install 'octomil-sdk[onnx]'`
MLC-LLM	Mac, Linux, Android	auto-detected
MNN	All platforms	auto-detected
ExecuTorch	Mobile	auto-detected
Whisper.cpp	All platforms	`pip install 'octomil-sdk[whisper]'`
Ollama	Mac, Linux	auto-detected if running

No engine installed? octomil serve tells you exactly what to install.

Supported models

Full model list (60+ models)

Model	Sizes	Engines
Gemma 3	1B, 4B, 12B, 27B	MLX, llama.cpp, MNN, ONNX, MLC
Gemma 2	2B, 9B, 27B	MLX, llama.cpp
Llama 3.2	1B, 3B	MLX, llama.cpp, MNN, ONNX, MLC
Llama 3.1/3.3	8B, 70B	MLX, llama.cpp
Phi-4 / Phi Mini	3.8B, 14B	MLX, llama.cpp, MNN, ONNX
Qwen 2.5	1.5B, 3B, 7B	MLX, llama.cpp, MNN, ONNX
Qwen 3	0.6B - 32B	MLX, llama.cpp
DeepSeek R1	1.5B - 70B	MLX, llama.cpp
DeepSeek V3	671B (MoE)	MLX, llama.cpp
Mistral / Nemo / Small	7B, 12B, 24B	MLX, llama.cpp
Mixtral	8x7B, 8x22B (MoE)	MLX, llama.cpp
Qwen 2.5 Coder	1.5B, 7B	MLX, llama.cpp
CodeLlama	7B, 13B, 34B	MLX, llama.cpp
StarCoder2	3B, 7B, 15B	MLX, llama.cpp
Falcon 3	1B, 7B, 10B	MLX, llama.cpp
SmolLM	360M, 1.7B	MLX, llama.cpp, MNN, ONNX
Whisper	tiny - large-v3	Whisper.cpp
+ many more

Use aliases: octomil serve deepseek-r1 resolves to deepseek-r1-7b. Each model supports 4bit, 8bit, and fp16 quantization variants.

How it works

octomil serve gemma-1b
    │
    ├── 1. Resolve model name → catalog lookup (aliases, quant variants)
    ├── 2. Detect engines     → MLX? llama.cpp? ONNX? Ollama running?
    ├── 3. Benchmark engines  → Run each, measure tok/s, pick fastest
    ├── 4. Download model     → HuggingFace Hub (cached after first pull)
    └── 5. Start server       → FastAPI on :8080, OpenAI-compatible API
                                 ├── POST /v1/chat/completions
                                 ├── POST /v1/completions
                                 └── GET  /v1/models

CLI reference

Command	Description
`octomil serve <model>`	Start an OpenAI-compatible inference server
`octomil benchmark <model>`	Benchmark inference speed on your hardware
`octomil deploy <model>`	Deploy a model to edge devices
`octomil rollback <model>`	Roll back a deployment
`octomil convert <file>`	Convert model to CoreML / TFLite
`octomil pull <model>`	Download a model
`octomil push <file>`	Upload a model to registry
`octomil status <model>`	Check deployment status
`octomil scan <path>`	Security scan a model or app bundle
`octomil pair`	Pair with a phone for deployment
`octomil dashboard`	Open the web dashboard
`octomil login`	Authenticate with Octomil
`octomil init`	Initialize an organization

vs. alternatives

	Octomil	Ollama	llama.cpp (raw)	Cloud APIs
One-command serve	yes	yes	no (build from source)	n/a
OpenAI-compatible API	yes	yes	partial	native
Auto engine selection	yes (benchmarks all)	no (single engine)	n/a	n/a
Deploy to phones	yes	no	manual	no
Fleet rollouts + rollback	yes	no	no	n/a
Model conversion (CoreML/TFLite)	yes	no	no	n/a
A/B testing	yes	no	no	no
Offline / on-device	yes	yes	yes	no
Cost per inference	$0 (your hardware)	$0	$0	$0.01-0.10
60+ models in catalog	yes	yes (different catalog)	yes (manual download)	varies
Python SDK	yes	yes	community	yes

Python SDK

For fleet management, model registry, and A/B testing:

from octomil import Octomil

client = Octomil(api_key="oct_...", org_id="org_123")

# Register and deploy a model
model = client.registry.ensure_model(name="sentiment", framework="pytorch")
client.rollouts.create(model_id=model["id"], version="1.0.0", rollout_percentage=10)

# Run an A/B test
client.experiments.create(
    name="v1-vs-v2",
    model_id=model["id"],
    control_version="1.0.0",
    treatment_version="1.1.0",
)

Requirements

Python 3.9+
At least one inference engine (see Supported engines)
macOS, Linux, or Windows

Contributing

See CONTRIBUTING.md.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

4.1.2

Mar 14, 2026

4.0.0

Mar 12, 2026

2.14.1

Mar 12, 2026

2.14.0

Mar 12, 2026

2.13.0

Mar 11, 2026

2.12.0

Mar 10, 2026

2.11.0

Mar 10, 2026

2.7.4

Mar 9, 2026

2.7.3

Mar 9, 2026

This version

2.7.2

Mar 9, 2026

2.7.1

Mar 9, 2026

2.7.0

Mar 9, 2026

2.6.0

Feb 27, 2026

2.5.2

Feb 27, 2026

2.5.1

Feb 27, 2026

2.5.0

Feb 27, 2026

2.4.0

Feb 26, 2026

2.3.0

Feb 26, 2026

2.1.8

Feb 25, 2026

2.1.7

Feb 25, 2026

2.1.6

Feb 25, 2026

2.0.3

Feb 25, 2026

1.0.0

Feb 23, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

octomil_sdk-2.7.2.tar.gz (505.9 kB view details)

Uploaded Mar 9, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

octomil_sdk-2.7.2-py3-none-any.whl (334.6 kB view details)

Uploaded Mar 9, 2026 Python 3

File details

Details for the file octomil_sdk-2.7.2.tar.gz.

File metadata

Download URL: octomil_sdk-2.7.2.tar.gz
Upload date: Mar 9, 2026
Size: 505.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for octomil_sdk-2.7.2.tar.gz
Algorithm	Hash digest
SHA256	`4b5f9ca5ec9b00c7870d39ecddde9cee34a30b3895e93cb30c19a667dfefb425`
MD5	`eaf297906c85eb27f0e6f967a47a61fd`
BLAKE2b-256	`f7bc41e35e0a206bf172d0c0180fc7a863e347b7e806b540ea16f5d1a9d8bcbf`

See more details on using hashes here.

File details

Details for the file octomil_sdk-2.7.2-py3-none-any.whl.

File metadata

Download URL: octomil_sdk-2.7.2-py3-none-any.whl
Upload date: Mar 9, 2026
Size: 334.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for octomil_sdk-2.7.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9cbbce3419614facc6ac4ff6870303d709ab19a8c787c4acc848e4e9ea26dc00`
MD5	`9718461ddabdd375269bad53631389ad`
BLAKE2b-256	`6adb475193c5610029fa2ee88e5aeeeea632d1b58e148646bb28f7fac6232884`

See more details on using hashes here.

octomil-sdk 2.7.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Octomil

What is this?

Quick start

Features

Supported engines

Supported models

How it works

CLI reference

vs. alternatives

Python SDK

Requirements

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes