Octomil — serve, deploy, and observe ML models on edge devices

These details have not been verified by PyPI

Project links

Project description

Octomil

Run LLMs on your laptop, phone, or edge device. One command. OpenAI-compatible API.

What is this?

Octomil is a CLI + Python SDK for running open-weight models locally behind an OpenAI-compatible API. It detects your hardware, picks the fastest available engine, and gives you a local-first replacement for cloud API calls on Mac, Linux, and Windows.

Quick Start

Install

curl -fsSL https://get.octomil.com | sh

The installer downloads a self-contained Octomil binary for your OS and CPU, verifies it against the release SHA256SUMS, and installs it without using Python, pip, Homebrew Python, or virtualenvs.

For Python SDK development, you can still install the library with pip:

pip install octomil

Local Inference (no server, no account needed)

# Chat / responses
octomil run "What can you help me with?"

# Embeddings
octomil embed "On-device AI inference at scale" --json

# Transcription
octomil transcribe meeting.wav

OpenAI-Compatible Local Server

octomil serve

# Then use any OpenAI-compatible client:
curl http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'

Hosted API

export OCTOMIL_SERVER_KEY=YOUR_SERVER_KEY

curl https://api.octomil.com/v1/responses \
  -H "Authorization: Bearer $OCTOMIL_SERVER_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"default","input":"Hello"}'

Unified Facade (recommended for new code)

The Octomil facade is the simplest way to use the cloud-backed Responses API:

export OCTOMIL_SERVER_KEY=YOUR_SERVER_KEY
export OCTOMIL_ORG_ID=YOUR_ORG_ID

import asyncio
from octomil import Octomil

async def main():
    client = Octomil.from_env()
    await client.initialize()
    response = await client.responses.create(model="phi-4-mini", input="Hello")
    print(response.output_text)

asyncio.run(main())

Embeddings are available through the same facade:

# Embeddings
result = await client.embeddings.create(
    model="nomic-embed-text-v1.5",
    input="On-device AI inference at scale",
)
print(result.embeddings[0][:5])

Migrating from OctomilClient

OctomilClient and the low-level OctomilResponses / ResponseRequest APIs still work exactly as before. The Octomil facade is a convenience wrapper for the common path — it delegates to the same underlying client internally.

Native API

The responses API is the primary Octomil interface for new code. It gives you local inference, routing, multimodal inputs, and conversation threading without going through the OpenAI compatibility layer.

responses.create

import asyncio
from octomil.responses import OctomilResponses, ResponseRequest, text_input

responses = OctomilResponses()

async def main():
    result = await responses.create(ResponseRequest(
        model="gemma-1b",
        input=[text_input("Explain quantum computing in one sentence")],
    ))
    print(result.output[0].text)

asyncio.run(main())

Pass a plain string as shorthand:

result = await responses.create(ResponseRequest.text("gemma-1b", "Hello"))
print(result.output[0].text)

responses.stream

import asyncio
from octomil.responses import OctomilResponses, ResponseRequest, TextDeltaEvent, DoneEvent, text_input

responses = OctomilResponses()

async def main():
    async for event in responses.stream(ResponseRequest(
        model="gemma-1b",
        input=[text_input("Write a haiku about the ocean")],
    )):
        if isinstance(event, TextDeltaEvent):
            print(event.delta, end="", flush=True)
        elif isinstance(event, DoneEvent):
            print()
            print(f"Tokens used: {event.response.usage.total_tokens}")

asyncio.run(main())

With system instructions and conversation threading

result1 = await responses.create(ResponseRequest(
    model="gemma-1b",
    input=[text_input("My name is Alice.")],
    instructions="You are a helpful assistant.",
))

# Continue the conversation by referencing the previous response
result2 = await responses.create(ResponseRequest(
    model="gemma-1b",
    input=[text_input("What's my name?")],
    previous_response_id=result1.id,
))
print(result2.output[0].text)  # "Your name is Alice."

The OpenAI-compatible /v1/chat/completions endpoint remains available for existing integrations. See Migrating from OpenAI if you are switching from the OpenAI SDK.

Features

Auto engine selection -- benchmarks all available engines and picks the fastest:

octomil serve llama-3b
# => Detected: mlx-lm (38 tok/s), llama.cpp (29 tok/s), ollama (25 tok/s)
# => Using mlx-lm

60+ models -- Gemma, Llama, Phi, Qwen, DeepSeek, Mistral, Mixtral, and more:

octomil models                  # list all available models
octomil serve phi-mini          # Microsoft Phi-4 Mini (3.8B)
octomil serve deepseek-r1-7b    # DeepSeek R1 reasoning
octomil serve qwen3-4b          # Alibaba Qwen 3
octomil serve whisper-small     # Speech-to-text

Interactive chat -- one command from install to conversation:

octomil chat                        # auto-picks best model for your device
octomil chat qwen-coder-7b          # chat with a specific model
octomil chat llama-8b -s "You are a Python expert."

Launch coding agents -- power Codex, aider, or other agents with local inference:

octomil launch                  # pick an agent interactively
octomil launch codex            # launch OpenAI Codex CLI with local model
octomil launch codex --model codestral

Deploy to phones -- push models to iOS/Android devices:

octomil deploy gemma-1b --phone --rollout 10   # canary to 10% of devices
octomil status gemma-1b                        # monitor rollout
octomil rollback gemma-1b                      # instant rollback

Benchmark your hardware:

octomil benchmark gemma-1b
# Model: gemma-1b (4bit)
# Engine: mlx-lm
# Tokens/sec: 42.3
# Memory: 1.2 GB
# Time to first token: 89ms

MCP server for AI tools -- give Claude, Cursor, VS Code, and Codex access to local inference:

octomil mcp register                    # register with all detected AI tools
octomil mcp register --target claude    # register with Claude Code only
octomil mcp status                      # check registration status

Model conversion -- convert to CoreML (iOS) or TFLite (Android):

octomil convert model.pt --target ios,android

Multi-model serving -- load multiple models, route by request:

octomil serve --models smollm-360m,phi-mini,llama-3b

Supported engines

Engine	Platform	Runtime
MLX	Apple Silicon Mac	Octomil-managed runtime
llama.cpp	Mac, Linux, Windows	Native runtime
ONNX Runtime	All platforms	Native runtime
MLC-LLM	Mac, Linux, Android	Auto-detected
MNN	All platforms	Auto-detected
ExecuTorch	Mobile	Auto-detected
Whisper.cpp	All platforms	Native runtime

No engine installed? octomil serve tells you exactly what to install.

Supported models

Full model list (60+ models)

Model	Sizes	Engines
Gemma 3	1B, 4B, 12B, 27B	MLX, llama.cpp, MNN, ONNX, MLC
Gemma 2	2B, 9B, 27B	MLX, llama.cpp
Llama 3.2	1B, 3B	MLX, llama.cpp, MNN, ONNX, MLC
Llama 3.1/3.3	8B, 70B	MLX, llama.cpp
Phi-4 / Phi Mini	3.8B, 14B	MLX, llama.cpp, MNN, ONNX
Qwen 2.5	1.5B, 3B, 7B	MLX, llama.cpp, MNN, ONNX
Qwen 3	0.6B - 32B	MLX, llama.cpp
DeepSeek R1	1.5B - 70B	MLX, llama.cpp
DeepSeek V3	671B (MoE)	MLX, llama.cpp
Mistral / Nemo / Small	7B, 12B, 24B	MLX, llama.cpp
Mixtral	8x7B, 8x22B (MoE)	MLX, llama.cpp
Qwen 2.5 Coder	1.5B, 7B	MLX, llama.cpp
CodeLlama	7B, 13B, 34B	MLX, llama.cpp
StarCoder2	3B, 7B, 15B	MLX, llama.cpp
Falcon 3	1B, 7B, 10B	MLX, llama.cpp
SmolLM	360M, 1.7B	MLX, llama.cpp, MNN, ONNX
Whisper	tiny - large-v3	Whisper.cpp
+ many more

Use aliases: octomil serve deepseek-r1 resolves to deepseek-r1-7b. Each model supports 4bit, 8bit, and fp16 quantization variants.

How it works

curl -fsSL https://get.octomil.com | sh
    │
    └── install standalone bundle
         ├── 1. Detect OS/arch
         ├── 2. Download octomil-vX.Y.Z-<os>-<arch>.tar.gz
         ├── 3. Verify SHA256SUMS
         ├── 4. Install under ~/.local/lib/octomil/
         └── 5. Symlink octomil into your bin dir and warn on PATH shadowing

octomil serve gemma-1b
    │
    ├── 1. Resolve model name → catalog lookup (aliases, quant variants)
    ├── 2. Detect engines     → MLX? llama.cpp? ONNX?
    ├── 3. Benchmark engines  → Run each, measure tok/s, pick fastest
    ├── 4. Download model     → HuggingFace Hub (cached after first pull)
    └── 5. Start server       → FastAPI on :8080, OpenAI-compatible API
                                 ├── POST /v1/chat/completions
                                 ├── POST /v1/completions
                                 └── GET  /v1/models

CLI reference

Command	Description
`octomil setup`	Install engine, download model, register MCP servers
`octomil serve <model>`	Start an OpenAI-compatible inference server
`octomil chat [model]`	Interactive chat (auto-starts server)
`octomil launch [agent]`	Launch a coding agent with local inference
`octomil models`	List available models
`octomil benchmark <model>`	Benchmark inference speed on your hardware
`octomil warmup`	Pre-download the recommended model for your device
`octomil mcp register`	Register MCP server with AI tools
`octomil mcp unregister`	Remove MCP server from AI tools
`octomil mcp status`	Show MCP registration status
`octomil mcp serve`	Start the HTTP agent server (REST + A2A)
`octomil deploy <model>`	Deploy a model to edge devices
`octomil rollback <model>`	Roll back a deployment
`octomil convert <file>`	Convert model to CoreML / TFLite
`octomil pull <model>`	Download a model
`octomil push <file>`	Upload a model to registry
`octomil status <model>`	Check deployment status
`octomil scan <path>`	Security scan a model or app bundle
`octomil completions`	Print shell completion setup instructions
`octomil pair`	Pair with a phone for deployment
`octomil dashboard`	Open the web dashboard
`octomil login`	Authenticate with Octomil
`octomil init`	Initialize an organization

AppManifest

An AppManifest declares which AI capabilities your app needs and how models are delivered. All SDKs (iOS, Android, Node, Python) use AppManifest as a programmatic data structure — you instantiate it in code, not from a config file.

Delivery modes

Mode	Behaviour
`managed`	Control plane assigns the model version. SDK downloads and caches it.
`bundled`	Model is included in the app binary at `bundled_path`.
`cloud`	Inference runs remotely — no model artifact stored on device.

Capabilities

Each manifest entry maps a model to a named capability the app requests at runtime:

Capability	Use case
`chat`	Conversational generation (chat UI)
`transcription`	Speech-to-text (Whisper pipeline)
`keyboard_prediction`	Next-word suggestion chips
`embedding`	Vector encoding for retrieval
`classification`	Text or image categorisation

How SDKs consume it

iOS — declare in code, configure the client:

import Octomil

let client = OctomilClient(auth: .publishableKey("oct_pub_live_..."))
let manifest = AppManifest(models: [
    AppModelEntry(id: "chat-model", capability: .chat, delivery: .managed),
    AppModelEntry(id: "classifier", capability: .classification, delivery: .bundled,
                  bundledPath: "models/classifier.mlmodelc"),
])
try await client.configure(manifest: manifest, auth: .publishableKey("oct_pub_live_..."), monitoring: .enabled)

See the iOS SDK README for full integration instructions.

Android — same pattern:

import ai.octomil.Octomil
import ai.octomil.manifest.*
import ai.octomil.auth.AuthConfig

val manifest = AppManifest(models = listOf(
    AppModelEntry(id = "chat-model", capability = ModelCapability.CHAT, delivery = DeliveryMode.MANAGED,
                  inputModalities = listOf(Modality.TEXT), outputModalities = listOf(Modality.TEXT)),
))
Octomil.configure(context, manifest, auth = AuthConfig.PublishableKey("oct_pub_live_..."))

See the Android SDK README for full integration instructions.

Python SDK

Two separate configure paths:

import octomil
from octomil.auth_config import PublishableKeyAuth

# 1. Device registration (background thread, non-blocking)
ctx = octomil.configure(auth=PublishableKeyAuth(key="oct_pub_live_..."))

# 2. Attach manifest for catalog-driven model resolution
from octomil import OctomilClient
from octomil.manifest.types import AppManifest, AppModelEntry
from octomil._generated.delivery_mode import DeliveryMode
from octomil._generated.model_capability import ModelCapability

client = OctomilClient.from_env()
client.configure(manifest=AppManifest(models=[
    AppModelEntry(id="chat-model", capability=ModelCapability.TEXT_GENERATION, delivery=DeliveryMode.MANAGED),
]))

Note: The Python SDK does not auto-poll desired state. Use client.control.get_desired_state() to fetch it explicitly.

vs. alternatives

	Octomil	Ollama	llama.cpp (raw)	Cloud APIs
One-command serve	yes	yes	no (build from source)	n/a
OpenAI-compatible API	yes	yes	partial	native
Auto engine selection	yes (benchmarks all)	no (single engine)	n/a	n/a
Deploy to phones	yes	no	manual	no
Fleet rollouts + rollback	yes	no	no	n/a
Model conversion (CoreML/TFLite)	yes	no	no	n/a
A/B testing	yes	no	no	no
Offline / on-device	yes	yes	yes	no
Cost per inference	$0 (your hardware)	$0	$0	$0.01-0.10
60+ models in catalog	yes	yes (different catalog)	yes (manual download)	varies
Python SDK	yes	yes	community	yes

Migrating from OpenAI

Octomil is wire-compatible with the OpenAI API. Change two lines:

# Before
from openai import OpenAI
client = OpenAI(api_key="sk-...")

# After (local inference — no API key needed)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

That's it. chat.completions.create, streaming, tool calls, and audio transcriptions all work without further changes.

For a full guide including model name mapping, error code mapping, and a comparison of what's different: docs/migration-from-openai.md

SDKs

SDK	Package	Status	Inference Engine
Python	`octomil` (PyPI)	Production (v2.10.1)	MLX, llama.cpp, ONNX, MLC, ExecuTorch, Whisper, MNN
Browser	`@octomil/browser` (npm)	Production (v1.0.0)	ONNX Runtime Web (WebGPU + WASM)
iOS	Swift Package Manager	Production (v1.1.0)	CoreML + MLX
Android	Maven (GitHub Packages)	Production (v1.2.0)	TFLite + vendor NPU
Node	`@octomil/sdk` (source)	v0.1.0 (not on npm)	ONNX Runtime Node

Python SDK

For fleet management, model registry, and A/B testing:

from octomil import Octomil

client = Octomil(api_key="oct_...", org_id="org_123")

# Register and deploy a model
model = client.registry.ensure_model(name="sentiment", framework="pytorch")
client.rollouts.create(model_id=model["id"], version="1.0.0", rollout_percentage=10)

# Run an A/B test
client.experiments.create(
    name="v1-vs-v2",
    model_id=model["id"],
    control_version="1.0.0",
    treatment_version="1.1.0",
)

MCP Server & AI Tool Integration

Octomil registers as an MCP server across your AI coding tools so they can use local inference. octomil setup does this automatically, or you can run it manually:

octomil mcp register                    # Claude Code, Cursor, VS Code, Codex CLI
octomil mcp register --target cursor    # single tool
octomil mcp status                      # check what's registered
octomil mcp unregister                  # remove from all tools

HTTP Agent Server & x402 Payments

Octomil also exposes its tools over HTTP with an A2A agent card, OpenAPI docs, and optional micro-payments via the x402 protocol.

octomil mcp serve                       # start HTTP agent server on :8402
octomil mcp serve --port 9000           # custom port

# With x402 payment gating (agents pay per call)
OCTOMIL_X402_ADDRESS=0xYourWallet \
OCTOMIL_SETTLER_TOKEN=s402_... \
octomil mcp serve --x402

How it works:

Agent calls an Octomil tool (e.g. /api/v1/run_inference)
Server returns 402 Payment Required with x402 payment requirements
Agent signs an EIP-3009 transferWithAuthorization and retries with x-payment header
Server verifies the signature, serves the response, and accumulates the payment
When payments reach the settlement threshold ($1 USDC by default), the batch is submitted to settle402 for on-chain settlement via Multicall3

Environment variables:

Variable	Default	Description
`OCTOMIL_X402_ADDRESS`	—	Your wallet address (where you get paid)
`OCTOMIL_X402_PRICE`	`1000`	Price per call in base units (1000 = $0.001 USDC)
`OCTOMIL_X402_NETWORK`	`base`	Chain: base, ethereum, polygon, arbitrum, optimism
`OCTOMIL_X402_THRESHOLD`	`1.0`	Settlement threshold in USD
`OCTOMIL_SETTLER_URL`	`https://api.settle402.dev`	settle402 batch settlement endpoint
`OCTOMIL_SETTLER_TOKEN`	—	settle402 API key

Requirements

Python 3.9+
At least one inference engine (see Supported engines)
macOS, Linux, or Windows

Contributing

See CONTRIBUTING.md.

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

4.17.42

Jun 6, 2026

4.17.41

Jun 6, 2026

4.17.40

Jun 5, 2026

4.17.39

Jun 5, 2026

4.17.38

Jun 5, 2026

4.17.37

Jun 5, 2026

4.17.36

Jun 5, 2026

4.17.35

Jun 5, 2026

4.17.34

Jun 4, 2026

4.17.33

Jun 3, 2026

4.17.32

Jun 3, 2026

4.17.31

Jun 2, 2026

4.17.30

Jun 1, 2026

4.17.29

Jun 1, 2026

4.17.28

Jun 1, 2026

4.17.27

Jun 1, 2026

4.17.26

Jun 1, 2026

4.17.25

Jun 1, 2026

4.17.24

Jun 1, 2026

4.17.23

May 30, 2026

4.17.22

May 30, 2026

4.17.21

May 29, 2026

4.17.20

May 20, 2026

4.17.19

May 18, 2026

4.17.18

May 15, 2026

4.17.17

May 15, 2026

4.17.16

May 15, 2026

4.17.15

May 15, 2026

4.17.14

May 15, 2026

4.17.13

May 15, 2026

4.17.12

May 15, 2026

4.17.11

May 14, 2026

4.17.10

May 14, 2026

4.17.9

May 14, 2026

4.17.8

May 13, 2026

4.17.7

May 13, 2026

4.17.6

May 12, 2026

4.17.5

May 5, 2026

4.17.4

May 5, 2026

4.17.3

May 5, 2026

4.17.2

May 5, 2026

4.17.1

May 5, 2026

4.17.0

May 4, 2026

4.16.1

May 2, 2026

4.16.0

May 2, 2026

4.15.1

May 2, 2026

4.15.0

May 2, 2026

4.14.0

Apr 29, 2026

4.13.0

Apr 28, 2026

4.12.4

Apr 28, 2026

4.12.3

Apr 28, 2026

4.12.2

Apr 28, 2026

4.12.1

Apr 28, 2026

4.12.0

Apr 28, 2026

4.10.1

Apr 26, 2026

4.10.0

Apr 26, 2026

4.9.0

Apr 25, 2026

4.8.0

Apr 25, 2026

4.7.6

Apr 24, 2026

4.7.4

Apr 24, 2026

4.7.3

Apr 23, 2026

4.7.2

Apr 23, 2026

4.7.1

Apr 23, 2026

4.7.0

Apr 22, 2026

4.6.1

Apr 20, 2026

4.6.0

Mar 25, 2026

4.5.0

Mar 25, 2026

4.4.0

Mar 25, 2026

4.3.0

Mar 20, 2026

4.2.0

Mar 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

octomil-4.17.42.tar.gz (1.6 MB view details)

Uploaded Jun 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

octomil-4.17.42-py3-none-any.whl (1.1 MB view details)

Uploaded Jun 6, 2026 Python 3

File details

Details for the file octomil-4.17.42.tar.gz.

File metadata

Download URL: octomil-4.17.42.tar.gz
Upload date: Jun 6, 2026
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for octomil-4.17.42.tar.gz
Algorithm	Hash digest
SHA256	`5c36e82fa606a5e58387384171837c5a37eb50aeeff0449a6960e07560908a94`
MD5	`c7533c02052cea625ac391bf4a629951`
BLAKE2b-256	`0867b26ce3c99a982aa386d2beead7d9e46304c9b86e40553789b96d05b5311c`

See more details on using hashes here.

File details

Details for the file octomil-4.17.42-py3-none-any.whl.

File metadata

Download URL: octomil-4.17.42-py3-none-any.whl
Upload date: Jun 6, 2026
Size: 1.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for octomil-4.17.42-py3-none-any.whl
Algorithm	Hash digest
SHA256	`161d2f92960ecd620f868987f275a45f13f9ee868e2f3cc3e62c5f3a8978f535`
MD5	`b1082fabfcd15462893299daef0051e4`
BLAKE2b-256	`7ea74f709d9fe8404f98ae363dc343f27eca1de24dfa1ea22871fca0e38c6ced`

See more details on using hashes here.

octomil 4.17.42

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Octomil

What is this?

Quick Start

Install

Local Inference (no server, no account needed)

OpenAI-Compatible Local Server

Hosted API

Unified Facade (recommended for new code)

Migrating from OctomilClient

Native API

responses.create

responses.stream

With system instructions and conversation threading

Features

Supported engines

Supported models

How it works

CLI reference

AppManifest

Delivery modes

Capabilities

How SDKs consume it

Python SDK

vs. alternatives

Migrating from OpenAI

SDKs

Python SDK

MCP Server & AI Tool Integration

HTTP Agent Server & x402 Payments

Requirements

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes