Python client for Orchard, a compute platform for Apple Silicon

Project description

Orchard

100% local, OpenAI-compatible LLM inference for Apple Silicon. Multi-model serving, prefix caching, continuous batching. No cloud APIs, no data leaves your machine.

macOS 14+ · Apple Silicon (M1+) · Python 3.12+ · Apache-2.0

Features

Drop-in OpenAI API — /v1/chat/completions, /v1/responses, /v1/embeddings, /v1/models
Fast — C++ inference engine with prefix caching and continuous batching
Multi-model — load Qwen, Llama, and Gemma side-by-side; swap between them per request
Multimodal — vision, tool calling, thinking; native where the model was trained, grammar-constrained where it wasn't
OpenAI Responses API — streaming events for reasoning, tool calls, and messages
Use from anything — curl, Python (openai SDK), Rust (orchard-rs), or any OpenAI-compatible client

Getting started

pip install orchard
orchard serve --model google/gemma-4-E4B-it

First run downloads the PIE engine binary (~2 GB) and the model weights from HuggingFace. Subsequent runs start in seconds.

Then point anything at http://localhost:8000/v1:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-E4B-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Or with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="google/gemma-4-E4B-it",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Supported models

Orchard downloads open-weights models from HuggingFace on demand. Any model whose family has a profile in Pantheon works out of the box.

Model	Size (BF16)	Modalities	Best for
google/gemma-4-E4B-it (default)	~8 GB	text, vision	Multimodal, native thinking, tool calls
Qwen/Qwen3.5-4B	~9 GB	text	256k context, native thinking, tool calls
meta-llama/Llama-3.1-8B-Instruct	~16 GB	text	General-purpose, trained tool calls
google/gemma-3-4b-it	~8 GB	text, vision	Multimodal chat
google/gemma-4-E2B-it	~5 GB	text, vision	Fits on 8 GB Macs
moondream/moondream3-preview	~9 GB	vision	Pointing, detection, captioning

By hardware

Your Mac	Recommended
M1 / M2 / M3 (8 GB)	`google/gemma-4-E2B-it`
M-Pro / Max (16–32 GB)	`google/gemma-4-E4B-it`, `Qwen/Qwen3.5-4B`
M-Max / Ultra (32+ GB)	`meta-llama/Llama-3.1-8B-Instruct` + a small model hot-loaded

Need quantized weights? Pass any mlx-community/... repo directly. Orchard resolves any HuggingFace repo whose architecture belongs to a supported family.

OpenAI Responses API

Orchard implements the OpenAI Responses API — the successor to Chat Completions. You get structured streaming events for reasoning, tool calls, and messages, plus per-item state and lifecycle events.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
stream = client.responses.create(
    model="google/gemma-4-E4B-it",
    input="Explain quantum tunneling in three sentences.",
    reasoning={"effort": "medium"},
    stream=True,
)
for event in stream:
    print(event)

Supports response.output_text.delta, response.reasoning.delta, response.function_call_arguments.delta, structured output via JSON Schema, and max_tool_calls for bounded tool loops.

Python SDK (no HTTP)

For embedded use — skip the HTTP layer:

from orchard.engine.inference_engine import InferenceEngine

async with InferenceEngine() as engine:
    client = engine.client()
    response = await client.achat(
        "google/gemma-4-E4B-it",
        messages=[{"role": "user", "content": "Hello!"}],
    )
    print(response.text)

Sync, async, streaming, batching, and best-of-N are all supported. See orchard/clients/client.py.

Privacy

Orchard runs entirely on your Mac. No telemetry, no analytics, no phone-home.

Surface	Status	What it does
Inference	✅ Local	All generation on-device via PIE (C++, Metal)
Chat templates	✅ Local	Rendered from Pantheon profiles bundled in the package
Model weights	✅ One-time	HuggingFace Hub → `~/.cache/huggingface/`
Engine binary	✅ One-time	GitHub release → `~/.orchard/bin/`
Telemetry	✅ None	No tracking SDKs — verify with `grep -r analytics orchard/`

How it works

Orchard is the Python layer over a stack built for Apple Silicon:

PIE — C++ inference engine: prefix caching, continuous batching, multi-model scheduling
PAL — Metal GPU kernels
PSE — grammar-constrained generation for tool calls, structured output, and thinking
Pantheon — chat templates and capability manifests, shared across all Orchard SDKs

The Python package handles IPC, model resolution, HuggingFace downloads, prompt rendering, and the FastAPI server.

CLI

orchard serve --model <hf-repo> [--host 127.0.0.1] [--port 8000]
orchard serve --models model-a model-b model-c    # preload multiple
orchard upgrade [stable|nightly]                  # update engine binary
orchard engine stop                               # kill background engine

Requirements

macOS 14+, Apple Silicon (M1 or newer)
Python 3.12+
~2 GB free disk for the engine binary, plus model weights

orchard-rs — Rust client
orchard-swift — Swift telemetry client
Pantheon — model profiles
PIE — the engine

License

Apache-2.0

Project details

Release history Release notifications | RSS feed

2026.5.11

May 21, 2026

2026.5.10

May 19, 2026

2026.5.9

May 17, 2026

2026.5.8

May 17, 2026

2026.5.7

May 15, 2026

This version

2026.5.6

May 10, 2026

2026.5.5

May 8, 2026

2026.5.4

May 5, 2026

2026.5.3

May 3, 2026

2026.5.2

May 2, 2026

2026.5.1

May 1, 2026

2026.4.9

Apr 29, 2026

2026.4.8

Apr 28, 2026

2026.4.7

Apr 28, 2026

2026.4.6

Apr 19, 2026

2026.4.5

Apr 19, 2026

2026.4.4

Apr 17, 2026

2026.4.3

Apr 14, 2026

2026.4.2

Apr 3, 2026

2026.3.12

Mar 29, 2026

2026.3.11

Mar 16, 2026

2026.3.10

Mar 15, 2026

2026.3.9

Mar 15, 2026

2026.3.8

Mar 15, 2026

2026.3.7

Mar 14, 2026

2026.3.6

Mar 14, 2026

2026.3.5

Mar 12, 2026

2026.3.3

Mar 11, 2026

2026.3.2

Mar 10, 2026

2026.3.1

Mar 10, 2026

2026.2.7

Feb 27, 2026

2026.2.6

Feb 27, 2026

2026.2.5

Feb 14, 2026

2026.2.4

Feb 13, 2026

2026.2.3

Feb 13, 2026

2026.2.2

Feb 10, 2026

2026.2.1

Feb 10, 2026

2026.1.3

Jan 27, 2026

2026.1.2

Jan 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orchard-2026.5.6.tar.gz (126.1 kB view details)

Uploaded May 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

orchard-2026.5.6-py3-none-any.whl (139.7 kB view details)

Uploaded May 10, 2026 Python 3

File details

Details for the file orchard-2026.5.6.tar.gz.

File metadata

Download URL: orchard-2026.5.6.tar.gz
Upload date: May 10, 2026
Size: 126.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.5.6.tar.gz
Algorithm	Hash digest
SHA256	`166f870a38e2b79f2bba0cde1670dd6acd654f19e984393ec1b64235a52d55bf`
MD5	`8381e2ad5d287982ba59c1deacf5a067`
BLAKE2b-256	`e6d1958b9113e6cc3f8dc5b1668d67f96191cc3fada579ec39682a597d5bfe2d`

See more details on using hashes here.

File details

Details for the file orchard-2026.5.6-py3-none-any.whl.

File metadata

Download URL: orchard-2026.5.6-py3-none-any.whl
Upload date: May 10, 2026
Size: 139.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.5.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2d645030cd25a3a4fbf8fe0432e49700948490a3781cacd2b26bf7fc0fd1de4f`
MD5	`977f0384cb5f65fb12de709c7ac57e42`
BLAKE2b-256	`b17a1d8450388b8cf4092a9ca589e801bcdeb8225114b37ece8f5cc90744bdce`

See more details on using hashes here.

orchard 2026.5.6

Navigation

Verified details

Owner

Unverified details

Meta

Project description

Orchard

Features

Getting started

Supported models

OpenAI Responses API

Python SDK (no HTTP)

Privacy

How it works

CLI

Requirements

Related

License

Project details

Verified details

Owner

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes