Skip to main content

Python client for Orchard, a compute platform for Apple Silicon

Project description

Orchard

100% local, OpenAI-compatible LLM inference for Apple Silicon. Multi-model serving, prefix caching, continuous batching. No cloud APIs, no data leaves your machine.

macOS 14+ · Apple Silicon (M1+) · Python 3.12+ · Apache-2.0

Features

  • Drop-in OpenAI API/v1/chat/completions, /v1/responses, /v1/embeddings, /v1/models
  • Fast — C++ inference engine with prefix caching and continuous batching
  • Multi-model — load Qwen, Llama, and Gemma side-by-side; swap between them per request
  • Multimodal — vision, tool calling, thinking; native where the model was trained, grammar-constrained where it wasn't
  • OpenAI Responses API — streaming events for reasoning, tool calls, and messages
  • Use from anything — curl, Python (openai SDK), Rust (orchard-rs), or any OpenAI-compatible client

Getting started

pip install orchard
orchard serve --model google/gemma-4-E4B-it

First run downloads the PIE engine binary (~2 GB) and the model weights from HuggingFace. Subsequent runs start in seconds.

Then point anything at http://localhost:8000/v1:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-E4B-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Or with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="google/gemma-4-E4B-it",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Supported models

Orchard downloads open-weights models from HuggingFace on demand. Any model whose family has a profile in Pantheon works out of the box.

Model Size (BF16) Modalities Best for
google/gemma-4-E4B-it (default) ~8 GB text, vision Multimodal, native thinking, tool calls
Qwen/Qwen3.5-4B ~9 GB text 256k context, native thinking, tool calls
meta-llama/Llama-3.1-8B-Instruct ~16 GB text General-purpose, trained tool calls
google/gemma-3-4b-it ~8 GB text, vision Multimodal chat
google/gemma-4-E2B-it ~5 GB text, vision Fits on 8 GB Macs
moondream/moondream3-preview ~9 GB vision Pointing, detection, captioning

By hardware

Your Mac Recommended
M1 / M2 / M3 (8 GB) google/gemma-4-E2B-it
M-Pro / Max (16–32 GB) google/gemma-4-E4B-it, Qwen/Qwen3.5-4B
M-Max / Ultra (32+ GB) meta-llama/Llama-3.1-8B-Instruct + a small model hot-loaded

Need quantized weights? Pass any mlx-community/... repo directly. Orchard resolves any HuggingFace repo whose architecture belongs to a supported family.

OpenAI Responses API

Orchard implements the OpenAI Responses API — the successor to Chat Completions. You get structured streaming events for reasoning, tool calls, and messages, plus per-item state and lifecycle events.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
stream = client.responses.create(
    model="google/gemma-4-E4B-it",
    input="Explain quantum tunneling in three sentences.",
    reasoning={"effort": "medium"},
    stream=True,
)
for event in stream:
    print(event)

Supports response.output_text.delta, response.reasoning.delta, response.function_call_arguments.delta, structured output via JSON Schema, and max_tool_calls for bounded tool loops.

Python SDK (no HTTP)

For embedded use — skip the HTTP layer:

from orchard.engine.inference_engine import InferenceEngine

async with InferenceEngine() as engine:
    client = engine.client()
    response = await client.achat(
        "google/gemma-4-E4B-it",
        messages=[{"role": "user", "content": "Hello!"}],
    )
    print(response.text)

Sync, async, streaming, batching, and best-of-N are all supported. See orchard/clients/client.py.

Privacy

Orchard runs entirely on your Mac. No telemetry, no analytics, no phone-home.

Surface Status What it does
Inference ✅ Local All generation on-device via PIE (C++, Metal)
Chat templates ✅ Local Rendered from Pantheon profiles bundled in the package
Model weights ✅ One-time HuggingFace Hub → ~/.cache/huggingface/
Engine binary ✅ One-time GitHub release → ~/.orchard/bin/
Telemetry ✅ None No tracking SDKs — verify with grep -r analytics orchard/

How it works

Orchard is the Python layer over a stack built for Apple Silicon:

  • PIE — C++ inference engine: prefix caching, continuous batching, multi-model scheduling
  • PAL — Metal GPU kernels
  • PSE — grammar-constrained generation for tool calls, structured output, and thinking
  • Pantheon — chat templates and capability manifests, shared across all Orchard SDKs

The Python package handles IPC, model resolution, HuggingFace downloads, prompt rendering, and the FastAPI server.

CLI

orchard serve --model <hf-repo> [--host 127.0.0.1] [--port 8000]
orchard serve --models model-a model-b model-c    # preload multiple
orchard upgrade [stable|nightly]                  # update engine binary
orchard engine stop                               # kill background engine

Requirements

  • macOS 14+, Apple Silicon (M1 or newer)
  • Python 3.12+
  • ~2 GB free disk for the engine binary, plus model weights

Related

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orchard-2026.5.6.tar.gz (126.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

orchard-2026.5.6-py3-none-any.whl (139.7 kB view details)

Uploaded Python 3

File details

Details for the file orchard-2026.5.6.tar.gz.

File metadata

  • Download URL: orchard-2026.5.6.tar.gz
  • Upload date:
  • Size: 126.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.5.6.tar.gz
Algorithm Hash digest
SHA256 166f870a38e2b79f2bba0cde1670dd6acd654f19e984393ec1b64235a52d55bf
MD5 8381e2ad5d287982ba59c1deacf5a067
BLAKE2b-256 e6d1958b9113e6cc3f8dc5b1668d67f96191cc3fada579ec39682a597d5bfe2d

See more details on using hashes here.

File details

Details for the file orchard-2026.5.6-py3-none-any.whl.

File metadata

  • Download URL: orchard-2026.5.6-py3-none-any.whl
  • Upload date:
  • Size: 139.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.5.6-py3-none-any.whl
Algorithm Hash digest
SHA256 2d645030cd25a3a4fbf8fe0432e49700948490a3781cacd2b26bf7fc0fd1de4f
MD5 977f0384cb5f65fb12de709c7ac57e42
BLAKE2b-256 b17a1d8450388b8cf4092a9ca589e801bcdeb8225114b37ece8f5cc90744bdce

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page