Skip to main content

Python client for Orchard, a compute platform for Apple Silicon

Project description

Orchard

100% local, OpenAI-compatible LLM inference for Apple Silicon. Multi-model serving, prefix caching, continuous batching. No cloud APIs, no data leaves your machine.

macOS 14+ · Apple Silicon (M1+) · Python 3.12+ · Apache-2.0

Features

  • Drop-in OpenAI API/v1/chat/completions, /v1/responses, /v1/embeddings, /v1/models
  • Fast — C++ inference engine with prefix caching and continuous batching
  • Multi-model — load Qwen, Llama, and Gemma side-by-side; swap between them per request
  • Multimodal — vision, tool calling, thinking; native where the model was trained, grammar-constrained where it wasn't
  • OpenAI Responses API — streaming events for reasoning, tool calls, and messages
  • Use from anything — curl, Python (openai SDK), Rust (orchard-rs), or any OpenAI-compatible client

Getting started

pip install orchard
orchard serve --model google/gemma-4-E4B-it

First run downloads the PIE engine binary (~2 GB) and the model weights from HuggingFace. Subsequent runs start in seconds.

Then point anything at http://localhost:8000/v1:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-E4B-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Or with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="google/gemma-4-E4B-it",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Supported models

Orchard downloads open-weights models from HuggingFace on demand. Any model whose family has a profile in Pantheon works out of the box.

Model Size (BF16) Modalities Best for
google/gemma-4-E4B-it (default) ~8 GB text, vision Multimodal, native thinking, tool calls
Qwen/Qwen3.5-4B ~9 GB text 256k context, native thinking, tool calls
meta-llama/Llama-3.1-8B-Instruct ~16 GB text General-purpose, trained tool calls
google/gemma-3-4b-it ~8 GB text, vision Multimodal chat
google/gemma-4-E2B-it ~5 GB text, vision Fits on 8 GB Macs
moondream/moondream3-preview ~9 GB vision Pointing, detection, captioning

By hardware

Your Mac Recommended
M1 / M2 / M3 (8 GB) google/gemma-4-E2B-it
M-Pro / Max (16–32 GB) google/gemma-4-E4B-it, Qwen/Qwen3.5-4B
M-Max / Ultra (32+ GB) meta-llama/Llama-3.1-8B-Instruct + a small model hot-loaded

Need quantized weights? Pass any mlx-community/... repo directly. Orchard resolves any HuggingFace repo whose architecture belongs to a supported family.

OpenAI Responses API

Orchard implements the OpenAI Responses API — the successor to Chat Completions. You get structured streaming events for reasoning, tool calls, and messages, plus per-item state and lifecycle events.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
stream = client.responses.create(
    model="google/gemma-4-E4B-it",
    input="Explain quantum tunneling in three sentences.",
    reasoning={"effort": "medium"},
    stream=True,
)
for event in stream:
    print(event)

Supports response.output_text.delta, response.reasoning.delta, response.function_call_arguments.delta, structured output via JSON Schema, and max_tool_calls for bounded tool loops.

Python SDK (no HTTP)

For embedded use — skip the HTTP layer:

from orchard.engine.inference_engine import InferenceEngine

async with InferenceEngine() as engine:
    client = engine.client()
    response = await client.achat(
        "google/gemma-4-E4B-it",
        messages=[{"role": "user", "content": "Hello!"}],
    )
    print(response.text)

Sync, async, streaming, batching, and best-of-N are all supported. See orchard/clients/client.py.

Privacy

Orchard runs entirely on your Mac. No telemetry, no analytics, no phone-home.

Surface Status What it does
Inference ✅ Local All generation on-device via PIE (C++, Metal)
Chat templates ✅ Local Rendered from Pantheon profiles bundled in the package
Model weights ✅ One-time HuggingFace Hub → ~/.cache/huggingface/
Engine binary ✅ One-time GitHub release → ~/.orchard/bin/
Telemetry ✅ None No tracking SDKs — verify with grep -r analytics orchard/

How it works

Orchard is the Python layer over a stack built for Apple Silicon:

  • PIE — C++ inference engine: prefix caching, continuous batching, multi-model scheduling
  • PAL — Metal GPU kernels
  • PSE — grammar-constrained generation for tool calls, structured output, and thinking
  • Pantheon — chat templates and capability manifests, shared across all Orchard SDKs

The Python package handles IPC, model resolution, HuggingFace downloads, prompt rendering, and the FastAPI server.

CLI

orchard serve --model <hf-repo> [--host 127.0.0.1] [--port 8000]
orchard serve --models model-a model-b model-c    # preload multiple
orchard upgrade [stable|nightly]                  # update engine binary
orchard engine stop                               # kill background engine

Requirements

  • macOS 14+, Apple Silicon (M1 or newer)
  • Python 3.12+
  • ~2 GB free disk for the engine binary, plus model weights

Related

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orchard-2026.5.5.tar.gz (125.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

orchard-2026.5.5-py3-none-any.whl (139.7 kB view details)

Uploaded Python 3

File details

Details for the file orchard-2026.5.5.tar.gz.

File metadata

  • Download URL: orchard-2026.5.5.tar.gz
  • Upload date:
  • Size: 125.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.5.5.tar.gz
Algorithm Hash digest
SHA256 4932e42b33d80ac4570480216fb2ce75bccb63f7e36cab1e0ef2e489453fafcc
MD5 bac5793ad5897763a4c17816bbf24e34
BLAKE2b-256 334e60fd6b6ef79dd893e34a5b535d8de51c6b259df460551a4e6ef6f9ab3502

See more details on using hashes here.

File details

Details for the file orchard-2026.5.5-py3-none-any.whl.

File metadata

  • Download URL: orchard-2026.5.5-py3-none-any.whl
  • Upload date:
  • Size: 139.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.5.5-py3-none-any.whl
Algorithm Hash digest
SHA256 9b717f4e7812cb084bc1202802a7c8294d1aa0be6214a01db45373b4cb02f130
MD5 91b84206e0982f5629f2f5bf970c7e04
BLAKE2b-256 6f3b4f7b322c617fbaa9cee3b8643da97b781f0383aff544600acf3d1757560c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page