Skip to main content

Python client for Orchard, a compute platform for Apple Silicon

Project description

Orchard

100% local, OpenAI-compatible LLM inference for Apple Silicon. Multi-model serving, prefix caching, continuous batching. No cloud APIs, no data leaves your machine.

macOS 14+ · Apple Silicon (M1+) · Python 3.12+ · Apache-2.0

Features

  • Drop-in OpenAI API/v1/chat/completions, /v1/responses, /v1/embeddings, /v1/models
  • Fast — C++ inference engine with prefix caching and continuous batching
  • Multi-model — load Qwen, Llama, and Gemma side-by-side; swap between them per request
  • Multimodal — vision, tool calling, thinking; native where the model was trained, grammar-constrained where it wasn't
  • OpenAI Responses API — streaming events for reasoning, tool calls, and messages
  • Use from anything — curl, Python (openai SDK), Rust (orchard-rs), or any OpenAI-compatible client

Getting started

pip install orchard
orchard serve --model google/gemma-4-E4B-it

First run downloads the PIE engine binary (~2 GB) and the model weights from HuggingFace. Subsequent runs start in seconds.

Then point anything at http://localhost:8000/v1:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-E4B-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Or with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="google/gemma-4-E4B-it",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Supported models

Orchard downloads open-weights models from HuggingFace on demand. Any model whose family has a profile in Pantheon works out of the box.

Model Size (BF16) Modalities Best for
google/gemma-4-E4B-it (default) ~8 GB text, vision Multimodal, native thinking, tool calls
Qwen/Qwen3.5-4B ~9 GB text 256k context, native thinking, tool calls
meta-llama/Llama-3.1-8B-Instruct ~16 GB text General-purpose, trained tool calls
google/gemma-3-4b-it ~8 GB text, vision Multimodal chat
google/gemma-4-E2B-it ~5 GB text, vision Fits on 8 GB Macs
moondream/moondream3-preview ~9 GB vision Pointing, detection, captioning

By hardware

Your Mac Recommended
M1 / M2 / M3 (8 GB) google/gemma-4-E2B-it
M-Pro / Max (16–32 GB) google/gemma-4-E4B-it, Qwen/Qwen3.5-4B
M-Max / Ultra (32+ GB) meta-llama/Llama-3.1-8B-Instruct + a small model hot-loaded

Need quantized weights? Pass any mlx-community/... repo directly. Orchard resolves any HuggingFace repo whose architecture belongs to a supported family.

OpenAI Responses API

Orchard implements the OpenAI Responses API — the successor to Chat Completions. You get structured streaming events for reasoning, tool calls, and messages, plus per-item state and lifecycle events.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
stream = client.responses.create(
    model="google/gemma-4-E4B-it",
    input="Explain quantum tunneling in three sentences.",
    reasoning={"effort": "medium"},
    stream=True,
)
for event in stream:
    print(event)

Supports response.output_text.delta, response.reasoning.delta, response.function_call_arguments.delta, structured output via JSON Schema, and max_tool_calls for bounded tool loops.

Python SDK (no HTTP)

For embedded use — skip the HTTP layer:

from orchard.engine.inference_engine import InferenceEngine

async with InferenceEngine() as engine:
    client = engine.client()
    response = await client.achat(
        "google/gemma-4-E4B-it",
        messages=[{"role": "user", "content": "Hello!"}],
    )
    print(response.text)

Sync, async, streaming, batching, and best-of-N are all supported. See orchard/clients/client.py.

Privacy

Orchard runs entirely on your Mac. No telemetry, no analytics, no phone-home.

Surface Status What it does
Inference ✅ Local All generation on-device via PIE (C++, Metal)
Chat templates ✅ Local Rendered from Pantheon profiles bundled in the package
Model weights ✅ One-time HuggingFace Hub → ~/.cache/huggingface/
Engine binary ✅ One-time GitHub release → ~/.orchard/bin/
Telemetry ✅ None No tracking SDKs — verify with grep -r analytics orchard/

How it works

Orchard is the Python layer over a stack built for Apple Silicon:

  • PIE — C++ inference engine: prefix caching, continuous batching, multi-model scheduling
  • PAL — Metal GPU kernels
  • PSE — grammar-constrained generation for tool calls, structured output, and thinking
  • Pantheon — chat templates and capability manifests, shared across all Orchard SDKs

The Python package handles IPC, model resolution, HuggingFace downloads, prompt rendering, and the FastAPI server.

CLI

orchard serve --model <hf-repo> [--host 127.0.0.1] [--port 8000]
orchard serve --models model-a model-b model-c    # preload multiple
orchard upgrade [stable|nightly]                  # update engine binary
orchard engine stop                               # kill background engine

Requirements

  • macOS 14+, Apple Silicon (M1 or newer)
  • Python 3.12+
  • ~2 GB free disk for the engine binary, plus model weights

Related

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orchard-2026.4.7.tar.gz (123.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

orchard-2026.4.7-py3-none-any.whl (137.2 kB view details)

Uploaded Python 3

File details

Details for the file orchard-2026.4.7.tar.gz.

File metadata

  • Download URL: orchard-2026.4.7.tar.gz
  • Upload date:
  • Size: 123.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.4.7.tar.gz
Algorithm Hash digest
SHA256 4ed19f9eeb9b4dc5054eadf3b416eae3ad70507225e276349d6fe27dfd44eac4
MD5 edb175bc1616f58c082789b06a389334
BLAKE2b-256 109d27e11bf186354eb4dbc55e54aedfa530b364da60d66888b2ff0d8163df68

See more details on using hashes here.

File details

Details for the file orchard-2026.4.7-py3-none-any.whl.

File metadata

  • Download URL: orchard-2026.4.7-py3-none-any.whl
  • Upload date:
  • Size: 137.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.4.7-py3-none-any.whl
Algorithm Hash digest
SHA256 ea8e9db367bc7e00aa2c49620ec36f12a9d35083f8b8ef374a85abfd141adbe0
MD5 acda91f170754b0681dd6a399951f08b
BLAKE2b-256 99179c9f84b1b21cdd938c313b2db7a9122397f8a6dcfeb6db972cef045b6f85

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page