Skip to main content

Python client for Orchard, a compute platform for Apple Silicon

Project description

Orchard

100% local, OpenAI-compatible LLM inference for Apple Silicon. Multi-model serving, prefix caching, continuous batching. No cloud APIs, no data leaves your machine.

macOS 14+ · Apple Silicon (M1+) · Python 3.12+ · Apache-2.0

Features

  • Drop-in OpenAI API/v1/chat/completions, /v1/responses, /v1/embeddings, /v1/models
  • Fast — C++ inference engine with prefix caching and continuous batching
  • Multi-model — load Qwen, Llama, and Gemma side-by-side; swap between them per request
  • Multimodal — vision, tool calling, thinking; native where the model was trained, grammar-constrained where it wasn't
  • OpenAI Responses API — streaming events for reasoning, tool calls, and messages
  • Use from anything — curl, Python (openai SDK), Rust (orchard-rs), or any OpenAI-compatible client

Getting started

pip install orchard
orchard serve --model google/gemma-4-E4B-it

First run downloads the PIE engine binary (~2 GB) and the model weights from HuggingFace. Subsequent runs start in seconds.

Then point anything at http://localhost:8000/v1:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-E4B-it",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Or with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="google/gemma-4-E4B-it",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Supported models

Orchard downloads open-weights models from HuggingFace on demand. Any model whose family has a profile in Pantheon works out of the box.

Model Size (BF16) Modalities Best for
google/gemma-4-E4B-it (default) ~8 GB text, vision Multimodal, native thinking, tool calls
Qwen/Qwen3.5-4B ~9 GB text 256k context, native thinking, tool calls
meta-llama/Llama-3.1-8B-Instruct ~16 GB text General-purpose, trained tool calls
google/gemma-3-4b-it ~8 GB text, vision Multimodal chat
google/gemma-4-E2B-it ~5 GB text, vision Fits on 8 GB Macs
moondream/moondream3-preview ~9 GB vision Pointing, detection, captioning

By hardware

Your Mac Recommended
M1 / M2 / M3 (8 GB) google/gemma-4-E2B-it
M-Pro / Max (16–32 GB) google/gemma-4-E4B-it, Qwen/Qwen3.5-4B
M-Max / Ultra (32+ GB) meta-llama/Llama-3.1-8B-Instruct + a small model hot-loaded

Need quantized weights? Pass any mlx-community/... repo directly. Orchard resolves any HuggingFace repo whose architecture belongs to a supported family.

OpenAI Responses API

Orchard implements the OpenAI Responses API — the successor to Chat Completions. You get structured streaming events for reasoning, tool calls, and messages, plus per-item state and lifecycle events.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
stream = client.responses.create(
    model="google/gemma-4-E4B-it",
    input="Explain quantum tunneling in three sentences.",
    reasoning={"effort": "medium"},
    stream=True,
)
for event in stream:
    print(event)

Supports response.output_text.delta, response.reasoning.delta, response.function_call_arguments.delta, structured output via JSON Schema, and max_tool_calls for bounded tool loops.

Python SDK (no HTTP)

For embedded use — skip the HTTP layer:

from orchard.engine.inference_engine import InferenceEngine

async with InferenceEngine() as engine:
    client = engine.client()
    response = await client.achat(
        "google/gemma-4-E4B-it",
        messages=[{"role": "user", "content": "Hello!"}],
    )
    print(response.text)

Sync, async, streaming, batching, and best-of-N are all supported. See orchard/clients/client.py.

Privacy

Orchard runs entirely on your Mac. No telemetry, no analytics, no phone-home.

Surface Status What it does
Inference ✅ Local All generation on-device via PIE (C++, Metal)
Chat templates ✅ Local Rendered from Pantheon profiles bundled in the package
Model weights ✅ One-time HuggingFace Hub → ~/.cache/huggingface/
Engine binary ✅ One-time GitHub release → ~/.orchard/bin/
Telemetry ✅ None No tracking SDKs — verify with grep -r analytics orchard/

How it works

Orchard is the Python layer over a stack built for Apple Silicon:

  • PIE — C++ inference engine: prefix caching, continuous batching, multi-model scheduling
  • PAL — Metal GPU kernels
  • PSE — grammar-constrained generation for tool calls, structured output, and thinking
  • Pantheon — chat templates and capability manifests, shared across all Orchard SDKs

The Python package handles IPC, model resolution, HuggingFace downloads, prompt rendering, and the FastAPI server.

CLI

orchard serve --model <hf-repo> [--host 127.0.0.1] [--port 8000]
orchard serve --models model-a model-b model-c    # preload multiple
orchard upgrade [stable|nightly]                  # update engine binary
orchard engine stop                               # kill background engine

Requirements

  • macOS 14+, Apple Silicon (M1 or newer)
  • Python 3.12+
  • ~2 GB free disk for the engine binary, plus model weights

Related

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orchard-2026.5.4.tar.gz (125.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

orchard-2026.5.4-py3-none-any.whl (138.7 kB view details)

Uploaded Python 3

File details

Details for the file orchard-2026.5.4.tar.gz.

File metadata

  • Download URL: orchard-2026.5.4.tar.gz
  • Upload date:
  • Size: 125.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.5.4.tar.gz
Algorithm Hash digest
SHA256 26c2e6e74508c742c630b14cd1e8320f1a452524fd7d45324c8a9a7abe099a4f
MD5 97735ebf7d51be18498b41d3f09820d6
BLAKE2b-256 15d005eb9d6ec744418eeb75695d6325ce3577d0bd87ac14bf770b49be1f1ba7

See more details on using hashes here.

File details

Details for the file orchard-2026.5.4-py3-none-any.whl.

File metadata

  • Download URL: orchard-2026.5.4-py3-none-any.whl
  • Upload date:
  • Size: 138.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.5.4-py3-none-any.whl
Algorithm Hash digest
SHA256 f1e53acbbb994ab8330d6399f1432b61736c5c557f301abe935fdf8d5dd06889
MD5 b0bbeba3cabbfe27061e053d57ec5d7b
BLAKE2b-256 9121e5ac049a7dfa1bfe6181fa05a2986f5929534f0464a8c8c98fd506bbfab1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page