Python client for Orchard, a compute platform for Apple Silicon

Project description

Orchard

Standalone local inference for Apple Silicon, from Python.

orchard is the standalone Python package for Orchard. Install it into a Python environment, call the embedded client directly from scripts or services, or start the optional OpenAI-compatible HTTP server when another process needs to talk to local models. It wraps the Proxy Inference Engine, a local C++ and Metal runtime built for streaming, continuous batching, multiple loaded models, structured output, tool calls, and multimodal inputs.

macOS 14+ | Apple Silicon | Python 3.12+ | Apache-2.0

Install

uv venv
source .venv/bin/activate
uv pip install orchard

If you are not using uv, install inside a virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install orchard

The first request downloads the Orchard engine binary and the model weights you ask for. The engine binary is cached under ~/.orchard/; Hugging Face model files use the normal Hugging Face cache.

Quickstart

Use the Python client directly when you are writing a Python app, notebook, worker, or evaluation job. You do not need to start the HTTP server for this path.

Create hello_orchard.py:

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.chat(
        MODEL,
        [{"role": "user", "content": "Write one sentence about local AI."}],
        temperature=0.0,
        max_generated_tokens=64,
    )
    print(response.text)

Run it:

python hello_orchard.py

For larger Macs, try google/gemma-4-E4B-it, meta-llama/Llama-3.1-8B-Instruct, or Qwen/Qwen3.5-4B.

Streaming

client.chat(..., stream=True) returns token deltas as the engine produces them.

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    stream = client.chat(
        MODEL,
        [{"role": "user", "content": "Count from one to five."}],
        stream=True,
        temperature=0.0,
        max_generated_tokens=64,
    )

    for delta in stream:
        if delta.content:
            print(delta.content, end="", flush=True)
    print()

Responses API

Use responses() when you want OpenAI Responses-style output objects, text deltas, reasoning items, and function-call items.

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.responses(
        MODEL,
        input="Explain why local inference is useful in two sentences.",
        temperature=0.0,
        max_output_tokens=96,
    )
    print(response.output_text)

For text-only streaming from Responses:

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    for chunk in client.responses_text(
        MODEL,
        input="Give me three concise debugging tips.",
        temperature=0.0,
        max_output_tokens=96,
    ):
        print(chunk, end="", flush=True)
    print()

Async

Every client path has an async form. Use achat(), aresponses(), and aresponses_text() inside async services.

import asyncio

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"


async def main() -> None:
    async with InferenceEngine() as engine:
        await engine.load_model(MODEL)
        client = engine.client()
        response = await client.achat(
            MODEL,
            [{"role": "user", "content": "Say hello from Orchard."}],
            temperature=0.0,
            max_generated_tokens=64,
        )
        print(response.text)


asyncio.run(main())

HTTP Server

Start the server only when another process, curl, or an OpenAI-compatible client needs to talk to Orchard over HTTP. The normal Python path is the client above.

orchard serve --model google/gemma-4-E2B-it

The default server listens on http://127.0.0.1:8000.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[{"role": "user", "content": "Hello from Orchard."}],
)
print(response.choices[0].message.content)

The server exposes:

Endpoint	Use
`POST /v1/chat/completions`	Chat Completions, streaming, batching, tools, structured output
`POST /v1/responses`	Responses objects, event streams, reasoning, tool calls, multimodal input
`POST /v1/completions`	Text completions
`POST /v1/embeddings`	Embeddings for supported models
`GET /v1/models`	Loaded model list
`GET /health`	Server health

Batching

Pass a list of conversations to schedule prompts together. Orchard returns one response per prompt in order.

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    responses = client.chat(
        MODEL,
        [
            [{"role": "user", "content": "Say hello politely."}],
            [{"role": "user", "content": "Give me a fun fact about space."}],
        ],
        temperature=0.0,
        max_generated_tokens=24,
    )

    for response in responses:
        print(response.text)

Sync, async, streaming, batching, and best-of-N are all supported. See orchard/clients/client.py.

For the HTTP API, send the same shape in messages:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-E2B-it",
    "messages": [
      [{"role": "user", "content": "Say hello politely."}],
      [{"role": "user", "content": "Give me a fun fact about space."}]
    ],
    "max_completion_tokens": 24,
    "temperature": 0.0
  }'

Multimodal

Use Responses-style content parts for images. Pass images as data URLs.

import base64
from pathlib import Path

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-3-4b-it"
IMAGE = Path("apple.jpg")

image_url = "data:image/jpeg;base64," + base64.b64encode(IMAGE.read_bytes()).decode()

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.responses(
        MODEL,
        input=[
            {
                "type": "message",
                "role": "user",
                "content": [
                    {"type": "input_text", "text": "What is in this image?"},
                    {"type": "input_image", "image_url": image_url},
                ],
            }
        ],
        temperature=0.0,
        max_output_tokens=96,
    )
    print(response.output_text)

Structured Output

Use JSON Schema when the caller needs machine-readable output.

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

schema = {
    "type": "object",
    "properties": {
        "capital": {"type": "string"},
        "population": {"type": "integer"},
    },
    "required": ["capital", "population"],
}

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.responses(
        MODEL,
        input="What is the capital of France and its approximate population?",
        text={
            "format": {
                "type": "json_schema",
                "name": "city_info",
                "schema": schema,
                "strict": True,
            }
        },
        temperature=0.0,
        max_output_tokens=64,
    )
    print(response.output_text)

Tool Use

Tools use the Responses function schema. Non-streaming responses expose parsed function calls on response.tool_calls.

import json

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

weather_tool = {
    "type": "function",
    "name": "get_weather",
    "description": "Get the current weather for a location.",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "City name, e.g. San Francisco",
            }
        },
        "required": ["location"],
    },
}

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.responses(
        MODEL,
        input="What is the weather in San Francisco?",
        tools=[weather_tool],
        tool_choice="required",
        temperature=0.0,
        max_output_tokens=128,
    )

    for call in response.tool_calls:
        print(call.name, json.loads(call.arguments))

Reasoning

Reasoning is model-dependent. For models with native thinking tokens, pass reasoning=True or an effort level.

response = client.responses(
    MODEL,
    input="Solve 23 * 47 and explain the steps briefly.",
    reasoning={"effort": "medium"},
    temperature=0.0,
    max_output_tokens=128,
)

Accepted effort values are minimal, low, medium, and high.

Production Use

Orchard is designed for production local services, not just one-off scripts. The same package covers notebooks, batch jobs, benchmark harnesses, and long-running agents that keep several models warm.

Capability	Path
Multiple loaded models	Start `InferenceEngine(load_models=[...])` or `orchard serve --model model-a model-b`
Continuous batching	Send batched prompts or concurrent requests through the same engine process
Streaming	Use `stream=True`, `responses_text()`, or Server-Sent Events over HTTP
Structured output	Use `response_format` for Chat Completions or `text.format` for Responses
Tool use	Use `tools`, `tool_choice`, and `max_tool_calls`
Multimodal input	Use Responses content parts with `input_text` and `input_image`

The engine process is shared by Orchard clients on the machine. Stop it when you want a clean shutdown:

orchard engine stop

Update the engine binary:

orchard upgrade stable

Models

Orchard resolves local paths and Hugging Face repos on demand. The currently tested families include Gemma, Qwen, Llama, and Moondream.

Mac	Start with
8 GB unified memory	`google/gemma-4-E2B-it`
16-32 GB unified memory	`google/gemma-4-E4B-it` or `Qwen/Qwen3.5-4B`
32 GB+ unified memory	`meta-llama/Llama-3.1-8B-Instruct`

Other model families need a profile in Pantheon, which supplies chat templates, control tokens, and capability metadata.

Requirements

macOS 14 or newer
Apple Silicon Mac
Python 3.12 or newer
Disk space for the engine binary and model weights

Privacy

Inference runs locally on your Mac. Orchard downloads the engine binary and the model weights you request; prompts and outputs are not sent to a cloud inference API by Orchard.

Development

uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"
pytest

For full engine/client verification inside the Proxy Company hyper-repo, run:

./scripts/pie_cycle.sh --py-only

Official Orchard docs
orchard-rs for Rust apps that embed Orchard
orchard-swift for Swift telemetry
Pantheon
Proxy Inference Engine

License

Apache-2.0

Project details

Release history Release notifications | RSS feed

2026.5.11

May 21, 2026

2026.5.10

May 19, 2026

2026.5.9

May 17, 2026

This version

2026.5.8

May 17, 2026

2026.5.7

May 15, 2026

2026.5.6

May 10, 2026

2026.5.5

May 8, 2026

2026.5.4

May 5, 2026

2026.5.3

May 3, 2026

2026.5.2

May 2, 2026

2026.5.1

May 1, 2026

2026.4.9

Apr 29, 2026

2026.4.8

Apr 28, 2026

2026.4.7

Apr 28, 2026

2026.4.6

Apr 19, 2026

2026.4.5

Apr 19, 2026

2026.4.4

Apr 17, 2026

2026.4.3

Apr 14, 2026

2026.4.2

Apr 3, 2026

2026.3.12

Mar 29, 2026

2026.3.11

Mar 16, 2026

2026.3.10

Mar 15, 2026

2026.3.9

Mar 15, 2026

2026.3.8

Mar 15, 2026

2026.3.7

Mar 14, 2026

2026.3.6

Mar 14, 2026

2026.3.5

Mar 12, 2026

2026.3.3

Mar 11, 2026

2026.3.2

Mar 10, 2026

2026.3.1

Mar 10, 2026

2026.2.7

Feb 27, 2026

2026.2.6

Feb 27, 2026

2026.2.5

Feb 14, 2026

2026.2.4

Feb 13, 2026

2026.2.3

Feb 13, 2026

2026.2.2

Feb 10, 2026

2026.2.1

Feb 10, 2026

2026.1.3

Jan 27, 2026

2026.1.2

Jan 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orchard-2026.5.8.tar.gz (129.0 kB view details)

Uploaded May 17, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

orchard-2026.5.8-py3-none-any.whl (141.3 kB view details)

Uploaded May 17, 2026 Python 3

File details

Details for the file orchard-2026.5.8.tar.gz.

File metadata

Download URL: orchard-2026.5.8.tar.gz
Upload date: May 17, 2026
Size: 129.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.5.8.tar.gz
Algorithm	Hash digest
SHA256	`bc4627ef1f712d3afb5969b944d735b363e4a263e3adfe71f5b48616e6e30d36`
MD5	`08a15c61374a4abd22e0ac5ac41da940`
BLAKE2b-256	`3b98d09d584ab6ecb07e864c7c51fd1468e5c3cb6d7705372508c6bfc7e872d4`

See more details on using hashes here.

File details

Details for the file orchard-2026.5.8-py3-none-any.whl.

File metadata

Download URL: orchard-2026.5.8-py3-none-any.whl
Upload date: May 17, 2026
Size: 141.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.5.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ebfe56261a0fb585eb49a433062510a1efee706d9ba03dd82cb4df602a4d4aa0`
MD5	`acf68d036a5818154837188e43bb8fa5`
BLAKE2b-256	`d14b6be7f289443e74e12beae2cdc692d81c5bb6b06811e06819d73845ee56f9`

See more details on using hashes here.

orchard 2026.5.8

Navigation

Verified details

Owner

Unverified details

Meta

Project description

Orchard

Install

Quickstart

Streaming

Responses API

Async

HTTP Server

Batching

Multimodal

Structured Output

Tool Use

Reasoning

Production Use

Models

Requirements

Privacy

Development

Related

License

Project details

Verified details

Owner

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes