Skip to main content

Python client for Orchard, a compute platform for Apple Silicon

Project description

Orchard

PyPI License macOS Apple Silicon

Standalone local inference for Apple Silicon, from Python.

orchard is the standalone Python package for Orchard. Install it into a Python environment, call the embedded client directly from scripts or services, or start the optional OpenAI-compatible HTTP server when another process needs to talk to local models. It wraps the Proxy Inference Engine, a local C++ and Metal runtime built for streaming, continuous batching, multiple loaded models, structured output, tool calls, and multimodal inputs.

macOS 14+ | Apple Silicon | Python 3.12+ | Apache-2.0

Official docs | Quickstart | Client | Streaming | Responses | Server | Batching | Multimodal | Structured Output | Tool Use | Models

Install

uv venv
source .venv/bin/activate
uv pip install orchard

If you are not using uv, install inside a virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install orchard

The first request downloads the Orchard engine binary and the model weights you ask for. The engine binary is cached under ~/.orchard/; Hugging Face model files use the normal Hugging Face cache.

Quickstart

Use the Python client directly when you are writing a Python app, notebook, worker, or evaluation job. You do not need to start the HTTP server for this path.

Create hello_orchard.py:

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.chat(
        MODEL,
        [{"role": "user", "content": "Write one sentence about local AI."}],
        temperature=0.0,
        max_generated_tokens=64,
    )
    print(response.text)

Run it:

python hello_orchard.py

For larger Macs, try google/gemma-4-E4B-it, meta-llama/Llama-3.1-8B-Instruct, or Qwen/Qwen3.5-4B.

Streaming

client.chat(..., stream=True) returns token deltas as the engine produces them.

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    stream = client.chat(
        MODEL,
        [{"role": "user", "content": "Count from one to five."}],
        stream=True,
        temperature=0.0,
        max_generated_tokens=64,
    )

    for delta in stream:
        if delta.content:
            print(delta.content, end="", flush=True)
    print()

Responses API

Use responses() when you want OpenAI Responses-style output objects, text deltas, reasoning items, and function-call items.

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.responses(
        MODEL,
        input="Explain why local inference is useful in two sentences.",
        temperature=0.0,
        max_output_tokens=96,
    )
    print(response.output_text)

For text-only streaming from Responses:

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    for chunk in client.responses_text(
        MODEL,
        input="Give me three concise debugging tips.",
        temperature=0.0,
        max_output_tokens=96,
    ):
        print(chunk, end="", flush=True)
    print()

Async

Every client path has an async form. Use achat(), aresponses(), and aresponses_text() inside async services.

import asyncio

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"


async def main() -> None:
    async with InferenceEngine() as engine:
        await engine.load_model(MODEL)
        client = engine.client()
        response = await client.achat(
            MODEL,
            [{"role": "user", "content": "Say hello from Orchard."}],
            temperature=0.0,
            max_generated_tokens=64,
        )
        print(response.text)


asyncio.run(main())

HTTP Server

Start the server only when another process, curl, or an OpenAI-compatible client needs to talk to Orchard over HTTP. The normal Python path is the client above.

orchard serve --model google/gemma-4-E2B-it

The default server listens on http://127.0.0.1:8000.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[{"role": "user", "content": "Hello from Orchard."}],
)
print(response.choices[0].message.content)

The server exposes:

Endpoint Use
POST /v1/chat/completions Chat Completions, streaming, batching, tools, structured output
POST /v1/responses Responses objects, event streams, reasoning, tool calls, multimodal input
POST /v1/completions Text completions
POST /v1/embeddings Embeddings for supported models
GET /v1/models Loaded model list
GET /health Server health

Batching

Pass a list of conversations to schedule prompts together. Orchard returns one response per prompt in order.

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    responses = client.chat(
        MODEL,
        [
            [{"role": "user", "content": "Say hello politely."}],
            [{"role": "user", "content": "Give me a fun fact about space."}],
        ],
        temperature=0.0,
        max_generated_tokens=24,
    )

    for response in responses:
        print(response.text)

Sync, async, streaming, batching, and best-of-N are all supported. See orchard/clients/client.py.

For the HTTP API, send the same shape in messages:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-E2B-it",
    "messages": [
      [{"role": "user", "content": "Say hello politely."}],
      [{"role": "user", "content": "Give me a fun fact about space."}]
    ],
    "max_completion_tokens": 24,
    "temperature": 0.0
  }'

Multimodal

Use Responses-style content parts for images. Pass images as data URLs.

import base64
from pathlib import Path

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-3-4b-it"
IMAGE = Path("apple.jpg")

image_url = "data:image/jpeg;base64," + base64.b64encode(IMAGE.read_bytes()).decode()

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.responses(
        MODEL,
        input=[
            {
                "type": "message",
                "role": "user",
                "content": [
                    {"type": "input_text", "text": "What is in this image?"},
                    {"type": "input_image", "image_url": image_url},
                ],
            }
        ],
        temperature=0.0,
        max_output_tokens=96,
    )
    print(response.output_text)

Structured Output

Use JSON Schema when the caller needs machine-readable output.

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

schema = {
    "type": "object",
    "properties": {
        "capital": {"type": "string"},
        "population": {"type": "integer"},
    },
    "required": ["capital", "population"],
}

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.responses(
        MODEL,
        input="What is the capital of France and its approximate population?",
        text={
            "format": {
                "type": "json_schema",
                "name": "city_info",
                "schema": schema,
                "strict": True,
            }
        },
        temperature=0.0,
        max_output_tokens=64,
    )
    print(response.output_text)

Tool Use

Tools use the Responses function schema. Non-streaming responses expose parsed function calls on response.tool_calls.

import json

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

weather_tool = {
    "type": "function",
    "name": "get_weather",
    "description": "Get the current weather for a location.",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "City name, e.g. San Francisco",
            }
        },
        "required": ["location"],
    },
}

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.responses(
        MODEL,
        input="What is the weather in San Francisco?",
        tools=[weather_tool],
        tool_choice="required",
        temperature=0.0,
        max_output_tokens=128,
    )

    for call in response.tool_calls:
        print(call.name, json.loads(call.arguments))

Reasoning

Reasoning is model-dependent. For models with native thinking tokens, pass reasoning=True or an effort level.

response = client.responses(
    MODEL,
    input="Solve 23 * 47 and explain the steps briefly.",
    reasoning={"effort": "medium"},
    temperature=0.0,
    max_output_tokens=128,
)

Accepted effort values are minimal, low, medium, and high.

Production Use

Orchard is designed for production local services, not just one-off scripts. The same package covers notebooks, batch jobs, benchmark harnesses, and long-running agents that keep several models warm.

Capability Path
Multiple loaded models Start InferenceEngine(load_models=[...]) or orchard serve --model model-a model-b
Continuous batching Send batched prompts or concurrent requests through the same engine process
Streaming Use stream=True, responses_text(), or Server-Sent Events over HTTP
Structured output Use response_format for Chat Completions or text.format for Responses
Tool use Use tools, tool_choice, and max_tool_calls
Multimodal input Use Responses content parts with input_text and input_image

The engine process is shared by Orchard clients on the machine. Stop it when you want a clean shutdown:

orchard engine stop

Update the engine binary:

orchard upgrade stable

Models

Orchard resolves local paths and Hugging Face repos on demand. The currently tested families include Gemma, Qwen, Llama, and Moondream.

Mac Start with
8 GB unified memory google/gemma-4-E2B-it
16-32 GB unified memory google/gemma-4-E4B-it or Qwen/Qwen3.5-4B
32 GB+ unified memory meta-llama/Llama-3.1-8B-Instruct

Other model families need a profile in Pantheon, which supplies chat templates, control tokens, and capability metadata.

Requirements

  • macOS 14 or newer
  • Apple Silicon Mac
  • Python 3.12 or newer
  • Disk space for the engine binary and model weights

Privacy

Inference runs locally on your Mac. Orchard downloads the engine binary and the model weights you request; prompts and outputs are not sent to a cloud inference API by Orchard.

Development

uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"
pytest

For full engine/client verification inside the Proxy Company hyper-repo, run:

./scripts/pie_cycle.sh --py-only

Related

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orchard-2026.5.8.tar.gz (129.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

orchard-2026.5.8-py3-none-any.whl (141.3 kB view details)

Uploaded Python 3

File details

Details for the file orchard-2026.5.8.tar.gz.

File metadata

  • Download URL: orchard-2026.5.8.tar.gz
  • Upload date:
  • Size: 129.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.5.8.tar.gz
Algorithm Hash digest
SHA256 bc4627ef1f712d3afb5969b944d735b363e4a263e3adfe71f5b48616e6e30d36
MD5 08a15c61374a4abd22e0ac5ac41da940
BLAKE2b-256 3b98d09d584ab6ecb07e864c7c51fd1468e5c3cb6d7705372508c6bfc7e872d4

See more details on using hashes here.

File details

Details for the file orchard-2026.5.8-py3-none-any.whl.

File metadata

  • Download URL: orchard-2026.5.8-py3-none-any.whl
  • Upload date:
  • Size: 141.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.5.8-py3-none-any.whl
Algorithm Hash digest
SHA256 ebfe56261a0fb585eb49a433062510a1efee706d9ba03dd82cb4df602a4d4aa0
MD5 acf68d036a5818154837188e43bb8fa5
BLAKE2b-256 d14b6be7f289443e74e12beae2cdc692d81c5bb6b06811e06819d73845ee56f9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page