Skip to main content

Python client for Orchard, a compute platform for Apple Silicon

Project description

Orchard

PyPI License macOS Apple Silicon

Standalone local inference for Apple Silicon, from Python.

orchard is the standalone Python package for Orchard. Install it into a Python environment, call the embedded client directly from scripts or services, or start the optional OpenAI-compatible HTTP server when another process needs to talk to local models. It wraps the Proxy Inference Engine, a local C++ and Metal runtime built for streaming, continuous batching, multiple loaded models, structured output, tool calls, and multimodal inputs.

macOS 14+ | Apple Silicon | Python 3.12+ | Apache-2.0

Official docs | Quickstart | Client | Streaming | Responses | Server | Batching | Multimodal | Structured Output | Tool Use | Models

Install

uv venv
source .venv/bin/activate
uv pip install orchard

If you are not using uv, install inside a virtual environment:

python -m venv .venv
source .venv/bin/activate
pip install orchard

The first request downloads the Orchard engine binary and the model weights you ask for. The engine binary is cached under ~/.orchard/; Hugging Face model files use the normal Hugging Face cache.

Quickstart

Use the Python client directly when you are writing a Python app, notebook, worker, or evaluation job. You do not need to start the HTTP server for this path.

Create hello_orchard.py:

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.chat(
        MODEL,
        [{"role": "user", "content": "Write one sentence about local AI."}],
        temperature=0.0,
        max_generated_tokens=64,
    )
    print(response.text)

Run it:

python hello_orchard.py

For larger Macs, try google/gemma-4-E4B-it, meta-llama/Llama-3.1-8B-Instruct, or Qwen/Qwen3.5-4B.

Streaming

client.chat(..., stream=True) returns token deltas as the engine produces them.

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    stream = client.chat(
        MODEL,
        [{"role": "user", "content": "Count from one to five."}],
        stream=True,
        temperature=0.0,
        max_generated_tokens=64,
    )

    for delta in stream:
        if delta.content:
            print(delta.content, end="", flush=True)
    print()

Responses API

Use responses() when you want OpenAI Responses-style output objects, text deltas, reasoning items, and function-call items.

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.responses(
        MODEL,
        input="Explain why local inference is useful in two sentences.",
        temperature=0.0,
        max_output_tokens=96,
    )
    print(response.output_text)

For text-only streaming from Responses:

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    for chunk in client.responses_text(
        MODEL,
        input="Give me three concise debugging tips.",
        temperature=0.0,
        max_output_tokens=96,
    ):
        print(chunk, end="", flush=True)
    print()

Async

Every client path has an async form. Use achat(), aresponses(), and aresponses_text() inside async services.

import asyncio

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"


async def main() -> None:
    async with InferenceEngine() as engine:
        await engine.load_model(MODEL)
        client = engine.client()
        response = await client.achat(
            MODEL,
            [{"role": "user", "content": "Say hello from Orchard."}],
            temperature=0.0,
            max_generated_tokens=64,
        )
        print(response.text)


asyncio.run(main())

HTTP Server

Start the server only when another process, curl, or an OpenAI-compatible client needs to talk to Orchard over HTTP. The normal Python path is the client above.

orchard serve --model google/gemma-4-E2B-it

The default server listens on http://127.0.0.1:8000.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="google/gemma-4-E2B-it",
    messages=[{"role": "user", "content": "Hello from Orchard."}],
)
print(response.choices[0].message.content)

The server exposes:

Endpoint Use
POST /v1/chat/completions Chat Completions, streaming, batching, tools, structured output
POST /v1/responses Responses objects, event streams, reasoning, tool calls, multimodal input
POST /v1/completions Text completions
POST /v1/embeddings Embeddings for supported models
GET /v1/models Loaded model list
GET /health Server health

Batching

Pass a list of conversations to schedule prompts together. Orchard returns one response per prompt in order.

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    responses = client.chat(
        MODEL,
        [
            [{"role": "user", "content": "Say hello politely."}],
            [{"role": "user", "content": "Give me a fun fact about space."}],
        ],
        temperature=0.0,
        max_generated_tokens=24,
    )

    for response in responses:
        print(response.text)

Sync, async, streaming, batching, and best-of-N are all supported. See orchard/clients/client.py.

For the HTTP API, send the same shape in messages:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-E2B-it",
    "messages": [
      [{"role": "user", "content": "Say hello politely."}],
      [{"role": "user", "content": "Give me a fun fact about space."}]
    ],
    "max_completion_tokens": 24,
    "temperature": 0.0
  }'

Multimodal

Use Responses-style content parts for images. Pass images as data URLs.

import base64
from pathlib import Path

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-3-4b-it"
IMAGE = Path("apple.jpg")

image_url = "data:image/jpeg;base64," + base64.b64encode(IMAGE.read_bytes()).decode()

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.responses(
        MODEL,
        input=[
            {
                "type": "message",
                "role": "user",
                "content": [
                    {"type": "input_text", "text": "What is in this image?"},
                    {"type": "input_image", "image_url": image_url},
                ],
            }
        ],
        temperature=0.0,
        max_output_tokens=96,
    )
    print(response.output_text)

Structured Output

Use JSON Schema when the caller needs machine-readable output.

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

schema = {
    "type": "object",
    "properties": {
        "capital": {"type": "string"},
        "population": {"type": "integer"},
    },
    "required": ["capital", "population"],
}

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.responses(
        MODEL,
        input="What is the capital of France and its approximate population?",
        text={
            "format": {
                "type": "json_schema",
                "name": "city_info",
                "schema": schema,
                "strict": True,
            }
        },
        temperature=0.0,
        max_output_tokens=64,
    )
    print(response.output_text)

Tool Use

Tools use the Responses function schema. Non-streaming responses expose parsed function calls on response.tool_calls.

import json

from orchard.engine.inference_engine import InferenceEngine

MODEL = "google/gemma-4-E2B-it"

weather_tool = {
    "type": "function",
    "name": "get_weather",
    "description": "Get the current weather for a location.",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "City name, e.g. San Francisco",
            }
        },
        "required": ["location"],
    },
}

with InferenceEngine(load_models=[MODEL]) as engine:
    client = engine.client()
    response = client.responses(
        MODEL,
        input="What is the weather in San Francisco?",
        tools=[weather_tool],
        tool_choice="required",
        temperature=0.0,
        max_output_tokens=128,
    )

    for call in response.tool_calls:
        print(call.name, json.loads(call.arguments))

Reasoning

Reasoning is model-dependent. For models with native thinking tokens, pass reasoning=True or an effort level.

response = client.responses(
    MODEL,
    input="Solve 23 * 47 and explain the steps briefly.",
    reasoning={"effort": "medium"},
    temperature=0.0,
    max_output_tokens=128,
)

Accepted effort values are minimal, low, medium, and high.

Production Use

Orchard is designed for production local services, not just one-off scripts. The same package covers notebooks, batch jobs, benchmark harnesses, and long-running agents that keep several models warm.

Capability Path
Multiple loaded models Start InferenceEngine(load_models=[...]) or orchard serve --model model-a model-b
Continuous batching Send batched prompts or concurrent requests through the same engine process
Streaming Use stream=True, responses_text(), or Server-Sent Events over HTTP
Structured output Use response_format for Chat Completions or text.format for Responses
Tool use Use tools, tool_choice, and max_tool_calls
Multimodal input Use Responses content parts with input_text and input_image

The engine process is shared by Orchard clients on the machine. Stop it when you want a clean shutdown:

orchard engine stop

Update the engine binary:

orchard upgrade stable

Models

Orchard resolves local paths and Hugging Face repos on demand. The currently tested families include Gemma, Qwen, Llama, and Moondream.

Mac Start with
8 GB unified memory google/gemma-4-E2B-it
16-32 GB unified memory google/gemma-4-E4B-it or Qwen/Qwen3.5-4B
32 GB+ unified memory meta-llama/Llama-3.1-8B-Instruct

Other model families need a profile in Pantheon, which supplies chat templates, control tokens, and capability metadata.

Requirements

  • macOS 14 or newer
  • Apple Silicon Mac
  • Python 3.12 or newer
  • Disk space for the engine binary and model weights

Privacy

Inference runs locally on your Mac. Orchard downloads the engine binary and the model weights you request; prompts and outputs are not sent to a cloud inference API by Orchard.

Development

uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"
pytest

For full engine/client verification inside the Proxy Company hyper-repo, run:

./scripts/pie_cycle.sh --py-only

Related

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

orchard-2026.5.11.tar.gz (130.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

orchard-2026.5.11-py3-none-any.whl (144.9 kB view details)

Uploaded Python 3

File details

Details for the file orchard-2026.5.11.tar.gz.

File metadata

  • Download URL: orchard-2026.5.11.tar.gz
  • Upload date:
  • Size: 130.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.5.11.tar.gz
Algorithm Hash digest
SHA256 0b1eae48618c3f1db2b4cfda0c95f40135ffa812b335fc695f65b41b09a0d701
MD5 910bc42bb7fd23f2d7b513b91f21718f
BLAKE2b-256 300d31627a56cc0823bfc49bd1916d71032898cfd6a4a4d8cb1a67ed0f99c4ea

See more details on using hashes here.

File details

Details for the file orchard-2026.5.11-py3-none-any.whl.

File metadata

  • Download URL: orchard-2026.5.11-py3-none-any.whl
  • Upload date:
  • Size: 144.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.12

File hashes

Hashes for orchard-2026.5.11-py3-none-any.whl
Algorithm Hash digest
SHA256 66fd436c220bf9196333ee9ae096ddd31aa3d451a382efed979d2f714d34dbb4
MD5 6a49c8b30b3f205de9853ded97d10637
BLAKE2b-256 0cc5630cbb447303fba261dc5d65642909f6079a9ac4ead324f6b95e31fad867

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page