Robotics-aware inference orchestration on top of Ray Serve
Project description
Inferential
Multi-client inference orchestration on top of Ray Serve.
Inferential sits between your clients and your ML models. It receives observations over ZMQ, schedules inference requests using cadence-aware priority scoring, dispatches to Ray Serve, and streams results back — all with sub-millisecond transport overhead. Built for any scenario where multiple clients need concurrent access to shared models: robotics fleets, game agents, IoT devices, real-time ML pipelines.
Features
- ZMQ transport — ROUTER/DEALER sockets with automatic reconnection and zero-copy tensor payloads
- Pluggable schedulers — Deadline-aware (default), batch-optimized, priority-tiered, round-robin
- Cadence learning — EMA-based tracking of each client's request pattern to predict urgency
- Protobuf wire protocol — Typed tensor metadata (dtype, shape, encoding) with binary payload
- Queue management — Request TTL, drop-oldest overflow policy, dispatch retry
- In-memory metrics — Ring-buffer storage with label filtering and percentile stats (p50/p95/p99)
- Lightweight client SDK — No Ray dependency; just
pyzmq,protobuf, andnumpy
Install
# Client SDK only
pip install inferential
# Server with Ray Serve
pip install inferential[server]
# Development
pip install inferential[dev]
Quick Start
Server
import asyncio
from inferential import Server
server = Server(bind="tcp://*:5555", models=["policy-v2"])
@server.on_metric
def log(name, value, labels):
if name == "inference_latency_ms":
print(f"Client {labels.get('client')}: {value:.1f}ms")
asyncio.run(server.run())
Client
import numpy as np
from inferential import Connection
conn = Connection(server="tcp://localhost:5555", client_id="agent-01", client_type="sensor")
model = conn.model("policy-v2", latency_budget_ms=30.0)
state = np.random.randn(7).astype(np.float32)
model.observe(urgency=0.8, state=state)
result = model.get_result(timeout_ms=50)
if result is not None:
actions = result["actions"] # np.ndarray
Documentation
- Quick Start — Install, run server + client, get your first result
- Architecture — System design, wire protocol, schedulers, queue management, metrics, configuration
- Examples — Multi-client demos, metric callbacks, extending with custom models and schedulers
- Contributing — Commit conventions, branching, code style, pre-commit hooks, releases
Development
# Generate protobuf code
make proto
# Run tests
make test
# Lint
make lint
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file inferential-0.2.0.tar.gz.
File metadata
- Download URL: inferential-0.2.0.tar.gz
- Upload date:
- Size: 212.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
807f7691b348ffcac7fb35be51c2bc50b362898a0e13a38f5f28c5f322e0ec94
|
|
| MD5 |
3aca58c283eeb0352cf7b681f4b1c02c
|
|
| BLAKE2b-256 |
3472fa7b3648487ada0361c824e83342a9690543b4e8fa46c51dc1a61f63b8bd
|
File details
Details for the file inferential-0.2.0-py3-none-any.whl.
File metadata
- Download URL: inferential-0.2.0-py3-none-any.whl
- Upload date:
- Size: 37.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c114920052902f8c3ccaba26d45a10b02ab304bfce5f6561f00a5404db7a8355
|
|
| MD5 |
cb7889f78469abc83afc459b86c89cb2
|
|
| BLAKE2b-256 |
3c229c2434e0871d813e7b1647cfcc7ed1edf4736778567dc1c2cfa594bf8dd0
|