SLA/QoS-aware reverse proxy for ML inference workloads (batching, routing, latency metrics).
Project description
mlproxy-py
mlproxy-py is a minimal ML inference reverse proxy with QoS-aware routing.
Designed for LLM / ML inference workloads where routing decisions should be based on latency, SLA targets, backend health, queue depth, and batching potential.
Features
- Reverse proxy for JSON inference requests
- Backends grouped into model pools
- SLA-aware routing (choose lowest latency backend)
- Optional micro-batching (collect requests for N ms)
- Concurrent health checks with connection pooling
- Prometheus metrics (request count, latency, backend latency)
Quickstart
Install
pip install mlproxy-py
Run proxy
mlproxy run -c examples/config.yml
Send request
curl -X POST http://localhost:7000/infer/modelA \
-H "Content-Type: application/json" \
-d '{"text":"hello"}'
Architecture
Client ──POST /infer/{model}──► FastAPI
│
┌─────────▼──────────┐
│ ModelRouter │
│ choose_backend() │
│ (score = latency │
│ + active_req*5) │
└─────────┬──────────┘
│ backend URL
┌─────────▼──────────┐
│ forward_json() │
│ (httpx conn pool) │
└─────────┬──────────┘
▼
Backend ML server
┌──────────────────┐ ┌──────────────────┐
│ BatchQueue │ │ Healthcheck │
│ (optional per │ │ (concurrent, │
│ model pool) │ │ per-backend) │
└──────────────────┘ └──────────────────┘
Config
See examples/config.yml.
Changelog
0.1.1
- Lifespan pattern: Migrated from deprecated
@app.on_event("startup")to FastAPIlifespancontext manager. - Graceful shutdown: Batch workers and healthcheck loop are properly cancelled on shutdown.
- Connection pooling: Shared
httpx.AsyncClientsingletons for proxy and healthcheck (was creating a client per request/check). - Concurrent health checks: Backends checked in parallel via
asyncio.gather(was sequential). - Logging: Added structured
loggingthroughout;--log-levelCLI option. - Bare except fixes: All
except Exceptionblocks re-raiseasyncio.CancelledError. - Deprecated API fixes: Replaced
asyncio.get_event_loop()withasyncio.get_running_loop()in batching module. - Build system: Migrated from
setuptoolstohatchling. Added classifiers, keywords, optional dev/test deps, ruff/pytest config. - Tests: Expanded from 1 test to 15+ tests covering config, router, batching, proxy, healthcheck, and backends.
0.1.0
- Initial release: JSON inference proxy, model pools, SLA-aware routing, micro-batching, health checks, Prometheus metrics.
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mlproxy_py-0.1.1.tar.gz.
File metadata
- Download URL: mlproxy_py-0.1.1.tar.gz
- Upload date:
- Size: 10.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7e46458f8e6784aeb9b6eca66a18917142d482e42c724449583622146544a75
|
|
| MD5 |
f6437bbca98d219dcafae27d3ae52b01
|
|
| BLAKE2b-256 |
6075eab5d2b9c807832afa51a197fd6c60347efae7c50233ec265b20d45ebb99
|
File details
Details for the file mlproxy_py-0.1.1-py3-none-any.whl.
File metadata
- Download URL: mlproxy_py-0.1.1-py3-none-any.whl
- Upload date:
- Size: 10.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1462ec15c8c6a32b52055dd2001b6d529d43eb7e4325c38d2a2d6b1f574d11fa
|
|
| MD5 |
941cb5a9772c79e0656b552e2b1d334e
|
|
| BLAKE2b-256 |
65838f1246c756fedacb0dccd34e62f4c92d9b2b8ad10ca5a77ba0abce8db6f6
|