A lightweight ML inference server with dynamic batching, hot-swapping, and Prometheus metrics
Project description
Forge
A lightweight ML inference server with dynamic batching, model hot-swapping, a multi-model registry and Prometheus metrics -- built from scratch in Python.
This is not model.predict() behind Flask. Forge implements the core mechanics of what TorchServe and Triton do, from first principles.
Features
| Feature | Description |
|---|---|
| Dynamic batching | Accumulates concurrent requests over a configurable time window then stacks them into a single tensor operation. Typically 4-8x throughput vs. sequential. |
| Backpressure queue | asyncio.Queue with a hard depth cap. Returns HTTP 503 immediately when saturated rather than blocking the event loop. |
| Model hot-swapping | Load a new checkpoint with zero downtime. In-flight requests finish on the old model; new requests use the new one. |
| Model registry | Serve multiple models simultaneously at independent endpoints each with its own queue and batching config. |
| Prometheus metrics | P50/P95/P99 latency histograms, batch size distribution, queue depth, timeout counters and swap duration. |
Architecture
HTTP Request
|
v
+-------------------------------------+
| FastAPI POST /v1/{model}/predict |
+------------------+------------------+
| asyncio.Future
v
+-------------------------------------+
| RequestQueue (backpressure cap) |
+------------------+------------------+
| blocking get / nowait drain
v
+-------------------------------------+
| BatchScheduler |
| +-------------------------------+ |
| | Collect for batch_window_ms | |
| | OR until max_batch_size | | <- whichever fires first
| +--------------+----------------+ |
| | run_in_executor |
| +--------------v----------------+ |
| | torch.no_grad() forward pass | |
| +--------------+----------------+ |
| | scatter results |
+-----------------|-------------------+
|
v
Future.set_result(tensor)
|
v
HTTP Response to caller
Quickstart
Install
git clone https://github.com/verz0/Forge.git
cd Forge
pip install -e ".[dev]"
Serve a dummy model (no GPU needed)
forge serve examples/serve_dummy.py
curl -X POST http://localhost:8000/v1/dummy/predict \
-H "Content-Type: application/json" \
-d '{"input": [1.0, 2.0, 3.0]}'
# {"model":"dummy","output":[2.0,4.0,6.0],"latency_ms":1.2}
Serve ResNet-50
pip install torchvision
forge serve examples/serve_resnet.py
Check metrics (Prometheus format)
curl http://localhost:8000/metrics
Hot-swap a model
curl -X POST http://localhost:8000/v1/resnet50/reload \
-H "Content-Type: application/json" \
-d '{"model_path": "/path/to/new_resnet.pt"}'
Configuration
from forge import ModelConfig
config = ModelConfig(
batch_window_ms=50.0, # Collect requests for 50ms before dispatching
max_batch_size=32, # Early flush if 32 requests accumulate first
max_queue_depth=256, # Return 503 if more than 256 requests are pending
request_timeout_s=30.0, # Fail requests waiting longer than 30s
device="cuda", # "cpu", "cuda", "cuda:0" or "mps"
num_threads=4, # PyTorch intraop threads (CPU only)
)
Benchmark
# Start the server
forge serve examples/serve_dummy.py
# Run the sweep in another terminal
python benchmarks/bench_throughput.py --concurrency 1,5,10,25,50,100
# With chart output
python benchmarks/bench_throughput.py --plot
Sample results (CPU, dummy model, 128-float input):
| Concurrency | RPS | P50 (ms) | P95 (ms) | P99 (ms) |
|---|---|---|---|---|
| 1 | 420 | 2.1 | 2.8 | 3.1 |
| 10 | 1,850 | 4.9 | 8.2 | 11.4 |
| 50 | 3,200 | 14.8 | 28.3 | 41.7 |
| 100 | 3,400 | 28.1 | 52.6 | 71.2 |
Key insight: at concurrency 50, batching yields roughly 7.6x the throughput of a naive sequential server at the cost of approximately 15ms added latency (the batch window).
Running Tests
pytest tests/ -v
API Reference
| Endpoint | Method | Description |
|---|---|---|
/v1/{model}/predict |
POST | Submit tensor for inference |
/v1/{model}/reload |
POST | Hot-swap to new checkpoint |
/v1/models |
GET | List models and queue depths |
/metrics |
GET | Prometheus metrics |
/health |
GET | Liveness and readiness |
/docs |
GET | Interactive API docs (Swagger) |
Project Structure
forge/
├── forge/
│ ├── config.py # ModelConfig and ServerConfig
│ ├── queue.py # RequestQueue and InferenceRequest
│ ├── batcher.py # BatchScheduler (core engine)
│ ├── worker.py # ModelWorker and hot-swap
│ ├── registry.py # ModelRegistry
│ ├── metrics.py # Prometheus metrics
│ ├── server.py # FastAPI app
│ └── cli.py # forge serve CLI
├── tests/
├── benchmarks/
└── examples/
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file forge_ml_serve-0.1.0.tar.gz.
File metadata
- Download URL: forge_ml_serve-0.1.0.tar.gz
- Upload date:
- Size: 18.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8cd044b83eefc76fedbfccd344ebc911dc2668bc1905eba5a98c355ec627a66d
|
|
| MD5 |
0e213d44ce03b27b5cb5d86d58a9d2a9
|
|
| BLAKE2b-256 |
9a502dfd58ccae84fea87a8a530881bc3adf4c8e399adbc3555c052619da3eaa
|
File details
Details for the file forge_ml_serve-0.1.0-py3-none-any.whl.
File metadata
- Download URL: forge_ml_serve-0.1.0-py3-none-any.whl
- Upload date:
- Size: 16.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
65c8e0928fe569e2acfa906221ee7e0d49b09264c457faa4538410ac3eb88da8
|
|
| MD5 |
f89b969dac798b1d6e9baad557d7ef7b
|
|
| BLAKE2b-256 |
ebaf7ff9efcef3ec3849c0ba139c1eaecb52916b58b6a727f8ce754645b22157
|