Skip to main content

A lightweight ML inference server with dynamic batching, hot-swapping, and Prometheus metrics

Project description

Forge

A lightweight ML inference server with dynamic batching, model hot-swapping, a multi-model registry and Prometheus metrics -- built from scratch in Python.

This is not model.predict() behind Flask. Forge implements the core mechanics of what TorchServe and Triton do, from first principles.


Features

Feature Description
Dynamic batching Accumulates concurrent requests over a configurable time window then stacks them into a single tensor operation. Typically 4-8x throughput vs. sequential.
Backpressure queue asyncio.Queue with a hard depth cap. Returns HTTP 503 immediately when saturated rather than blocking the event loop.
Model hot-swapping Load a new checkpoint with zero downtime. In-flight requests finish on the old model; new requests use the new one.
Model registry Serve multiple models simultaneously at independent endpoints each with its own queue and batching config.
Prometheus metrics P50/P95/P99 latency histograms, batch size distribution, queue depth, timeout counters and swap duration.

Architecture

  HTTP Request
       |
       v
  +-------------------------------------+
  |  FastAPI  POST /v1/{model}/predict  |
  +------------------+------------------+
                     | asyncio.Future
                     v
  +-------------------------------------+
  |  RequestQueue  (backpressure cap)   |
  +------------------+------------------+
                     | blocking get / nowait drain
                     v
  +-------------------------------------+
  |  BatchScheduler                     |
  |  +-------------------------------+  |
  |  | Collect for batch_window_ms   |  |
  |  |   OR until max_batch_size     |  |  <- whichever fires first
  |  +--------------+----------------+  |
  |                 | run_in_executor   |
  |  +--------------v----------------+  |
  |  |  torch.no_grad() forward pass |  |
  |  +--------------+----------------+  |
  |                 | scatter results   |
  +-----------------|-------------------+
                    |
                    v
         Future.set_result(tensor)
                    |
                    v
         HTTP Response to caller

Quickstart

Install

git clone https://github.com/verz0/Forge.git
cd Forge
pip install -e ".[dev]"

Serve a dummy model (no GPU needed)

forge serve examples/serve_dummy.py
curl -X POST http://localhost:8000/v1/dummy/predict \
     -H "Content-Type: application/json" \
     -d '{"input": [1.0, 2.0, 3.0]}'
# {"model":"dummy","output":[2.0,4.0,6.0],"latency_ms":1.2}

Serve ResNet-50

pip install torchvision
forge serve examples/serve_resnet.py

Check metrics (Prometheus format)

curl http://localhost:8000/metrics

Hot-swap a model

curl -X POST http://localhost:8000/v1/resnet50/reload \
     -H "Content-Type: application/json" \
     -d '{"model_path": "/path/to/new_resnet.pt"}'

Configuration

from forge import ModelConfig

config = ModelConfig(
    batch_window_ms=50.0,   # Collect requests for 50ms before dispatching
    max_batch_size=32,       # Early flush if 32 requests accumulate first
    max_queue_depth=256,     # Return 503 if more than 256 requests are pending
    request_timeout_s=30.0,  # Fail requests waiting longer than 30s
    device="cuda",           # "cpu", "cuda", "cuda:0" or "mps"
    num_threads=4,           # PyTorch intraop threads (CPU only)
)

Benchmark

# Start the server
forge serve examples/serve_dummy.py

# Run the sweep in another terminal
python benchmarks/bench_throughput.py --concurrency 1,5,10,25,50,100

# With chart output
python benchmarks/bench_throughput.py --plot

Sample results (CPU, dummy model, 128-float input):

Concurrency RPS P50 (ms) P95 (ms) P99 (ms)
1 420 2.1 2.8 3.1
10 1,850 4.9 8.2 11.4
50 3,200 14.8 28.3 41.7
100 3,400 28.1 52.6 71.2

Key insight: at concurrency 50, batching yields roughly 7.6x the throughput of a naive sequential server at the cost of approximately 15ms added latency (the batch window).


Running Tests

pytest tests/ -v

API Reference

Endpoint Method Description
/v1/{model}/predict POST Submit tensor for inference
/v1/{model}/reload POST Hot-swap to new checkpoint
/v1/models GET List models and queue depths
/metrics GET Prometheus metrics
/health GET Liveness and readiness
/docs GET Interactive API docs (Swagger)

Project Structure

forge/
├── forge/
│   ├── config.py      # ModelConfig and ServerConfig
│   ├── queue.py       # RequestQueue and InferenceRequest
│   ├── batcher.py     # BatchScheduler (core engine)
│   ├── worker.py      # ModelWorker and hot-swap
│   ├── registry.py    # ModelRegistry
│   ├── metrics.py     # Prometheus metrics
│   ├── server.py      # FastAPI app
│   └── cli.py         # forge serve CLI
├── tests/
├── benchmarks/
└── examples/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forge_ml_serve-0.1.0.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

forge_ml_serve-0.1.0-py3-none-any.whl (16.8 kB view details)

Uploaded Python 3

File details

Details for the file forge_ml_serve-0.1.0.tar.gz.

File metadata

  • Download URL: forge_ml_serve-0.1.0.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for forge_ml_serve-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8cd044b83eefc76fedbfccd344ebc911dc2668bc1905eba5a98c355ec627a66d
MD5 0e213d44ce03b27b5cb5d86d58a9d2a9
BLAKE2b-256 9a502dfd58ccae84fea87a8a530881bc3adf4c8e399adbc3555c052619da3eaa

See more details on using hashes here.

File details

Details for the file forge_ml_serve-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: forge_ml_serve-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 16.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for forge_ml_serve-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 65c8e0928fe569e2acfa906221ee7e0d49b09264c457faa4538410ac3eb88da8
MD5 f89b969dac798b1d6e9baad557d7ef7b
BLAKE2b-256 ebaf7ff9efcef3ec3849c0ba139c1eaecb52916b58b6a727f8ce754645b22157

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page