A lightweight ML inference server with dynamic batching, hot-swapping, and Prometheus metrics

These details have not been verified by PyPI

Project description

Forge

A lightweight ML inference server with dynamic batching, model hot-swapping, a multi-model registry and Prometheus metrics -- built from scratch in Python.

This is not model.predict() behind Flask. Forge implements the core mechanics of what TorchServe and Triton do, from first principles.

Features

Feature	Description
Dynamic batching	Accumulates concurrent requests over a configurable time window then stacks them into a single tensor operation. Typically 4-8x throughput vs. sequential.
Backpressure queue	asyncio.Queue with a hard depth cap. Returns HTTP 503 immediately when saturated rather than blocking the event loop.
Model hot-swapping	Load a new checkpoint with zero downtime. In-flight requests finish on the old model; new requests use the new one.
Model registry	Serve multiple models simultaneously at independent endpoints each with its own queue and batching config.
Prometheus metrics	P50/P95/P99 latency histograms, batch size distribution, queue depth, timeout counters and swap duration.

Architecture

  HTTP Request
       |
       v
  +-------------------------------------+
  |  FastAPI  POST /v1/{model}/predict  |
  +------------------+------------------+
                     | asyncio.Future
                     v
  +-------------------------------------+
  |  RequestQueue  (backpressure cap)   |
  +------------------+------------------+
                     | blocking get / nowait drain
                     v
  +-------------------------------------+
  |  BatchScheduler                     |
  |  +-------------------------------+  |
  |  | Collect for batch_window_ms   |  |
  |  |   OR until max_batch_size     |  |  <- whichever fires first
  |  +--------------+----------------+  |
  |                 | run_in_executor   |
  |  +--------------v----------------+  |
  |  |  torch.no_grad() forward pass |  |
  |  +--------------+----------------+  |
  |                 | scatter results   |
  +-----------------|-------------------+
                    |
                    v
         Future.set_result(tensor)
                    |
                    v
         HTTP Response to caller

Quickstart

Install

git clone https://github.com/verz0/Forge.git
cd Forge
pip install -e ".[dev]"

Serve a dummy model (no GPU needed)

forge serve examples/serve_dummy.py

curl -X POST http://localhost:8000/v1/dummy/predict \
     -H "Content-Type: application/json" \
     -d '{"input": [1.0, 2.0, 3.0]}'
# {"model":"dummy","output":[2.0,4.0,6.0],"latency_ms":1.2}

Serve ResNet-50

pip install torchvision
forge serve examples/serve_resnet.py

Check metrics (Prometheus format)

curl http://localhost:8000/metrics

Hot-swap a model

curl -X POST http://localhost:8000/v1/resnet50/reload \
     -H "Content-Type: application/json" \
     -d '{"model_path": "/path/to/new_resnet.pt"}'

Configuration

from forge import ModelConfig

config = ModelConfig(
    batch_window_ms=50.0,   # Collect requests for 50ms before dispatching
    max_batch_size=32,       # Early flush if 32 requests accumulate first
    max_queue_depth=256,     # Return 503 if more than 256 requests are pending
    request_timeout_s=30.0,  # Fail requests waiting longer than 30s
    device="cuda",           # "cpu", "cuda", "cuda:0" or "mps"
    num_threads=4,           # PyTorch intraop threads (CPU only)
)

Benchmark

# Start the server
forge serve examples/serve_dummy.py

# Run the sweep in another terminal
python benchmarks/bench_throughput.py --concurrency 1,5,10,25,50,100

# With chart output
python benchmarks/bench_throughput.py --plot

Sample results (CPU, dummy model, 128-float input):

Concurrency	RPS	P50 (ms)	P95 (ms)	P99 (ms)
1	420	2.1	2.8	3.1
10	1,850	4.9	8.2	11.4
50	3,200	14.8	28.3	41.7
100	3,400	28.1	52.6	71.2

Key insight: at concurrency 50, batching yields roughly 7.6x the throughput of a naive sequential server at the cost of approximately 15ms added latency (the batch window).

Running Tests

pytest tests/ -v

API Reference

Endpoint	Method	Description
`/v1/{model}/predict`	POST	Submit tensor for inference
`/v1/{model}/reload`	POST	Hot-swap to new checkpoint
`/v1/models`	GET	List models and queue depths
`/metrics`	GET	Prometheus metrics
`/health`	GET	Liveness and readiness
`/docs`	GET	Interactive API docs (Swagger)

Project Structure

forge/
├── forge/
│   ├── config.py      # ModelConfig and ServerConfig
│   ├── queue.py       # RequestQueue and InferenceRequest
│   ├── batcher.py     # BatchScheduler (core engine)
│   ├── worker.py      # ModelWorker and hot-swap
│   ├── registry.py    # ModelRegistry
│   ├── metrics.py     # Prometheus metrics
│   ├── server.py      # FastAPI app
│   └── cli.py         # forge serve CLI
├── tests/
├── benchmarks/
└── examples/

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.1

May 27, 2026

0.2.0

May 27, 2026

This version

0.1.0

May 27, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forge_ml_serve-0.1.0.tar.gz (18.4 kB view details)

Uploaded May 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

forge_ml_serve-0.1.0-py3-none-any.whl (16.8 kB view details)

Uploaded May 27, 2026 Python 3

File details

Details for the file forge_ml_serve-0.1.0.tar.gz.

File metadata

Download URL: forge_ml_serve-0.1.0.tar.gz
Upload date: May 27, 2026
Size: 18.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for forge_ml_serve-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8cd044b83eefc76fedbfccd344ebc911dc2668bc1905eba5a98c355ec627a66d`
MD5	`0e213d44ce03b27b5cb5d86d58a9d2a9`
BLAKE2b-256	`9a502dfd58ccae84fea87a8a530881bc3adf4c8e399adbc3555c052619da3eaa`

See more details on using hashes here.

File details

Details for the file forge_ml_serve-0.1.0-py3-none-any.whl.

File metadata

Download URL: forge_ml_serve-0.1.0-py3-none-any.whl
Upload date: May 27, 2026
Size: 16.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.9

File hashes

Hashes for forge_ml_serve-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`65c8e0928fe569e2acfa906221ee7e0d49b09264c457faa4538410ac3eb88da8`
MD5	`f89b969dac798b1d6e9baad557d7ef7b`
BLAKE2b-256	`ebaf7ff9efcef3ec3849c0ba139c1eaecb52916b58b6a727f8ce754645b22157`

See more details on using hashes here.

forge-ml-serve 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

Forge

Features

Architecture

Quickstart

Install

Serve a dummy model (no GPU needed)

Serve ResNet-50

Check metrics (Prometheus format)

Hot-swap a model

Configuration

Benchmark

Running Tests

API Reference

Project Structure

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes