A reliability layer for self-hosted LLM tool-calling. Guardrails, context management, and backend adapters for multi-step agentic workflows.

These details have not been verified by PyPI

Project links

Project description

forge

A reliability layer for self-hosted LLM tool-calling. Forge lifts an 8B local model to the top of its class on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction). The current top self-hosted config (Ministral-3 8B Instruct Q8 on llama-server) scores 86.5% across forge's 26-scenario eval suite — and 76% on the hardest tier.

Three ways to use it:

WorkflowRunner — Define tools, pick a backend, run structured agent loops. Forge manages the full lifecycle: system prompts, tool execution, context compaction, and guardrails. SlotWorker adds priority-queued access to a shared inference slot with auto-preemption — for multi-agent architectures where specialist workflows share a GPU slot. Best when you're building on forge directly.
Guardrails middleware — Use forge's reliability stack (composable middleware) inside your own orchestration loop. You control the loop; forge validates responses, rescues malformed tool calls, and enforces required steps.
Proxy server — Drop-in OpenAI-compatible proxy (python -m forge.proxy) that sits between any client (opencode, Continue, aider, etc.) and a local model server. Applies guardrails transparently — the client thinks it's talking to a smarter model.

Supports Ollama, llama-server (llama.cpp), Llamafile, and Anthropic as backends.

Requirements

Python 3.12+
A running LLM backend (see below)

Install

pip install forge-guardrails                # core only
pip install "forge-guardrails[anthropic]"   # + Anthropic client

For development:

git clone https://github.com/antoinezambelli/forge.git
cd forge
pip install -e ".[dev]"

Backend setup (pick one)

llama-server (recommended — top 10 eval configs all run on llama-server):

# Install from https://github.com/ggml-org/llama.cpp/releases
llama-server -m path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf --jinja -ngl 999 --port 8080

Ollama (alternative — easier setup, slightly weaker on harder workloads):

# Install from https://ollama.com/download
ollama pull ministral-3:8b-instruct-2512-q4_K_M

Anthropic (API, no local GPU needed):

pip install -e ".[anthropic]"
export ANTHROPIC_API_KEY=sk-...

See Backend Setup for full instructions and Model Guide for which model fits your hardware.

Quick Start

import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())

For multi-step workflows, multi-turn conversations, and backend auto-management, see the User Guide. If you're building a long-running session (CLI, chat server, voice assistant), see the long-running session advisory for important guidance on filtering transient messages.

Proxy Server

Drop-in replacement for a local model server. Point any OpenAI-compatible client at the proxy and get forge's guardrails for free.

# External mode — you manage llama-server, forge proxies it
python -m forge.proxy --backend-url http://localhost:8080 --port 8081

# Managed mode — forge starts llama-server and the proxy together
python -m forge.proxy --backend llamaserver --gguf path/to/model.gguf --port 8081

Then configure your client to use http://localhost:8081/v1 as the API base URL.

Note: The proxy automatically injects a synthetic respond tool when tools are present in the request. The model calls respond(message="...") instead of producing bare text, keeping it in tool-calling mode where forge's full guardrail stack applies. The respond call is stripped from the outbound response — the client sees a normal text response (finish_reason: "stop") and never knows the tool exists. This is essential for small local models (~8B), which cannot be trusted to choose correctly between text and tool calls — guiding them to a tool is a must. See ADR-013 for the full analysis.

Backends

Backend	Best for	Native FC?
Ollama	Easiest setup, model management built-in	Yes
llama-server	Best performance, full control	Yes (with `--jinja`)
Llamafile	Single binary, zero dependencies	No (prompt-injected)
Anthropic	Frontier baseline, hybrid workflows	Yes

See Backend Setup for installation and Model Guide for which model to pick.

Running Tests

python -m pytest tests/ -v --tb=short

python -m pytest tests/ --cov=forge --cov-report=term-missing

Eval Harness

26 scenarios measuring how reliably a model + backend combo navigates multi-step tool-calling workflows — split into an OG-18 baseline tier and an 8-scenario advanced_reasoning tier for top-end separation. See Eval Guide for full CLI reference.

# llama-server (start in another terminal first; see Eval Guide)
python -m tests.eval.eval_runner --backend llamafile --llamafile-mode prompt --gguf "path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf" --runs 10 --stream --verbose

# Batch eval (JSONL output, automatic resume)
python -m tests.eval.batch_eval --config all --runs 50

# Reports (ASCII table, HTML dashboard, markdown views)
python -m tests.eval.report eval_results.jsonl

Project Structure

src/forge/
  __init__.py          # Public API exports
  errors.py            # ForgeError hierarchy
  server.py            # setup_backend(), ServerManager, BudgetMode
  core/
    messages.py        # Message, MessageRole, MessageType, MessageMeta
    workflow.py        # ToolSpec, ToolDef, ToolCall, TextResponse, Workflow
    inference.py       # run_inference() — shared front half (compact, fold, validate, retry)
    runner.py          # WorkflowRunner — the agentic loop
    slot_worker.py     # SlotWorker — priority-queued slot access
    steps.py           # StepTracker
  guardrails/
    nudge.py           # Nudge dataclass
    response_validator.py  # ResponseValidator, ValidationResult
    step_enforcer.py   # StepEnforcer, StepCheck
    error_tracker.py   # ErrorTracker
  clients/
    base.py            # ChunkType, StreamChunk, LLMClient protocol
    ollama.py          # OllamaClient (native FC)
    llamafile.py       # LlamafileClient (native FC or prompt-injected)
    anthropic.py       # AnthropicClient (frontier baseline)
  context/
    manager.py         # ContextManager, CompactEvent
    strategies.py      # CompactStrategy, NoCompact, TieredCompact, SlidingWindowCompact
    hardware.py        # HardwareProfile, detect_hardware()
  prompts/
    templates.py       # Tool prompt builders (prompt-injected path)
    nudges.py          # Retry and step-enforcement nudge templates
  tools/
    respond.py         # Synthetic respond tool (respond_tool(), respond_spec())
  proxy/
    proxy.py           # ProxyServer — programmatic start/stop API
    server.py          # Raw asyncio HTTP server, SSE streaming
    handler.py         # Request handler — bridge between HTTP and run_inference
    convert.py         # OpenAI messages ↔ forge Messages conversion
tests/
  unit/                # 865 deterministic tests — no LLM backend required
  eval/                # Eval harness — model qualification against real backends

Documentation

User Guide — Usage patterns, multi-turn, context management, guardrails, slot worker, long-running session advisory
Model Guide — Which model and backend for your hardware
Backend Setup — Backend installation and server setup
Eval Guide — Eval harness CLI reference, batch eval
Architecture — Full design document
Workflow Internals — Workflow design and runner internals
Contributing — How to set up, test, and add new backends or scenarios

Paper

The forge guardrail framework and ablation study are published as:

Zambelli, A. Forge: A Reliability Layer for Self-Hosted LLM Tool-Calling. https://doi.org/10.1145/3786335.3813193

A pre-publication preprint is also available at docs/forge_ieee_preprint.pdf — kept as a historical artifact. Cite the published version above; the DOI link may not resolve immediately depending on the publisher's release timing.

License

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.7.2

May 24, 2026

0.7.1

May 24, 2026

0.7.0

May 22, 2026

This version

0.6.0

May 1, 2026

0.5.0

Apr 29, 2026

0.4.3

Apr 17, 2026

0.4.2

Apr 11, 2026

0.4.1

Apr 3, 2026

0.4.0

Apr 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forge_guardrails-0.6.0.tar.gz (25.4 MB view details)

Uploaded May 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

forge_guardrails-0.6.0-py3-none-any.whl (83.9 kB view details)

Uploaded May 1, 2026 Python 3

File details

Details for the file forge_guardrails-0.6.0.tar.gz.

File metadata

Download URL: forge_guardrails-0.6.0.tar.gz
Upload date: May 1, 2026
Size: 25.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for forge_guardrails-0.6.0.tar.gz
Algorithm	Hash digest
SHA256	`09d61006cb5d68993ac1c4474a7853d336b02872c19054f9cd5cedb9b5873f74`
MD5	`8bf99a973b7c26c29d5364983041c18a`
BLAKE2b-256	`ff5aa938df8a9cf7cd99d0337e78962ace4a7ca33b0927e99cf36637cf801e96`

See more details on using hashes here.

File details

Details for the file forge_guardrails-0.6.0-py3-none-any.whl.

File metadata

Download URL: forge_guardrails-0.6.0-py3-none-any.whl
Upload date: May 1, 2026
Size: 83.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for forge_guardrails-0.6.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`444453f4577f1a39d4fb93b130e6fce013d1fb92d2f13523883052dca822d3c2`
MD5	`f41d678036e0b81020886eac4e8226d5`
BLAKE2b-256	`05ead4ed85adf2b6be4d3a6ef6e94fed1f4e5f1acec73630a9cb285896f76794`

See more details on using hashes here.

forge-guardrails 0.6.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

forge

Requirements

Install

Backend setup (pick one)

Quick Start

Proxy Server

Backends

Running Tests

Eval Harness

Project Structure

Documentation

Paper

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes