Skip to main content

A reliability layer for self-hosted LLM tool-calling. Guardrails, context management, and backend adapters for multi-step agentic workflows.

Project description

forge

PyPI Tests codecov Python 3.12+ License: MIT

A reliability layer for self-hosted LLM tool-calling. Forge lifts an 8B local model to the top of its class on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction). The current top self-hosted config (Ministral-3 8B Instruct Q8 on llama-server) scores 86.5% across forge's 26-scenario eval suite — and 76% on the hardest tier.

Three ways to use it:

  • WorkflowRunner — Define tools, pick a backend, run structured agent loops. Forge manages the full lifecycle: system prompts, tool execution, context compaction, and guardrails. SlotWorker adds priority-queued access to a shared inference slot with auto-preemption — for multi-agent architectures where specialist workflows share a GPU slot. Best when you're building on forge directly.

  • Guardrails middleware — Use forge's reliability stack (composable middleware) inside your own orchestration loop. You control the loop; forge validates responses, rescues malformed tool calls, and enforces required steps.

  • Proxy server — Drop-in OpenAI-compatible proxy (python -m forge.proxy) that sits between any client (opencode, Continue, aider, etc.) and a local model server. Applies guardrails transparently — the client thinks it's talking to a smarter model.

Supports Ollama, llama-server (llama.cpp), Llamafile, and Anthropic as backends.

Requirements

  • Python 3.12+
  • A running LLM backend (see below)

Install

pip install forge-guardrails                # core only
pip install "forge-guardrails[anthropic]"   # + Anthropic client

For development:

git clone https://github.com/antoinezambelli/forge.git
cd forge
pip install -e ".[dev]"

Backend setup (pick one)

llama-server (recommended — top 10 eval configs all run on llama-server):

# Install from https://github.com/ggml-org/llama.cpp/releases
llama-server -m path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf --jinja -ngl 999 --port 8080

Ollama (alternative — easier setup, slightly weaker on harder workloads):

# Install from https://ollama.com/download
ollama pull ministral-3:8b-instruct-2512-q4_K_M

Anthropic (API, no local GPU needed):

pip install -e ".[anthropic]"
export ANTHROPIC_API_KEY=sk-...

See Backend Setup for full instructions and Model Guide for which model fits your hardware.

Quick Start

import asyncio
from pydantic import BaseModel, Field
from forge import (
    Workflow, ToolDef, ToolSpec,
    WorkflowRunner, OllamaClient,
    ContextManager, TieredCompact,
)

def get_weather(city: str) -> str:
    return f"72°F and sunny in {city}"

class GetWeatherParams(BaseModel):
    city: str = Field(description="City name")

workflow = Workflow(
    name="weather",
    description="Look up weather for a city.",
    tools={
        "get_weather": ToolDef(
            spec=ToolSpec(
                name="get_weather",
                description="Get current weather",
                parameters=GetWeatherParams,
            ),
            callable=get_weather,
        ),
    },
    required_steps=[],
    terminal_tool="get_weather",
    system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)

async def main():
    client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M", recommended_sampling=True)
    ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
    runner = WorkflowRunner(client=client, context_manager=ctx)
    await runner.run(workflow, "What's the weather in Paris?")

asyncio.run(main())

For multi-step workflows, multi-turn conversations, and backend auto-management, see the User Guide. If you're building a long-running session (CLI, chat server, voice assistant), see the long-running session advisory for important guidance on filtering transient messages.

Proxy Server

Drop-in replacement for a local model server. Point any OpenAI-compatible client at the proxy and get forge's guardrails for free.

# External mode — you manage llama-server, forge proxies it
python -m forge.proxy --backend-url http://localhost:8080 --port 8081

# Managed mode — forge starts llama-server and the proxy together
python -m forge.proxy --backend llamaserver --gguf path/to/model.gguf --port 8081

Then configure your client to use http://localhost:8081/v1 as the API base URL.

Note: The proxy automatically injects a synthetic respond tool when tools are present in the request. The model calls respond(message="...") instead of producing bare text, keeping it in tool-calling mode where forge's full guardrail stack applies. The respond call is stripped from the outbound response — the client sees a normal text response (finish_reason: "stop") and never knows the tool exists. This is essential for small local models (~8B), which cannot be trusted to choose correctly between text and tool calls — guiding them to a tool is a must. See ADR-013 for the full analysis.

Backends

Backend Best for Native FC?
Ollama Easiest setup, model management built-in Yes
llama-server Best performance, full control Yes (with --jinja)
Llamafile Single binary, zero dependencies No (prompt-injected)
Anthropic Frontier baseline, hybrid workflows Yes

See Backend Setup for installation and Model Guide for which model to pick.

Running Tests

python -m pytest tests/ -v --tb=short
python -m pytest tests/ --cov=forge --cov-report=term-missing

Eval Harness

26 scenarios measuring how reliably a model + backend combo navigates multi-step tool-calling workflows — split into an OG-18 baseline tier and an 8-scenario advanced_reasoning tier for top-end separation. See Eval Guide for full CLI reference.

# llama-server (start in another terminal first; see Eval Guide)
python -m tests.eval.eval_runner --backend llamafile --llamafile-mode prompt --gguf "path/to/Ministral-3-8B-Instruct-2512-Q8_0.gguf" --runs 10 --stream --verbose

# Batch eval (JSONL output, automatic resume)
python -m tests.eval.batch_eval --config all --runs 50

# Reports (ASCII table, HTML dashboard, markdown views)
python -m tests.eval.report eval_results.jsonl

Project Structure

src/forge/
  __init__.py          # Public API exports
  errors.py            # ForgeError hierarchy
  server.py            # setup_backend(), ServerManager, BudgetMode
  core/
    messages.py        # Message, MessageRole, MessageType, MessageMeta
    workflow.py        # ToolSpec, ToolDef, ToolCall, TextResponse, Workflow
    inference.py       # run_inference() — shared front half (compact, fold, validate, retry)
    runner.py          # WorkflowRunner — the agentic loop
    slot_worker.py     # SlotWorker — priority-queued slot access
    steps.py           # StepTracker
  guardrails/
    nudge.py           # Nudge dataclass
    response_validator.py  # ResponseValidator, ValidationResult
    step_enforcer.py   # StepEnforcer, StepCheck
    error_tracker.py   # ErrorTracker
  clients/
    base.py            # ChunkType, StreamChunk, LLMClient protocol
    ollama.py          # OllamaClient (native FC)
    llamafile.py       # LlamafileClient (native FC or prompt-injected)
    anthropic.py       # AnthropicClient (frontier baseline)
  context/
    manager.py         # ContextManager, CompactEvent
    strategies.py      # CompactStrategy, NoCompact, TieredCompact, SlidingWindowCompact
    hardware.py        # HardwareProfile, detect_hardware()
  prompts/
    templates.py       # Tool prompt builders (prompt-injected path)
    nudges.py          # Retry and step-enforcement nudge templates
  tools/
    respond.py         # Synthetic respond tool (respond_tool(), respond_spec())
  proxy/
    proxy.py           # ProxyServer — programmatic start/stop API
    server.py          # Raw asyncio HTTP server, SSE streaming
    handler.py         # Request handler — bridge between HTTP and run_inference
    convert.py         # OpenAI messages ↔ forge Messages conversion
tests/
  unit/                # 865 deterministic tests — no LLM backend required
  eval/                # Eval harness — model qualification against real backends

Documentation

  • User Guide — Usage patterns, multi-turn, context management, guardrails, slot worker, long-running session advisory
  • Model Guide — Which model and backend for your hardware
  • Backend Setup — Backend installation and server setup
  • Eval Guide — Eval harness CLI reference, batch eval
  • Architecture — Full design document
  • Workflow Internals — Workflow design and runner internals
  • Contributing — How to set up, test, and add new backends or scenarios

Paper

The forge guardrail framework and ablation study are published as:

Zambelli, A. Forge: A Reliability Layer for Self-Hosted LLM Tool-Calling. https://doi.org/10.1145/3786335.3813193

A pre-publication preprint is also available at docs/forge_ieee_preprint.pdf — kept as a historical artifact. Cite the published version above; the DOI link may not resolve immediately depending on the publisher's release timing.

License

MIT — Copyright (c) 2025-2026 Antoine Zambelli

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

forge_guardrails-0.6.0.tar.gz (25.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

forge_guardrails-0.6.0-py3-none-any.whl (83.9 kB view details)

Uploaded Python 3

File details

Details for the file forge_guardrails-0.6.0.tar.gz.

File metadata

  • Download URL: forge_guardrails-0.6.0.tar.gz
  • Upload date:
  • Size: 25.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for forge_guardrails-0.6.0.tar.gz
Algorithm Hash digest
SHA256 09d61006cb5d68993ac1c4474a7853d336b02872c19054f9cd5cedb9b5873f74
MD5 8bf99a973b7c26c29d5364983041c18a
BLAKE2b-256 ff5aa938df8a9cf7cd99d0337e78962ace4a7ca33b0927e99cf36637cf801e96

See more details on using hashes here.

File details

Details for the file forge_guardrails-0.6.0-py3-none-any.whl.

File metadata

File hashes

Hashes for forge_guardrails-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 444453f4577f1a39d4fb93b130e6fce013d1fb92d2f13523883052dca822d3c2
MD5 f41d678036e0b81020886eac4e8226d5
BLAKE2b-256 05ead4ed85adf2b6be4d3a6ef6e94fed1f4e5f1acec73630a9cb285896f76794

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page