A reliability layer for self-hosted LLM tool-calling. Guardrails, context management, and backend adapters for multi-step agentic workflows.
Project description
forge
A reliability layer for self-hosted LLM tool-calling. Forge takes an 8B model from ~38% to ~99% on multi-step agentic workflows through guardrails (rescue parsing, retry nudges, step enforcement) and context management (VRAM-aware budgets, tiered compaction).
Three ways to use it:
-
WorkflowRunner — Define tools, pick a backend, run structured agent loops. Forge manages the full lifecycle: system prompts, tool execution, context compaction, and guardrails. SlotWorker adds priority-queued access to a shared inference slot with auto-preemption — for multi-agent architectures where specialist workflows share a GPU slot. Best when you're building on forge directly.
-
Guardrails middleware — Use forge's reliability stack (composable middleware) inside your own orchestration loop. You control the loop; forge validates responses, rescues malformed tool calls, and enforces required steps.
-
Proxy server — Drop-in OpenAI-compatible proxy (
python -m forge.proxy) that sits between any client (opencode, Continue, aider, etc.) and a local model server. Applies guardrails transparently — the client thinks it's talking to a smarter model.
Supports Ollama, llama-server (llama.cpp), Llamafile, and Anthropic as backends.
Requirements
- Python 3.12+
- A running LLM backend (see below)
Install
pip install forge-guardrails # core only
pip install "forge-guardrails[anthropic]" # + Anthropic client
For development:
git clone https://github.com/antoinezambelli/forge.git
cd forge
pip install -e ".[dev]"
Backend setup (pick one)
Ollama (easiest):
# Install from https://ollama.com/download
ollama pull ministral-3:8b-instruct-2512-q4_K_M
llama-server (best performance):
# Install from https://github.com/ggml-org/llama.cpp/releases
llama-server -m path/to/Ministral-3-8B-Reasoning-2512-Q4_K_M.gguf --jinja -ngl 999 --port 8080
Anthropic (API, no local GPU needed):
pip install -e ".[anthropic]"
export ANTHROPIC_API_KEY=sk-...
See Backend Setup for full instructions and Model Guide for which model fits your hardware.
Quick Start
import asyncio
from pydantic import BaseModel, Field
from forge import (
Workflow, ToolDef, ToolSpec,
WorkflowRunner, OllamaClient,
ContextManager, TieredCompact,
)
def get_weather(city: str) -> str:
return f"72°F and sunny in {city}"
class GetWeatherParams(BaseModel):
city: str = Field(description="City name")
workflow = Workflow(
name="weather",
description="Look up weather for a city.",
tools={
"get_weather": ToolDef(
spec=ToolSpec(
name="get_weather",
description="Get current weather",
parameters=GetWeatherParams,
),
callable=get_weather,
),
},
required_steps=[],
terminal_tool="get_weather",
system_prompt_template="You are a helpful assistant. Use the available tools to answer the user.",
)
async def main():
client = OllamaClient(model="ministral-3:8b-instruct-2512-q4_K_M")
ctx = ContextManager(strategy=TieredCompact(keep_recent=2), budget_tokens=8192)
runner = WorkflowRunner(client=client, context_manager=ctx)
await runner.run(workflow, "What's the weather in Paris?")
asyncio.run(main())
For multi-step workflows, multi-turn conversations, and backend auto-management, see the User Guide. If you're building a long-running session (CLI, chat server, voice assistant), see the long-running session advisory for important guidance on filtering transient messages.
Proxy Server
Drop-in replacement for a local model server. Point any OpenAI-compatible client at the proxy and get forge's guardrails for free.
# External mode — you manage llama-server, forge proxies it
python -m forge.proxy --backend-url http://localhost:8080 --port 8081
# Managed mode — forge starts llama-server and the proxy together
python -m forge.proxy --backend llamaserver --gguf path/to/model.gguf --port 8081
Then configure your client to use http://localhost:8081/v1 as the API base URL.
Note: The proxy automatically injects a synthetic respond tool when tools are present in the request. The model calls respond(message="...") instead of producing bare text, keeping it in tool-calling mode where forge's full guardrail stack applies. The respond call is stripped from the outbound response — the client sees a normal text response (finish_reason: "stop") and never knows the tool exists. This is essential for small local models (~8B), which cannot be trusted to choose correctly between text and tool calls — guiding them to a tool is a must. See ADR-013 for the full analysis.
Backends
| Backend | Best for | Native FC? |
|---|---|---|
| Ollama | Easiest setup, model management built-in | Yes |
| llama-server | Best performance, full control | Yes (with --jinja) |
| Llamafile | Single binary, zero dependencies | No (prompt-injected) |
| Anthropic | Frontier baseline, hybrid workflows | Yes |
See Backend Setup for installation and Model Guide for which model to pick.
Running Tests
python -m pytest tests/ -v --tb=short
python -m pytest tests/ --cov=forge --cov-report=term-missing
Eval Harness
22 scenarios measuring how reliably a model + backend combo navigates multi-step tool-calling workflows. See Eval Guide for full CLI reference.
# Ollama
python -m tests.eval.eval_runner --backend ollama --model "ministral-3:8b-instruct-2512-q4_K_M" --runs 10 --stream --verbose
# Batch eval (JSONL output, automatic resume)
python -m tests.eval.batch_eval --config all --runs 50
# Reports (ASCII table, HTML dashboard, markdown views)
python -m tests.eval.report eval_results.jsonl
Project Structure
src/forge/
__init__.py # Public API exports
errors.py # ForgeError hierarchy
server.py # setup_backend(), ServerManager, BudgetMode
core/
messages.py # Message, MessageRole, MessageType, MessageMeta
workflow.py # ToolSpec, ToolDef, ToolCall, TextResponse, Workflow
inference.py # run_inference() — shared front half (compact, fold, validate, retry)
runner.py # WorkflowRunner — the agentic loop
slot_worker.py # SlotWorker — priority-queued slot access
steps.py # StepTracker
guardrails/
nudge.py # Nudge dataclass
response_validator.py # ResponseValidator, ValidationResult
step_enforcer.py # StepEnforcer, StepCheck
error_tracker.py # ErrorTracker
clients/
base.py # ChunkType, StreamChunk, LLMClient protocol
ollama.py # OllamaClient (native FC)
llamafile.py # LlamafileClient (native FC or prompt-injected)
anthropic.py # AnthropicClient (frontier baseline)
context/
manager.py # ContextManager, CompactEvent
strategies.py # CompactStrategy, NoCompact, TieredCompact, SlidingWindowCompact
hardware.py # HardwareProfile, detect_hardware()
prompts/
templates.py # Tool prompt builders (prompt-injected path)
nudges.py # Retry and step-enforcement nudge templates
tools/
respond.py # Synthetic respond tool (respond_tool(), respond_spec())
proxy/
proxy.py # ProxyServer — programmatic start/stop API
server.py # Raw asyncio HTTP server, SSE streaming
handler.py # Request handler — bridge between HTTP and run_inference
convert.py # OpenAI messages ↔ forge Messages conversion
tests/
unit/ # 638 deterministic tests — no LLM backend required
eval/ # Eval harness — model qualification against real backends
Documentation
- User Guide — Usage patterns, multi-turn, context management, guardrails, slot worker, long-running session advisory
- Model Guide — Which model and backend for your hardware
- Backend Setup — Backend installation and server setup
- Eval Guide — Eval harness CLI reference, batch eval
- Architecture — Full design document
- Workflow Internals — Workflow design and runner internals
- Contributing — How to set up, test, and add new backends or scenarios
License
MIT — Copyright (c) 2025-2026 Antoine Zambelli
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file forge_guardrails-0.4.2.tar.gz.
File metadata
- Download URL: forge_guardrails-0.4.2.tar.gz
- Upload date:
- Size: 25.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be3ed202b4447b86f8036c01579804ccf96d5fa9c1b38464cbaba06312317c34
|
|
| MD5 |
1899d905e6a1e8aeabc125b07e4c35e7
|
|
| BLAKE2b-256 |
61a4fbfaef6ba9ae5f2e04b9df1c1b3bd676173aee6850fb80efa1f848cf5630
|
File details
Details for the file forge_guardrails-0.4.2-py3-none-any.whl.
File metadata
- Download URL: forge_guardrails-0.4.2-py3-none-any.whl
- Upload date:
- Size: 75.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3feabaa58763aa351ac7456f382c24625e5c8d3eb8ad33b53d488353291b6bac
|
|
| MD5 |
fa150e57bfdfb27b71e9214ebab03e36
|
|
| BLAKE2b-256 |
4f3dbde2f4a5df6db65b791fc1e32a88cc966f09885638d141547d957b31aac6
|