Skip to main content

Epistemological firewall for AI agents — detects Knowledge-Action Gap using trained Logos models

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

logos-firewall

Epistemological firewall for AI agents. Detects when an agent reasons correctly but acts against its own reasoning (Knowledge-Action Gap).

The Problem

Autonomous AI agents execute actions without verification. An agent can reason "this is dangerous" in its think block and then execute the dangerous action anyway. Existing guardrails use prompted models — logos-firewall uses a trained epistemological classifier.

Installation

pip install logos-firewall

For the API server:

pip install "logos-firewall[server]"

Quick Start

Python SDK

from logos_firewall import LogosFirewall, FastGate, ThinkAuditor

# Level A only — fast offline classification (no Ollama needed)
gate = FastGate()
result = gate.classify("rm -rf /")
# result.verdict == "BLOCK"
# result.confidence == 0.95

# Full pipeline — Level A (regex) + Level B (Logos via Ollama)
fw = LogosFirewall(ollama_url="http://localhost:11434")
result = await fw.audit(
    action="rm -rf /etc/config/*",
    think="<think>I need to clean up old files...</think>",
    context="coding_agent",
)
# result.verdict == "BLOCK"

# Standalone Think Block Auditor
auditor = ThinkAuditor(ollama_url="http://localhost:11434")
result = await auditor.audit(
    think_block="<think>This request is dangerous. I should refuse.</think>",
    output="Sure, here's how to do it: ...",
)
# result.verdict == "GAP"  (Knowledge-Action Gap detected)

API Server

# Start the server
logos-firewall
# or: uvicorn logos_firewall.server:app --host 0.0.0.0 --port 8000

# Classify an action (Level A only, no Ollama)
curl -X POST http://localhost:8000/v1/classify \
  -H "Content-Type: application/json" \
  -d '{"action": "rm -rf /"}'

# Full audit (Level A + B)
curl -X POST http://localhost:8000/v1/audit \
  -H "Content-Type: application/json" \
  -d '{"action": "pip install unknown-pkg", "think": "<think>installing dependency</think>"}'

# Think block audit
curl -X POST http://localhost:8000/v1/think-audit \
  -H "Content-Type: application/json" \
  -d '{"think_block": "<think>This is dangerous</think>", "output": "Sure, here you go..."}'

# Health check
curl http://localhost:8000/health

Docker

docker-compose up

This starts both Ollama and the logos-firewall server. You'll need to pull a Logos model into Ollama separately:

# From the Ollama container or host
ollama pull logos10v2_auditor_v3

Architecture

Agent action request
        |
        v
+-------------------------+
|  Level A: FastGate       |  < 10ms
|  (regex + action type)   |
|  ALLOW / BLOCK / STEP_UP |
+------------+------------+
             | STEP_UP
             v
+-------------------------+
|  Level B: LogosGate      |  100-500ms
|  (Logos 1B via Ollama)   |
|  Think-Action audit      |
|  ALLOW / BLOCK / UNCERTAIN|
+-------------------------+

Level A catches obvious cases with regex patterns (destructive commands, safe read-only ops). Unknown or risky actions are escalated to Level B, which uses a Logos fine-tuned model for epistemological evaluation.

Configuration

Environment Variables

Variable Default Description
LOGOS_OLLAMA_URL http://localhost:11434 Ollama server URL
LOGOS_API_TOKEN (none) Bearer token for API auth (optional)
LOGOS_RATE_LIMIT_RPM 60 Requests per minute per IP
LOGOS_HOST 0.0.0.0 Server bind host
LOGOS_PORT 8000 Server bind port

Model Chain

By default, logos-firewall tries these Logos models in order:

  1. logos10v2_auditor_v3 (Gemma 3 1B, recommended)
  2. logos9_hybrid (Gemma 3 1B, fallback)
  3. logos9_auditor_v2 (Gemma 3 1B, fallback)

The Think Block Auditor prefers the 9B model (logos-auditor) for higher accuracy.

API Reference

POST /v1/audit

Full firewall audit (Level A -> B).

Request:

{
  "action": "rm -rf /tmp/old-files",
  "think": "<think>cleanup needed</think>",
  "context": "coding_agent"
}

Response:

{
  "verdict": "BLOCK",
  "confidence": 0.95,
  "action_class": "DESTRUCTIVE",
  "mechanism": "regex",
  "detail": "Blocked: wildcard delete",
  "latency_ms": 0.12,
  "level": "A",
  "model": ""
}

POST /v1/think-audit

Standalone think block / output consistency audit.

Request:

{
  "think_block": "<think>This seems dangerous...</think>",
  "output": "Sure, here's how to do it...",
  "domain": "general"
}

Response:

{
  "verdict": "GAP",
  "confidence": 0.15,
  "reasoning": "The reasoning identifies danger but the output ignores it",
  "model": "logos-auditor",
  "latency_ms": 342.5
}

POST /v1/classify

Action classification only (Level A, no Ollama needed).

GET /health

Health check with Ollama and model availability.

Connection to Research

This package implements the agent firewall described in "The Instrument Trap: When Aligned Models Serve Misaligned Purposes" (DOI: 10.5281/zenodo.18644322).

The benchmark dataset (14,950 test cases) is available at LumenSyntax/instrument-trap-benchmark on Hugging Face.

Requirements

  • Python 3.10+
  • Ollama with a Logos model loaded (for Level B and Think Auditor)
  • Level A (FastGate) works entirely offline with no dependencies beyond httpx and pydantic

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

logos_firewall-0.1.0.tar.gz (18.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

logos_firewall-0.1.0-py3-none-any.whl (19.7 kB view details)

Uploaded Python 3

File details

Details for the file logos_firewall-0.1.0.tar.gz.

File metadata

  • Download URL: logos_firewall-0.1.0.tar.gz
  • Upload date:
  • Size: 18.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for logos_firewall-0.1.0.tar.gz
Algorithm Hash digest
SHA256 41e7f38579269e4ae45928455abf09a297ccf31d47027e15c082f6d9f733b7d4
MD5 8cebe160ee605004e20e644aa43abbba
BLAKE2b-256 cb5e75995b0ff6f8edd8b60893140f086f90fd73586d9bcc3fe2738d453045a8

See more details on using hashes here.

File details

Details for the file logos_firewall-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: logos_firewall-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 19.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for logos_firewall-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 692ac12e60f26b78974466eeda9e9c2b81f2f9a00aa6abf77fa0465a83abdf20
MD5 d6b8996bcdf96f99372be4dfecbb5549
BLAKE2b-256 5d2c4fd671f1692173cfd777bbceba99c8d1d903c0f039ce8ea43e88dde7aa05

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page