Skip to main content

Epistemological firewall for AI agents — detects Knowledge-Action Gap using trained Logos models

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

logos-firewall

Epistemological firewall for AI agents. Detects when an agent reasons correctly but acts against its own reasoning (Knowledge-Action Gap).

The Problem

Autonomous AI agents execute actions without verification. An agent can reason "this is dangerous" in its think block and then execute the dangerous action anyway. Existing guardrails use prompted models — logos-firewall uses a trained epistemological classifier.

Installation

pip install logos-firewall

For the API server:

pip install "logos-firewall[server]"

Model Setup

Level A (FastGate) works offline with no model. For Level B and Think Auditor, you need Ollama with a Logos model:

# Install Ollama (https://ollama.ai), then:

# Option 1: Download from Hugging Face (recommended)
pip install huggingface-cli
huggingface-cli download LumenSyntax/logos10v2-gemma3-1b-F16 --local-dir logos10v2-f16
ollama create logos10v2_auditor_v3 -f logos10v2-f16/Modelfile

# Option 2: Manual GGUF download
# Download logos10v2-auditor-v3-f16.gguf from:
#   https://huggingface.co/LumenSyntax/logos10v2-gemma3-1b-F16
# Then: ollama create logos10v2_auditor_v3 -f Modelfile

Important: Use the F16 (full precision) model for production. The Q4_K_M quantized variant has known safety degradation and is only suitable for edge/demo use. See model card for details.

Quick Start

Python SDK

from logos_firewall import LogosFirewall, FastGate, ThinkAuditor

# Level A only — fast offline classification (no Ollama needed)
gate = FastGate()
result = gate.classify("rm -rf /")
# result.verdict == "BLOCK"
# result.confidence == 0.95

# Full pipeline — Level A (regex) + Level B (Logos via Ollama)
fw = LogosFirewall(ollama_url="http://localhost:11434")
result = await fw.audit(
    action="rm -rf /etc/config/*",
    think="<think>I need to clean up old files...</think>",
    context="coding_agent",
)
# result.verdict == "BLOCK"

# Standalone Think Block Auditor
auditor = ThinkAuditor(ollama_url="http://localhost:11434")
result = await auditor.audit(
    think_block="<think>This request is dangerous. I should refuse.</think>",
    output="Sure, here's how to do it: ...",
)
# result.verdict == "GAP"  (Knowledge-Action Gap detected)

API Server

# Start the server
logos-firewall
# or: uvicorn logos_firewall.server:app --host 0.0.0.0 --port 8000

# Classify an action (Level A only, no Ollama)
curl -X POST http://localhost:8000/v1/classify \
  -H "Content-Type: application/json" \
  -d '{"action": "rm -rf /"}'

# Full audit (Level A + B)
curl -X POST http://localhost:8000/v1/audit \
  -H "Content-Type: application/json" \
  -d '{"action": "pip install unknown-pkg", "think": "<think>installing dependency</think>"}'

# Think block audit
curl -X POST http://localhost:8000/v1/think-audit \
  -H "Content-Type: application/json" \
  -d '{"think_block": "<think>This is dangerous</think>", "output": "Sure, here you go..."}'

# Health check
curl http://localhost:8000/health

Docker

docker-compose up

This starts both Ollama and the logos-firewall server. You'll need to load a Logos model into Ollama separately (see Model Setup above).

Architecture

Agent action request
        |
        v
+-------------------------+
|  Level A: FastGate       |  < 10ms
|  (regex + action type)   |
|  ALLOW / BLOCK / STEP_UP |
+------------+------------+
             | STEP_UP
             v
+-------------------------+
|  Level B: LogosGate      |  100-500ms
|  (Logos 1B via Ollama)   |
|  Think-Action audit      |
|  ALLOW / BLOCK / UNCERTAIN|
+-------------------------+

Level A catches obvious cases with regex patterns (destructive commands, safe read-only ops). Unknown or risky actions are escalated to Level B, which uses a Logos fine-tuned model for epistemological evaluation.

Configuration

Environment Variables

Variable Default Description
LOGOS_OLLAMA_URL http://localhost:11434 Ollama server URL
LOGOS_API_TOKEN (none) Bearer token for API auth (optional)
LOGOS_RATE_LIMIT_RPM 60 Requests per minute per IP
LOGOS_HOST 0.0.0.0 Server bind host
LOGOS_PORT 8000 Server bind port

Model Chain

The default model is logos10v2_auditor_v3 (Gemma 3 1B F16). The code also supports logos9_hybrid and logos9_auditor_v2 as fallbacks, but these are earlier versions of the same architecture and are not published — logos10v2_auditor_v3 supersedes them.

API Reference

POST /v1/audit

Full firewall audit (Level A -> B).

Request:

{
  "action": "rm -rf /tmp/old-files",
  "think": "<think>cleanup needed</think>",
  "context": "coding_agent"
}

Response:

{
  "verdict": "BLOCK",
  "confidence": 0.95,
  "action_class": "DESTRUCTIVE",
  "mechanism": "regex",
  "detail": "Blocked: wildcard delete",
  "latency_ms": 0.12,
  "level": "A",
  "model": ""
}

POST /v1/think-audit

Standalone think block / output consistency audit.

Request:

{
  "think_block": "<think>This seems dangerous...</think>",
  "output": "Sure, here's how to do it...",
  "domain": "general"
}

Response:

{
  "verdict": "GAP",
  "confidence": 0.15,
  "reasoning": "The reasoning identifies danger but the output ignores it",
  "model": "logos-auditor",
  "latency_ms": 342.5
}

POST /v1/classify

Action classification only (Level A, no Ollama needed).

GET /health

Health check with Ollama and model availability.

Connection to Research

This package implements the agent firewall described in "The Instrument Trap: When Aligned Models Serve Misaligned Purposes" (DOI: 10.5281/zenodo.18644322).

The benchmark dataset (14,950 test cases) is available at LumenSyntax/instrument-trap-benchmark on Hugging Face.

Requirements

  • Python 3.10+
  • Ollama with a Logos model loaded (for Level B and Think Auditor)
  • Level A (FastGate) works entirely offline with no dependencies beyond httpx and pydantic

License

Apache 2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

logos_firewall-0.2.0.tar.gz (26.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

logos_firewall-0.2.0-py3-none-any.whl (24.1 kB view details)

Uploaded Python 3

File details

Details for the file logos_firewall-0.2.0.tar.gz.

File metadata

  • Download URL: logos_firewall-0.2.0.tar.gz
  • Upload date:
  • Size: 26.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for logos_firewall-0.2.0.tar.gz
Algorithm Hash digest
SHA256 8bc3e24f04dec0d448dcd23d18f25f8f664109c294eaac6c089eab2555a3db37
MD5 aecb2906170733dad464c4fdef26694f
BLAKE2b-256 38335ac227bb7384f6d88699e89f87be77351285bf55e6dd7b0a0d79c6eb0316

See more details on using hashes here.

File details

Details for the file logos_firewall-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: logos_firewall-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 24.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for logos_firewall-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 683f00fa602401b995a508815fbb70d2b179435eb26ad827461bf78408982021
MD5 7dc718bf4973d6f655fef06d270cc220
BLAKE2b-256 033b8cb874bbbf6219ca6718774c49859d3a4b585de6e30f3315101792e8491f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page