Skip to main content

An agent regression firewall: replay saved agent traces and flag regressions by checking requirements, not text diffs. PASS / FAIL / UNCERTAIN.

Project description

reqfence

An agent regression firewall. When you change a prompt, model, or tool, reqfence replays saved agent traces and flags regressions by checking whether outputs still satisfy their requirements — not by text-diffing. Every check returns PASS / FAIL / UNCERTAIN.

Standalone package. Does not depend on ariadx or fie-sdk: the requirement critic + trace schema are vendored and dependency-cleaned. Milestone 1 (CLI).

Why two tiers

Validated by the Milestone 0 derisking experiment (🟢 GREEN): text-diff can't tell a harmless reword from a confidently-wrong answer. reqfence uses two tiers over developer-declared requirements:

Each declared requirement has exactly one owner, decided by decidability:

Tier Owns Role
Deterministic (checks.py) every checkable item (JSON-valid, field-present, tool-called, word-count, …) Primary hard gate, ~100% precision by construction
Semantic (semantic.py) only the uncheckable items (factual correctness) Catches confidently-wrong outputs; abstains (UNCERTAIN) when the judge isn't unanimous

The semantic judge is never asked to grade a checkable item — that alone removed the false alarms an earlier "grade everything" design produced (the LLM can't reliably count words). See RESULTS.md.

Final verdict (engine.py, schema.combine): each requirement resolves to one PASS/FAIL/UNCERTAIN; the candidate FAILs if any requirement fails, PASSes iff all pass, else UNCERTAIN. A semantic UNCERTAIN never fails the build; a deterministic FAIL always does.

Install

pip install -e ".[groq]"     # or ".[anthropic]"; core installs with just pydantic+click

Python ≥ 3.11 (uses stdlib tomllib).

The three commands

reqfence init

Scaffolds reqfence.toml + empty fixtures.jsonl / candidates.jsonl.

reqfence record — save a baseline

Stores a frozen baseline trace + its developer-declared requirement checklist. Ingests an already-captured trace (it does not execute an agent):

# requirements.json: [{"id":"json","desc":"valid JSON","check":{"type":"valid_json"}}, ...]
reqfence record --id weather --task "Return weather as JSON" \
  --requirements requirements.json --from-trace baseline_trace.json
# or convert a framework trace:
reqfence record --id t1 --task "..." --requirements reqs.json --from-langgraph messages.json
reqfence record --id t1 --task "..." --requirements reqs.json --from-openai steps.json --openai-format run_steps

reqfence check — gate a change

Replays candidate traces against baselines, runs both tiers, prints a per-requirement table, and exits non-zero if any FAIL (UNCERTAIN does not):

reqfence check                       # uses paths from reqfence.toml
reqfence check --no-semantic         # deterministic gate only (no API key needed)

The semantic tier runs only when enabled and a key is in the environment (GROQ_API_KEY / ANTHROPIC_API_KEY). Keys are read from the environment only; check will also read a nearby .env for convenience but never prints or writes it.

Requirement checks (catalog)

Core six (the reliable gate, unit-tested for precision): valid_json, contains_substring (+ regex), max_words, contains_field, tool_called, no_tool_error. Extended (thin, tested): min_words, min_sources, json_array_len, file_written. Special: semantic — always abstains deterministically; only the LLM tier judges it.

Fixtures format

Versioned JSONL, one record per line (fixtures.jsonl = baselines + checklists, candidates.jsonl = labeled candidate traces). The Milestone 0 benchmark is migrated in under fixtures/ via python scripts/migrate_m0.py. The format is a first-class artifact designed to grow.

Tests

pip install -e ".[dev]" && pytest      # 26 tests: checks, union/abstention, fixtures, CLI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

reqfence-0.1.0.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

reqfence-0.1.0-py3-none-any.whl (26.7 kB view details)

Uploaded Python 3

File details

Details for the file reqfence-0.1.0.tar.gz.

File metadata

  • Download URL: reqfence-0.1.0.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for reqfence-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0119255eca8dd9480ecc3e8f7ce30b5b9d6f77673278c0343e33da779b261528
MD5 6271e34eaede0ffd80bc7505348b40c2
BLAKE2b-256 cd090cc4663a28982c54e18514fc8dca085e8160e1415652606348ec2cc689e0

See more details on using hashes here.

File details

Details for the file reqfence-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: reqfence-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 26.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.0

File hashes

Hashes for reqfence-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 397c16bbceed2ff7191355499df7e01485e13721a46cb9491d8b8330360065fc
MD5 e515c97ee8641a0683dae58fac7085c2
BLAKE2b-256 7a46e82bd8ff7a330e4ba33c754b93f887e92e778366953fe1cb12d39c65275c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page