Skip to main content

STOP-first, evidence-grounded document extraction with audit-grade negative proof

Project description

AJT Grounded Extract

Extract structured data only when it can be proven; otherwise stop—and prove that you stopped.

Status: Production-ready (v1.0) | Constitution: Frozen | Attack Tests: 10/10 blocked


Installation

pip install ajt-grounded-extract

Zero dependencies. Pure Python stdlib.


Philosophy: STOP-first

  • This project does not aim to extract everything.
  • Extraction occurs only when evidence is sufficient.
  • When evidence is insufficient, the system stops and proves why.
  • Evidence Integrity > Recall: Only extract values with verifiable document evidence
  • Default: STOP: When evidence is insufficient, conflicting, or missing → stop extraction
  • Negative Proof: Every STOP includes explicit reason + preserved artifacts
  • No Fine-tuning: Rule-based + LLM extraction without training pipelines
  • Local Execution: Runs entirely on local machine

What This Is NOT

This system is blocked-by-design, not secure-by-claim.

  • ❌ Multi-domain rule engine
  • ❌ Enterprise extraction with thresholds
  • ❌ Training/fine-tuning pipeline
  • ❌ High-recall extraction system
  • ❌ "Secure" or "safe" (we demonstrate how attacks are blocked, not claim safety)

What we guarantee:

  • ✅ Stoppability (DEFAULT: STOP)
  • ✅ Traceability (decision_maker required)
  • ✅ Audit trail (write-once logs)

Architecture

Document → Ingest → Extract → Ground → Judge → Archive
           ↓        ↓         ↓        ↓        ↓
           Hash     Candidates Evidence STOP?   Artifacts

Pipeline Stages

  1. Ingest: Load document, compute hash, build line index
  2. Extract: Find candidate values (rule-based or LLM)
  3. Ground: Map each value to exact document span (quote + offsets)
  4. Judge: STOP-first decision: ACCEPT | STOP | NEED_REVIEW
  5. Archive: Write-once artifacts with timestamps + integrity hashes

Decision Taxonomy

  • ACCEPT: Evidence found, confidence sufficient, integrity verified
  • STOP: No candidates, conflict, low confidence, or integrity failure
  • NEED_REVIEW: Edge cases requiring human judgment

Quick Start

Run Extraction

# ACCEPT case (has clear "Effective Date: 01/15/2025")
python run.py examples/accept_example.txt

# STOP case (no explicit effective date)
python run.py examples/stop_example.txt

View Results

Open generated HTML viewer:

open viewer/accept_example_viewer.html
open viewer/stop_example_viewer.html

Output Format

JSON Result

{
  "field_name": "effective_date",
  "decision": "ACCEPT",
  "value": "01/15/2025",
  "evidence": {
    "quote": "01/15/2025",
    "start": 245,
    "end": 255,
    "line": 12,
    "context": "...Effective Date: 01/15/2025..."
  },
  "confidence": 0.9
}

STOP Event

{
  "field_name": "effective_date",
  "decision": "STOP",
  "value": null,
  "stop_reason": "no_candidates_found",
  "stop_proof": {
    "searched": true,
    "candidates_found": 0
  }
}

HTML Viewer Features

  • Evidence Highlighting: Green (ACCEPT) / Red (STOP)
  • Navigation Sidebar: Jump to extracted fields
  • "Why Stopped" Panel: Explicit reasons with proof artifacts
  • Offset Mapping: Click evidence span → see exact document location

Directory Structure

ajt-grounded-extract/
├── schema/              # Field definitions
├── engine/              # Core extraction modules
│   ├── ingest.py
│   ├── extract.py
│   ├── ground.py
│   ├── judge.py
│   └── archive.py
├── viewer/              # HTML viewer generator
├── evidence/            # Write-once artifacts (JSONL + manifests)
├── examples/            # Demo documents
└── run.py               # CLI entry point

Evidence Requirements

All extractions must satisfy:

  • require_exact_quote: Value must appear verbatim in document
  • require_offset_mapping: Quote mapped to byte offsets
  • stop_on_conflict: Multiple conflicting values → STOP
  • min_confidence: Below threshold → STOP

Acceptance Criteria

  • Demo shows at least one ACCEPT and one STOP
  • STOP includes explicit reason and preserved artifacts
  • Viewer navigates evidence spans correctly
  • Non-goals stated explicitly

Regulatory Mapping & Review

This system includes industry-specific regulatory risk mappings for:

  • Financial Services — Authorization scope, customer isolation, advisory vs execution separation
  • Healthcare — Patient data isolation, complete clinical evidence requirements, clinician traceability
  • Legal Practice — Attorney responsibility, client-matter isolation, conflict-of-interest prevention

Navigation: See REGULATORY_REVIEW_GUIDE.md for audience-specific entry points.

Key documents:

Principle: This project demonstrates how specified risks are blocked. It does not claim regulatory compliance.


Reference

Motivated by ajt-negative-proof-sim (sealed reference).

Core principle: Prove extraction succeeded OR prove why you stopped.


License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ajt_grounded_extract-1.0.0.tar.gz (13.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ajt_grounded_extract-1.0.0-py3-none-any.whl (15.0 kB view details)

Uploaded Python 3

File details

Details for the file ajt_grounded_extract-1.0.0.tar.gz.

File metadata

  • Download URL: ajt_grounded_extract-1.0.0.tar.gz
  • Upload date:
  • Size: 13.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for ajt_grounded_extract-1.0.0.tar.gz
Algorithm Hash digest
SHA256 496cec37e7488790f00088b546d33c149d746028d7f548c9ee52077abb4c1a2c
MD5 e23497bbc2a3ca142b69f1a7a500dd64
BLAKE2b-256 fd9d701e18168aef2d7c44e9695fb044ce57975f8208a0d07231c8b8958f6a53

See more details on using hashes here.

File details

Details for the file ajt_grounded_extract-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for ajt_grounded_extract-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 871300fd01c5c1026e607f01f550062df5efc8e987b3a13532197ff63202156e
MD5 2b88e3849f2bdece781d910b7e0a623f
BLAKE2b-256 b1d1c384f0e2cfe914f9bc42cdcd9efcec61fc682762850431f60912d31c97c5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page