Skip to main content

Policy enforcement for AI data access, with cryptographic proof

Project description

provenex-core

test PyPI Python License

Policy enforcement for AI data access, with cryptographic proof.

Buyer framing. Platform engineering champions Provenex (a runtime guardrail they don't have to build). Security signs off (cryptographic enforcement, not promises). Compliance consumes the output (a queryable, exportable, regulator-ready record). Three reinforcing budget lines, faster close than a compliance-only sale.

Provenex is the policy enforcement layer for AI data access. You declare your security policy once — in our native YAML config (or OPA/Rego, commercial) — and Provenex enforces it on every retrieval, then emits a cryptographically signed receipt that proves which chunks were allowed, which were blocked, and under what policy.

Scope of this repo. provenex-core is the retrieval primitive — Phase 1 of the broader vision: enforce policy on what an AI system reads. Agentic tool-call enforcement (the "can this agent access Jira / Salesforce / this connector" question, anchored on the MCP ecosystem) is Phase 2 and lives in a separate Provenex repository on the same policy-and-proof spine. Provenex is always decision and proof, not execution — an admission controller for AI data access, not a proxy that brokers calls or holds tokens.

This repository contains the open source core: fingerprinting, local SQLite index, the native YAML policy DSL, receipt generation, and integrations for LangChain / LangGraph / LlamaIndex / CrewAI. The algorithm is open so it can be audited. Hosted infrastructure, the Rego adapter, the OPA service adapter, Bloom-filter acceleration, compliance-grade exports, and cross-enterprise policy interoperability are available separately at provenex.ai.

What you declare. What you get back.

A unified policy file:

version: 1
policy_id: hr-corpus-retrieval-v3

# Five-outcome verification gate
verification:
  block_unauthorized: true
  block_tampered: true
  block_stale: false

# Data-access rules
access_control:
  rules:
    - name: jurisdiction_eu_only
      when:
        request.jurisdiction: EU
      require:
        chunk.metadata.residency:
          in: [EU, EEA]
      on_violation: deny

    - name: pii_classification_gate
      when:
        chunk.metadata.contains_pii: true
      require:
        request.caller.role:
          in: [hr_admin, payroll]
      on_violation: deny

    - name: freshness_for_policy_corpus
      when:
        chunk.metadata.corpus: policy_documents
      require:
        chunk.ingested_at:
          not_older_than: 90d
      on_violation: deny

  defaults:
    unknown_metadata: deny

A signed receipt per retrieval — verifiable offline by anyone with the public key:

{
  "receipt_id": "prx_f2de431dc125ccfc6b57e6ca327fa504",
  "schema_version": "2.1.0",
  "issuer": "provenex-core/0.4.0",
  "output": { "hash": "sha256:...", "hash_algorithm": "sha256" },
  "sources": [
    { "chunk_index": 0, "fingerprint": "sha256:1ebcde39...",
      "verification_outcome": "VERIFIED", "...": "..." }
  ],
  "policy": {
    "verification": { "block_unauthorized": true, "block_tampered": true, "...": "..." },
    "access_control": {
      "evaluator": "native_yaml",
      "policy_id": "hr-corpus-retrieval-v3",
      "policy_version_hash": "sha256:e10b1df5...",
      "policy_in_transparency_log": false,
      "decisions": [
        {
          "chunk_fingerprint": "sha256:1ebcde39...",
          "decision": "allow",
          "rules_fired": ["jurisdiction_eu_only", "freshness_for_policy_corpus"],
          "inputs_hash": "sha256:a3f9c2d1...",
          "inputs": { "chunk_metadata": { "...": "..." }, "request_context": { "...": "..." } }
        }
      ]
    }
  },
  "summary": { "total_chunks": 3, "verified": 2, "unverified": 1, "overall_status": "PARTIAL" },
  "signature": { "algorithm": "hmac-sha256", "value": "fc5d40895ca2..." }
}

A chunk reaches the LLM only if it clears both gates: the verification policy AND the access-control policy. The receipt records both verdicts per chunk so an auditor can reason about them independently — and the signature covers everything.

Where Provenex fits in your stack

Standard RAG:
  documents ─▶ chunker ─▶ embedder ─▶ vector DB
                                            │
  user query ─▶ embedder ─▶ vector DB.search() ──▶ retriever ─▶ LLM ─▶ answer


Same pipeline with Provenex:
  documents ─┬─▶ chunker ─▶ embedder ─▶ vector DB
             │
             └─▶ provenex.add()   (parallel signed write)

  user query ─▶ embedder ─▶ vector DB.search() ─▶ retriever ─┐
                                                              ▼
                                ┌───────────────────────────────────────┐
                                │  policy.verification (5-outcome gate) │
                                │  policy.access_control (rule engine)  │
                                │      BOTH must allow                  │
                                └────────────┬──────────────────────────┘
                                             ▼
                                    surviving chunks ─▶ LLM ─▶ answer
                                             │
                                             ▼
                              signed policy-decision receipt ─▶ audit / compliance

The pieces

Piece What it does
Provenex index A separate database (SQLite locally, hosted in production) that stores cryptographic fingerprints of every chunk you ingested, plus metadata: document ID, version, ingestion timestamp, authorization state, residency / classification / PII tags supplied by upstream tools. Not the embeddings. Not the chunk text. SHA-256 hashes and metadata only.
Ingester At document-write time, alongside the code that writes embeddings to your vector DB, this writes fingerprints to the Provenex index. Two writes, both committed before "ingest" is done.
Policy evaluator At query time, after your retriever pulls chunks from the vector DB, Provenex re-fingerprints each chunk and runs it through both gates: the verification policy (origin, freshness, tampering) and the access-control policy (jurisdiction, classification, PII tags, freshness windows, caller role).
Receipt A signed JSON record of the whole transaction: chunks, verification outcomes, the unified policy, per-chunk decisions, the rules that fired, a hash of the LLM output, and a signature over the whole thing.

Where does your code change?

Not in your vector DB. Provenex doesn't talk to Pinecone, Weaviate, Milvus, or any vector store directly. There's no plugin to install, no schema migration, no managed-vendor permission to wire up. Your vector DB stays exactly as it is.

The integration lives in your application code, the same RAG glue layer that already calls your vector DB. Two spots:

  1. In your ingest pipeline. Wherever your code currently writes chunks into the vector DB, add a parallel call to provenex.add(...) for each chunk.
  2. In your retrieval path. Wherever you get chunks back from the vector DB and hand them to the LLM, run them through provenex.verify_chunks(..., policy=Policy.from_yaml("hr_policy.yaml"), request_context=...) first.

What policy can express

In scope, in the open-source core:

  • Origin / provenance — was this chunk ingested through Provenex (VERIFIED vs UNVERIFIED), is the document version current (STALE), is it authorized (UNAUTHORIZED), did the stored signature survive (TAMPERED).
  • Freshness / recencychunk.ingested_at against a duration window.
  • Access control — fields under request.caller.* against rule expectations.
  • Jurisdiction / data residencychunk.metadata.residency against request.jurisdiction.
  • Sensitivity / classificationchunk.metadata.classification against caller role or purpose.
  • PII presence and handlingchunk.metadata.contains_pii (or any tag your upstream PII tool sets) against caller role.
  • Authorization scoperequest.purpose and arbitrary policy-defined combinations of the above.

Out of scope, deliberately:

  • Content quality assessment.
  • Factual accuracy or hallucination detection.
  • Bias detection.
  • Output safety or content moderation.
  • Cost-based routing.
  • Business logic enforcement.
  • PII detection. Provenex enforces PII tags set by upstream tools; it does not detect PII itself.
  • Quality evaluation. Provenex enforces quality decisions made by upstream data governance; it does not evaluate quality itself.

The refusal list is as important as the feature list. A policy enforcement layer that quietly drifts into hallucination detection becomes unpredictable.

Policy languages: bring your own, or use ours

Provenex is evaluator-agnostic. The runtime accepts pluggable evaluator backends:

Backend Status Use when
Native YAML DSL Open-source core (v0.4) You aren't already on OPA. Want a small, opinionated DSL that fits in a config file.
Rego adapter Commercial You author authorization policies in Rego elsewhere and want one language across the stack.
OPA service adapter Commercial You run OPA as a service and want Provenex to delegate decisions to it.

Compared to OPA alone, Provenex adds the cryptographic enforcement record, the integration with retrieval, and (in a future release) transparency-log-backed proof of which policy was in effect when. OPA tells you yes / no. Provenex tells you yes / no plus a signed receipt verifiable offline.

See docs/policy.md for the full DSL reference, supported operators, and worked examples.

Five-line integration

from provenex import (
    verify_chunks, Policy, RequestContext,
    HmacSha256Signer, SQLiteProvenanceIndex,
)

index = SQLiteProvenanceIndex("provenance.db")
policy = Policy.from_yaml("hr_policy.yaml")
request = RequestContext(
    caller={"role": "hr_admin"}, jurisdiction="EU",
    purpose="customer_support", timestamp="2026-05-13T00:00:00Z",
)
result = verify_chunks(
    chunks=retrieved_chunks, index=index,
    signer=HmacSha256Signer(),
    policy=policy, request_context=request,
    chunk_metadata=[doc.metadata for doc in retrieved_documents],
)
feed_to_llm(result.kept)            # only chunks that cleared BOTH gates
save_receipt(result.receipt)        # signed, verifiable offline

Your existing vector store is untouched. Provenex runs alongside as a parallel signed index plus a policy gate. Whether you use Pinecone, Weaviate, Milvus, Qdrant, Chroma, FAISS, pgvector, MongoDB Atlas Vector Search, Elasticsearch with vectors, Vespa, or a Postgres table you wrote yourself, Provenex doesn't know and doesn't care.

Agentic and multi-step flows

Modern RAG isn't always one retrieve-then-answer cycle. Agents reason, retrieve, reflect, retrieve again. Multiple agents collaborate. Tools fetch live data. Provenex is built for these flows alongside the simple one-shot case:

Framework Integration
LangChain ProvenexRetriever wraps any retriever. Accepts an optional trajectory= for multi-step chains.
LangGraph provenex_retrieval_node(...) factory + state helpers. Drops into any state-graph DAG; the trajectory threads through the shared state.
CrewAI ProvenexCrewSession owns a per-crew trajectory; session.wrap_tool(tool) wraps any retrieval / tool / memory callable.
LlamaIndex ProvenexRetriever middleware (same pattern as LangChain).
Anything else provenex.verify_chunks(chunks, index=..., policy=..., request_context=..., trajectory=...) — framework-agnostic one-liner.

Every retrieval step emits its own signed receipt with a trajectory block linking it to its parents in a DAG. After the agent finishes, provenex audit --trajectory <dir> validates the entire trajectory end-to-end: signatures, inclusion proofs, no dangling parents, no cycles, shared trajectory id, at least one root step. One audit pass, the whole run.

Receipts also carry two optional per-chunk fields useful in agent flows:

  • claims[] — self-attribution claims from the agent ("I used this chunk", "this supports the answer", "this is relevant"). Cryptographically bound to the receipt so the agent cannot deny what it asserted. Provenex does not verify the claim itself — that is the agent operator's compliance burden, made auditable by the signature.
  • content_source — origin classifier (indexed_corpus, live_tool_output, memory_store, compiled_artifact). Lets an auditor reading an UNVERIFIED outcome distinguish "this chunk was supposed to be in the index and wasn't" (alarm) from "this came from a live web search" (expected).

See docs/quickstart.md for a runnable agentic example.

How it works

Four components:

1. Ingestion. Documents are normalized (Unicode NFC, whitespace collapse, optional case folding, zero-width stripping) and run through a sliding window. Each window gets a Rabin-Karp rolling hash (base 1_000_003, modulo Mersenne prime 2^61 - 1) for cheap O(1) updates, strengthened with SHA-256 for collision-resistant identity. The fingerprints (not the document content) are written to the provenance index along with document_id, document_version, timestamp, authorization state, and customer-supplied tags. The index never stores document text.

2. Verification. When your retriever returns chunks, Provenex re-fingerprints each one using the same normalization and hash pipeline, checks the fingerprint against the index, and assigns one of five outcomes (VERIFIED, STALE, UNAUTHORIZED, UNVERIFIED, TAMPERED). A configurable policy.verification decides which outcomes are blocked before the next stage.

3. Policy evaluation. Each chunk that survived the verification gate goes through the configured policy evaluator (native YAML in the open-source core; Rego and OPA service commercial). The evaluator returns allow or deny plus the names of the rules that fired. The chunk reaches the LLM only if both gates allow it.

4. Receipt. After verification and policy evaluation, a JSON receipt is issued that records the chunks, their verification outcomes, the policy that was in effect (both halves), the per-chunk decisions and rules fired, a SHA-256 of the LLM output, and a signature over the whole thing.

For iterative agentic flows, each retrieval step emits its own receipt with a trajectory block linking it to its parents — see Agentic and multi-step flows. The five verification outcomes and the policy framework are unchanged; the trajectory metadata sits alongside them.

See docs/how_it_works.md for the full algorithm, including the architectural distinction between fingerprint-based identity and embedding-based similarity. See docs/receipt_format.md for the schema spec.

How this fits alongside vector databases (and OPA)

Vector databases store semantic similarity: dense embeddings that let you find content similar to a query. Provenex stores cryptographic identity: SHA-256 fingerprints that prove bit-exact match against a signed reference, plus a policy evaluation layer over operator-declared rules. These solve different problems and compose cleanly.

Vector DBs Provenex
Primary storage Dense embeddings (semantic similarity) SHA-256 fingerprints (cryptographic identity) + signed metadata
Retrieval Approximate nearest neighbor over vectors Bit-exact match against signed index
Tampering Not detectable. Embeddings are lossy by design Detectable. Any modification produces a different SHA-256
Policy enforcement Tag-based filters at query construction Evaluator-agnostic rule engine + signed decision record
Audit artifact Vendor dashboard, internal logs Signed JSON receipt, verifiable offline
Trust root Vendor's SOC 2 attestation HMAC (or Ed25519) signature, verifiable by anyone with the key
Vendor lock-in Yes (per database) None. Works alongside any retriever

The expected enterprise deployment is both: vector DB for retrieval performance, Provenex for the policy enforcement record.

Composing with OPA and existing data governance tools

Provenex sits above your existing governance plumbing, not in place of it. PII detection happens in your data pipeline; classification happens in your data catalog; identity is owned by your IdP; authorization rules are authored in OPA / Rego if that's your house language. Provenex consumes the tags and identity those systems produce, applies the policy at retrieval time, and emits the signed record. The Rego adapter (commercial) lets you reuse Rego policies you already have; the OPA service adapter (commercial) lets you delegate decisions to a running OPA instance. The native YAML DSL exists for teams who don't already run OPA — it covers the common retrieval policies without forcing a new platform commitment.

Why vendor-agnostic matters

If you run more than one vector DB across the enterprise — common for cost or latency reasons — you have separate audit stories with separate vendor trust roots, and no way to produce a single signed record that says "this chunk, wherever it came from, was bit-exact identical to the one we authorized AND passed the policy in effect for this caller."

Provenex works the same way against all of them, because it never talks to the vector DB. It re-fingerprints the chunks the retriever returns, runs the same unified policy across every retrieval path, and emits the same receipt schema. One signed index, one policy engine, one verifiable artifact across every retrieval path in the enterprise. Migration risk between vector DBs goes to zero.

Install

pip install provenex-core                  # core only (pure stdlib)
pip install "provenex-core[policy]"        # + native YAML policy DSL (PyYAML)
pip install "provenex-core[langchain]"     # + LangChain integration
pip install "provenex-core[langgraph]"     # + LangGraph integration
pip install "provenex-core[llamaindex]"    # + LlamaIndex integration
pip install "provenex-core[crewai]"        # + CrewAI integration
pip install "provenex-core[ed25519]"       # + Ed25519 asymmetric signing

Python 3.10+. The core has zero third-party dependencies; it's pure stdlib. Framework integrations, the native YAML DSL, and the Ed25519 signer are optional extras.

Try it in 30 seconds

pip install "provenex-core[policy]"
git clone https://github.com/provenex/provenex-core.git
export PROVENEX_SIGNING_SECRET="$(python3 -c 'import secrets; print(secrets.token_hex(32))')"
python provenex-core/examples/standalone_demo.py

For the integration-pattern story, run examples/rag_with_provenance.py. Watch a poisoned chunk that was added directly to the vector store, bypassing Provenex ingest, get caught at the retrieval boundary and blocked from reaching the LLM.

CLI

provenex ingest  --index prov.db --doc-id policy_v4 policy.txt
provenex verify  --index prov.db retrieved_chunk.txt
provenex receipt --index prov.db --output llm_output.txt chunk1.txt chunk2.txt
provenex audit   receipt.json
provenex audit   receipt.json --show-policy          # render the unified policy block
provenex audit   --trajectory ./receipts/            # validate a whole agentic trajectory at once
provenex policy  validate hr_policy.yaml             # parse + validate a policy file
provenex policy  hash     hr_policy.yaml             # print canonical policy_version_hash

provenex policy validate is the CI-time check for policy files: a typo or a reserved-but-unimplemented feature fails the build instead of silently allowing at runtime. provenex policy hash prints the canonical policy_version_hash that will appear on every receipt produced under that policy.

For receipts signed with Ed25519 (asymmetric), pass --public-key audit.pub instead of relying on PROVENEX_SIGNING_SECRET. An auditor with only the public key can verify but cannot forge: the strongest version of the "verifiable by anyone" guarantee, suitable for handing receipts to external regulators.

Why open source?

Security teams won't trust a black box. If a regulator asks how your access-policy enforcement system works, "it's proprietary" is not an answer. The whole algorithm needs to be auditable end to end: normalization, rolling hash, sliding window, SHA-256 strengthening, policy evaluator semantics, receipt schema, signature payload. So it is.

Open source (this repo, MIT)

  • Fingerprinting engine (normalizer + Rabin-Karp + SHA-256)
  • Local SQLite provenance index with HMAC-signed rows
  • RFC 6962 Merkle transparency log (optional, on top of the SQLite index)
  • Receipt generation, HMAC + Ed25519 signing, offline inclusion-proof verification
  • Unified policy (schema 2.1.0): single top-level policy block with verification and access_control halves
  • Native YAML data-access policy DSL: pluggable PolicyEvaluator protocol with the YAML evaluator as the reference backend
  • metadata_binding per decision: each chunk_metadata block on the receipt declares whether it was tag-at-ingest (signed by the index row) or tag-at-evaluate (looked up at decision time). Lets an auditor see the trust class of every input at a glance.
  • Bloom-filter interface (BloomFilterIndex ABC + NoopBloomFilter + BloomAcceleratedIndex wrapper). The interface is OSS so commercial deployments are drop-in; the actual high-throughput Bloom implementation ships commercially.
  • Trajectory receipts (schema 1.3.0+): per-step receipts linked into a DAG for agentic / multi-step flows
  • Self-attribution claims (schema 1.4.0+): signed but unverified records of what the agent said it used
  • Content-source classifier (schema 1.4.0+): distinguish indexed-corpus chunks from live-tool / memory-store chunks
  • LangChain / LangGraph / LlamaIndex / CrewAI integrations
  • Framework-agnostic provenex.verify_chunks(...) for everything else
  • CLI: provenex ingest / verify / receipt / audit / policy
  • Python SDK: pip install provenex-core

Commercial (at provenex.ai)

  • Rego adapter — load Rego bundles into the same PolicyEvaluator protocol; emit the same receipt shape
  • OPA service adapter — delegate evaluation to a running OPA instance over HTTP
  • Hosted provenance index with distributed signed append-only storage
  • Transparency-log-backed policy bundle records (so policy_in_transparency_log: true lights up)
  • Bloom-filter implementation for high-throughput verification at 10M+ chunk scale (the OSS ships the interface; commercial ships the working filter)
  • Compliance-grade export formats (PDF, CSV, JSON-LD for regulator-side / semantic-web consumers)
  • Identity-provider integration (RequestContext auto-populated from Okta / Azure AD)
  • Inference attribution and temporal decay scoring
  • Enterprise SSO / RBAC, HSM-backed Ed25519, dedicated support, SLA

The interfaces (ProvenanceIndex, PolicyEvaluator, BloomFilterIndex) are the same across open source and commercial. Moving from one to the other is one line of code: the class you instantiate.

Privacy and data sovereignty

The index stores fingerprints (one-way SHA-256 hashes) and metadata. No document content, no PII, no chunk text is ever written. Anyone with the index can verify retrieval, but no one can recover document content from it. The policy.access_control.decisions[].inputs field on the receipt records the metadata the evaluator looked at (residency tags, classification, caller role) — operators who want to redact those can set inputs: null while keeping the inputs_hash for offline verification.

License

MIT. See LICENSE.

Links

Reading:

Project:

  • Homepage: provenex.ai
  • Issues and discussion: GitHub Issues on this repo
  • Commercial features: contact via provenex.ai

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

provenex_core-0.4.0.tar.gz (133.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

provenex_core-0.4.0-py3-none-any.whl (101.7 kB view details)

Uploaded Python 3

File details

Details for the file provenex_core-0.4.0.tar.gz.

File metadata

  • Download URL: provenex_core-0.4.0.tar.gz
  • Upload date:
  • Size: 133.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for provenex_core-0.4.0.tar.gz
Algorithm Hash digest
SHA256 53b5bdcbf25f6e34ca02b0239c157e8464b41a3d23d96e729336b2034e3211bc
MD5 a652dffdd56c887b9ee4084fa802bc7e
BLAKE2b-256 001f594b069d72d1779ed62afa2cf84644039d6f00345efae94ae0f0c4c97643

See more details on using hashes here.

File details

Details for the file provenex_core-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: provenex_core-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 101.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.7

File hashes

Hashes for provenex_core-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 5265619ed1eefa9e0ecfb348b105135b54886e3bdbe12da117cf96c74eacd20a
MD5 5dc41ed31e6a3322f56d61eca9c52433
BLAKE2b-256 7289cd76a865388f6bd6543f2b5b9ed9e886e19d2f4bcb565193d583141300ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page