Policy enforcement for AI data access, with cryptographic proof
Project description
provenex-core
Policy enforcement for AI data access, with cryptographic proof.
Platform engineering champions Provenex (a runtime guardrail they don't have to build). Security signs off (cryptographic enforcement, not promises). Compliance consumes the output (a queryable, exportable, regulator-ready record).
Provenex is the policy enforcement layer for AI data access. You declare your security policy once - in our native YAML config (or OPA/Rego, commercial) - and Provenex enforces it on every retrieval and on every agentic tool call, then emits a cryptographically signed receipt that proves which chunks reached the LLM, which tool calls were admitted, and under what policy.
Scope of this repo.
provenex-corecovers both enforcement fronts on one policy-and-proof spine: retrieval enforcement (what the AI reads) and agentic tool-call admission (what the AI is allowed to do, including MCP-shaped tool calls and the "can this agent access Jira / Salesforce / this connector" question). Provenex is always decision and proof, not execution - an admission controller for AI data access, not a proxy that brokers calls or holds tokens.
This repository contains the open source core: fingerprinting, a Postgres-backed production index (SQLite for development), the native YAML policy DSL, receipt generation, the tool-call admission primitive, and integrations for LangChain / LangGraph / LlamaIndex / CrewAI / MCP. The algorithm is open so it can be audited. Hosted infrastructure, the Rego adapter, the OPA service adapter, Bloom-filter acceleration, compliance-grade exports, and cross-enterprise policy interoperability are available separately at provenex.ai.
What you declare. What you get back.
A unified policy file:
version: 1
policy_id: hr-corpus-retrieval-v3
# Five-outcome verification gate
verification:
block_unauthorized: true
block_tampered: true
block_stale: false
# Data-access rules
access_control:
rules:
- name: jurisdiction_eu_only
when:
request.jurisdiction: EU
require:
chunk.metadata.residency:
in: [EU, EEA]
on_violation: deny
- name: pii_classification_gate
when:
chunk.metadata.contains_pii: true
require:
request.caller.role:
in: [hr_admin, payroll]
on_violation: deny
- name: freshness_for_policy_corpus
when:
chunk.metadata.corpus: policy_documents
require:
chunk.ingested_at:
not_older_than: 90d
on_violation: deny
defaults:
unknown_metadata: deny
# Tool-call admission rules (schema 2.2.0)
tool_call_control:
rules:
- name: web_search_provider_allowlist
when: { tool.name: web_search }
require:
tool.target_system:
in: [google_custom_search, bing_v7]
on_violation: deny
# fnmatch is glob, not regex - one rule per pattern. The DSL
# deliberately refuses regex; globs are auditable.
- name: no_api_key_in_query
when: { tool.name: web_search }
require:
tool.parameters.q:
not_matches_pattern: "*api_key=*"
on_violation: deny
- name: no_password_in_query
when: { tool.name: web_search }
require:
tool.parameters.q:
not_matches_pattern: "*password=*"
on_violation: deny
- name: jira_writes_require_role
when:
tool.name: jira
tool.operation: { in: [create_issue, update_issue, delete_issue] }
require:
request.caller.role:
in: [engineer, manager, admin]
on_violation: deny
defaults:
unknown_metadata: deny
A signed receipt per retrieval or per tool-call - verifiable offline by anyone with the public key. Retrieval receipts carry sources[] and policy.access_control; tool-call receipts carry actions[] and policy.tool_call_control; mixed agentic flows link both into one trajectory.
{
"receipt_id": "prx_f2de431dc125ccfc6b57e6ca327fa504",
"schema_version": "2.4.0",
"issuer": "provenex-core/0.8.2",
"caller_hash": "sha256:7a2bf01571c43f...",
"request_binding": {
"algorithm": "sha256",
"query_hash": "sha256:b7a1e09c...",
"request_context_hash": "sha256:31d8e94c...",
"request_hash": "sha256:c2f6a18d..."
},
"output": { "hash": "sha256:...", "hash_algorithm": "sha256" },
"sources": [
{ "chunk_index": 0, "fingerprint": "sha256:1ebcde39...",
"verification_outcome": "VERIFIED", "...": "..." }
],
"actions": [
{ "action_index": 0, "name": "web_search", "operation": "query",
"parameters_hash": "sha256:7a2bf015...", "target_system": "google_custom_search",
"parameters": { "q": "..." } }
],
"policy": {
"verification": { "block_unauthorized": true, "block_tampered": true, "...": "..." },
"access_control": {
"evaluator": "native_yaml",
"policy_id": "hr-corpus-retrieval-v3",
"policy_version_hash": "sha256:e10b1df5...",
"policy_in_transparency_log": false,
"decisions": [
{
"chunk_fingerprint": "sha256:1ebcde39...",
"decision": "allow",
"rules_fired": ["jurisdiction_eu_only", "freshness_for_policy_corpus"],
"inputs_hash": "sha256:a3f9c2d1...",
"inputs": { "chunk_metadata": { "...": "..." }, "request_context": { "...": "..." } }
}
]
},
"tool_call_control": {
"evaluator": "native_yaml",
"policy_id": "hr-corpus-retrieval-v3",
"policy_version_hash": "sha256:d9fdce46...",
"policy_in_transparency_log": false,
"decisions": [
{ "action_index": 0, "decision": "allow",
"rules_fired": ["web_search_provider_allowlist", "no_secrets_in_query"],
"inputs_hash": "sha256:b8e441f7...", "inputs": null }
]
}
},
"summary": { "total_chunks": 3, "verified": 2, "unverified": 1,
"total_actions": 1, "actions_allowed": 1, "actions_denied": 0,
"overall_status": "PARTIAL" },
"trajectory": { "trajectory_id": "trj_a3f1c0d2...", "step_index": 1,
"parent_step_ids": ["prx_c5d8e1f2..."], "step_kind": "tool_call",
"agent_id": "incident_agent",
"session_id": "session-2026-001" },
"signature": { "algorithm": "ed25519", "value": "fc5d40895ca2..." }
}
A chunk reaches the LLM only if it clears both gates: the verification policy AND the access-control policy. The receipt records both verdicts per chunk so an auditor can reason about them independently - and the signature covers everything, including the request_binding that ties the receipt cryptographically to the triggering query.
Signing algorithm in this showcase. This receipt is signed with Ed25519 (asymmetric) - the right default for any receipt that may be handed to a regulator, external auditor, or downstream party who must verify but should not be able to forge. The OSS core also ships
HmacSha256Signerfor internal-only deployments where producer and verifier share a trust boundary; both implement the sameReceiptSignerinterface, so receipts are structurally identical. Pick HMAC if simplicity matters more than non-forgeability by the verifier; pick Ed25519 the moment a receipt crosses an org boundary.
Source-of-record fields for downstream anomaly detectors / SIEMs (schema 2.3.0). caller_hash is the SHA-256 over the canonical JSON of request_context.caller - a stable group-by key so a detector can baseline a single user's activity across receipts without crawling per-decision input blobs. trajectory.session_id is a caller-chosen opaque string that correlates multiple trajectories under one logical session (a chat conversation, an incident-response engagement, a multi-day investigation). Both fields are decision-and-proof artifacts: they don't influence policy decisions (so inputs_hash stays deterministic), they just make receipts joinable downstream. Provenex emits the source-of-record; your detector / SIEM is the SIEM that reads it.
Where Provenex fits in your stack
Standard RAG:
documents ─▶ chunker ─▶ embedder ─▶ vector DB
│
user query ─▶ embedder ─▶ vector DB.search() ──▶ retriever ─▶ LLM ─▶ answer
Same pipeline with Provenex:
documents ─┬─▶ chunker ─▶ embedder ─▶ vector DB
│
└─▶ provenex.add(entry_kind=whole_chunk) (parallel signed write)
user query ─▶ embedder ─▶ vector DB.search() ─▶ retriever ─┐
│ ▼
│ ┌───────────────────────────────────────┐
│ │ policy.verification (5-outcome gate) │
│ │ policy.access_control (rule engine) │
│ │ whole-chunk match only → VERIFIED │
│ │ BOTH must allow │
│ └────────────┬──────────────────────────┘
│ ▼
│ surviving chunks ─▶ LLM ─▶ answer
│ │
└─────── request_text ──▶ signed receipt + request_binding
▼
audit / compliance / SIEM
The pieces
| Piece | What it does |
|---|---|
| Provenex index | A separate database that stores cryptographic fingerprints of every chunk you ingested, plus metadata: document ID, version, ingestion timestamp, authorization state, residency / classification / PII tags supplied by upstream tools. Not the embeddings. Not the chunk text. SHA-256 hashes and metadata only. Ships with two backends: Postgres for multi-node production deployments (point at your own RDS / Aurora / Cloud SQL / on-prem cluster), and SQLite for single-node development. Same ProvenanceIndex interface, identical canonical signing payload - receipts produced against one backend verify bit-identically against the other. |
| Ingester | At document-write time, alongside the code that writes embeddings to your vector DB, this writes fingerprints to the Provenex index. Two writes, both committed before "ingest" is done. |
| Policy evaluator | At query time, after your retriever pulls chunks from the vector DB, Provenex re-fingerprints each chunk and runs it through both gates: the verification policy (origin, freshness, tampering) and the access-control policy (jurisdiction, classification, PII tags, freshness windows, caller role). |
| Receipt | A signed JSON record of the whole transaction: chunks, verification outcomes, the unified policy, per-chunk decisions, the rules that fired, a hash of the LLM output, and a signature over the whole thing. |
Where does your code change?
Not in your vector DB. Provenex doesn't talk to Pinecone, Weaviate, Milvus, or any vector store directly. There's no plugin to install, no schema migration, no managed-vendor permission to wire up. Your vector DB stays exactly as it is.
The integration lives in your application code, the same RAG glue layer that already calls your vector DB. Two spots:
- In your ingest pipeline. Wherever your code currently writes chunks into the vector DB, add a parallel call to
provenex.add(...)for each chunk. - In your retrieval path. Wherever you get chunks back from the vector DB and hand them to the LLM, run them through
provenex.verify_chunks(..., policy=Policy.from_yaml("hr_policy.yaml"), request_context=...)first.
What policy can express
In scope, in the open-source core:
- Origin / provenance - was this chunk ingested through Provenex (
VERIFIEDvsUNVERIFIED), is the document version current (STALE), is it authorized (UNAUTHORIZED), did the stored signature survive (TAMPERED). - Freshness / recency -
chunk.ingested_atagainst a duration window. - Access control - fields under
request.caller.*against rule expectations. - Jurisdiction / data residency -
chunk.metadata.residencyagainstrequest.jurisdiction. - Sensitivity / classification -
chunk.metadata.classificationagainst caller role or purpose. - PII presence and handling -
chunk.metadata.contains_pii(or any tag your upstream PII tool sets) against caller role. - Authorization scope -
request.purposeand arbitrary policy-defined combinations of the above.
Out of scope, deliberately:
- Content quality assessment.
- Factual accuracy or hallucination detection.
- Bias detection.
- Output safety or content moderation.
- Cost-based routing.
- Business logic enforcement.
- PII detection. Provenex enforces PII tags set by upstream tools; it does not detect PII itself.
- Quality evaluation. Provenex enforces quality decisions made by upstream data governance; it does not evaluate quality itself.
The refusal list is as important as the feature list. A policy enforcement layer that quietly drifts into hallucination detection becomes unpredictable.
Policy languages: bring your own, or use ours
Provenex is evaluator-agnostic. The runtime accepts pluggable evaluator backends:
| Backend | Status | Use when |
|---|---|---|
| Native YAML DSL | Open-source core (v0.4) | You aren't already on OPA. Want a small, opinionated DSL that fits in a config file. |
| Rego adapter | Commercial | You author authorization policies in Rego elsewhere and want one language across the stack. |
| OPA service adapter | Commercial | You run OPA as a service and want Provenex to delegate decisions to it. |
Compared to OPA alone, Provenex adds the cryptographic enforcement record, the integration with retrieval, and (in a future release) transparency-log-backed proof of which policy was in effect when. OPA tells you yes / no. Provenex tells you yes / no plus a signed receipt verifiable offline.
See docs/policy.md for the full DSL reference, supported operators, and worked examples.
Easy integration
Production (Postgres, multi-node)
from provenex import (
verify_chunks, Policy, RequestContext,
Ed25519Signer, PostgresProvenanceIndex,
)
index = PostgresProvenanceIndex(
dsn="postgresql://provenex:secret@db.internal:5432/provenex",
)
policy = Policy.from_yaml("hr_policy.yaml")
request = RequestContext(
caller={"role": "hr_admin"}, jurisdiction="EU",
purpose="customer_support", timestamp="2026-05-13T00:00:00Z",
)
result = verify_chunks(
chunks=retrieved_chunks, index=index,
signer=Ed25519Signer.from_private_key_file("audit-signing.pem"),
policy=policy, request_context=request,
request_text=query, # binds the receipt to this specific query
chunk_metadata=[doc.metadata for doc in retrieved_documents],
)
feed_to_llm(result.kept) # only chunks that cleared BOTH gates
save_receipt(result.receipt) # signed, verifiable offline by anyone
# with the public key
Why Ed25519 here, not HMAC? This example produces a receipt that may be handed to a regulator, an external auditor, or any party outside your operations team. Ed25519 lets them verify the receipt without giving them anything they could forge with - they hold only the public key. The internal-only deployment path uses
HmacSha256Signerfor the symmetric-fast case; seedocs/threat_model.mdfor when each is the right tool.
Many verify pods plus one ingester pod is the recommended deployment shape - bulk ingest is a batch job; verify is per-request and scales horizontally via Postgres read replicas. Multi-writer ingest into the same index is supported and serialized at the document-row level. Bring your own Postgres (RDS, Aurora, Cloud SQL, Crunchy, Supabase, or self-managed) - Provenex doesn't host it.
Default for
block_unverifiedisFalse. Chunks whose fingerprint isn't in the Provenex index (UNVERIFIEDoutcome) pass through to the LLM by default - the receipt records the outcome, but the chunk is not removed. For strict enforcement where every chunk must be Provenex-tracked, setblock_unverified=Truein yourVerificationPolicy. ConstructingVerificationPolicy()without an explicit value emits aDeprecationWarningso the choice is visible; passblock_unverified=Falseexplicitly to acknowledge the advisory stance. The default will flip toTruein a future major release.
Development (SQLite, single-node)
from provenex import SQLiteProvenanceIndex
index = SQLiteProvenanceIndex("provenance.db")
# ... rest is identical to the Postgres example
Stdlib-only, no service to stand up. Same interface, same canonical signing payload, same receipt format - a receipt produced against SQLite verifies identically against Postgres and vice versa.
Your existing vector store is untouched. Provenex runs alongside as a parallel signed index plus a policy gate. Whether you use Pinecone, Weaviate, Milvus, Qdrant, Chroma, FAISS, pgvector, MongoDB Atlas Vector Search, Elasticsearch with vectors, Vespa, or a Postgres table you wrote yourself, Provenex doesn't know and doesn't care.
Tool-call admission (schema 2.2.0)
from provenex import (
HmacSha256Signer, Policy, RequestContext,
ToolCallContext, admission_check,
)
policy = Policy.from_yaml("agent_policy.yaml") # both halves live in one file
request = RequestContext(
caller={"id": "u_42", "role": "engineer"}, jurisdiction="US",
purpose="incident_response", timestamp="2026-05-14T11:30:00Z",
)
result = admission_check(
tool=ToolCallContext(
name="jira", operation="create_issue",
parameters={"project": "INC", "summary": "..."},
target_system="acme.atlassian.net",
),
request=request, policy=policy, signer=HmacSha256Signer(),
)
if result.allowed:
jira_client.create_issue(...) # YOUR code, YOUR credentials
save_receipt(result.receipt) # signed, verifiable offline - denies too
Decision and proof, not execution. Provenex returns a decision and emits a signed receipt; the caller makes the actual call against the target system using its own credentials. Provenex never holds OAuth tokens, never proxies traffic, and never sits on the response-data path. Use ProvenexToolWrapper to wrap any LangChain tool; use provenex_mcp_admission to decorate any MCP tools/call handler.
Memory reads, memory writes, and model-inference (0.6.5+)
Every class of action an agent takes lands on a receipt under the right trajectory.step_kind classifier - not just retrieval (step_kind="retrieval") and tool calls (step_kind="tool_call"). Three convenience entrypoints close the loop so a downstream anomaly detector / SIEM gets a complete event stream:
from provenex import (
HmacSha256Signer, RequestContext, SQLiteProvenanceIndex,
admit_memory_write, admit_model_inference, verify_memory,
)
index = SQLiteProvenanceIndex("memory.db")
signer = HmacSha256Signer()
request = RequestContext(caller={"id": "u_42", "role": "engineer"},
jurisdiction="US", purpose="incident_response",
timestamp="2026-05-14T11:30:00Z")
# Memory read - emits a receipt with step_kind="memory_read" and
# content_source="memory_store" on every source. Same five outcomes
# (VERIFIED / STALE / UNAUTHORIZED / UNVERIFIED / TAMPERED) apply.
r1 = verify_memory(["last user message: ..."], index=index, signer=signer,
request_context=request)
# Memory write - emits an admission receipt with name="memory.write",
# operation=<memory_key>. By default the verbatim value is redacted
# (memory values often contain PII); value_hash is always recorded.
r2 = admit_memory_write(memory_key="user_profile", value={"prefers": "dark_mode"},
request=request, store_id="crewai_memory", signer=signer)
# Model inference - emits an admission receipt with name=<model_name>,
# target_system=<provider>, parameters={prompt_hash, **extras}. Verbatim
# prompt redacted by default. Enables detection on "this user is calling
# claude-opus 100x baseline" or "prompts contain pattern X".
r3 = admit_model_inference(model_name="claude-opus-4-7",
prompt="Summarize TICKET-001",
request=request, target_provider="anthropic",
extra_parameters={"max_tokens": 4000}, signer=signer)
All three reuse the existing receipt schema unchanged (now 2.4.0; the v0.8.0 additions are additive). They produce admission-shaped receipts (actions[] + policy.tool_call_control) for memory_write / model_inference, and retrieval-shaped receipts (sources[] + policy.access_control) for memory_read. The unified YAML policy gates all of them the same way - a tool-call rule like when: { tool.name: "memory.write", tool.operation: "user_profile" } enforces per-key gates; a rule like when: { tool.name: "claude-opus-4-7" } gates model usage by provider/allowlist.
Streaming receipts to a SIEM / firehose (0.6.6+)
Every receipt-emitting entrypoint accepts an optional sink= parameter. Provenex publishes to the sink after the receipt is finalised - your hot path stays the same; the firehose runs alongside.
from provenex import (
HmacSha256Signer, RequestContext, ToolCallContext,
admission_check, MultiSink, FileJSONLSink,
)
from provenex.export.kafka import KafkaSink # extra: [export-kafka]
from provenex.export.aws import S3AppendSink # extra: [export-aws]
# Real-time firehose for the detector + long-term archive for compliance.
sink = MultiSink([
KafkaSink(bootstrap_servers="kafka.internal:9092", topic="provenex-receipts"),
S3AppendSink(bucket="audit-archive", prefix="provenex"),
FileJSONLSink("/var/log/provenex"),
])
result = admission_check(..., sink=sink) # the only line that changes
Reference sinks shipped: StdoutJSONLSink, FileJSONLSink (date-rotated), MultiSink (fan-out), RetryQueueSink (bounded in-process retry queue) in the stdlib core; KafkaSink, SQSSink, S3AppendSink (date-hour-partitioned), PubSubSink behind optional extras. Define-your-own via the ReceiptSink Protocol.
Error semantics - load-bearing. Sink failures are swallowed and logged via warnings.warn. Provenex never breaks the agent's hot path because export is degraded. A misconfigured Kafka cluster writes a warning to stderr; the receipt is still returned through the function value; the agent keeps running. See docs/streaming_export.md for the full reference including retry queue semantics and custom-sink implementation.
OCSF export - receipts as cross-vendor security events (0.6.7+)
Provenex maps signed receipts to OCSF v1.3 events - the emerging cross-vendor schema (Splunk, Datadog, Elastic, Microsoft Sentinel) for security events. One function transforms; one adapter streams.
from provenex import OCSFAdapter, MultiSink, FileJSONLSink, receipt_to_ocsf
from provenex.export.kafka import KafkaSink
# Stream-and-fan-out: OCSF events to the SIEM, raw receipts to archive.
sink = MultiSink([
OCSFAdapter(
downstream=KafkaSink(bootstrap_servers="...", topic="ocsf-security-events"),
extra_metadata={"organization_uid": "acme-corp", "environment": "prod"},
),
FileJSONLSink("/var/log/provenex/raw"),
])
result = admission_check(..., sink=sink)
# Or convert ad-hoc:
events = receipt_to_ocsf(result.receipt.to_dict())
# → [{class_uid: 6003, ...}] (API Activity for allowed admissions)
| Provenex event | OCSF class | UID | Severity |
|---|---|---|---|
| Allowed retrieval / memory_read | Application Activity | 6005 | Informational |
| Allowed tool_call / memory_write / model_inference | API Activity | 6003 | Informational |
| Verification block (TAMPERED, UNAUTHORIZED, etc.) | Detection Finding | 2004 | Critical |
| Policy deny (access_control or tool_call_control) | Detection Finding | 2004 | High |
Correlation fields land where SIEMs expect them: caller_hash → actor.user.uid, trajectory_id → metadata.correlation_uid, session_id → metadata.session_uid, step_kind → metadata.labels[]. The full field-by-field spec is in docs/ocsf_mapping.md - the public artifact for SIEM vendors and enterprise security architects.
Provenex is the firewall. Your detector is the SIEM.
Provenex enforces per-decision admission and emits signed receipts. Your anomaly detector / UEBA / SIEM reads the receipt stream and does sequence / pattern detection. Two categories, two budgets, two vendors - by design.
- Provenex side: deterministic, per-decision-pure, side-effect-free, sub-millisecond.
inputs_hashis reproducible by a regulator years later from the recorded inputs + the original policy bundle. - Detector side: stateful, cross-decision, external-data-aware. Reads receipts via
ReceiptSink(or OCSF events viaOCSFAdapter), groups bycaller_hash/trajectory_id/session_id/step_kind, baselines normal behaviour, alerts on drift.
The native YAML DSL deliberately refuses trajectory-level rules, cross-decision aggregations, and external-data lookups during evaluation. Putting those inside a per-decision admission engine breaks the audit-anchor guarantees and the latency budget. They belong downstream - in your detector reading the receipt stream. Customers who need trajectory rules in-engine use the commercial Rego adapter; the trade-off is explicit. See docs/policy.md for the design rationale.
The canonical positioning doc, including worked detection patterns: docs/anomaly_detection.md - what fields a detector reads, five worked patterns (per-caller rate, trajectory shape drift, policy near-miss, cross-trajectory correlation, content-source anomaly), trust model, and the operational reasoning for the firewall / SIEM split.
Per-deployment unlinkability for caller_hash (0.6.5+)
By default, caller_hash is a plain SHA-256 over the canonical caller dict (sha256:<hex> prefix) - anyone with the verbatim caller dict can reproduce the hash. For multi-tenant deployments that want two of their customers' detectors to NOT be able to cross-correlate users via shared caller_hash buckets, pass caller_hash_salt=b"..." to verify_chunks / admission_check / verify_memory / admit_memory_write / admit_model_inference. The hash becomes HMAC-SHA256 keyed by the salt (hmac-sha256:<hex> prefix); two deployments with different salts produce different caller_hash for the same caller. Same algorithm family (SHA-256), same wire format - the prefix tells consumers which mode produced the hash. Salting is opt-in; no caller-side migration needed for the bare-SHA-256 default.
Agentic and multi-step flows
Modern RAG isn't always one retrieve-then-answer cycle. Agents reason, retrieve, reflect, retrieve again. Multiple agents collaborate. Tools fetch live data. Provenex is built for these flows alongside the simple one-shot case:
| Framework | Retrieval | Tool calls |
|---|---|---|
| LangChain | ProvenexRetriever wraps any retriever. Accepts an optional trajectory=. |
ProvenexToolWrapper wraps any LangChain tool; same receipt shape as MCP. |
| LangGraph | provenex_retrieval_node(...) factory + state helpers. Drops into any state-graph DAG; the trajectory threads through the shared state. |
Call admission_check(...) from a graph node; pass trajectory= to thread admissions into the same DAG. |
| CrewAI | ProvenexCrewSession.wrap_tool(tool) wraps any retrieval / tool / memory callable; session.verify_chunks(...) runs retrieval verification on tool output. |
session.wrap_tool_admission(tool, name=..., request_factory=...) runs admission before the tool fires (denials raise ToolCallDenied). session.admission_check(tool_ctx, request) is the lower-level variant; both thread the session's trajectory automatically. |
| LlamaIndex | ProvenexRetriever middleware (same pattern as LangChain). |
Use the framework-agnostic admission_check(...) directly. |
| MCP | n/a (retrieval is upstream of MCP) | provenex_mcp_admission(...) decorator wraps a tools/call handler. Standard JSON-RPC error code on deny. |
| Anything else | provenex.verify_chunks(chunks, index=..., policy=..., request_context=..., trajectory=...) |
provenex.admission_check(tool=..., request=..., policy=..., signer=..., trajectory=...) |
Every retrieval, tool-call admission, memory read, memory write, and model-inference step emits its own signed receipt with a trajectory block linking it to its parents in a DAG. After the agent finishes, provenex audit --trajectory <dir> validates the entire trajectory end-to-end: signatures, inclusion proofs, no dangling parents, no cycles, shared trajectory id, at least one root step. Mixed step kinds - retrieval / tool_call / memory_read / memory_write / model_inference - are first-class under one signed audit trail. One CLI invocation covers the whole agent run.
Receipts also carry two optional per-chunk fields useful in agent flows:
claims[]- self-attribution claims from the agent ("I used this chunk", "this supports the answer", "this is relevant"). Cryptographically bound to the receipt so the agent cannot deny what it asserted. Provenex does not verify the claim itself - that is the agent operator's compliance burden, made auditable by the signature.content_source- origin classifier (indexed_corpus,live_tool_output,memory_store,compiled_artifact). Lets an auditor reading anUNVERIFIEDoutcome distinguish "this chunk was supposed to be in the index and wasn't" (alarm) from "this came from a live web search" (expected).
See docs/quickstart.md for a runnable agentic example.
How it works
Four components:
1. Ingestion. Documents are normalized (Unicode NFC, whitespace collapse, optional case folding, zero-width stripping) and run through a sliding window. Each window gets a Rabin-Karp rolling hash (base 1_000_003, modulo Mersenne prime 2^61 - 1) for cheap O(1) updates, strengthened with SHA-256 for collision-resistant identity. The fingerprints (not the document content) are written to the provenance index along with document_id, document_version, timestamp, authorization state, and customer-supplied tags. The index never stores document text.
2. Verification. When your retriever returns chunks, Provenex re-fingerprints each one using the same normalization and hash pipeline, checks the fingerprint against the index, and assigns one of five outcomes (VERIFIED, STALE, UNAUTHORIZED, UNVERIFIED, TAMPERED). A configurable policy.verification decides which outcomes are blocked before the next stage.
3. Policy evaluation. Each chunk that survived the verification gate goes through the configured policy evaluator (native YAML in the open-source core; Rego and OPA service commercial). The evaluator returns allow or deny plus the names of the rules that fired. The chunk reaches the LLM only if both gates allow it.
4. Receipt. After verification and policy evaluation, a JSON receipt is issued that records the chunks, their verification outcomes, the policy that was in effect (both halves), the per-chunk decisions and rules fired, a SHA-256 of the LLM output, and a signature over the whole thing.
For iterative agentic flows, each retrieval step emits its own receipt with a trajectory block linking it to its parents - see Agentic and multi-step flows. The five verification outcomes and the policy framework are unchanged; the trajectory metadata sits alongside them.
See docs/how_it_works.md for the full algorithm, including the architectural distinction between fingerprint-based identity and embedding-based similarity. See docs/receipt_format.md for the schema spec.
How this fits alongside vector databases (and OPA)
Vector databases store semantic similarity: dense embeddings that let you find content similar to a query. Provenex stores cryptographic identity: SHA-256 fingerprints that prove bit-exact match against a signed reference, plus a policy evaluation layer over operator-declared rules. These solve different problems and compose cleanly.
| Vector DBs | Provenex | |
|---|---|---|
| Primary storage | Dense embeddings (semantic similarity) | SHA-256 fingerprints (cryptographic identity) + signed metadata |
| Retrieval | Approximate nearest neighbor over vectors | Bit-exact match against signed index |
| Tampering | Not detectable. Embeddings are lossy by design | Detectable. Any modification produces a different SHA-256 |
| Policy enforcement | Tag-based filters at query construction | Evaluator-agnostic rule engine + signed decision record |
| Audit artifact | Vendor dashboard, internal logs | Signed JSON receipt, verifiable offline |
| Trust root | Vendor's SOC 2 attestation | HMAC (or Ed25519) signature, verifiable by anyone with the key |
| Vendor lock-in | Yes (per database) | None. Works alongside any retriever |
The expected enterprise deployment is both: vector DB for retrieval performance, Provenex for the policy enforcement record.
Composing with OPA and existing data governance tools
Provenex sits above your existing governance plumbing, not in place of it. PII detection happens in your data pipeline; classification happens in your data catalog; identity is owned by your IdP; authorization rules are authored in OPA / Rego if that's your house language. Provenex consumes the tags and identity those systems produce, applies the policy at retrieval time, and emits the signed record. The Rego adapter (commercial) lets you reuse Rego policies you already have; the OPA service adapter (commercial) lets you delegate decisions to a running OPA instance. The native YAML DSL exists for teams who don't already run OPA - it covers the common retrieval policies without forcing a new platform commitment.
Why vendor-agnostic matters
If you run more than one vector DB across the enterprise - common for cost or latency reasons - you have separate audit stories with separate vendor trust roots, and no way to produce a single signed record that says "this chunk, wherever it came from, was bit-exact identical to the one we authorized AND passed the policy in effect for this caller."
Provenex works the same way against all of them, because it never talks to the vector DB. It re-fingerprints the chunks the retriever returns, runs the same unified policy across every retrieval path, and emits the same receipt schema. One signed index, one policy engine, one verifiable artifact across every retrieval path in the enterprise. Migration risk between vector DBs goes to zero.
Install
pip install provenex-core # core only (pure stdlib, SQLite backend)
pip install "provenex-core[postgres]" # + Postgres backend for production
pip install "provenex-core[policy]" # + native YAML policy DSL (PyYAML)
pip install "provenex-core[langchain]" # + LangChain integration
pip install "provenex-core[langgraph]" # + LangGraph integration
pip install "provenex-core[llamaindex]" # + LlamaIndex integration
pip install "provenex-core[crewai]" # + CrewAI integration
pip install "provenex-core[ed25519]" # + Ed25519 asymmetric signing
pip install "provenex-core[export-kafka]" # + KafkaSink (kafka-python)
pip install "provenex-core[export-aws]" # + SQSSink / S3AppendSink (boto3)
pip install "provenex-core[export-gcp]" # + PubSubSink (google-cloud-pubsub)
Python 3.10+. The core has zero third-party dependencies; it's pure stdlib. The Postgres backend, framework integrations, the native YAML DSL, and the Ed25519 signer are optional extras.
Try it in 30 seconds
pip install "provenex-core[policy]"
git clone https://github.com/provenex/provenex-core.git
export PROVENEX_SIGNING_SECRET="$(python3 -c 'import secrets; print(secrets.token_hex(32))')"
python provenex-core/examples/standalone_demo.py
For the integration-pattern story, run examples/rag_with_provenance.py. Watch a poisoned chunk that was added directly to the vector store, bypassing Provenex ingest, get caught at the retrieval boundary and blocked from reaching the LLM.
For the tool-call admission headline demo - a mixed retrieve → call_tool(allowed) → call_tool(denied) → retrieve agent flow producing four signed receipts validated end-to-end in one CLI invocation - run examples/agentic_admission_demo.py.
For the four-attack tour - one demo, four attack shapes, real LangChain integration (InMemoryVectorStore + @tool, no mocks): run examples/attack_thwarted_demo.py. The four acts: (1) a viewer-role insider tries jira.delete_issue and the wrapped tool denies before the underlying function runs; (2) two RAG-poisoning variants land in the vector store - a never-indexed chunk and a window-aligned splice of an authorized doc - and both return UNVERIFIED; (3) an attacker tries to re-present a valid signed receipt as evidence for a different regulator query, and the v0.8.0 request_binding catches the replay; (4) a low-privilege insider attempts a restricted memory write and a secret-in-prompt model call, both denied via the policy. Ends with an aggregate audit: 3 denies + 3 UNVERIFIED + 1 forged signature, all signed audit anchors a regulator can re-verify offline. pip install langchain-core numpy first.
For MCP servers: examples/mcp_admission_demo.py - the provenex_mcp_admission decorator on a JSON-RPC tools/call handler. Three live requests (allow + deny + allow), the on_deny callback pattern emitting a structured JSON-RPC error response, plus the lower-level wrap_mcp_request for routers. Pure stdlib - no MCP server library needed.
For LangGraph state graphs: examples/langgraph_admission_node_demo.py - the conditional-edge pattern (admit_jira → execute_jira on allow vs admit_jira → denied_handler on deny). Two scenarios (engineer-allowed + viewer-denied), both audited end-to-end. Pure stdlib - the integration imports nothing from langgraph, so the demo runs without [langgraph] installed.
CLI
provenex ingest --index prov.db --doc-id policy_v4 policy.txt
provenex verify --index prov.db retrieved_chunk.txt
provenex receipt --index prov.db --output llm_output.txt chunk1.txt chunk2.txt
provenex audit receipt.json
provenex audit receipt.json --show-policy # render the unified policy block (both halves + tool calls)
provenex audit --trajectory ./receipts/ # validate a whole agentic trajectory at once (mixed step kinds)
provenex policy validate hr_policy.yaml # parse + validate a policy file (chunk + tool-call rules)
provenex policy hash hr_policy.yaml # print canonical policy_version_hash(es)
provenex index audit --index prov.db --threshold-days 180 # supersession lint: stale docs + unsuperseded older versions
provenex selftest # conformance check: every documented property re-derived in-process
provenex policy validate is the CI-time check for policy files: a typo or a reserved-but-unimplemented feature fails the build instead of silently allowing at runtime. provenex policy hash prints the canonical policy_version_hash that will appear on every receipt produced under that policy. provenex index audit is the cron-style check that catches a re-ingest path that skipped Provenex (and would otherwise leave stale chunks marked VERIFIED). provenex selftest is the one-command conformance check security teams ask for: it asserts every property the docs claim against the installed binary.
For receipts signed with Ed25519 (asymmetric), pass --public-key audit.pub instead of relying on PROVENEX_SIGNING_SECRET. An auditor with only the public key can verify but cannot forge: the strongest version of the "verifiable by anyone" guarantee, suitable for handing receipts to external regulators.
Why open source?
Security teams won't trust a black box. If a regulator asks how your access-policy enforcement system works, "it's proprietary" is not an answer. The whole algorithm needs to be auditable end to end: normalization, rolling hash, sliding window, SHA-256 strengthening, policy evaluator semantics, receipt schema, signature payload. So it is.
Open source (this repo, MIT)
- Fingerprinting engine (normalizer + Rabin-Karp + SHA-256)
- Postgres provenance index for multi-node production (HMAC-signed rows, row-locked concurrent ingest)
- SQLite provenance index for single-node development (HMAC-signed rows, stdlib-only)
- RFC 6962 Merkle transparency log (optional, on top of either index)
- Receipt generation, HMAC + Ed25519 signing, offline inclusion-proof verification
- Unified policy (schema 2.3.0): single top-level
policyblock withverification,access_control, andtool_call_controlhalves - Native YAML policy DSL for both chunk decisions and tool-call admission: pluggable
PolicyEvaluatorandToolCallPolicyEvaluatorprotocols with the YAML evaluators as the reference backends; operators includein/not_in/not_older_than/matches_pattern/not_matches_pattern/length_at_most metadata_bindingper decision: eachchunk_metadatablock on the receipt declares whether it was tag-at-ingest (signed by the index row) or tag-at-evaluate (looked up at decision time). Lets an auditor see the trust class of every input at a glance.- Bloom-filter acceleration interface (
BloomFilterIndexABC +BloomAcceleratedIndexwrapper). High-throughput Bloom-filter acceleration available commercially. - Tool-call admission primitive (schema 2.2.0+):
provenex.admission_check(...)returns a signed receipt withactions[]+policy.tool_call_control. Reference MCP middleware (provenex.tool_call.integrations.mcp) and LangChain wrapper (ProvenexToolWrapper). Decision and proof, not execution - the wrapper never holds tokens or proxies the call. - Source-of-record correlation fields (schema 2.3.0): top-level
caller_hash(SHA-256 over the canonical caller dict; or HMAC-SHA256 with an opt-in deployment salt for per-deployment unlinkability) and optionaltrajectory.session_id(multi-trajectory correlation key). Decision-and-proof artifacts - they don't influence policy decisions, just make receipts joinable downstream by a SIEM / anomaly detector. - Step-kind coverage entrypoints (0.6.5+):
verify_memory(...),admit_memory_write(...),admit_model_inference(...)- convenience wrappers that produce admission-shaped receipts for the full agent surface (memory_read/memory_write/model_inferencestep kinds). Defaultredact_value=True/redact_prompt=Trueso verbatim values stay off the receipt by default; the hash anchor (value_hash/prompt_hash) is always recorded. - Streaming export sinks (0.6.6+):
ReceiptSinkProtocol + reference sinks forStdoutJSONLSink/FileJSONLSink(date-rotated) /MultiSink(fan-out) /RetryQueueSink(bounded in-process retry) in the stdlib core.KafkaSink/SQSSink/S3AppendSink(date-hour-partitioned) /PubSubSinkbehind optional[export-kafka]/[export-aws]/[export-gcp]extras. Every emission entrypoint acceptssink=; failures are swallowed-and-logged so the agent's hot path is never broken by export degradation. - OCSF v1.3 mapping (0.6.7+, stdlib core):
provenex.receipt_to_ocsf(receipt_dict)transforms one signed receipt into one or more OCSF events (Application Activity / API Activity / Detection Finding).OCSFAdapterwraps anyReceiptSinkso the stream emits OCSF events instead of raw receipts - instantly compatible with Splunk / Datadog / Elastic / Microsoft Sentinel. Full mapping spec indocs/ocsf_mapping.md. - Source-of-record positioning + detection patterns (0.6.8+):
docs/anomaly_detection.md- the canonical reference for how receipts integrate with downstream anomaly detectors / UEBA / SIEM. Schema field reference for detectors, five worked detection patterns, trust model, and the operational reasoning for the firewall / SIEM split. The native DSL deliberately refuses trajectory-level rules so per-decision purity (and the audit-anchor guarantees that depend on it) stays intact - seedocs/policy.md. - Window-aligned splice control (v0.8.0):
entry_kindon every index row, signed as part of the canonical payload. Whole-chunk entries promote toVERIFIED; sliding-window entries are structural locators that never promote. A 128-codepoint substring of an authorized document, or a splice across two documents, returnsUNVERIFIED. Red-team coverage:tests/test_window_splice_redteam.py. - Request-to-receipt binding (schema 2.4.0, v0.8.0): top-level
request_bindingblock hashes the triggering query + canonical request_context into the signed payload, so a valid receipt cannot be presented as evidence for a different query. Passrequest_text=toverify_chunks(...)/admission_check(...). The verbatim query is never recorded; only its hash. - Five-outcome precedence (v0.8.0):
OUTCOME_PRECEDENCE = (TAMPERED, UNAUTHORIZED, STALE, UNVERIFIED, VERIFIED)codified in code withreduce_outcomes()helper. An auditor reading a denied chunk knows which condition shadowed which. - Peppered fingerprint mode (v0.8.0):
FingerprinterConfig(pepper=b"...")turns the content hash into HMAC-SHA256(pepper, normalized). Blunts the confirmation / dictionary attack against low-entropy corpora (HR templates, policy boilerplate) and gives per-deployment unlinkability of fingerprints across tenants. - OSS witness / checkpoint log (v0.8.0):
WitnessLogis a hash-chained, signed, append-only JSONL log of(tree_size, tree_root, issued_at)records. The producer-side artifact of the standard CT witness pattern. Operators publish each line to a store they cannot retroactively edit to close split-view resistance against a key-holding operator. Seedocs/threat_model.md. - Conformance self-test (v0.8.0):
provenex selftestruns an in-process set of checks that match every property the docs claim. Exits 0 on every check passing; 1 on any failure. Suitable for CI / pre-deploy. - Index audit (v0.8.0):
provenex index audit --index <db>surfaces documents that have not been re-ingested in the threshold window AND any older versions still flagged not-superseded after a re-ingest skipped the API. The supersession operational contract is indocs/how_it_works.md. - Trajectory receipts (schema 1.3.0+): per-step receipts linked into a DAG for agentic / multi-step flows, mixing retrieval, tool-call, memory, and model-inference steps
- Self-attribution claims (schema 1.4.0+): signed but unverified records of what the agent said it used
- Content-source classifier (schema 1.4.0+): distinguish indexed-corpus chunks from live-tool / memory-store chunks
- LangChain / LangGraph / LlamaIndex / CrewAI / MCP integrations
- Framework-agnostic
verify_chunks/verify_memory/admission_check/admit_memory_write/admit_model_inferencefor everything else - Public hash helpers:
compute_caller_hash(caller, salt=...)andcompute_value_hash(value)so downstream consumers can independently re-derive the hashes embedded on receipts - CLI:
provenex ingest / verify / receipt / audit / policy / selftest / index audit - Python SDK:
pip install provenex-core
Commercial (at provenex.ai)
- Rego adapter - load Rego bundles into the same
PolicyEvaluatorprotocol; emit the same receipt shape - OPA service adapter - delegate evaluation to a running OPA instance over HTTP
- Hosted provenance index with distributed signed append-only storage
- Transparency-log-backed policy bundle records (so
policy_in_transparency_log: truelights up) - Bloom-filter acceleration for high-throughput verification at 10M+ chunk scale
- Compliance-grade export formats (PDF, CSV, JSON-LD for regulator-side / semantic-web consumers)
- Identity-provider integration (RequestContext auto-populated from Okta / Azure AD)
- Inference attribution and temporal decay scoring
- Enterprise SSO / RBAC, HSM-backed Ed25519, dedicated support, SLA
The interfaces (ProvenanceIndex, PolicyEvaluator, BloomFilterIndex) are the same across open source and commercial. Moving from one to the other is one line of code: the class you instantiate.
Privacy and data sovereignty
The index stores fingerprints (one-way SHA-256 hashes) and metadata. No document content, no PII, no chunk text is ever written. Anyone with the index can verify retrieval, but no one can recover document content from it. The policy.access_control.decisions[].inputs field on the receipt records the metadata the evaluator looked at (residency tags, classification, caller role) - operators who want to redact those can set inputs: null while keeping the inputs_hash for offline verification.
License
MIT. See LICENSE.
Links
Reading:
- Five Things People Mean by "AI Provenance" (And Which One Is For You): the category map, and where Provenex sits
docs/policy.md: unified policy reference (verification + access control), DSL, worked examples, commercial roadmapdocs/how_it_works.md: full algorithm, threat model, and architectural comparison to embedding-based systemsdocs/receipt_format.md: receipt schema 2.0.0 specificationdocs/quickstart.md: 5-minute getting-started, including a policy-driven retrieval pathdocs/threat_model.md: attacker model, defended/undefended threats, trust model for policy decisionsdocs/scaling.md: 1M-chunk benchmark numbers and policy-evaluation latency profile
Project:
- Homepage: provenex.ai
- Issues and discussion: GitHub Issues on this repo
- Commercial features: contact via provenex.ai
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file provenex_core-0.8.2.tar.gz.
File metadata
- Download URL: provenex_core-0.8.2.tar.gz
- Upload date:
- Size: 288.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5bdebebd56796edb3e18a974c4e0aebd1d1ae87b3831ffe11dffcc0b7a669d7b
|
|
| MD5 |
6270d866bd9be949f5bdb55a358c68c0
|
|
| BLAKE2b-256 |
0694b8cd6fda50e6dd4274f704b40d1ab9002c3e430655d972bd7d70d6ec70d9
|
File details
Details for the file provenex_core-0.8.2-py3-none-any.whl.
File metadata
- Download URL: provenex_core-0.8.2-py3-none-any.whl
- Upload date:
- Size: 195.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b55dc2f1273164bfae1a56e7e3b5c4e2a5de77bc237459dd11d187e64d6b7dd9
|
|
| MD5 |
0b76807c87b9726e87a802a523f74e6d
|
|
| BLAKE2b-256 |
f42eacd23edc7ec7a35a638f9a970e3438b5939d84165ab9c88155ac6c9dfae2
|