Cryptographic provenance verification for enterprise RAG pipelines

These details have not been verified by PyPI

Project links

Project description

provenex-core

Cryptographic provenance verification for RAG pipelines. When an enterprise AI system answers a question, this is what proves which documents the answer came from, whether they were current and authorized, and that they weren't tampered with along the way.

This repository contains the open source core: fingerprinting, local SQLite index, receipt generation, LangChain integration. The algorithm is open so it can be audited. Hosted infrastructure, Bloom-filter acceleration, compliance-grade exports, and cross-enterprise provenance graphs are available separately at provenex.ai.

Note on terminology. "Provenance" means several different things in the AI stack right now — training-data lineage, vector DB governance (Pinecone Nexus, Weaviate), retrieval verification, output faithfulness, generated-media credentials (C2PA). Provenex is the retrieval verification layer: cryptographic proof of which chunks reached the LLM, verifiable offline by anyone with the signing key, across any retriever. We've written up the full map in Five Things People Mean by "AI Provenance".

Five-line integration

from provenex.integrations.langchain import ProvenexIngestor, ProvenexRetriever
from provenex.index.sqlite_index import SQLiteProvenanceIndex

index = SQLiteProvenanceIndex("provenance.db")
ingestor = ProvenexIngestor(index=index)

ingestor.ingest(documents, doc_id="policy_v4", authorized=True)

retriever = ProvenexRetriever(base_retriever=your_existing_retriever, index=index)
result = retriever.get_relevant_documents_with_receipt(query)
print(result.receipt.to_json())

Your existing vector store is untouched. Provenex runs alongside as a parallel signed index. Whether you use Pinecone, Weaviate, Milvus, Qdrant, Chroma, FAISS, pgvector, MongoDB Atlas Vector Search, Elasticsearch with vectors, Vespa, or a Postgres table you wrote yourself — Provenex doesn't know and doesn't care. The integration surface is the retriever (LangChain today; LlamaIndex coming), not the database. your_existing_retriever keeps doing semantic similarity; Provenex adds cryptographic identity.

What a provenance receipt looks like

Every retrieval produces a JSON receipt that records exactly what went into the answer. Compliance teams hold onto it. Auditors verify it independently.

{
  "receipt_id": "prx_f2de431dc125ccfc6b57e6ca327fa504",
  "schema_version": "1.0.0",
  "issued_at": "2026-05-08T14:32:07.441Z",
  "issuer": "provenex-core/0.1.0",
  "output": {
    "hash": "sha256:6e9052525c80e43fb3612dce5edd025d350c8f0a1318097988ab4b0750c2f388",
    "hash_algorithm": "sha256"
  },
  "sources": [
    {
      "chunk_index": 0,
      "fingerprint": "sha256:1ebcde39...",
      "document_id": "policy_v4",
      "document_version": "sha256:1ebcde39...",
      "ingested_at": "2026-04-01T09:00:00Z",
      "chunk_offset": 0,
      "chunk_length": 936,
      "authorized": true,
      "verification_outcome": "VERIFIED",
      "normalization_applied": ["unicode_nfc", "strip_zero_width", "whitespace_collapse"]
    }
  ],
  "policy": { "block_unauthorized": true, "block_tampered": true, "...": "..." },
  "summary": { "total_chunks": 3, "verified": 2, "unverified": 1, "overall_status": "PARTIAL" },
  "signature": { "algorithm": "hmac-sha256", "value": "fc5d40895ca2..." }
}

Every retrieved chunk gets one of five verification outcomes:

Outcome	Meaning
`VERIFIED`	Chunk in index, document current, authorized.
`STALE`	Chunk in index, but the document has been superseded by a newer version.
`UNAUTHORIZED`	Chunk in index, but the document is not authorized for this context.
`UNVERIFIED`	Chunk fingerprint not in index. It was never ingested through Provenex.
`TAMPERED`	Chunk in index but the stored signature failed verification. Alarm condition.

The receipt is signed (HMAC-SHA256 by default; pluggable). Anyone with the receipt and the key can verify it didn't change since it was issued.

How it works

Three components:

1. Ingestion. Documents are normalized (Unicode NFC, whitespace collapse, optional case folding, zero-width stripping) and run through a sliding window. Each window gets a Rabin-Karp rolling hash (base 1_000_003, modulo Mersenne prime 2^61 - 1) for cheap O(1) updates, strengthened with SHA-256 for collision-resistant identity. The fingerprints — not the document content — are written to the provenance index along with document_id, document_version, timestamp, and authorization state. The index never stores document text.

2. Retrieval verification. When your retriever returns chunks, Provenex re-fingerprints each one using the same normalization and hash pipeline, checks the fingerprint against the index, and assigns one of the five outcomes above. Configurable policy decides which outcomes block the chunk before it reaches the LLM.

3. Receipt. After verification, a JSON receipt is issued that records the chunks, their outcomes, the policy in effect, a SHA-256 of the LLM output, and a signature over the whole thing. The receipt is the artifact you keep.

See docs/how_it_works.md for the full algorithm, including the architectural distinction between fingerprint-based identity and embedding-based similarity. See docs/receipt_format.md for the schema spec.

How this fits alongside Pinecone Nexus, Weaviate, and other vector DBs

Vector databases store semantic similarity — dense embeddings that let you find content similar to a query. Provenex stores cryptographic identity — SHA-256 fingerprints that prove bit-exact match against a signed reference. These solve different problems and compose cleanly.

	Vector DBs (Pinecone Nexus, Weaviate, Milvus, Qdrant, Chroma, FAISS, pgvector, ...)	Provenex
Primary storage	Dense embeddings (semantic similarity)	SHA-256 fingerprints (cryptographic identity)
Retrieval	Approximate nearest neighbor over vectors	Bit-exact match against signed index
Tampering	Not detectable — embeddings are lossy by design	Detectable — any modification produces a different SHA-256
Audit artifact	Vendor dashboard, internal logs	Signed JSON receipt, verifiable offline
Trust root	Vendor's SOC 2 attestation	HMAC signature, verifiable by anyone with the key
Vendor lock-in	Yes (per database)	None — works alongside any retriever

The expected enterprise deployment is both: vector DB for retrieval performance and vendor governance, Provenex for cryptographic audit trails compliance teams can hand to a regulator. See the blog post for the longer argument.

Why vendor-agnostic matters

Pinecone Nexus is governance inside Pinecone. Weaviate has its own governance stack. Milvus, Qdrant, Chroma, and the rest each have their own — or none. If you run Pinecone for one workload and Weaviate for another, you have two separate audit stories with two separate vendor trust roots, and no way to produce a single cryptographic record that says "this chunk, wherever it came from, is bit-exact identical to the one we authorized."

Provenex works the same way against all of them, because it never talks to the vector DB. It re-fingerprints the chunks the retriever returns, regardless of where they were stored. One signed index, one receipt schema, one verifiable artifact — across every retrieval path in the enterprise.

This also means migration risk between vector DBs goes to zero. If you decide to move from Pinecone to Weaviate, or from a managed service to something self-hosted, your provenance audit trail doesn't change. You re-ingest into the new vector DB; the Provenex index stays the same. Vector DB swaps are decoupled from compliance infrastructure.

The technical reason this works: Provenex's integration surface is the retriever (LangChain, LlamaIndex, custom Python), not the vector DB itself. As long as the retriever returns the chunk text the vector DB stored, Provenex can fingerprint it. We've smoke-tested against Chroma and FAISS in the examples; Pinecone, Weaviate, Milvus, Qdrant, and the rest are integration-trivial — a few lines of adapter code if you're not on a framework that already wraps them.

Install

We haven't shipped to PyPI yet — install directly from this repository:

pip install git+https://github.com/provenex/provenex-core.git
pip install "git+https://github.com/provenex/provenex-core.git#egg=provenex-core[langchain]"

Python 3.10+. The core has zero third-party dependencies — it's pure stdlib. LangChain and LlamaIndex are optional extras.

A PyPI release (pip install provenex-core) is coming once the API stabilizes. Pin to a commit hash in the meantime if you need a fixed version.

Try it in 30 seconds

git clone https://github.com/provenex/provenex-core.git
cd provenex-core
pip install -e .

export PROVENEX_SIGNING_SECRET="$(python3 -c 'import secrets; print(secrets.token_hex(32))')"
python examples/standalone_demo.py

examples/standalone_demo.py runs the full story end-to-end — ingest a document, get a signed receipt with a cryptographic inclusion proof, watch the HMAC catch a tampered row, then re-verify the proof with the database deleted using only the receipt fields and the published tree root. It's the demo we'd show a sceptical compliance team.

Want a shareable asciicast? See docs/recording_demo.md for the asciinema recipe.

CLI

provenex ingest  --index prov.db --doc-id policy_v4 policy.txt
provenex verify  --index prov.db retrieved_chunk.txt
provenex receipt --index prov.db --output llm_output.txt chunk1.txt chunk2.txt

Set PROVENEX_SIGNING_SECRET in your environment. The verify command exits non-zero when the outcome is not VERIFIED, so it composes in shell pipelines.

Why open source?

Compliance teams won't trust a black box. If a regulator asks how your provenance system works, "it's proprietary" is not an answer. The algorithm — normalization, rolling hash, sliding window, SHA-256 strengthening, receipt schema, signature payload — needs to be auditable end to end. So it is. The commercial value is in the hosted infrastructure that runs this algorithm at scale across an enterprise, not in keeping the algorithm secret.

What's in this repo:

Fingerprinting engine (normalizer + Rabin-Karp + SHA-256)
Local SQLite provenance index with HMAC-signed rows
Receipt generation and signature verification
LangChain integration (retriever middleware + ingestor)
CLI: provenex ingest / verify / receipt
Python SDK (install from GitHub — see Install)

What's not in this repo (commercial features at provenex.ai):

Hosted provenance index with distributed signed append-only storage
Bloom-filter acceleration for high-throughput verification
Compliance-grade export formats (PDF, JSON-LD for regulators)
Cross-enterprise provenance graphs
Inference attribution and temporal decay scoring
Enterprise SSO / RBAC

The interface (ProvenanceIndex) is the same. Moving from open source to commercial is one line of code: the class you instantiate.

Privacy and data sovereignty

The index stores fingerprints — one-way SHA-256 hashes — and metadata. No document content, no PII, no chunk text is ever written. Anyone with the index can verify retrieval, but no one can recover document content from it.

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.0

May 14, 2026

0.4.0

May 14, 2026

0.3.0

May 14, 2026

0.2.0

May 13, 2026

This version

0.1.0

May 12, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

provenex_core-0.1.0.tar.gz (50.1 kB view details)

Uploaded May 12, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

provenex_core-0.1.0-py3-none-any.whl (42.9 kB view details)

Uploaded May 12, 2026 Python 3

File details

Details for the file provenex_core-0.1.0.tar.gz.

File metadata

Download URL: provenex_core-0.1.0.tar.gz
Upload date: May 12, 2026
Size: 50.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for provenex_core-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`2c7aa756117e607073dbe33ade755d94d7ebe02644df68936eaec53ade3b9ab7`
MD5	`8b78de88ac59161ce0f62f6178fc8a62`
BLAKE2b-256	`9d40dae85203b9ddaa99b644488fb68206f36db8b7c80b8249b848e20f927f3a`

See more details on using hashes here.

File details

Details for the file provenex_core-0.1.0-py3-none-any.whl.

File metadata

Download URL: provenex_core-0.1.0-py3-none-any.whl
Upload date: May 12, 2026
Size: 42.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for provenex_core-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`60816abd4ec4c0dc1593f1d93c58a0927192d6c2193f2c8421092be749346dfd`
MD5	`da80634112cb6631aaf6f975535aac87`
BLAKE2b-256	`5509a3f5892b90387fe656af8334386485e12c174a745c06b826a56afbbcc826`

See more details on using hashes here.

provenex-core 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

provenex-core

Five-line integration

What a provenance receipt looks like

How it works

How this fits alongside Pinecone Nexus, Weaviate, and other vector DBs

Why vendor-agnostic matters

Install

Try it in 30 seconds

CLI

Why open source?

Privacy and data sovereignty

License

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes