Reversible PII tokenization for LLM pipelines — send documents to cloud AI without exposing real data
Project description
sovereign-vault
Reversible PII tokenization for LLM pipelines.
Send documents containing real names, SSNs, emails, and account numbers to any cloud AI — Claude, Gemini, GPT — without exposing the actual values. The AI reasons about relationships and patterns on placeholder tokens. You reconstruct the real values locally after the response comes back.
pip install sovereign-vault
The problem
You have documents with names, SSNs, emails, and account numbers. You need a cloud AI to analyze patterns, identify anomalies, or summarize findings. But you can't send the raw PII — compliance, legal, or common sense says no.
Standard redaction destroys the data permanently. The AI then can't reason about cross-entity relationships — "the same person appears in both transactions" becomes impossible once everything is [REDACTED].
The solution
Sovereign Vault replaces PII with stable, HMAC-bound tokens per session. The same value always maps to the same token, so AI can track relationships across a document. You reconstruct locally after the cloud call.
from sovereign_vault import VaultSession
with VaultSession() as vault:
abstract = vault.tokenize(
"John Doe (SSN: 123-45-6789) transferred funds to "
"Jane Smith (SSN: 987-65-4321) via john@firm.com on 2024-01-15."
)
# abstract:
# "[[PERSON_A1B2C3D4_e5f6a7]] (SSN: [[SSN_B8C9D0E1_f2a3b4]]) transferred
# funds to [[PERSON_F5G6H7I8_j9k0l1]] (SSN: [[SSN_J2K3L4M5_n6o7p8]])
# via [[EMAIL_N9O0P1Q2_r3s4t5]] on 2024-01-15."
response = your_llm_client.complete(abstract) # cloud sees only tokens
result = vault.reconstruct(response) # real values restored locally
# VaultSession.destroy() called automatically on context exit
No disk writes. No persistence between sessions. The mapping lives in RAM and is wiped on destroy().
Detection layers
Three layers run in sequence. Each is optional — the system never falls below Layer 1 reliability.
| Layer | Method | Confidence | Requires |
|---|---|---|---|
| 1 — Regex | Deterministic structural patterns | 1.0 | Nothing (always active) |
| 2 — GLiNER | Probabilistic NLP NER | 0.85× model score | pip install sovereign-vault[ner] |
| 3 — Ollama | Contextual LLM sweep | 0.65 | Local Ollama + pip install sovereign-vault[llm] |
Layer 3 triggers only when GLiNER finds fewer than 3 entities — handles implicit identifiers and role references that regex and NER miss.
Regex catches: SSN, phone, email, IP address, credit card, passport, Michigan DL, court case numbers
GLiNER catches: person names, organizations, locations, addresses, DOB, financial accounts, government IDs, medical record numbers
Ollama catches: contextual identifiers — "the defendant", "Account #XYZ", implicit role-based references
Installation
# Core (regex only — no dependencies)
pip install sovereign-vault
# With NLP entity recognition
pip install sovereign-vault[ner]
# With local LLM sweep (requires Ollama running locally)
pip install sovereign-vault[llm]
# Everything
pip install sovereign-vault[all]
Usage
Basic round-trip
from sovereign_vault import VaultSession
raw = "Alice (alice@corp.com, SSN 123-45-6789) authorized the transfer."
with VaultSession(use_gliner=False, use_ollama=False) as vault:
abstract = vault.tokenize(raw)
# Send `abstract` to cloud AI
cloud_response = call_your_cloud_ai(abstract)
restored = vault.reconstruct(cloud_response)
LENIENT mode — cloud paraphrased some tokens
with VaultSession(recon_mode=ReconMode.LENIENT) as vault:
abstract = vault.tokenize(raw)
cloud_response = call_cloud(abstract)
# Won't raise even if cloud dropped or paraphrased some tokens
restored = vault.reconstruct(cloud_response)
SEALED mode — abstract output only, no reconstruction
with VaultSession(seal_mode=SealMode.SEALED) as vault:
abstract = vault.tokenize(raw)
# Reconstruction is intentionally disabled
# Use when the abstract output IS the final product
Audit log — chain of custody, no real values
vault = VaultSession()
vault.tokenize(raw)
for entry in vault.audit_log():
print(entry["label"], entry["source_layer"], entry["confidence"])
vault.destroy()
Multi-session / server use
from sovereign_vault import new_session, get_session, drop_session
sid, vault = new_session()
abstract = vault.tokenize(raw)
# ... pass sid to the next step in your pipeline ...
vault2 = get_session(sid)
restored = vault2.reconstruct(cloud_output)
drop_session(sid) # destroys and deregisters
Security model
- RAM-only, session-scoped — no disk writes, no persistence between sessions
- HMAC-bound tokens — each token carries an HMAC tag derived from a 32-byte session secret; tampered or injected tokens raise
VaultSealBreach - Injection prevention — input containing pre-existing
[[...]]vault token format is rejected immediately - Entropy leak detection —
reconstruct()flags high-entropy tokens in cloud output that may be inferred identifiers - Best-effort memory wipe —
destroy()overwrites real values with random bytes before clearing
Reconstruction modes
| Mode | Behavior |
|---|---|
ReconMode.STRICT (default) |
Raises VaultReconstructionDegraded if cloud dropped any vault token |
ReconMode.LENIENT |
Allows partial reconstruction — logs missing tokens as warnings |
SealMode.SEALED |
Disables reconstruction entirely — raises VaultSealBreach if attempted |
Use cases
- Forensic e-discovery — send document patterns to cloud AI without exposing real names or case numbers
- HIPAA pipelines — analyze medical records cross-entity without raw patient identifiers leaving your perimeter
- Financial fraud detection — transaction pattern analysis without raw account numbers
- Gov/defense document processing — reason about relationships in sensitive case files
- Cross-agent PII passing — sanitize data moving between local and cloud agents in an agentic pipeline
Part of the LexiPro Sovereign OS
Sovereign Vault is a component of LexiPro — a local-first agentic OS running 15 MCP servers, 228 tools, and 20 agent personas on sovereign hardware. In the full OS, it powers Workflow O (Privacy Bridge): tokenize before any cloud call, reconstruct locally after, audit trail preserved.
Powered by:
- Anthropic Claude — Tier 5 reasoning backbone for multi-file analysis
- Google Gemini — OSINT, research, and long-context processing
- Ollama — Layer 3 local LLM sweep (Gemma, Llama) for contextual entity detection
- GLiNER — Layer 2 NLP NER for named entity recognition
Contributing
Issues and PRs welcome. The detection layer system is designed for extension — add new regex patterns to REGEX_PATTERNS, new GLiNER entity types to _GLINER_TYPES, or swap the Ollama model via ollama_model parameter.
Known Limitations
| Limitation | Impact | Mitigation |
|---|---|---|
| RAM-only storage | Vault lost if process crashes mid-pipeline | Call vault.destroy() in a finally block; checkpoint vault keys externally if needed |
| Probabilistic NER (GLiNER/Ollama) | Novel PII formats may not be detected | Use coverage_report() after tokenize to assess detection quality |
| Regex layer only on plain text | HTML entities, encoded chars may slip through | Pre-normalize input with html.unescape() before tokenizing |
| Session-scoped tokens | Same real value gets different token in different sessions | Design your pipeline to tokenize once per document, not per chunk |
| Not a legal compliance layer | Sovereign Vault assists compliance; it cannot replace legal review | Combine with your organization's data classification policy |
Comparison: Sovereign Vault vs. alternatives
| Feature | sovereign-vault | Microsoft Presidio | AWS Comprehend PII | Simple regex redaction |
|---|---|---|---|---|
| Reversible tokenization | Yes | No (replace only) | No | No |
| HMAC integrity on tokens | Yes | No | No | No |
| Offline capable | Yes (regex layer) | Partial | No (API) | Yes |
| Named entity detection | Yes (GLiNER + Ollama) | Yes (spaCy) | Yes (cloud) | No |
| STRICT mode audit trail | Yes | No | No | No |
| Cloud cost | $0 (local) | $0 (local) | Per-call | $0 |
| Setup complexity | pip install | pip + models + server | AWS credentials | None |
Compliance Disclaimer
Sovereign Vault is a technical tool that assists with PII handling in LLM pipelines. It is not a legal compliance product and does not constitute legal advice.
GDPR / HIPAA / CCPA: Tokenizing PII before sending it to a cloud model reduces exposure but does not by itself satisfy the requirements of any data protection regulation. Your compliance obligations depend on your specific use case, data classification, and organizational policies. Consult qualified legal counsel before deploying in regulated environments.
What sovereign-vault does:
- Replaces PII with HMAC-bound tokens so cloud AI never receives raw values
- Provides an audit trail of all vaulted entities (no real values in log)
- Wipes vault from RAM on
destroy()call
What sovereign-vault does NOT do:
- Guarantee detection of all PII in all languages and formats
- Provide legal indemnification or certification
- Replace a data classification policy or DPO review
- Encrypt data at rest (vault is RAM-only by design)
License
MIT — see LICENSE.
Built by Broken Arrow Entertainment LLC · Sovereign Intelligence Systems Group
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sovereign_vault-1.0.1.tar.gz.
File metadata
- Download URL: sovereign_vault-1.0.1.tar.gz
- Upload date:
- Size: 18.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
70b19f2e6a0e5091978901f1ad71fb7d49538cedf1e166c18dd33227334c0476
|
|
| MD5 |
e6d5ebc06c76d37e9ad8f01cca16dbab
|
|
| BLAKE2b-256 |
4dceee398944768d6e8b0da016c714b968f92b87026272e5ed38eb82c2e71cdd
|
File details
Details for the file sovereign_vault-1.0.1-py3-none-any.whl.
File metadata
- Download URL: sovereign_vault-1.0.1-py3-none-any.whl
- Upload date:
- Size: 13.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc5e3921bcebf8d469e2c80b5128ed95311deaac26a1478ca5eb2c7fe2c8913d
|
|
| MD5 |
9732089bcf3a70b07ae589e15edc9983
|
|
| BLAKE2b-256 |
6025b505c861b351866e8ac86935140beb02758d8718b80541683c389dcfea97
|