Skip to main content

Reversible PII tokenization for LLM pipelines — send documents to cloud AI without exposing real data

Project description

sovereign-vault

Reversible PII tokenization for LLM pipelines.

Send documents containing real names, SSNs, emails, and account numbers to any cloud AI — Claude, Gemini, GPT — without exposing the actual values. The AI reasons about relationships and patterns on placeholder tokens. You reconstruct the real values locally after the response comes back.

pip install sovereign-vault

PyPI version Downloads Python 3.10+ License: MIT CI


The problem

You have documents with names, SSNs, emails, and account numbers. You need a cloud AI to analyze patterns, identify anomalies, or summarize findings. But you can't send the raw PII — compliance, legal, or common sense says no.

Standard redaction destroys the data permanently. The AI then can't reason about cross-entity relationships — "the same person appears in both transactions" becomes impossible once everything is [REDACTED].

The solution

Sovereign Vault replaces PII with stable, HMAC-bound tokens per session. The same value always maps to the same token, so AI can track relationships across a document. You reconstruct locally after the cloud call.

from sovereign_vault import VaultSession

with VaultSession() as vault:
    abstract = vault.tokenize(
        "John Doe (SSN: 123-45-6789) transferred funds to "
        "Jane Smith (SSN: 987-65-4321) via john@firm.com on 2024-01-15."
    )
    # abstract:
    # "[[PERSON_A1B2C3D4_e5f6a7]] (SSN: [[SSN_B8C9D0E1_f2a3b4]]) transferred
    #  funds to [[PERSON_F5G6H7I8_j9k0l1]] (SSN: [[SSN_J2K3L4M5_n6o7p8]])
    #  via [[EMAIL_N9O0P1Q2_r3s4t5]] on 2024-01-15."

    response = your_llm_client.complete(abstract)  # cloud sees only tokens

    result = vault.reconstruct(response)  # real values restored locally
    # VaultSession.destroy() called automatically on context exit

No disk writes. No persistence between sessions. The mapping lives in RAM and is wiped on destroy().


Detection layers

Three layers run in sequence. Each is optional — the system never falls below Layer 1 reliability.

Layer Method Confidence Requires
1 — Regex Deterministic structural patterns 1.0 Nothing (always active)
2 — GLiNER Probabilistic NLP NER 0.85× model score pip install sovereign-vault[ner]
3 — Ollama Contextual LLM sweep 0.65 Local Ollama + pip install sovereign-vault[llm]

Layer 3 triggers only when GLiNER finds fewer than 3 entities — handles implicit identifiers and role references that regex and NER miss.

Regex catches: SSN, phone, email, IP address, credit card, passport, Michigan DL, court case numbers

GLiNER catches: person names, organizations, locations, addresses, DOB, financial accounts, government IDs, medical record numbers

Ollama catches: contextual identifiers — "the defendant", "Account #XYZ", implicit role-based references


Installation

# Core (regex only — no dependencies)
pip install sovereign-vault

# With NLP entity recognition
pip install sovereign-vault[ner]

# With local LLM sweep (requires Ollama running locally)
pip install sovereign-vault[llm]

# Everything
pip install sovereign-vault[all]

Usage

Basic round-trip

from sovereign_vault import VaultSession

raw = "Alice (alice@corp.com, SSN 123-45-6789) authorized the transfer."

with VaultSession(use_gliner=False, use_ollama=False) as vault:
    abstract = vault.tokenize(raw)
    # Send `abstract` to cloud AI
    cloud_response = call_your_cloud_ai(abstract)
    restored = vault.reconstruct(cloud_response)

LENIENT mode — cloud paraphrased some tokens

with VaultSession(recon_mode=ReconMode.LENIENT) as vault:
    abstract = vault.tokenize(raw)
    cloud_response = call_cloud(abstract)
    # Won't raise even if cloud dropped or paraphrased some tokens
    restored = vault.reconstruct(cloud_response)

SEALED mode — abstract output only, no reconstruction

with VaultSession(seal_mode=SealMode.SEALED) as vault:
    abstract = vault.tokenize(raw)
    # Reconstruction is intentionally disabled
    # Use when the abstract output IS the final product

Audit log — chain of custody, no real values

vault = VaultSession()
vault.tokenize(raw)
for entry in vault.audit_log():
    print(entry["label"], entry["source_layer"], entry["confidence"])
vault.destroy()

Multi-session / server use

from sovereign_vault import new_session, get_session, drop_session

sid, vault = new_session()
abstract = vault.tokenize(raw)
# ... pass sid to the next step in your pipeline ...
vault2 = get_session(sid)
restored = vault2.reconstruct(cloud_output)
drop_session(sid)  # destroys and deregisters

Security model

  • RAM-only, session-scoped — no disk writes, no persistence between sessions
  • HMAC-bound tokens — each token carries an HMAC tag derived from a 32-byte session secret; tampered or injected tokens raise VaultSealBreach
  • Injection prevention — input containing pre-existing [[...]] vault token format is rejected immediately
  • Entropy leak detectionreconstruct() flags high-entropy tokens in cloud output that may be inferred identifiers
  • Best-effort memory wipedestroy() overwrites real values with random bytes before clearing

Reconstruction modes

Mode Behavior
ReconMode.STRICT (default) Raises VaultReconstructionDegraded if cloud dropped any vault token
ReconMode.LENIENT Allows partial reconstruction — logs missing tokens as warnings
SealMode.SEALED Disables reconstruction entirely — raises VaultSealBreach if attempted

Use cases

  • Forensic e-discovery — send document patterns to cloud AI without exposing real names or case numbers
  • HIPAA pipelines — analyze medical records cross-entity without raw patient identifiers leaving your perimeter
  • Financial fraud detection — transaction pattern analysis without raw account numbers
  • Gov/defense document processing — reason about relationships in sensitive case files
  • Cross-agent PII passing — sanitize data moving between local and cloud agents in an agentic pipeline

Part of the LexiPro Sovereign OS

Sovereign Vault is a component of LexiPro — a local-first agentic OS running 15 MCP servers, 228 tools, and 20 agent personas on sovereign hardware. In the full OS, it powers Workflow O (Privacy Bridge): tokenize before any cloud call, reconstruct locally after, audit trail preserved.

Powered by:

  • Anthropic Claude — Tier 5 reasoning backbone for multi-file analysis
  • Google Gemini — OSINT, research, and long-context processing
  • Ollama — Layer 3 local LLM sweep (Gemma, Llama) for contextual entity detection
  • GLiNER — Layer 2 NLP NER for named entity recognition

Contributing

Issues and PRs welcome. The detection layer system is designed for extension — add new regex patterns to REGEX_PATTERNS, new GLiNER entity types to _GLINER_TYPES, or swap the Ollama model via ollama_model parameter.


Known Limitations

Limitation Impact Mitigation
RAM-only storage Vault lost if process crashes mid-pipeline Call vault.destroy() in a finally block; checkpoint vault keys externally if needed
Probabilistic NER (GLiNER/Ollama) Novel PII formats may not be detected Use coverage_report() after tokenize to assess detection quality
Regex layer only on plain text HTML entities, encoded chars may slip through Pre-normalize input with html.unescape() before tokenizing
Session-scoped tokens Same real value gets different token in different sessions Design your pipeline to tokenize once per document, not per chunk
Not a legal compliance layer Sovereign Vault assists compliance; it cannot replace legal review Combine with your organization's data classification policy

Comparison: Sovereign Vault vs. alternatives

Feature sovereign-vault Microsoft Presidio AWS Comprehend PII Simple regex redaction
Reversible tokenization Yes No (replace only) No No
HMAC integrity on tokens Yes No No No
Offline capable Yes (regex layer) Partial No (API) Yes
Named entity detection Yes (GLiNER + Ollama) Yes (spaCy) Yes (cloud) No
STRICT mode audit trail Yes No No No
Cloud cost $0 (local) $0 (local) Per-call $0
Setup complexity pip install pip + models + server AWS credentials None

Compliance Disclaimer

Sovereign Vault is a technical tool that assists with PII handling in LLM pipelines. It is not a legal compliance product and does not constitute legal advice.

GDPR / HIPAA / CCPA: Tokenizing PII before sending it to a cloud model reduces exposure but does not by itself satisfy the requirements of any data protection regulation. Your compliance obligations depend on your specific use case, data classification, and organizational policies. Consult qualified legal counsel before deploying in regulated environments.

What sovereign-vault does:

  • Replaces PII with HMAC-bound tokens so cloud AI never receives raw values
  • Provides an audit trail of all vaulted entities (no real values in log)
  • Wipes vault from RAM on destroy() call

What sovereign-vault does NOT do:

  • Guarantee detection of all PII in all languages and formats
  • Provide legal indemnification or certification
  • Replace a data classification policy or DPO review
  • Encrypt data at rest (vault is RAM-only by design)

License

MIT — see LICENSE.

Built by Broken Arrow Entertainment LLC · Sovereign Intelligence Systems Group

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sovereign_vault-1.0.1.tar.gz (18.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sovereign_vault-1.0.1-py3-none-any.whl (13.1 kB view details)

Uploaded Python 3

File details

Details for the file sovereign_vault-1.0.1.tar.gz.

File metadata

  • Download URL: sovereign_vault-1.0.1.tar.gz
  • Upload date:
  • Size: 18.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sovereign_vault-1.0.1.tar.gz
Algorithm Hash digest
SHA256 70b19f2e6a0e5091978901f1ad71fb7d49538cedf1e166c18dd33227334c0476
MD5 e6d5ebc06c76d37e9ad8f01cca16dbab
BLAKE2b-256 4dceee398944768d6e8b0da016c714b968f92b87026272e5ed38eb82c2e71cdd

See more details on using hashes here.

File details

Details for the file sovereign_vault-1.0.1-py3-none-any.whl.

File metadata

File hashes

Hashes for sovereign_vault-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dc5e3921bcebf8d469e2c80b5128ed95311deaac26a1478ca5eb2c7fe2c8913d
MD5 9732089bcf3a70b07ae589e15edc9983
BLAKE2b-256 6025b505c861b351866e8ac86935140beb02758d8718b80541683c389dcfea97

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page