Skip to main content

Reversible PII tokenization for LLM pipelines — send documents to cloud AI without exposing real data

Project description

sovereign-vault

Reversible PII tokenization for LLM pipelines.

Send documents containing real names, SSNs, emails, and account numbers to any cloud AI — Claude, Gemini, GPT — without exposing the actual values. The AI reasons about relationships and patterns on placeholder tokens. You reconstruct the real values locally after the response comes back.

pip install sovereign-vault

PyPI version Python 3.10+ License: MIT CI


The problem

You have documents with names, SSNs, emails, and account numbers. You need a cloud AI to analyze patterns, identify anomalies, or summarize findings. But you can't send the raw PII — compliance, legal, or common sense says no.

Standard redaction destroys the data permanently. The AI then can't reason about cross-entity relationships — "the same person appears in both transactions" becomes impossible once everything is [REDACTED].

The solution

Sovereign Vault replaces PII with stable, HMAC-bound tokens per session. The same value always maps to the same token, so AI can track relationships across a document. You reconstruct locally after the cloud call.

from sovereign_vault import VaultSession

with VaultSession() as vault:
    abstract = vault.tokenize(
        "John Doe (SSN: 123-45-6789) transferred funds to "
        "Jane Smith (SSN: 987-65-4321) via john@firm.com on 2024-01-15."
    )
    # abstract:
    # "[[PERSON_A1B2C3D4_e5f6a7]] (SSN: [[SSN_B8C9D0E1_f2a3b4]]) transferred
    #  funds to [[PERSON_F5G6H7I8_j9k0l1]] (SSN: [[SSN_J2K3L4M5_n6o7p8]])
    #  via [[EMAIL_N9O0P1Q2_r3s4t5]] on 2024-01-15."

    response = your_llm_client.complete(abstract)  # cloud sees only tokens

    result = vault.reconstruct(response)  # real values restored locally
    # VaultSession.destroy() called automatically on context exit

No disk writes. No persistence between sessions. The mapping lives in RAM and is wiped on destroy().


Detection layers

Three layers run in sequence. Each is optional — the system never falls below Layer 1 reliability.

Layer Method Confidence Requires
1 — Regex Deterministic structural patterns 1.0 Nothing (always active)
2 — GLiNER Probabilistic NLP NER 0.85× model score pip install sovereign-vault[ner]
3 — Ollama Contextual LLM sweep 0.65 Local Ollama + pip install sovereign-vault[llm]

Layer 3 triggers only when GLiNER finds fewer than 3 entities — handles implicit identifiers and role references that regex and NER miss.

Regex catches: SSN, phone, email, IP address, credit card, passport, Michigan DL, court case numbers

GLiNER catches: person names, organizations, locations, addresses, DOB, financial accounts, government IDs, medical record numbers

Ollama catches: contextual identifiers — "the defendant", "Account #XYZ", implicit role-based references


Installation

# Core (regex only — no dependencies)
pip install sovereign-vault

# With NLP entity recognition
pip install sovereign-vault[ner]

# With local LLM sweep (requires Ollama running locally)
pip install sovereign-vault[llm]

# Everything
pip install sovereign-vault[all]

Usage

Basic round-trip

from sovereign_vault import VaultSession

raw = "Alice (alice@corp.com, SSN 123-45-6789) authorized the transfer."

with VaultSession(use_gliner=False, use_ollama=False) as vault:
    abstract = vault.tokenize(raw)
    # Send `abstract` to cloud AI
    cloud_response = call_your_cloud_ai(abstract)
    restored = vault.reconstruct(cloud_response)

LENIENT mode — cloud paraphrased some tokens

with VaultSession(recon_mode=ReconMode.LENIENT) as vault:
    abstract = vault.tokenize(raw)
    cloud_response = call_cloud(abstract)
    # Won't raise even if cloud dropped or paraphrased some tokens
    restored = vault.reconstruct(cloud_response)

SEALED mode — abstract output only, no reconstruction

with VaultSession(seal_mode=SealMode.SEALED) as vault:
    abstract = vault.tokenize(raw)
    # Reconstruction is intentionally disabled
    # Use when the abstract output IS the final product

Audit log — chain of custody, no real values

vault = VaultSession()
vault.tokenize(raw)
for entry in vault.audit_log():
    print(entry["label"], entry["source_layer"], entry["confidence"])
vault.destroy()

Multi-session / server use

from sovereign_vault import new_session, get_session, drop_session

sid, vault = new_session()
abstract = vault.tokenize(raw)
# ... pass sid to the next step in your pipeline ...
vault2 = get_session(sid)
restored = vault2.reconstruct(cloud_output)
drop_session(sid)  # destroys and deregisters

Security model

  • RAM-only, session-scoped — no disk writes, no persistence between sessions
  • HMAC-bound tokens — each token carries an HMAC tag derived from a 32-byte session secret; tampered or injected tokens raise VaultSealBreach
  • Injection prevention — input containing pre-existing [[...]] vault token format is rejected immediately
  • Entropy leak detectionreconstruct() flags high-entropy tokens in cloud output that may be inferred identifiers
  • Best-effort memory wipedestroy() overwrites real values with random bytes before clearing

Reconstruction modes

Mode Behavior
ReconMode.STRICT (default) Raises VaultReconstructionDegraded if cloud dropped any vault token
ReconMode.LENIENT Allows partial reconstruction — logs missing tokens as warnings
SealMode.SEALED Disables reconstruction entirely — raises VaultSealBreach if attempted

Use cases

  • Forensic e-discovery — send document patterns to cloud AI without exposing real names or case numbers
  • HIPAA pipelines — analyze medical records cross-entity without raw patient identifiers leaving your perimeter
  • Financial fraud detection — transaction pattern analysis without raw account numbers
  • Gov/defense document processing — reason about relationships in sensitive case files
  • Cross-agent PII passing — sanitize data moving between local and cloud agents in an agentic pipeline

Part of the LexiPro Sovereign OS

Sovereign Vault is a component of LexiPro — a local-first agentic OS running 15 MCP servers, 228 tools, and 20 agent personas on sovereign hardware. In the full OS, it powers Workflow O (Privacy Bridge): tokenize before any cloud call, reconstruct locally after, audit trail preserved.

Powered by:

  • Anthropic Claude — Tier 5 reasoning backbone for multi-file analysis
  • Google Gemini — OSINT, research, and long-context processing
  • Ollama — Layer 3 local LLM sweep (Gemma, Llama) for contextual entity detection
  • GLiNER — Layer 2 NLP NER for named entity recognition

Contributing

Issues and PRs welcome. The detection layer system is designed for extension — add new regex patterns to REGEX_PATTERNS, new GLiNER entity types to _GLINER_TYPES, or swap the Ollama model via ollama_model parameter.


License

MIT — see LICENSE.

Built by Broken Arrow Entertainment LLC · Sovereign Intelligence Systems Group

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sovereign_vault-1.0.0.tar.gz (15.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sovereign_vault-1.0.0-py3-none-any.whl (11.5 kB view details)

Uploaded Python 3

File details

Details for the file sovereign_vault-1.0.0.tar.gz.

File metadata

  • Download URL: sovereign_vault-1.0.0.tar.gz
  • Upload date:
  • Size: 15.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for sovereign_vault-1.0.0.tar.gz
Algorithm Hash digest
SHA256 ceaa211c39e168e6e8365a236a393de0de38494d86d7938473e1891e1e6872d2
MD5 ff44bd22d7bf54651f11d08229812641
BLAKE2b-256 dd43bb28f210afeec70e09fd5088e7de8ecff8c3421cf775d0101625133900e1

See more details on using hashes here.

File details

Details for the file sovereign_vault-1.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sovereign_vault-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8080bea201502331890e94ec2fbaaa1ffae115f1ec66d372549d98576e733bc1
MD5 de161a2184099eb5fbaa4b9013ae3e3f
BLAKE2b-256 a542f15b64944a0a0eebbc0a953036fc551c0401eeb013240ea9ee6f8245b161

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page