Skip to main content

Identity de-identification for cloud-LLM hand-off: local detection, consistent pseudonyms, a reversible local map, and a swappable model backend.

Project description

vault-engine

A local-LLM privacy layer for anything you paste into a cloud model.

Strip the identities out of your text before it reaches ChatGPT / Claude / Gemini — a model running on your own machine finds the names, orgs, places and quasi-identifiers, replaces them with stable tokens, and keeps the only key-back-to-reality on disk. When the cloud answers in tokens, you put the real identities back locally.

出云前做身份脱敏:本地模型检测 → 代号化 → 云端用代号分析 → 本地还原真身。 检测不出本机,身份映射只存本地,大模型一行换。零依赖。

vault-engine scrubs English and Chinese names and PII into tokens before the cloud sees them, then restores them locally

CI  ·  Python ≥3.9  ·  stdlib-only  ·  Apache-2.0

# notes.txt  ── private, on your machine
林若曦是星澜资本的合伙人,在深圳见了字节跳动的陈大壮,邮箱 lin@xinglan.example

        ▼  vault-engine scrub  (local qwen3.6:27b)

# safe.txt  ── what the cloud sees: identities swapped for tokens
P-n1 是 ORG_1 的合伙人,在 LOC_1 见了 ORG_2 的 P-n2,邮箱 EMAIL_1

Why

You want a frontier cloud model to analyze sensitive notes — but you don't want the cloud to learn who they're about. Masking only the names you already know leaks everything you don't: an unregistered name, an employer, a city + a rare title, a project codename. Pattern-based redaction never sees those at all.

vault-engine puts a local model in front as the detector, so the semantic identifiers get caught too — and nothing but the sanitized text ever leaves.

How it works

 private text                                    cloud model
      │                                          (sees only tokens)
      ▼                                                ▲
┌─────────────────────────── vault-engine ────────────┼───────────┐
│  ① regex PII detectors  (offline floor)              │           │
│  ② LLM detector         (local model finds names,    │           │
│                          orgs, places, quasi-IDs)    │           │
│  ③ consistent pseudonyms (张三→P-n1, 同名同号)        │           │
│  ④ residual-risk critic  (re-scan: anything left?)   │  ① send   │
│        │                                             │           │
│   sanitized text ────────────────────────────────────┘           │
│        ▲                                                          │
│   reverse map (token → real identity) ── stays LOCAL ──┐  ② reply │
│        └───────────────────── ⑤ rehydrate ◀────────────┘          │
└──────────────────────────────────────────────────────────────────┘
      ▼
 real identities restored locally → use in your own system

Benchmark

How much identity each detector actually catches, on a labelled bilingual dataset (reproduce with python eval/run_eval.py; methodology in eval/):

77 gold identities across 15 bilingual documents — easy PII plus hard cases (ambiguous common-word names, abbreviations, transliterations, @handles, a badge number, a license plate). Reproduce: python eval/run_eval.py --provider ollama --with-presidio.

⚠️ A small synthetic set for regression testing and rough comparison — not evidence of legal anonymization or complete privacy. "Recall" means flagged-for-redaction; LLM detection is non-deterministic. See the threat model.

detector person org location project contact id overall over-redaction
regex only 0% 0% 0% 0% 69% 33% 13% 0%
Microsoft Presidio (en/zh lg) 78% 59% 80% 33% 38% 0% 61% 4%
vault-engine (qwen3.6:27b) 100% 100% 100% 100% 100% 100% 100% 0%

Same set where Presidio's NER scores 61%, the local LLM clears 100% — gap widest on codenames, @handles, IDs, and Chinese names/orgs. Trade-off is speed: Presidio ~6s, the LLM ~25s/doc.

The point isn't a leaderboard — it's the shape: pattern-only redaction can't see names, organizations, locations, or codenames at all; a local LLM can.

Install

pip install vault-engine

Or get the latest straight from source:

pip install git+https://github.com/fishonbike/vault-engine

For the default local backend, install Ollama and pull a model:

ollama pull qwen3.6:27b

No model yet? The deterministic floor (emails, phones, IDs, cards, URLs) works with zero setup via --no-llm.

Quickstart

vault-engine scrub notes.txt -o notes.safe.txt

That writes notes.safe.txt (send this to the cloud) and notes.safe.txt.map.json (local only — the identities). Paste the sanitized text into your model, save its reply, then restore the real identities:

vault-engine rehydrate reply.json --map notes.safe.txt.map.json -o reply.real.json

The clipboard one-liner

The fastest path — scrub whatever you're about to paste into a chatbot, in place:

vault-engine clip               # de-identifies the clipboard
#   …paste into ChatGPT/Claude, copy its reply, then:
vault-engine clip --rehydrate   # restores the real identities in the clipboard

Works on macOS, Windows, and Linux (with xclip/xsel/wl-clipboard).

Library:

from vaultengine import deidentify, rehydrate, Config

result = deidentify(open("notes.txt").read(), Config(model="qwen3.6:27b"))
send_to_cloud(result.text)                  # tokens only
restored = rehydrate(get_cloud_reply(), result.vault)   # real identities, locally
result.vault.save("notes.map.json")         # the reverse map — keep it local

Use cases

  • Pseudonymize before pasting into ChatGPT/Claude — analyze private notes, contracts, or chats with direct identifiers stripped.
  • Redact logs & support tickets before sharing them or feeding an LLM.
  • Anonymize a dataset for LLM-assisted analysis, then map results back.
  • Air-gapped review loops — a model on a locked-down box only ever sees tokens.

How it compares

Presidio and LLM Guard are excellent, mature tools. vault-engine's bet is different: a local LLM as the detector catches semantic/quasi-identifiers that label-based NER misses, with zero runtime deps and first-class Chinese.

vault-engine Presidio LLM Guard (Anonymize) regex / scrubadub
Detection local LLM + regex NER (spaCy) + regex NER / transformers patterns only
Unregistered names / orgs / quasi-IDs ✅ LLM ⚠️ NER labels only ⚠️ NER-limited
Reversible round-trip ✅ local map ✅ deanonymizer ✅ Vault
Fully local / offline ✅ Ollama ⚠️ varies
Runtime dependencies none (stdlib) spaCy + models several varies
Chinese (中文) ✅ strong ⚠️ needs model ⚠️
Swap the model ✅ one line partial
Fail-loud if detector errors ✅ degrades + non-zero exit

Redaction policy (privacy ↔ utility)

--policy Persons Orgs / places / roles Dates Token shape
balanced (default) ✅ typed (ORG_1, LOC_2) kept typed
max ✅ opaque R_1 (type hidden) coarsened opaque
light left in place kept typed

balanced keeps coarse structure — the cloud still reads "ORG_1 hired P-n2 as ROLE_1 in LOC_1" and can reason about it, while no real identity ships. Persons are tokenized in every policy.

Swap the model

vault-engine models                                   # list local Ollama tags
vault-engine scrub notes.txt --model qwen3.6:35b-a3b  # any local model
vault-engine scrub notes.txt --provider null          # offline, regex only

Built-in providers: ollama (default), openai-compat (any OpenAI-style endpoint — opt-in; ⚠️ sends raw text to that endpoint), null (offline). Add your own by implementing one method (complete) and registering it.

⚠️ Security model — read this

  • The reverse map (*.map.json) is the identity. It's the only thing that links tokens back to real people. Keep it local. Never send it to a cloud model, never commit it — .gitignore excludes *.map.json and the CLI warns every run. Use --one-way to produce no map (irreversible publish).
  • Detection stays local by default. Only the sanitized text is meant to leave, and only when you send it.

Threat model & limitations (honest)

  • LLM detection is best-effort, not a guarantee of non-identifiability — a model can miss a name or a rare quasi-identifier. It is not k-anonymity or differential privacy.
  • The critic pass and the risk report reduce and surface residual risk; they don't certify its absence. Writing style and domain-unique facts can still identify with names removed — use max for higher-stakes material.
  • If the model backend is unreachable, the run degrades to regex-only and exits non-zero (--allow-degraded to override) — it will never silently ship under-redacted text.

Protecting code & schemas (--format markdown)

With --format markdown (or auto, which switches on at a fenced block), anything inside fenced code blocks is preserved verbatim — a JSON reply-schema or code sample you include for the model survives untouched while the prose around it is scrubbed. Pre-existing placeholder tokens (e.g. P-7) pass through unchanged.

Development

python -m unittest discover -t . -s tests -v   # 59 tests, offline, no model
python eval/run_eval.py --provider ollama       # reproduce the benchmark

Fully offline and deterministic (null/fake providers); every fixture is synthetic — no real data lives in this repo.

License

Apache-2.0 © 2026 fishonbike. See LICENSE and NOTICE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vault_engine-0.1.0.tar.gz (44.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vault_engine-0.1.0-py3-none-any.whl (40.1 kB view details)

Uploaded Python 3

File details

Details for the file vault_engine-0.1.0.tar.gz.

File metadata

  • Download URL: vault_engine-0.1.0.tar.gz
  • Upload date:
  • Size: 44.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vault_engine-0.1.0.tar.gz
Algorithm Hash digest
SHA256 51f1939717c8c5667523c9a57d5ac0aada27a63bbb3ffe3d0701f76f275e094d
MD5 18c272f632f0d0f284cc8c2f51584fad
BLAKE2b-256 fc9d46f5e095e2e501f757a556ea19baf68d98fc1e21e09a3e6a58018ed4d9f9

See more details on using hashes here.

Provenance

The following attestation bundles were made for vault_engine-0.1.0.tar.gz:

Publisher: publish.yml on fishonbike/vault-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vault_engine-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vault_engine-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 40.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vault_engine-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8eeea103f1316c03b2002ef3418b972f33a771ac0752846f67589fb824c99998
MD5 a1ae2929689fcdc364f43c6fd981db06
BLAKE2b-256 a3dc4be8b7ae8a913f085ca17f797f57b60926dd788c23d5282e17c059ef3325

See more details on using hashes here.

Provenance

The following attestation bundles were made for vault_engine-0.1.0-py3-none-any.whl:

Publisher: publish.yml on fishonbike/vault-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page