Identity de-identification for cloud-LLM hand-off: local detection, consistent pseudonyms, a reversible local map, and a swappable model backend.
Project description
vault-engine
A local-LLM privacy layer for anything you paste into a cloud model.
Strip the identities out of your text before it reaches ChatGPT / Claude / Gemini — a model running on your own machine finds the names, orgs, places and quasi-identifiers, replaces them with stable tokens, and keeps the only key-back-to-reality on disk. When the cloud answers in tokens, you put the real identities back locally.
出云前做身份脱敏:本地模型检测 → 代号化 → 云端用代号分析 → 本地还原真身。 检测不出本机,身份映射只存本地,大模型一行换。零依赖。
· Python ≥3.9 · stdlib-only · Apache-2.0
# notes.txt ── private, on your machine
林若曦是星澜资本的合伙人,在深圳见了字节跳动的陈大壮,邮箱 lin@xinglan.example
▼ vault-engine scrub (local qwen3.6:27b)
# safe.txt ── what the cloud sees: identities swapped for tokens
P-n1 是 ORG_1 的合伙人,在 LOC_1 见了 ORG_2 的 P-n2,邮箱 EMAIL_1
Why
You want a frontier cloud model to analyze sensitive notes — but you don't want the cloud to learn who they're about. Masking only the names you already know leaks everything you don't: an unregistered name, an employer, a city + a rare title, a project codename. Pattern-based redaction never sees those at all.
vault-engine puts a local model in front as the detector, so the semantic
identifiers get caught too — and nothing but the sanitized text ever leaves.
How it works
private text cloud model
│ (sees only tokens)
▼ ▲
┌─────────────────────────── vault-engine ────────────┼───────────┐
│ ① regex PII detectors (offline floor) │ │
│ ② LLM detector (local model finds names, │ │
│ orgs, places, quasi-IDs) │ │
│ ③ consistent pseudonyms (张三→P-n1, 同名同号) │ │
│ ④ residual-risk critic (re-scan: anything left?) │ ① send │
│ │ │ │
│ sanitized text ────────────────────────────────────┘ │
│ ▲ │
│ reverse map (token → real identity) ── stays LOCAL ──┐ ② reply │
│ └───────────────────── ⑤ rehydrate ◀────────────┘ │
└──────────────────────────────────────────────────────────────────┘
▼
real identities restored locally → use in your own system
Benchmark
How much identity each detector actually catches, on a labelled bilingual
dataset (reproduce with python eval/run_eval.py; methodology in
eval/):
77 gold identities across 15 bilingual documents — easy PII plus hard cases
(ambiguous common-word names, abbreviations, transliterations, @handles, a badge
number, a license plate). Reproduce:
python eval/run_eval.py --provider ollama --with-presidio.
⚠️ A small synthetic set for regression testing and rough comparison — not evidence of legal anonymization or complete privacy. "Recall" means flagged-for-redaction; LLM detection is non-deterministic. See the threat model.
| detector | person | org | location | project | contact | id | overall | over-redaction |
|---|---|---|---|---|---|---|---|---|
| regex only | 0% | 0% | 0% | 0% | 69% | 33% | 13% | 0% |
Microsoft Presidio (en/zh lg) |
78% | 59% | 80% | 33% | 38% | 0% | 61% | 4% |
| vault-engine (qwen3.6:27b) | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 0% |
Same set where Presidio's NER scores 61%, the local LLM clears 100% — gap widest on codenames, @handles, IDs, and Chinese names/orgs. Trade-off is speed: Presidio ~6s, the LLM ~25s/doc.
The point isn't a leaderboard — it's the shape: pattern-only redaction can't see names, organizations, locations, or codenames at all; a local LLM can.
Install
pip install vault-engine
Or get the latest straight from source:
pip install git+https://github.com/fishonbike/vault-engine
For the default local backend, install Ollama and pull a model:
ollama pull qwen3.6:27b
No model yet? The deterministic floor (emails, phones, IDs, cards, URLs) works
with zero setup via --no-llm.
Quickstart
vault-engine scrub notes.txt -o notes.safe.txt
That writes notes.safe.txt (send this to the cloud) and
notes.safe.txt.map.json (local only — the identities). Paste the sanitized
text into your model, save its reply, then restore the real identities:
vault-engine rehydrate reply.json --map notes.safe.txt.map.json -o reply.real.json
The clipboard one-liner
The fastest path — scrub whatever you're about to paste into a chatbot, in place:
vault-engine clip # de-identifies the clipboard
# …paste into ChatGPT/Claude, copy its reply, then:
vault-engine clip --rehydrate # restores the real identities in the clipboard
Works on macOS, Windows, and Linux (with xclip/xsel/wl-clipboard).
Library:
from vaultengine import deidentify, rehydrate, Config
result = deidentify(open("notes.txt").read(), Config(model="qwen3.6:27b"))
send_to_cloud(result.text) # tokens only
restored = rehydrate(get_cloud_reply(), result.vault) # real identities, locally
result.vault.save("notes.map.json") # the reverse map — keep it local
Use cases
- Pseudonymize before pasting into ChatGPT/Claude — analyze private notes, contracts, or chats with direct identifiers stripped.
- Redact logs & support tickets before sharing them or feeding an LLM.
- Anonymize a dataset for LLM-assisted analysis, then map results back.
- Air-gapped review loops — a model on a locked-down box only ever sees tokens.
How it compares
Presidio and LLM Guard are excellent, mature tools. vault-engine's bet is different: a local LLM as the detector catches semantic/quasi-identifiers that label-based NER misses, with zero runtime deps and first-class Chinese.
| vault-engine | Presidio | LLM Guard (Anonymize) | regex / scrubadub | |
|---|---|---|---|---|
| Detection | local LLM + regex | NER (spaCy) + regex | NER / transformers | patterns only |
| Unregistered names / orgs / quasi-IDs | ✅ LLM | ⚠️ NER labels only | ⚠️ NER-limited | ❌ |
| Reversible round-trip | ✅ local map | ✅ deanonymizer | ✅ Vault | ❌ |
| Fully local / offline | ✅ Ollama | ✅ | ⚠️ varies | ✅ |
| Runtime dependencies | none (stdlib) | spaCy + models | several | varies |
| Chinese (中文) | ✅ strong | ⚠️ needs model | ⚠️ | ❌ |
| Swap the model | ✅ one line | — | partial | — |
| Fail-loud if detector errors | ✅ degrades + non-zero exit | — | — | — |
Redaction policy (privacy ↔ utility)
--policy |
Persons | Orgs / places / roles | Dates | Token shape |
|---|---|---|---|---|
balanced (default) |
✅ | ✅ typed (ORG_1, LOC_2) |
kept | typed |
max |
✅ | ✅ opaque R_1 (type hidden) |
coarsened | opaque |
light |
✅ | left in place | kept | typed |
balanced keeps coarse structure — the cloud still reads "ORG_1 hired P-n2
as ROLE_1 in LOC_1" and can reason about it, while no real identity ships.
Persons are tokenized in every policy.
Swap the model
vault-engine models # list local Ollama tags
vault-engine scrub notes.txt --model qwen3.6:35b-a3b # any local model
vault-engine scrub notes.txt --provider null # offline, regex only
Built-in providers: ollama (default), openai-compat (any OpenAI-style
endpoint — opt-in; ⚠️ sends raw text to that endpoint), null (offline). Add
your own by implementing one method (complete) and registering it.
⚠️ Security model — read this
- The reverse map (
*.map.json) is the identity. It's the only thing that links tokens back to real people. Keep it local. Never send it to a cloud model, never commit it —.gitignoreexcludes*.map.jsonand the CLI warns every run. Use--one-wayto produce no map (irreversible publish). - Detection stays local by default. Only the sanitized text is meant to leave, and only when you send it.
Threat model & limitations (honest)
- LLM detection is best-effort, not a guarantee of non-identifiability — a model can miss a name or a rare quasi-identifier. It is not k-anonymity or differential privacy.
- The critic pass and the risk report reduce and surface residual risk; they
don't certify its absence. Writing style and domain-unique facts can still
identify with names removed — use
maxfor higher-stakes material. - If the model backend is unreachable, the run degrades to regex-only and
exits non-zero (
--allow-degradedto override) — it will never silently ship under-redacted text.
Protecting code & schemas (--format markdown)
With --format markdown (or auto, which switches on at a fenced block),
anything inside fenced code blocks is preserved verbatim — a JSON reply-schema or
code sample you include for the model survives untouched while the prose around
it is scrubbed. Pre-existing placeholder tokens (e.g. P-7) pass through
unchanged.
Development
python -m unittest discover -t . -s tests -v # 59 tests, offline, no model
python eval/run_eval.py --provider ollama # reproduce the benchmark
Fully offline and deterministic (null/fake providers); every fixture is synthetic — no real data lives in this repo.
License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vault_engine-0.1.0.tar.gz.
File metadata
- Download URL: vault_engine-0.1.0.tar.gz
- Upload date:
- Size: 44.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
51f1939717c8c5667523c9a57d5ac0aada27a63bbb3ffe3d0701f76f275e094d
|
|
| MD5 |
18c272f632f0d0f284cc8c2f51584fad
|
|
| BLAKE2b-256 |
fc9d46f5e095e2e501f757a556ea19baf68d98fc1e21e09a3e6a58018ed4d9f9
|
Provenance
The following attestation bundles were made for vault_engine-0.1.0.tar.gz:
Publisher:
publish.yml on fishonbike/vault-engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vault_engine-0.1.0.tar.gz -
Subject digest:
51f1939717c8c5667523c9a57d5ac0aada27a63bbb3ffe3d0701f76f275e094d - Sigstore transparency entry: 1963829092
- Sigstore integration time:
-
Permalink:
fishonbike/vault-engine@36955b895102e68411474afe23e228e87ca3681c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/fishonbike
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@36955b895102e68411474afe23e228e87ca3681c -
Trigger Event:
release
-
Statement type:
File details
Details for the file vault_engine-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vault_engine-0.1.0-py3-none-any.whl
- Upload date:
- Size: 40.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8eeea103f1316c03b2002ef3418b972f33a771ac0752846f67589fb824c99998
|
|
| MD5 |
a1ae2929689fcdc364f43c6fd981db06
|
|
| BLAKE2b-256 |
a3dc4be8b7ae8a913f085ca17f797f57b60926dd788c23d5282e17c059ef3325
|
Provenance
The following attestation bundles were made for vault_engine-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on fishonbike/vault-engine
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vault_engine-0.1.0-py3-none-any.whl -
Subject digest:
8eeea103f1316c03b2002ef3418b972f33a771ac0752846f67589fb824c99998 - Sigstore transparency entry: 1963829358
- Sigstore integration time:
-
Permalink:
fishonbike/vault-engine@36955b895102e68411474afe23e228e87ca3681c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/fishonbike
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@36955b895102e68411474afe23e228e87ca3681c -
Trigger Event:
release
-
Statement type: