Identity de-identification for cloud-LLM hand-off: local detection, consistent pseudonyms, a reversible local map, and a swappable model backend.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

fishonbike

These details have not been verified by PyPI

Project description

vault-engine

A local-LLM privacy layer for anything you paste into a cloud model.

Strip the identities out of your text before it reaches ChatGPT / Claude / Gemini — a model running on your own machine finds the names, orgs, places and quasi-identifiers, replaces them with stable tokens, and keeps the only key-back-to-reality on disk. When the cloud answers in tokens, you put the real identities back locally.

出云前做身份脱敏：本地模型检测 → 代号化 → 云端用代号分析 → 本地还原真身。检测不出本机，身份映射只存本地，大模型一行换。零依赖。

vault-engine scrubs English and Chinese names and PII into tokens before the cloud sees them, then restores them locally

· Python ≥3.9 · stdlib-only · Apache-2.0

# notes.txt  ── private, on your machine
林若曦是星澜资本的合伙人，在深圳见了字节跳动的陈大壮，邮箱 lin@xinglan.example

        ▼  vault-engine scrub  (local qwen3.6:27b)

# safe.txt  ── what the cloud sees: identities swapped for tokens
P-n1 是 ORG_1 的合伙人，在 LOC_1 见了 ORG_2 的 P-n2，邮箱 EMAIL_1

Why

You want a frontier cloud model to analyze sensitive notes — but you don't want the cloud to learn who they're about. Masking only the names you already know leaks everything you don't: an unregistered name, an employer, a city + a rare title, a project codename. Pattern-based redaction never sees those at all.

vault-engine puts a local model in front as the detector, so the semantic identifiers get caught too — and nothing but the sanitized text ever leaves.

How it works

 private text                                    cloud model
      │                                          (sees only tokens)
      ▼                                                ▲
┌─────────────────────────── vault-engine ────────────┼───────────┐
│  ① regex PII detectors  (offline floor)              │           │
│  ② LLM detector         (local model finds names,    │           │
│                          orgs, places, quasi-IDs)    │           │
│  ③ consistent pseudonyms (张三→P-n1, 同名同号)        │           │
│  ④ residual-risk critic  (re-scan: anything left?)   │  ① send   │
│        │                                             │           │
│   sanitized text ────────────────────────────────────┘           │
│        ▲                                                          │
│   reverse map (token → real identity) ── stays LOCAL ──┐  ② reply │
│        └───────────────────── ⑤ rehydrate ◀────────────┘          │
└──────────────────────────────────────────────────────────────────┘
      ▼
 real identities restored locally → use in your own system

Benchmark

How much identity each detector actually catches, on a labelled bilingual dataset (reproduce with python eval/run_eval.py; methodology in eval/):

77 gold identities across 15 bilingual documents — easy PII plus hard cases (ambiguous common-word names, abbreviations, transliterations, @handles, a badge number, a license plate). Reproduce: python eval/run_eval.py --provider ollama --with-presidio.

⚠️ A small synthetic set for regression testing and rough comparison — not evidence of legal anonymization or complete privacy. "Recall" means flagged-for-redaction; LLM detection is non-deterministic. See the threat model.

detector	person	org	location	project	contact	id	overall	over-redaction
regex only	0%	0%	0%	0%	69%	33%	13%	0%
Microsoft Presidio (en/zh `lg`)	78%	59%	80%	33%	38%	0%	61%	4%
vault-engine (qwen3.6:27b)	100%	100%	100%	100%	100%	100%	100%	0%

Same set where Presidio's NER scores 61%, the local LLM clears 100% — gap widest on codenames, @handles, IDs, and Chinese names/orgs. Trade-off is speed: Presidio ~6s, the LLM ~25s/doc.

The point isn't a leaderboard — it's the shape: pattern-only redaction can't see names, organizations, locations, or codenames at all; a local LLM can.

Install

pip install vault-engine

Or get the latest straight from source:

pip install git+https://github.com/fishonbike/vault-engine

For the default local backend, install Ollama and pull a model:

ollama pull qwen3.6:27b

No model yet? The deterministic floor (emails, phones, IDs, cards, URLs) works with zero setup via --no-llm.

Quickstart

vault-engine scrub notes.txt -o notes.safe.txt

That writes notes.safe.txt (send this to the cloud) and notes.safe.txt.map.json (local only — the identities). Paste the sanitized text into your model, save its reply, then restore the real identities:

vault-engine rehydrate reply.json --map notes.safe.txt.map.json -o reply.real.json

The clipboard one-liner

The fastest path — scrub whatever you're about to paste into a chatbot, in place:

vault-engine clip               # de-identifies the clipboard
#   …paste into ChatGPT/Claude, copy its reply, then:
vault-engine clip --rehydrate   # restores the real identities in the clipboard

Works on macOS, Windows, and Linux (with xclip/xsel/wl-clipboard).

Library:

from vaultengine import deidentify, rehydrate, Config

result = deidentify(open("notes.txt").read(), Config(model="qwen3.6:27b"))
send_to_cloud(result.text)                  # tokens only
restored = rehydrate(get_cloud_reply(), result.vault)   # real identities, locally
result.vault.save("notes.map.json")         # the reverse map — keep it local

Use cases

Pseudonymize before pasting into ChatGPT/Claude — analyze private notes, contracts, or chats with direct identifiers stripped.
Redact logs & support tickets before sharing them or feeding an LLM.
Anonymize a dataset for LLM-assisted analysis, then map results back.
Air-gapped review loops — a model on a locked-down box only ever sees tokens.

How it compares

Presidio and LLM Guard are excellent, mature tools. vault-engine's bet is different: a local LLM as the detector catches semantic/quasi-identifiers that label-based NER misses, with zero runtime deps and first-class Chinese.

	vault-engine	Presidio	LLM Guard (Anonymize)	regex / scrubadub
Detection	local LLM + regex	NER (spaCy) + regex	NER / transformers	patterns only
Unregistered names / orgs / quasi-IDs	✅ LLM	⚠️ NER labels only	⚠️ NER-limited	❌
Reversible round-trip	✅ local map	✅ deanonymizer	✅ Vault	❌
Fully local / offline	✅ Ollama	✅	⚠️ varies	✅
Runtime dependencies	none (stdlib)	spaCy + models	several	varies
Chinese (中文)	✅ strong	⚠️ needs model	⚠️	❌
Swap the model	✅ one line	—	partial	—
Fail-loud if detector errors	✅ degrades + non-zero exit	—	—	—

Redaction policy (privacy ↔ utility)

`--policy`	Persons	Orgs / places / roles	Dates	Token shape
`balanced` (default)	✅	✅ typed (`ORG_1`, `LOC_2`)	kept	typed
`max`	✅	✅ opaque `R_1` (type hidden)	coarsened	opaque
`light`	✅	left in place	kept	typed

balanced keeps coarse structure — the cloud still reads "ORG_1 hired P-n2 as ROLE_1 in LOC_1" and can reason about it, while no real identity ships. Persons are tokenized in every policy.

Swap the model

vault-engine models                                   # list local Ollama tags
vault-engine scrub notes.txt --model qwen3.6:35b-a3b  # any local model
vault-engine scrub notes.txt --provider null          # offline, regex only

Built-in providers: ollama (default), openai-compat (any OpenAI-style endpoint — opt-in; ⚠️ sends raw text to that endpoint), null (offline). Add your own by implementing one method (complete) and registering it.

⚠️ Security model — read this

The reverse map (*.map.json) is the identity. It's the only thing that links tokens back to real people. Keep it local. Never send it to a cloud model, never commit it — .gitignore excludes *.map.json and the CLI warns every run. Use --one-way to produce no map (irreversible publish).
Detection stays local by default. Only the sanitized text is meant to leave, and only when you send it.

Threat model & limitations (honest)

LLM detection is best-effort, not a guarantee of non-identifiability — a model can miss a name or a rare quasi-identifier. It is not k-anonymity or differential privacy.
The critic pass and the risk report reduce and surface residual risk; they don't certify its absence. Writing style and domain-unique facts can still identify with names removed — use max for higher-stakes material.
If the model backend is unreachable, the run degrades to regex-only and exits non-zero (--allow-degraded to override) — it will never silently ship under-redacted text.

Protecting code & schemas (`--format markdown`)

With --format markdown (or auto, which switches on at a fenced block), anything inside fenced code blocks is preserved verbatim — a JSON reply-schema or code sample you include for the model survives untouched while the prose around it is scrubbed. Pre-existing placeholder tokens (e.g. P-7) pass through unchanged.

Development

python -m unittest discover -t . -s tests -v   # 59 tests, offline, no model
python eval/run_eval.py --provider ollama       # reproduce the benchmark

Fully offline and deterministic (null/fake providers); every fixture is synthetic — no real data lives in this repo.

License

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

fishonbike

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jun 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vault_engine-0.1.0.tar.gz (44.0 kB view details)

Uploaded Jun 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vault_engine-0.1.0-py3-none-any.whl (40.1 kB view details)

Uploaded Jun 26, 2026 Python 3

File details

Details for the file vault_engine-0.1.0.tar.gz.

File metadata

Download URL: vault_engine-0.1.0.tar.gz
Upload date: Jun 26, 2026
Size: 44.0 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vault_engine-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`51f1939717c8c5667523c9a57d5ac0aada27a63bbb3ffe3d0701f76f275e094d`
MD5	`18c272f632f0d0f284cc8c2f51584fad`
BLAKE2b-256	`fc9d46f5e095e2e501f757a556ea19baf68d98fc1e21e09a3e6a58018ed4d9f9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vault_engine-0.1.0.tar.gz:

Publisher: publish.yml on fishonbike/vault-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vault_engine-0.1.0.tar.gz
- Subject digest: 51f1939717c8c5667523c9a57d5ac0aada27a63bbb3ffe3d0701f76f275e094d
- Sigstore transparency entry: 1963829092
- Sigstore integration time: Jun 26, 2026
Source repository:
- Permalink: fishonbike/vault-engine@36955b895102e68411474afe23e228e87ca3681c
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/fishonbike
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@36955b895102e68411474afe23e228e87ca3681c
- Trigger Event: release

File details

Details for the file vault_engine-0.1.0-py3-none-any.whl.

File metadata

Download URL: vault_engine-0.1.0-py3-none-any.whl
Upload date: Jun 26, 2026
Size: 40.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vault_engine-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8eeea103f1316c03b2002ef3418b972f33a771ac0752846f67589fb824c99998`
MD5	`a1ae2929689fcdc364f43c6fd981db06`
BLAKE2b-256	`a3dc4be8b7ae8a913f085ca17f797f57b60926dd788c23d5282e17c059ef3325`

See more details on using hashes here.

Provenance

The following attestation bundles were made for vault_engine-0.1.0-py3-none-any.whl:

Publisher: publish.yml on fishonbike/vault-engine

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: vault_engine-0.1.0-py3-none-any.whl
- Subject digest: 8eeea103f1316c03b2002ef3418b972f33a771ac0752846f67589fb824c99998
- Sigstore transparency entry: 1963829358
- Sigstore integration time: Jun 26, 2026
Source repository:
- Permalink: fishonbike/vault-engine@36955b895102e68411474afe23e228e87ca3681c
- Branch / Tag: refs/tags/v0.1.0
- Owner: https://github.com/fishonbike
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@36955b895102e68411474afe23e228e87ca3681c
- Trigger Event: release

vault-engine 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

vault-engine

Why

How it works

Benchmark

Install

Quickstart

The clipboard one-liner

Use cases

How it compares

Redaction policy (privacy ↔ utility)

Swap the model

⚠️ Security model — read this

Threat model & limitations (honest)

Protecting code & schemas (--format markdown)

Development

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

Protecting code & schemas (`--format markdown`)