Reversible PII anonymization framework with an encrypted vault

These details have not been verified by PyPI

Project links

Project description

Carnaval mask

Carnaval

The art of the mask - hide the identity, keep the meaning.

Python 3.11+ Tests Status

Carnaval is an open-source Python framework for reversible PII anonymization. It masks sensitive entities in text documents before they are sent to a cloud LLM, then restores the original values in the structured response the LLM returns.

The problem

You want to use a cloud LLM (Claude, GPT, Mistral, Gemini...) to process text documents - order acknowledgements, invoices, business emails, contracts - but those documents contain personal or confidential data that must never leave your infrastructure in clear text.

The solution

RAW DOCUMENT  ──▶  [ Carnaval ]  ──▶  MASKED DOCUMENT  ──▶  Cloud LLM
                                                                  │
FINAL DOCUMENT  ◀──  [ Carnaval ]  ◀──  JSON / XML response  ◀──┘

Before sending - sensitive entities are replaced with placeholders such as [PERSON_1], [EMAIL_2], [ORG]. The placeholder ↔ real-value mapping is stored in an encrypted local vault.
After the response - the original values are re-injected into the JSON or XML structure returned by the LLM.

No data ever leaves your machine in clear text, and the LLM still receives a coherent, structured document it can reason about.

Key features

Reversible - every masked entity maps to a unique placeholder; the mapping lives in an AES-256-GCM encrypted vault.
Coherent - the same value always receives the same placeholder within a run, so the LLM can reason about cross-references.
Local-first - no network calls to anonymize. The optional neural model runs on your own machine.
9 entity types - PERSON, ORGANIZATION, LOCATION, EMAIL, PHONE, IBAN, BIC, VAT, SIREN/SIRET, URL.
Layered detection - regex recognizers, deny lists, bundled dictionaries (GeoNames cities, first names), and an optional zero-shot neural recognizer (GLiNER).
Multilingual - 6 languages: French, English, German, Spanish, Italian, Portuguese.
Business profiles - acknowledge, invoice, email, plus private per-client profiles kept out of version control.
8 output formats - TXT, JSON, JSONL, XML, CoNLL, HTML, encrypted vault, audit metadata - all produced in a single pass.
CLI and library - use the anonymize.py / reinject.py scripts, or import carnaval directly into your Python code.

Pipeline

Carnaval is built as 7 self-contained stages, each with a clear input → output contract:

TXT ──▶ S1 Intake ──▶ S2 Preprocess ──▶ S3 Detect ──▶ S4 Resolve ──▶ S5 Mask ──▶ S6 Output
        (read)        (language,         (recognizers)  (dedup,        (placeholders  (8 formats)
                       normalize)                        arbitration)   + vault)

JSON / XML ──▶ S7 Reinject ──▶ JSON / XML with original values restored

See Architecture for details on each stage.

Installation

Requires Python 3.11+ (tested on 3.13).

git clone <repository-url>
cd carnaval

python -m venv .venv
# Windows PowerShell
.\.venv\Scripts\Activate.ps1
# Linux / macOS
source .venv/bin/activate

pip install -r requirements.txt

The neural recognizer (GLiNER) is included in requirements.txt. The model (~500 MB) is downloaded automatically on first use; afterwards Carnaval works fully offline. See the Installation guide for an offline / air-gapped setup.

Configure the vault password

cp .env.example .env

Then edit .env and set a strong secret (16 characters minimum, 32+ recommended):

CARNAVAL_VAULT_PASSWORD=a-strong-randomly-generated-secret

Quickstart - CLI

# 1. Anonymize a document
python anonymize.py inbox/order.txt --profile acknowledge

# 2. Send outbox/txt/order_anonymise.txt to your LLM, collect a JSON response

# 3. Re-inject the real values into the LLM response
python reinject.py response.json --vault outbox/vault/order_vault.enc

anonymize.py produces, in one pass, all 8 output files under outbox/ (txt/, json/, jsonl/, xml/, conll/, html/, vault/, meta/).

Useful flags: --no-gliner (regex + deny lists only, faster), --gliner-threshold 0.6, --profile invoice, --private my_client, --console (human-readable logs).

Quickstart - Python API

from pathlib import Path
from carnaval.pipeline import run_anonymization

masked, written, config = run_anonymization(
    input_path=Path("inbox/order.txt"),
    outbox_dir=Path("outbox"),
    vault_password="a-strong-randomly-generated-secret",
    profile="acknowledge",
    use_gliner=True,
)

print(masked.anonymized_text)      # text with placeholders
print(masked.by_category)          # {'PERSON': 2, 'ORGANIZATION': 1, ...}
print(written.json_path)           # path to the JSON output

Re-injecting an LLM response:

from carnaval.core.vault import Vault
from carnaval.stages.s7_reinject import reinject_json_data

vault = Vault(password="a-strong-randomly-generated-secret",
              path="outbox/vault/order_vault.enc")
vault.load()

llm_response = {"supplier": "[ORG_1]", "contact": "[PERSON_1]"}
restored = reinject_json_data(llm_response, vault)
# {"supplier": "Globex Inc.", "contact": "Jane Doe"}

See the Quickstart and Reinjection wiki pages for more.

Security

The placeholder ↔ value mapping is stored in an encrypted vault:

Property	Value
Symmetric cipher	AES-256-GCM (authenticated encryption)
Key derivation	PBKDF2-HMAC-SHA256, 600,000 iterations
Salt	16 random bytes per file
Nonce	16 random bytes per file
Integrity tag	16 bytes - any tampering is detected on read

Without the password, the vault is unreadable. Carnaval makes no outbound network calls once the GLiNER model has been downloaded, and its structured logger redacts sensitive keys by default. It supports GDPR-style pseudonymization (Article 4.5). See Vault and Security.

Supported languages

French (FR), English (EN), German (DE), Spanish (ES), Italian (IT) and Portuguese (PT). The language is auto-detected; mixed-language documents are handled via in-text linguistic markers. See Multilingual.

Project status

Carnaval is a functional proof of concept. Core anonymization, re-injection, the encrypted vault and the 8 output formats are implemented and covered by an extensive automated test suite.

Testing

pytest                       # full suite (skips slow neural tests)
pytest -m slow               # real GLiNER tests (downloads the model)
pytest --cov=src/carnaval    # with coverage

Documentation

The complete reference lives in the project wiki:

The original design notes are kept under docs/.

Contributing

Contributions are welcome - see CONTRIBUTING.md and our Code of Conduct. Please use only fictitious entities (Acme Corp, Globex, Jane Doe, Springfield...) in public fixtures and examples.

Contact & Security

General questions, conduct reports: carnaval.oss@gmail.com
Bug reports and feature requests: GitHub issues
Security vulnerabilities: please do not open a public issue - see SECURITY.md for responsible disclosure.

Citation

If you use Carnaval in your work, please cite it via its archived DOI:

Patrice AUBERT. Carnaval: a reversible PII anonymization framework. 2026. DOI: 10.5281/zenodo.20219604

A machine-readable CITATION.cff is included - GitHub turns it into a "Cite this repository" button.

License

Carnaval is released under the Apache License 2.0. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.2

May 16, 2026

0.1.1

May 16, 2026

This version

0.1.0 yanked

May 16, 2026

Reason this release was yanked:

Broken packaging - pip install non-functional, use 0.1.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

carnaval-0.1.0.tar.gz (607.3 kB view details)

Uploaded May 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

carnaval-0.1.0-py3-none-any.whl (86.2 kB view details)

Uploaded May 16, 2026 Python 3

File details

Details for the file carnaval-0.1.0.tar.gz.

File metadata

Download URL: carnaval-0.1.0.tar.gz
Upload date: May 16, 2026
Size: 607.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for carnaval-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`e966d3f2e50e16927cf7ac628b8bdc7fcb8da03cac2ad6a6ba4181a688bea613`
MD5	`a396b97ccfaf75ef9b09d51fd9f1052f`
BLAKE2b-256	`cf0138899ff904d6cf27fe5d7245eaa60a5c0bafa9210beeaef72147db4b8ff1`

See more details on using hashes here.

File details

Details for the file carnaval-0.1.0-py3-none-any.whl.

File metadata

Download URL: carnaval-0.1.0-py3-none-any.whl
Upload date: May 16, 2026
Size: 86.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.13.6

File hashes

Hashes for carnaval-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b1cf1228afdf808a6c2a9bee24a86348ee2178f35858aee84e24232f98dddb87`
MD5	`2ca86d81c13b04c05bf9029421902474`
BLAKE2b-256	`98d56f5016e1328fa24ceb46820f606293aee2fa90c947f5bc4444c8daa0bda8`

See more details on using hashes here.

carnaval 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

Carnaval

The problem

The solution

Key features

Pipeline

Installation

Configure the vault password

Quickstart - CLI

Quickstart - Python API

Security

Supported languages

Project status

Testing

Documentation

Contributing

Contact & Security

Citation

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes