Skip to main content

Local PII redaction and pseudonymization for documents.

Project description

noirdoc — German-first PII redaction, local by default.

CI Python 3.12 | 3.13 License: MIT pre-commit enabled

noirdoc

German-first PII redaction and pseudonymization for documents. Local by default. Reversible when you need it.

Noirdoc redacts names, addresses, phone numbers, IBANs, Steuer-IDs, SVNRs, and the rest — from PDFs, DOCX, XLSX, and plain text — without sending anything to a third party. Under the hood it's a rules-based Presidio pipeline by default, and an ensemble (Presidio + GLiNER + Flair) when the [full] extra is installed. It's built for real-world German documents and mixed DE/EN text — the kind of stuff Mittelstand actually runs through an LLM.

Status: alpha (0.1.x). API will change before 1.0. Pin the minor version.

Prerequisites

  • Python 3.12 or 3.13
  • ~1 GB free disk if you install the [full] extra (spaCy + Flair + GLiNER weights)
  • Optional: a Redis instance if you want shared mapping storage across workers ([redis] extra)

Install

# Baseline — Presidio + all file extractors + reversible mapper.
pip install noirdoc

# Full ensemble (adds GLiNER + Flair, large ML weights). Recommended for real work.
pip install noirdoc[full]
noirdoc models pull

# Optional distributed mapper backend.
pip install noirdoc[redis]

For anything beyond toy examples, use noirdoc[full] — the ensemble catches what the baseline misses, especially on German lowercase text.

Quickstart

# One-shot redact (ephemeral mapping, discarded on exit).
noirdoc redact vertrag.pdf -o vertrag-clean.pdf

# Persistent namespace — placeholders stay consistent across files and sessions.
noirdoc redact --namespace mandant-mueller brief.docx -o brief-clean.docx
noirdoc reveal --namespace mandant-mueller brief-clean.docx -o brief-revealed.docx
noirdoc lookup --namespace mandant-mueller "<<PERSON_3>>"
from noirdoc import Redactor

r = Redactor(namespace="mandant-mueller")
r.redact_file("vertrag.pdf", output="vertrag-clean.pdf")
r.redact_file("brief.docx", output="brief-clean.docx")
r.reveal_text(llm_response)  # un-redact the model's reply

Input:

Anna Müller, geboren am 12.03.1981 in München, erreichbar unter 0171-2345678, Steuer-ID 12 345 678 901, IBAN DE89 3704 0044 0532 0130 00.

Output:

<<PERSON_1>>, geboren am <<DATE_TIME_1>> in <<LOCATION_1>>, erreichbar unter <<PHONE_NUMBER_1>>, Steuer-ID <<DE_STEUER_ID_1>>, IBAN <<IBAN_CODE_1>>.

Commands

Command What it does
noirdoc redact <files> Redact one or more files (accepts directories; -o FILE or --output-dir DIR).
noirdoc reveal <file> Reverse pseudonyms back to originals (DOCX / XLSX / plain; --namespace required).
noirdoc lookup <token> Resolve a pseudonym like <<PERSON_1>> to its original value.
noirdoc ns list List persistent namespaces under ~/.noirdoc/namespaces/.
noirdoc ns show <name> Print the mapping summary for a namespace as JSON.
noirdoc ns delete <name> Delete a namespace (prompts for confirmation).
noirdoc models pull Download spaCy models and (optionally) GLiNER weights up front.

Run noirdoc <cmd> --help for the full flag list on any subcommand.

Before you start

A few honest caveats before you ship this into a pipeline:

  • Best results need [full]. On first use (or via noirdoc models pull) the full extra downloads roughly 560 MB of weights: spaCy de_core_news_lg, Flair ner-german-large, and a GLiNER multilingual model. Budget disk and bandwidth.
  • PDF reveal is not supported yet. Round-tripping placeholders back into a PDF is a hard problem (position drift, font metrics, image-based redactions). PDFs redact cleanly; reveal is pass-through. DOCX, XLSX, and plain text round-trip fully.
  • Alpha API. Classes and CLI flags may change between 0.1.x and 0.2.x. Pin accordingly.
  • Detector quality depends on the upstream models. Presidio + Flair + GLiNER do the heavy lifting. Noirdoc adds German-specific recognizers on top, but it does not train models.

German-first

Noirdoc defaults to German (language="de") with fallback to ["de", "en"] for mixed documents. What that actually means:

  • Custom recognizers in src/noirdoc/detection/presidio_detector.py:
    • GermanPhoneRecognizer — German phone formats (0171-..., +49...)
    • GermanSVNRRecognizer — Sozialversicherungsnummer with checksum
    • GermanSteuerIDRecognizer — 11-digit Steuer-ID with checksum
    • InvertedNameRecognizer — registered for both de and en to catch "Nachname, Vorname" patterns
  • Flair ner-german-large (XLM-R, F1 92.3 % on CoNLL-03 DE) handles lowercase German text — the case where spaCy tends to drop names.
  • GLiNER multilingual catches entity types the others miss.
  • German-style lowercase financial terms, German IBANs, German date formats, and German address patterns are covered in the test suite (tests/test_presidio_detector.py).

If you're working with German legal, medical, HR, or financial documents, this is what the defaults are tuned for.

Supported formats

Format Redact Reveal (round-trip)
PDF ✗ (pass-through)
DOCX
XLSX
Plain text / CSV / MD / HTML
PPTX / images ✗ (pass-through)

PDF reveal is an open contribution target — see CONTRIBUTING.md.

Advanced: shared mapping storage

The [redis] extra ships a RedisMappingBackend that plugs into the lower-level MappingStore — the same primitive Noirdoc Cloud uses for request-scoped, encrypted, TTL-bounded mapping persistence across workers. It is not wired into Redactor(namespace=...), which persists to the local filesystem under ~/.noirdoc/namespaces/. Use MappingStore when you have multiple workers that need to share pseudonym mappings for the same request, or when you want encrypted-at-rest mappings with automatic expiry.

import asyncio
from cryptography.fernet import Fernet
from redis.asyncio import Redis

from noirdoc.mappings.backends.redis_backend import RedisMappingBackend
from noirdoc.mappings.store import MappingStore

async def main() -> None:
    redis = Redis.from_url("redis://localhost:6379")
    store = MappingStore(
        backend=RedisMappingBackend(redis),
        encryption_key=Fernet.generate_key(),  # keep stable across workers
    )
    # store.save(request_id=..., tenant_id=..., mapper=...)
    # mappings = await store.load(request_id)

asyncio.run(main())

The encryption_key must be identical across workers that need to read the same mappings. MappingStore.save() accepts a ttl_days kwarg (default 30).

Noirdoc Cloud

Don't want to run this yourself? Noirdoc Cloud is the hosted API wrapper: a privacy-preserving reverse proxy for LLM calls that uses this exact pipeline, plus multi-tenancy, audit, and provider key management. Compliance story: what's on GitHub is what the cloud runs.

Contributing

Bug reports, detectors, and format support are all welcome. See CONTRIBUTING.md for dev setup, tests, and the recognizer pattern.

Security

Report vulnerabilities via GitHub's private vulnerability reporting — see SECURITY.md. Please don't open public issues for security bugs.

Changelog

See CHANGELOG.md. Follows Keep a Changelog and SemVer.

License

MIT © 2026 Antonio Maiolo / Nextaim GmbH. See LICENSE.


Built by Nextaim GmbH · noirdoc.de

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

noirdoc-0.1.0.tar.gz (164.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

noirdoc-0.1.0-py3-none-any.whl (54.0 kB view details)

Uploaded Python 3

File details

Details for the file noirdoc-0.1.0.tar.gz.

File metadata

  • Download URL: noirdoc-0.1.0.tar.gz
  • Upload date:
  • Size: 164.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for noirdoc-0.1.0.tar.gz
Algorithm Hash digest
SHA256 faac3473ed20a2fa4ace3c7bcb55e7c53ea50e7e7a84e84fdad2a66d1f95cd9c
MD5 421b563b30e283a1e55d14237f21f7f3
BLAKE2b-256 96bfa4abaae3c2d5c524552c150dcd5f5e8ca97c5d54e2b127b68f4ce269249a

See more details on using hashes here.

Provenance

The following attestation bundles were made for noirdoc-0.1.0.tar.gz:

Publisher: release.yml on nextaim-de/noirdoc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file noirdoc-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: noirdoc-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 54.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for noirdoc-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a99952d0ba7e30f1fbee38905da289a484a4d9a50d4301c977a0b780cef26dcd
MD5 9d1733be933834d1dff9879ec7ccd930
BLAKE2b-256 4cfebfac47ee99d48fb3f661eb101f2c2d37ab80e54d69fe1a4c4be69a10da14

See more details on using hashes here.

Provenance

The following attestation bundles were made for noirdoc-0.1.0-py3-none-any.whl:

Publisher: release.yml on nextaim-de/noirdoc

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page