Offline document anonymizer for legal teams

These details have not been verified by PyPI

Project links

Homepage

Project description

anonymizer

Offline document anonymizer for legal teams. Replaces personally identifiable information (PII) in documents with structured tokens before sending them to external AI services.

Status: MVP-0 release candidate.

What it does

Drag a file (docx / xlsx / pdf, including scanned PDFs when local OCR is available) into the local web UI and get an anonymized document where:

Names, companies, financial details, addresses, emails, phones are replaced with structured tokens like [Person_1], [Company_1], [ADDRESS_1], ...
Document metadata is cleared
No network calls during processing — runs entirely on your machine

Then send the result to your AI tool of choice.

MVP-0 scope

Formats: docx, xlsx, pdf with text layer, scanned PDF, and hybrid PDF
Languages: Russian, English (NER); language-agnostic detectors for emails, phones, IBAN, cards, IP/MAC/URL, dates, geocoordinates
Platforms: Windows + macOS
UI: local web app at 127.0.0.1 in your browser
Install: single curl one-liner → uv tool install docs-anonymizer

Scanned and hybrid PDFs use local Tesseract OCR with English and Russian language packs. Password-protected files, additional languages, and editable recognized-DOCX export remain planned for later iterations.

Installation

# macOS / Linux
curl -fsSL https://anonymizer.site/install.sh | sh

# Windows (PowerShell)
iwr -useb https://anonymizer.site/install.ps1 | iex

Then run anonymize — your browser will open at http://127.0.0.1:<port>.

OCR setup for scanned PDFs

Scanned and hybrid PDFs require system Tesseract with English and Russian language packs. The anonymizer installer offers to install Tesseract interactively and shows an approximate download/install size before asking. If you skip it, DOCX, XLSX, and PDFs with a text layer still work.

# macOS
brew install tesseract tesseract-lang

# Ubuntu / Debian
sudo apt install tesseract-ocr tesseract-ocr-eng tesseract-ocr-rus

# Windows (PowerShell)
winget install UB-Mannheim.TesseractOCR

On macOS, Homebrew's tesseract-lang package is large because it bundles all extra languages; expect up to roughly 720 MB on disk. Ubuntu/Debian and Windows downloads are usually smaller, and the package manager may show the exact download size.

After installing Tesseract, run:

anonymize doctor --no-network

If OCR is unavailable, scanned PDF processing is rejected with installation guidance instead of silently skipping scanned pages.

Stack

Python 3.11+, FastAPI + htmx, spaCy + Natasha, PyMuPDF, python-docx, openpyxl, lxml. Full details in the technical spec.

Architecture

Three-layer design — core (headless Python library), cli, webapp (FastAPI on loopback) — plus testkit for synthetic test corpus generation and feedback loop tooling. Detectors are pluggable; language packs are drop-in. Manual masking + audit logging without PII leakage.

Licenses

The project is released under AGPL-3.0 because it depends on PyMuPDF (AGPL). All other dependencies are permissive open-source (MIT / Apache 2.0 / BSD / MPL). The source distribution published with each release contains the project source needed to satisfy AGPL source-availability obligations.

A page in the application UI will list all bundled libraries and models with their individual licenses.

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.3.0

Jun 9, 2026

0.2.32

Jun 8, 2026

0.2.31

Jun 2, 2026

This version

0.2.30

May 29, 2026

0.2.29

May 28, 2026

0.2.28

May 28, 2026

0.2.22

May 26, 2026

0.2.21

May 25, 2026

0.2.20

May 25, 2026

0.2.19

May 24, 2026

0.2.18

May 22, 2026

0.2.17

May 21, 2026

0.2.16

May 21, 2026

0.2.15

May 21, 2026

0.2.14

May 20, 2026

0.2.13

May 20, 2026

0.2.12

May 20, 2026

0.2.11

May 20, 2026

0.2.10

May 20, 2026

0.2.9

May 19, 2026

0.2.8

May 19, 2026

0.2.7

May 19, 2026

0.2.6

May 19, 2026

0.2.5

May 19, 2026

0.2.4

May 19, 2026

0.2.3

May 19, 2026

0.2.2

May 19, 2026

0.2.1

May 19, 2026

0.2.0

May 19, 2026

0.0.1

May 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docs_anonymizer-0.2.30.tar.gz (804.6 kB view details)

Uploaded May 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docs_anonymizer-0.2.30-py3-none-any.whl (303.6 kB view details)

Uploaded May 29, 2026 Python 3

File details

Details for the file docs_anonymizer-0.2.30.tar.gz.

File metadata

Download URL: docs_anonymizer-0.2.30.tar.gz
Upload date: May 29, 2026
Size: 804.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.5

File hashes

Hashes for docs_anonymizer-0.2.30.tar.gz
Algorithm	Hash digest
SHA256	`30e42446a7824b648e6f72df8cb8f69bd8a62ac6a2831de4a85824fb60b3d9a5`
MD5	`be9d3f4e922b607bb925cb2c43452622`
BLAKE2b-256	`58468183dc4cf1dd5bde8fe307c5ed763c59ee0edf27c55317a4e1af30693c43`

See more details on using hashes here.

File details

Details for the file docs_anonymizer-0.2.30-py3-none-any.whl.

File metadata

Download URL: docs_anonymizer-0.2.30-py3-none-any.whl
Upload date: May 29, 2026
Size: 303.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.5

File hashes

Hashes for docs_anonymizer-0.2.30-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bd6ce082abb1be1edd341d77b875638cd05b66520c0f7f16c2f930cde6e45e79`
MD5	`f25d7736fb926509051332f569ae0080`
BLAKE2b-256	`8c6cc861bea49cc79851fd88e26bf2c3b7a1eaf4eb6d605fea10e0bcba751a89`

See more details on using hashes here.

docs-anonymizer 0.2.30

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

anonymizer

What it does

MVP-0 scope

Installation

OCR setup for scanned PDFs

Stack

Architecture

Licenses

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes