Skip to main content

Offline document anonymizer for legal teams

Project description

anonymizer

Offline document anonymizer for legal teams. Replaces personally identifiable information (PII) in documents with structured tokens before sending them to external AI services.

Status: MVP-0 release candidate.

What it does

Drag a file (docx / pdf with text layer / xlsx) into the local web UI and get an anonymized document where:

  • Names, companies, financial details, addresses, emails, phones are replaced with structured tokens like [Person_1], [Company_1], [ADDRESS_1], ...
  • Document metadata is cleared
  • No network calls during processing — runs entirely on your machine

Then send the result to your AI tool of choice.

MVP-0 scope

  • Formats: docx, pdf with text layer, xlsx
  • Languages: Russian, English (NER); language-agnostic detectors for emails, phones, IBAN, cards, IP/MAC/URL, dates, geocoordinates
  • Platforms: Windows + macOS
  • UI: local web app at 127.0.0.1 in your browser
  • Install: single curl one-liner → uv tool install docs-anonymizer

OCR for scanned PDFs, password-protected files, additional languages — planned for later iterations (MVP-1+).

Installation

# macOS / Linux
curl -fsSL https://anonymizer.site/install.sh | sh

# Windows (PowerShell)
iwr -useb https://anonymizer.site/install.ps1 | iex

Then run anonymize — your browser will open at http://127.0.0.1:<port>.

Stack

Python 3.11+, FastAPI + htmx, spaCy + Natasha, PyMuPDF, python-docx, openpyxl, lxml. Full details in the technical spec.

Architecture

Three-layer design — core (headless Python library), cli, webapp (FastAPI on loopback) — plus testkit for synthetic test corpus generation and feedback loop tooling. Detectors are pluggable; language packs are drop-in. Manual masking + audit logging without PII leakage.

Licenses

The project is released under AGPL-3.0 because it depends on PyMuPDF (AGPL). All other dependencies are permissive open-source (MIT / Apache 2.0 / BSD / MPL). The source distribution published with each release contains the project source needed to satisfy AGPL source-availability obligations.

A page in the application UI will list all bundled libraries and models with their individual licenses.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docs_anonymizer-0.2.13.tar.gz (503.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docs_anonymizer-0.2.13-py3-none-any.whl (266.3 kB view details)

Uploaded Python 3

File details

Details for the file docs_anonymizer-0.2.13.tar.gz.

File metadata

  • Download URL: docs_anonymizer-0.2.13.tar.gz
  • Upload date:
  • Size: 503.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docs_anonymizer-0.2.13.tar.gz
Algorithm Hash digest
SHA256 e767515c5ecee02b72964238f9a54dc8a12169f8e7dcc2bcc001f936ee21b18c
MD5 e7c8ec8d4d1dca9ef12f45fdf8a56304
BLAKE2b-256 b1bad6f20d49a049a6973c6b5fd581c23520f030c33a6bbf9203d4406d6d61b7

See more details on using hashes here.

File details

Details for the file docs_anonymizer-0.2.13-py3-none-any.whl.

File metadata

File hashes

Hashes for docs_anonymizer-0.2.13-py3-none-any.whl
Algorithm Hash digest
SHA256 107c622b68807da688e1370b7686d23c93166299eb26b4ae2d89b3aaeb9fa6c4
MD5 3c4c333a81f782579a12aa8aa5df9f93
BLAKE2b-256 71dcf1c0ed84555ed878804c6ed7230533f26b9f23806c5e7755adf10cfd097d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page