Skip to main content

OCR-driven anonymization pipeline for medical reports and endoscopy frames

Project description

LX Anonymizer

LX Anonymizer is a toolkit for de-identifying endoscopy frames and medical reports. It combines OCR pipelines, spaCy-based NER, heuristic sanitizers, and report-specific rules to redact or pseudonymize sensitive information while preserving clinical context.

Highlights

  • End-to-end anonymization of PDFs and frame sequences using OCR, NER, and pseudonymization helpers.
  • Modular pipeline that lets you choose between Tesseract, TrOCR, ensemble OCR, and multiple metadata extractors.
  • Human-in-the-loop ready outputs: original/anonymized text side by side, metadata JSON, and validation artefacts.
  • Extensible ruleset covering device-specific renderers, fuzzy name matching, and language-specific replacements.

Requirements

  • Python 3.12+
  • Linux or macOS (Windows support is experimental)
  • NVIDIA GPU recommended for real-time video anonymization (CUDA 12.x). CPU-only processing works but is slower.
  • Optional extras:
    • spaCy de_core_news_lg model (download after installation)
    • Torch vision/audio for video OCR workloads
    • Ollama-compatible LLMs for advanced metadata extraction

Installation

From PyPI (upcoming release)

pip install lx-anonymizer

Install extras to tailor the footprint:

pip install "lx-anonymizer[gpu,ocr,llm,dev]"

From source

git clone https://github.com/wg-lux/lx-anonymizer.git
cd lx-anonymizer
uv sync

Nix development shell

direnv allow
nix develop

This loads GPU, OCR, and tooling dependencies declared in devenv.nix.

Model downloads

After installation, fetch the German spaCy model:

python -m spacy download de_core_news_lg

First CLI runs also download OCR checkpoints (EAST, TrOCR, etc.). For air-gapped deployments, grab the archives listed in lx_anonymizer/settings.py and place them in ~/.cache/lx-anonymizer.

Quickstart

CLI

python -m cli.report_reader process report.pdf --ensemble --output-dir ./anonymized

Useful options:

  • --llm-extractor {deepseek,medllama,llama3} for LLM-powered metadata extraction.
  • --use-ocr and --ensemble to switch OCR strategies.
  • batch and extract sub-commands for folder processing or metadata-only runs.

Python API

from lx_anonymizer import ReportReader

reader = ReportReader(locale="de_DE")
original, anonymized, meta = reader.process_report(
    pdf_path="/path/to/report.pdf",
    use_ensemble=True,
    use_llm_extractor="deepseek",
)

See tests/test_cli_integration.py for more examples.

Data directories

By default, outputs live in ~/etc/lx-anonymizer/{data,temp}. Adjust them in lx_anonymizer/directory_setup.py. Clean temp regularly to avoid large intermediate artefacts.

Development workflow

  • Format & lint: uv run flake8
  • Tests (CPU friendly): uv run pytest -m "not gpu"
    • GPU tests are marked and can be run with -m gpu
  • Build wheel for release: uv run python -m build
  • Full local check helper: scripts/run_checks.sh

Project roadmap

  1. Publish CPU-only wheel to TestPyPI.
  2. Add optional extras for GPU/LLM workloads and slim default install.
  3. Automate release workflow (wheel + sdist upload, GitHub release notes).
  4. Expose REST/gRPC service with validation UI.

Contributing

See CONTRIBUTING.md for contribution guidelines, testing instructions, and communication channels.

License

Released under the MIT License.

Contact

Questions? Email lux@coloreg.de .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lx_anonymizer-0.8.8.1.tar.gz (562.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lx_anonymizer-0.8.8.1-py3-none-any.whl (724.2 kB view details)

Uploaded Python 3

File details

Details for the file lx_anonymizer-0.8.8.1.tar.gz.

File metadata

  • Download URL: lx_anonymizer-0.8.8.1.tar.gz
  • Upload date:
  • Size: 562.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for lx_anonymizer-0.8.8.1.tar.gz
Algorithm Hash digest
SHA256 50bbc202e13f47797236278bd8ce9c31f7f908f6b390e5792a67edbaca5bc71f
MD5 26ae5a95f6a4ea0637e5d45ea339a36e
BLAKE2b-256 c5613238dd3e5a9f05aef8d446ba63425d1d7a6daa64593ac7c5a3374a70a304

See more details on using hashes here.

File details

Details for the file lx_anonymizer-0.8.8.1-py3-none-any.whl.

File metadata

  • Download URL: lx_anonymizer-0.8.8.1-py3-none-any.whl
  • Upload date:
  • Size: 724.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.11

File hashes

Hashes for lx_anonymizer-0.8.8.1-py3-none-any.whl
Algorithm Hash digest
SHA256 b1114504fdbb0a3c7d5ce1c488c87f429927c91928c527287edf0576b9da2bad
MD5 9813ef8c8f5c0c5b90f88da19e8d3cab
BLAKE2b-256 0deb4bc898a181f29f266a5dd37694434004c1d823853592edb7c52f1359bda1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page