OCR-driven anonymization pipeline for medical reports and endoscopy frames
Project description
LX Anonymizer
LX Anonymizer is a toolkit for de-identifying endoscopy frames and medical reports. It combines OCR pipelines, spaCy-based NER, heuristic sanitizers, and report-specific rules to redact or pseudonymize sensitive information while preserving clinical context.
Highlights
- End-to-end anonymization of PDFs and frame sequences using OCR, NER, and pseudonymization helpers.
- Modular pipeline that lets you choose between Tesseract, TrOCR, ensemble OCR, and multiple metadata extractors.
- Human-in-the-loop ready outputs: original/anonymized text side by side, metadata JSON, and validation artefacts.
- Extensible ruleset covering device-specific renderers, fuzzy name matching, and language-specific replacements.
Requirements
- Python 3.12+
- Linux or macOS (Windows support is experimental)
- NVIDIA GPU recommended for real-time video anonymization (CUDA 12.x). CPU-only processing works but is slower.
- Optional extras:
- spaCy
de_core_news_lgmodel (download after installation) - Torch vision/audio for video OCR workloads
- Ollama-compatible LLMs for advanced metadata extraction
- spaCy
Installation
From PyPI (upcoming release)
pip install lx-anonymizer
Install extras to tailor the footprint:
pip install "lx-anonymizer[gpu,ocr,llm,dev]"
From source
git clone https://github.com/wg-lux/lx-anonymizer.git
cd lx-anonymizer
uv sync
Nix development shell
direnv allow
nix develop
This loads GPU, OCR, and tooling dependencies declared in devenv.nix.
Model downloads
After installation, fetch the German spaCy model:
python -m spacy download de_core_news_lg
First CLI runs also download OCR checkpoints (EAST, TrOCR, etc.). For air-gapped deployments, grab the archives listed in lx_anonymizer/settings.py and place them in ~/.cache/lx-anonymizer.
Quickstart
CLI
python -m cli.report_reader process report.pdf --ensemble --output-dir ./anonymized
Useful options:
--llm-extractor {deepseek,medllama,llama3}for LLM-powered metadata extraction.--use-ocrand--ensembleto switch OCR strategies.batchandextractsub-commands for folder processing or metadata-only runs.
Python API
from lx_anonymizer import ReportReader
reader = ReportReader(locale="de_DE")
original, anonymized, meta = reader.process_report(
pdf_path="/path/to/report.pdf",
use_ensemble=True,
use_llm_extractor="deepseek",
)
See tests/test_cli_integration.py for more examples.
Data directories
By default, outputs live in ~/etc/lx-anonymizer/{data,temp}. Adjust them in lx_anonymizer/directory_setup.py. Clean temp regularly to avoid large intermediate artefacts.
Development workflow
- Format & lint:
uv run flake8 - Tests (CPU friendly):
uv run pytest -m "not gpu"- GPU tests are marked and can be run with
-m gpu
- GPU tests are marked and can be run with
- Build wheel for release:
uv run python -m build - Full local check helper:
scripts/run_checks.sh
Project roadmap
- Publish CPU-only wheel to TestPyPI.
- Add optional extras for GPU/LLM workloads and slim default install.
- Automate release workflow (wheel + sdist upload, GitHub release notes).
- Expose REST/gRPC service with validation UI.
Contributing
See CONTRIBUTING.md for contribution guidelines, testing instructions, and communication channels.
License
Released under the MIT License.
Contact
Questions? Email lux@coloreg.de .
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lx_anonymizer-0.8.5.tar.gz.
File metadata
- Download URL: lx_anonymizer-0.8.5.tar.gz
- Upload date:
- Size: 564.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b6b3f74133334ac3ffca0b3264c8a226d21ea211fd61908ca04c070653c1e502
|
|
| MD5 |
60215402ba7ae0e8725b99ba02ba01bc
|
|
| BLAKE2b-256 |
1c7f2aa00932efc64e9af2b00b09256afe68ca83838c257ab267c21a796375a4
|
File details
Details for the file lx_anonymizer-0.8.5-py3-none-any.whl.
File metadata
- Download URL: lx_anonymizer-0.8.5-py3-none-any.whl
- Upload date:
- Size: 725.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e80af35e26b482c97e05d4cfb2b9a1fcfd1a8d4e37077ef8112d1da1fbf5d0a2
|
|
| MD5 |
a9881e28cce8f8c3a72e3e5f0b34c842
|
|
| BLAKE2b-256 |
9f6706ec6904523b1d11e35378bccc4217450553eea39eba0289247f6f7e82bb
|