Detect and redact PHI/PII from documents (.txt, .csv, .docx, .xlsx, .pdf)

These details have not been verified by PyPI

Project links

Project description

no-phi

A command-line tool for detecting and redacting PHI/PII (protected health information / personally identifiable information) from documents. It reads .txt, .csv, .docx, .xlsx, and .pdf files, finds personal data, writes redacted copies, and produces an Excel findings report.

It is tuned for healthcare documents: a layer of biomedical recognizers suppresses the false positives that general-purpose NER produces on clinical text (e.g. tagging a drug name like Perindopril as a PERSON, or Cardiology as an ORGANIZATION).

# Scan a file or folder, write redacted copies + phi_report.xlsx
python main.py scan report.pdf
python main.py scan ./records/ --output ./records_cleaned/

# Detect only, don't write redacted files
python main.py scan report.docx --dry-run

# Restrict to specific entity types
python main.py scan data.csv --entities PERSON,PHONE_NUMBER,US_SSN

# Map detected values to stable IDs instead of <ENTITY_TYPE> (CSV cols: id,mapped_id)
python main.py scan notes.txt --mappings mappings.csv

# Ignore known-safe values (.txt/.csv/.xlsx/.json) — not redacted or reported
python main.py scan ./records/ --exclude allowlist.txt

# Pre-download all NLP models (otherwise downloaded on first scan)
python main.py download-models

Pipeline

Every file flows through four stages: extract → recognize → redact → report. The tools used at each stage are listed below.

                ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
   file  ─────► │ 1. EXTRACT  ├──►│ 2. RECOGNIZE├──►│ 3. REDACT   ├──►│ 4. REPORT   │
                │  text +     │   │  PII spans  │   │  anonymize/ │   │  Excel      │
                │  positions  │   │  (Presidio) │   │  black-box  │   │  findings   │
                └─────────────┘   └─────────────┘   └─────────────┘   └─────────────┘

The CLI orchestration lives in nophi/cli.py: it collects input files (_collect_files), dispatches each by extension to a handler in nophi/handlers/, aggregates findings, and prints a Rich summary table. (main.py is a thin shim that calls into it.)

Layer	Package / tool
CLI, options, sub-commands	Typer
Terminal progress bars & tables	Rich
Entity detection engine	Presidio Analyzer
Anonymization engine	Presidio Anonymizer
General NER backend	spaCy `en_core_web_lg`
Biomedical NER	scispaCy `en_ner_bc5cdr_md`, `en_ner_bionlp13cg_md`
Drug-name matching	drug-named-entity-recognition (DrugBank) + bundled RxNorm name list
Report output	openpyxl

1. Extract — text + positions

Each file type has a handler in nophi/handlers/ that pulls out the text to scan. For formats with layout (PDF), it also tracks where each piece of text sits so redactions can be placed precisely.

Type	Handler	Library	Notes
`.txt`	text.py	stdlib	Whole file read as one string.
`.csv`	text.py	`csv`	Dialect auto-sniffed; scanned per cell.
`.docx`	docx.py	`python-docx`	Each paragraph and table cell.
`.xlsx`	xlsx.py	`openpyxl`	Every string cell across all sheets.
`.pdf`	pdf.py	`PyMuPDF`	Words + bounding boxes via `get_text("words")`, reassembled into text with a char-offset → word-box map.

2. Recognize — find PII spans

nophi/analyzer.py builds a Presidio AnalyzerEngine (backed by the spaCy en_core_web_lg model) and exposes scan_text, which returns the detected entities with character offsets and confidence scores.

Detection comes from three sources working together:

Presidio built-ins — spaCy NER for PERSON, ORGANIZATION, LOCATION, DATE_TIME, NRP, plus pattern/checksum recognizers for PHONE_NUMBER, EMAIL_ADDRESS, CREDIT_CARD, US_SSN, IBAN_CODE, IP_ADDRESS, URL, MEDICAL_LICENSE, and other structured identifiers.

Custom biomedical recognizers (nophi/recognizers.py) — these do not add PII. They recognize medical vocabulary and tag it with the internal type MEDICAL_TERM, which is used to protect that text from being scrubbed — not to redact it.

They form four complementary layers, each catching what the others miss (a curated deny-list, two drug-name lists, and an ML model). All four emit MEDICAL_TERM:

#	Recognizer	Backed by	Catches	Matching
1	`MedicalTermRecognizer`	deny-list in nophi/data/medical_terms.py	Hospital departments, specialties, wards, symptoms, diagnoses, procedures, labs/imaging, shorthand.	Exact (case-insensitive)
2	`MedicationRecognizer`	`drug-named-entity-recognition` (DrugBank)	Drug names, incl. common misspellings.	Fuzzy
3	`RxNormRecognizer`	bundled RxNorm name list (nophi/data/rxnorm_names.txt.gz)	Drug brand + ingredient names from RxNorm (incl. many vitamins/minerals under their ingredient names).	Exact n-gram
4	`BiomedicalNerRecognizer`	scispaCy `en_ner_bc5cdr_md` + `en_ner_bionlp13cg_md`	Chemicals, diseases, anatomy, genes, organisms, tissues — recognized by ML context, so it catches substances in no list.	ML model

Layers 2–4 overlap on purpose: the two drug lists give high-precision exact/fuzzy hits, and the ML layer is the backstop for substances not in any list. Coverage of supplements/vitamins is therefore good for clinical/ingredient names (e.g. ascorbic acid, cholecalciferol) but thinner for lay/botanical names (e.g. fish oil, ginkgo biloba); the ML layer is the main net for those.

Suppression logic in scan_text: any PERSON / ORGANIZATION / NRP / LOCATION detection that overlaps a MEDICAL_TERM span is dropped, and the MEDICAL_TERM spans themselves are removed from the output (they are not PII). The net effect is that genuine names/places survive while clinical vocabulary stops being mislabeled as identifiers.

StreetAddressRecognizer (nophi/recognizers.py) — unlike the biomedical recognizers, this one adds PII that Presidio's defaults miss. spaCy NER tags cities/regions (Scarborough) but not street lines, so a regex matches a house number + 1–3 street-name words + a known street-type suffix (Rd, Street, Ave, Blvd, Dr, …) and reports it as LOCATION. It handles bare addresses (2867 Ellesmere Rd) as well as full ones, plus alphanumeric house numbers (221B) and ordinal street names (350 5th Avenue). Requiring a leading number keeps it from matching a Dr. title or dosages like 100 mg tablet.

3. Redact — anonymize or black-box

nophi/redactor.py builds a Presidio AnonymizerEngine and the operator set used to replace each entity. By default an entity becomes <ENTITY_TYPE>; with --mappings (CSV columns id,mapped_id), a detection whose text matches an id is replaced by its mapped_id instead (token-overlap match, applied across all entity types so it works regardless of how Presidio classified the value). The --exclude option takes a .txt/.csv/.xlsx/.json list of values to ignore — any detection matching one (case-insensitive) is dropped in scan_text before redaction or reporting.

How the replacement is applied depends on the format:

.txt / .csv — Presidio rewrites the string in place (anonymize_text).
.docx — the anonymized text is written back into the paragraph/cell, preserving document structure.
.xlsx — matching cell values are overwritten.
.pdf — each detected span is mapped back to the exact word bounding boxes it covers; page.add_redact_annot() draws a filled black box with a short white label, and page.apply_redactions() permanently removes the underlying text from the PDF content stream (a true irreversible redaction, not just a visual cover).

4. Report — Excel findings

nophi/reporter.py writes an openpyxl workbook (default phi_report.xlsx) with two sheets:

Findings — one row per detection: file, entity type, original text, replacement, character position.
Summary — entity-type counts and per-file PHI counts.

Models & first run

The NLP models are downloaded on first use and cached under ~/.cache/no-phi/models/ — they are not bundled into the program. nophi/models.py handles fetching, extracting, and (for the scispaCy models) patching them to load under the installed spaCy version.

Model	Size	Source
`en_core_web_lg` (base NER)	~560 MB	spaCy GitHub releases (pip wheel)
`en_ner_bc5cdr_md`	~115 MB	scispaCy S3 release (`.tar.gz`)
`en_ner_bionlp13cg_md`	~120 MB	scispaCy S3 release (`.tar.gz`)

The scispaCy biomedical models load with plain spaCy — the heavyweight scispacy package (and its nmslib/scipy/scikit-learn dependencies) is not required. nophi/models.py rewrites a stale boolean in each model's config.cfg during extraction so it validates under spaCy 3.8.

Run python main.py download-models to fetch everything ahead of time, or just run a scan and the models download automatically on first invocation.

Project layout

pyproject.toml       # packaging + dependencies + `nophi` console entry point
main.py              # thin entry-point shim → nophi.cli:main (used by Nuitka build)
nophi/               # the package
├── __main__.py      # enables `python -m nophi`
├── cli.py           # Typer app + orchestration
├── analyzer.py      # build_analyzer() + scan_text()  (detection)
├── recognizers.py   # custom MEDICAL_TERM recognizers
├── redactor.py      # anonymization
├── reporter.py      # Excel findings report
├── models.py        # model download / cache
├── handlers/        # per-format read/write/redact (text, docx, xlsx, pdf)
└── data/            # medical_terms.py + bundled rxnorm_names.txt.gz
scripts/             # build_rxnorm_list.py (refreshes the bundled RxNorm list)
docs/                # expansion_notes.md (user guide lives in the repo-root docs/)

Install

As a pip package (recommended)

Installs a nophi command on your PATH:

pip install .                  # or `pip install nophi` once published to PyPI
nophi download-models         # one-time: fetch NLP models (~560 MB)
nophi scan report.pdf

You can also run it without installing the script via python -m nophi scan ....

For development

pip install -e .               # editable install (deps come from pyproject.toml)
# or: pip install -r requirements.txt
python main.py download-models # optional: pre-fetch models

See the user guide for end-user instructions.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nophi-0.1.0.tar.gz (164.6 kB view details)

Uploaded Jun 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nophi-0.1.0-py3-none-any.whl (162.9 kB view details)

Uploaded Jun 29, 2026 Python 3

File details

Details for the file nophi-0.1.0.tar.gz.

File metadata

Download URL: nophi-0.1.0.tar.gz
Upload date: Jun 29, 2026
Size: 164.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for nophi-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`dc35b2a29b4c6e26e00f1a6ac6082cd2d3c6374e7c3370263add6a067475ce2a`
MD5	`49cc196eb2be1e5441e04ad6c4aa6ea3`
BLAKE2b-256	`c1e3b00b2325645b0fa6c98b6d4b66f7e621965c38273bad0dc05f58e66af2af`

See more details on using hashes here.

File details

Details for the file nophi-0.1.0-py3-none-any.whl.

File metadata

Download URL: nophi-0.1.0-py3-none-any.whl
Upload date: Jun 29, 2026
Size: 162.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for nophi-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a56bf0123ffbdfb6fbf7924ebaf6ffbee585d0264718c7063dbb5db050d7e390`
MD5	`92c811fe32a939a0179f06b6128359f3`
BLAKE2b-256	`9bd909100e88f0514e686470257b918175c087b08242eb3e411044fcadaa898b`

See more details on using hashes here.

nophi 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

no-phi

Pipeline

1. Extract — text + positions

2. Recognize — find PII spans

3. Redact — anonymize or black-box

4. Report — Excel findings

Models & first run

Project layout

Install

As a pip package (recommended)

For development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes