Skip to main content

Detect and redact PHI/PII from documents (.txt, .csv, .docx, .xlsx, .pdf)

Project description

no-phi

A command-line tool for detecting and redacting PHI/PII (protected health information / personally identifiable information) from documents. It reads .txt, .csv, .docx, .xlsx, and .pdf files, finds personal data, writes redacted copies, and produces an Excel findings report.

It is tuned for healthcare documents: a layer of biomedical recognizers suppresses the false positives that general-purpose NER produces on clinical text (e.g. tagging a drug name like Perindopril as a PERSON, or Cardiology as an ORGANIZATION).

# Scan a file or folder, write redacted copies + phi_report.xlsx
python main.py scan report.pdf
python main.py scan ./records/ --output ./records_cleaned/

# Detect only, don't write redacted files
python main.py scan report.docx --dry-run

# Restrict to specific entity types
python main.py scan data.csv --entities PERSON,PHONE_NUMBER,US_SSN

# Map detected values to stable IDs instead of <ENTITY_TYPE> (CSV cols: id,mapped_id)
python main.py scan notes.txt --mappings mappings.csv

# Ignore known-safe values (.txt/.csv/.xlsx/.json) — not redacted or reported
python main.py scan ./records/ --exclude allowlist.txt

# Pre-download all NLP models (otherwise downloaded on first scan)
python main.py download-models

Pipeline

Every file flows through four stages: extract → recognize → redact → report. The tools used at each stage are listed below.

                ┌─────────────┐   ┌─────────────┐   ┌─────────────┐   ┌─────────────┐
   file  ─────► │ 1. EXTRACT  ├──►│ 2. RECOGNIZE├──►│ 3. REDACT   ├──►│ 4. REPORT   │
                │  text +     │   │  PII spans  │   │  anonymize/ │   │  Excel      │
                │  positions  │   │  (Presidio) │   │  black-box  │   │  findings   │
                └─────────────┘   └─────────────┘   └─────────────┘   └─────────────┘

The CLI orchestration lives in nophi/cli.py: it collects input files (_collect_files), dispatches each by extension to a handler in nophi/handlers/, aggregates findings, and prints a Rich summary table. (main.py is a thin shim that calls into it.)

Layer Package / tool
CLI, options, sub-commands Typer
Terminal progress bars & tables Rich
Entity detection engine Presidio Analyzer
Anonymization engine Presidio Anonymizer
General NER backend spaCy en_core_web_lg
Biomedical NER scispaCy en_ner_bc5cdr_md, en_ner_bionlp13cg_md
Drug-name matching drug-named-entity-recognition (DrugBank) + bundled RxNorm name list
Report output openpyxl

1. Extract — text + positions

Each file type has a handler in nophi/handlers/ that pulls out the text to scan. For formats with layout (PDF), it also tracks where each piece of text sits so redactions can be placed precisely.

Type Handler Library Notes
.txt text.py stdlib Whole file read as one string.
.csv text.py csv Dialect auto-sniffed; scanned per cell.
.docx docx.py python-docx Each paragraph and table cell.
.xlsx xlsx.py openpyxl Every string cell across all sheets.
.pdf pdf.py PyMuPDF Words + bounding boxes via get_text("words"), reassembled into text with a char-offset → word-box map.

2. Recognize — find PII spans

nophi/analyzer.py builds a Presidio AnalyzerEngine (backed by the spaCy en_core_web_lg model) and exposes scan_text, which returns the detected entities with character offsets and confidence scores.

Detection comes from three sources working together:

  • Presidio built-ins — spaCy NER for PERSON, ORGANIZATION, LOCATION, DATE_TIME, NRP, plus pattern/checksum recognizers for PHONE_NUMBER, EMAIL_ADDRESS, CREDIT_CARD, US_SSN, IBAN_CODE, IP_ADDRESS, URL, MEDICAL_LICENSE, and other structured identifiers.

  • Custom biomedical recognizers (nophi/recognizers.py) — these do not add PII. They recognize medical vocabulary and tag it with the internal type MEDICAL_TERM, which is used to protect that text from being scrubbed — not to redact it.

    They form four complementary layers, each catching what the others miss (a curated deny-list, two drug-name lists, and an ML model). All four emit MEDICAL_TERM:

    # Recognizer Backed by Catches Matching
    1 MedicalTermRecognizer deny-list in nophi/data/medical_terms.py Hospital departments, specialties, wards, symptoms, diagnoses, procedures, labs/imaging, shorthand. Exact (case-insensitive)
    2 MedicationRecognizer drug-named-entity-recognition (DrugBank) Drug names, incl. common misspellings. Fuzzy
    3 RxNormRecognizer bundled RxNorm name list (nophi/data/rxnorm_names.txt.gz) Drug brand + ingredient names from RxNorm (incl. many vitamins/minerals under their ingredient names). Exact n-gram
    4 BiomedicalNerRecognizer scispaCy en_ner_bc5cdr_md + en_ner_bionlp13cg_md Chemicals, diseases, anatomy, genes, organisms, tissues — recognized by ML context, so it catches substances in no list. ML model

    Layers 2–4 overlap on purpose: the two drug lists give high-precision exact/fuzzy hits, and the ML layer is the backstop for substances not in any list. Coverage of supplements/vitamins is therefore good for clinical/ingredient names (e.g. ascorbic acid, cholecalciferol) but thinner for lay/botanical names (e.g. fish oil, ginkgo biloba); the ML layer is the main net for those.

    Suppression logic in scan_text: any PERSON / ORGANIZATION / NRP / LOCATION detection that overlaps a MEDICAL_TERM span is dropped, and the MEDICAL_TERM spans themselves are removed from the output (they are not PII). The net effect is that genuine names/places survive while clinical vocabulary stops being mislabeled as identifiers.

  • StreetAddressRecognizer (nophi/recognizers.py) — unlike the biomedical recognizers, this one adds PII that Presidio's defaults miss. spaCy NER tags cities/regions (Scarborough) but not street lines, so a regex matches a house number + 1–3 street-name words + a known street-type suffix (Rd, Street, Ave, Blvd, Dr, …) and reports it as LOCATION. It handles bare addresses (2867 Ellesmere Rd) as well as full ones, plus alphanumeric house numbers (221B) and ordinal street names (350 5th Avenue). Requiring a leading number keeps it from matching a Dr. title or dosages like 100 mg tablet.

3. Redact — anonymize or black-box

nophi/redactor.py builds a Presidio AnonymizerEngine and the operator set used to replace each entity. By default an entity becomes <ENTITY_TYPE>; with --mappings (CSV columns id,mapped_id), a detection whose text matches an id is replaced by its mapped_id instead (token-overlap match, applied across all entity types so it works regardless of how Presidio classified the value). The --exclude option takes a .txt/.csv/.xlsx/.json list of values to ignore — any detection matching one (case-insensitive) is dropped in scan_text before redaction or reporting.

How the replacement is applied depends on the format:

  • .txt / .csv — Presidio rewrites the string in place (anonymize_text).
  • .docx — the anonymized text is written back into the paragraph/cell, preserving document structure.
  • .xlsx — matching cell values are overwritten.
  • .pdf — each detected span is mapped back to the exact word bounding boxes it covers; page.add_redact_annot() draws a filled black box with a short white label, and page.apply_redactions() permanently removes the underlying text from the PDF content stream (a true irreversible redaction, not just a visual cover).

4. Report — Excel findings

nophi/reporter.py writes an openpyxl workbook (default phi_report.xlsx) with two sheets:

  • Findings — one row per detection: file, entity type, original text, replacement, character position.
  • Summary — entity-type counts and per-file PHI counts.

Models & first run

The NLP models are downloaded on first use and cached under ~/.cache/no-phi/models/ — they are not bundled into the program. nophi/models.py handles fetching, extracting, and (for the scispaCy models) patching them to load under the installed spaCy version.

Model Size Source
en_core_web_lg (base NER) ~560 MB spaCy GitHub releases (pip wheel)
en_ner_bc5cdr_md ~115 MB scispaCy S3 release (.tar.gz)
en_ner_bionlp13cg_md ~120 MB scispaCy S3 release (.tar.gz)

The scispaCy biomedical models load with plain spaCy — the heavyweight scispacy package (and its nmslib/scipy/scikit-learn dependencies) is not required. nophi/models.py rewrites a stale boolean in each model's config.cfg during extraction so it validates under spaCy 3.8.

Run python main.py download-models to fetch everything ahead of time, or just run a scan and the models download automatically on first invocation.


Project layout

pyproject.toml       # packaging + dependencies + `nophi` console entry point
main.py              # thin entry-point shim → nophi.cli:main (used by Nuitka build)
nophi/               # the package
├── __main__.py      # enables `python -m nophi`
├── cli.py           # Typer app + orchestration
├── analyzer.py      # build_analyzer() + scan_text()  (detection)
├── recognizers.py   # custom MEDICAL_TERM recognizers
├── redactor.py      # anonymization
├── reporter.py      # Excel findings report
├── models.py        # model download / cache
├── handlers/        # per-format read/write/redact (text, docx, xlsx, pdf)
└── data/            # medical_terms.py + bundled rxnorm_names.txt.gz
scripts/             # build_rxnorm_list.py (refreshes the bundled RxNorm list)
docs/                # expansion_notes.md (user guide lives in the repo-root docs/)

Install

As a pip package (recommended)

Installs a nophi command on your PATH:

pip install .                  # or `pip install nophi` once published to PyPI
nophi download-models         # one-time: fetch NLP models (~560 MB)
nophi scan report.pdf

You can also run it without installing the script via python -m nophi scan ....

For development

pip install -e .               # editable install (deps come from pyproject.toml)
# or: pip install -r requirements.txt
python main.py download-models # optional: pre-fetch models

See the user guide for end-user instructions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nophi-0.1.0.tar.gz (164.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nophi-0.1.0-py3-none-any.whl (162.9 kB view details)

Uploaded Python 3

File details

Details for the file nophi-0.1.0.tar.gz.

File metadata

  • Download URL: nophi-0.1.0.tar.gz
  • Upload date:
  • Size: 164.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for nophi-0.1.0.tar.gz
Algorithm Hash digest
SHA256 dc35b2a29b4c6e26e00f1a6ac6082cd2d3c6374e7c3370263add6a067475ce2a
MD5 49cc196eb2be1e5441e04ad6c4aa6ea3
BLAKE2b-256 c1e3b00b2325645b0fa6c98b6d4b66f7e621965c38273bad0dc05f58e66af2af

See more details on using hashes here.

File details

Details for the file nophi-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: nophi-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 162.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for nophi-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a56bf0123ffbdfb6fbf7924ebaf6ffbee585d0264718c7063dbb5db050d7e390
MD5 92c811fe32a939a0179f06b6128359f3
BLAKE2b-256 9bd909100e88f0514e686470257b918175c087b08242eb3e411044fcadaa898b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page