Detect and redact PHI/PII from documents (.txt, .csv, .docx, .xlsx, .pdf)
Project description
no-phi
A command-line tool for detecting and redacting PHI/PII (protected health
information / personally identifiable information) from documents. It reads
.txt, .csv, .docx, .xlsx, and .pdf files, finds personal data, writes
redacted copies, and produces an Excel findings report.
It is tuned for healthcare documents: a layer of biomedical recognizers suppresses the false positives that general-purpose NER produces on clinical text (e.g. tagging a drug name like Perindopril as a PERSON, or Cardiology as an ORGANIZATION).
# Scan a file or folder, write redacted copies + phi_report.xlsx
python main.py scan report.pdf
python main.py scan ./records/ --output ./records_cleaned/
# Detect only, don't write redacted files
python main.py scan report.docx --dry-run
# Restrict to specific entity types
python main.py scan data.csv --entities PERSON,PHONE_NUMBER,US_SSN
# Map detected values to stable IDs instead of <ENTITY_TYPE> (CSV cols: id,mapped_id)
python main.py scan notes.txt --mappings mappings.csv
# Ignore known-safe values (.txt/.csv/.xlsx/.json) — not redacted or reported
python main.py scan ./records/ --exclude allowlist.txt
# Pre-download all NLP models (otherwise downloaded on first scan)
python main.py download-models
Pipeline
Every file flows through four stages: extract → recognize → redact → report. The tools used at each stage are listed below.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
file ─────► │ 1. EXTRACT ├──►│ 2. RECOGNIZE├──►│ 3. REDACT ├──►│ 4. REPORT │
│ text + │ │ PII spans │ │ anonymize/ │ │ Excel │
│ positions │ │ (Presidio) │ │ black-box │ │ findings │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
The CLI orchestration lives in nophi/cli.py: it collects input
files (_collect_files), dispatches each by extension to a
handler in nophi/handlers/, aggregates findings, and prints a
Rich summary table. (main.py is a thin shim that calls into it.)
| Layer | Package / tool |
|---|---|
| CLI, options, sub-commands | Typer |
| Terminal progress bars & tables | Rich |
| Entity detection engine | Presidio Analyzer |
| Anonymization engine | Presidio Anonymizer |
| General NER backend | spaCy en_core_web_lg |
| Biomedical NER | scispaCy en_ner_bc5cdr_md, en_ner_bionlp13cg_md |
| Drug-name matching | drug-named-entity-recognition (DrugBank) + bundled RxNorm name list |
| Report output | openpyxl |
1. Extract — text + positions
Each file type has a handler in nophi/handlers/ that pulls out the text to scan. For formats with layout (PDF), it also tracks where each piece of text sits so redactions can be placed precisely.
| Type | Handler | Library | Notes |
|---|---|---|---|
.txt |
text.py | stdlib | Whole file read as one string. |
.csv |
text.py | csv |
Dialect auto-sniffed; scanned per cell. |
.docx |
docx.py | python-docx |
Each paragraph and table cell. |
.xlsx |
xlsx.py | openpyxl |
Every string cell across all sheets. |
.pdf |
pdf.py | PyMuPDF |
Words + bounding boxes via get_text("words"), reassembled into text with a char-offset → word-box map. |
2. Recognize — find PII spans
nophi/analyzer.py builds a Presidio AnalyzerEngine (backed
by the spaCy en_core_web_lg model) and exposes scan_text,
which returns the detected entities with character offsets and confidence scores.
Detection comes from three sources working together:
-
Presidio built-ins — spaCy NER for
PERSON,ORGANIZATION,LOCATION,DATE_TIME,NRP, plus pattern/checksum recognizers forPHONE_NUMBER,EMAIL_ADDRESS,CREDIT_CARD,US_SSN,IBAN_CODE,IP_ADDRESS,URL,MEDICAL_LICENSE, and other structured identifiers. -
Custom biomedical recognizers (nophi/recognizers.py) — these do not add PII. They recognize medical vocabulary and tag it with the internal type
MEDICAL_TERM, which is used to protect that text from being scrubbed — not to redact it.They form four complementary layers, each catching what the others miss (a curated deny-list, two drug-name lists, and an ML model). All four emit
MEDICAL_TERM:# Recognizer Backed by Catches Matching 1 MedicalTermRecognizerdeny-list in nophi/data/medical_terms.py Hospital departments, specialties, wards, symptoms, diagnoses, procedures, labs/imaging, shorthand. Exact (case-insensitive) 2 MedicationRecognizerdrug-named-entity-recognition(DrugBank)Drug names, incl. common misspellings. Fuzzy 3 RxNormRecognizerbundled RxNorm name list (nophi/data/rxnorm_names.txt.gz) Drug brand + ingredient names from RxNorm (incl. many vitamins/minerals under their ingredient names). Exact n-gram 4 BiomedicalNerRecognizerscispaCy en_ner_bc5cdr_md+en_ner_bionlp13cg_mdChemicals, diseases, anatomy, genes, organisms, tissues — recognized by ML context, so it catches substances in no list. ML model Layers 2–4 overlap on purpose: the two drug lists give high-precision exact/fuzzy hits, and the ML layer is the backstop for substances not in any list. Coverage of supplements/vitamins is therefore good for clinical/ingredient names (e.g. ascorbic acid, cholecalciferol) but thinner for lay/botanical names (e.g. fish oil, ginkgo biloba); the ML layer is the main net for those.
Suppression logic in
scan_text: anyPERSON/ORGANIZATION/NRP/LOCATIONdetection that overlaps aMEDICAL_TERMspan is dropped, and theMEDICAL_TERMspans themselves are removed from the output (they are not PII). The net effect is that genuine names/places survive while clinical vocabulary stops being mislabeled as identifiers. -
StreetAddressRecognizer(nophi/recognizers.py) — unlike the biomedical recognizers, this one adds PII that Presidio's defaults miss. spaCy NER tags cities/regions (Scarborough) but not street lines, so a regex matches a house number + 1–3 street-name words + a known street-type suffix (Rd,Street,Ave,Blvd,Dr, …) and reports it asLOCATION. It handles bare addresses (2867 Ellesmere Rd) as well as full ones, plus alphanumeric house numbers (221B) and ordinal street names (350 5th Avenue). Requiring a leading number keeps it from matching aDr.title or dosages like100 mg tablet.
3. Redact — anonymize or black-box
nophi/redactor.py builds a Presidio AnonymizerEngine and the
operator set used to replace each entity. By default an entity becomes
<ENTITY_TYPE>; with --mappings (CSV columns id,mapped_id), a detection whose
text matches an id is replaced by its mapped_id instead (token-overlap match,
applied across all entity types so it works regardless of how Presidio classified
the value). The --exclude option takes a .txt/.csv/.xlsx/.json list of
values to ignore — any detection matching one (case-insensitive) is dropped in
scan_text before redaction or reporting.
How the replacement is applied depends on the format:
.txt/.csv— Presidio rewrites the string in place (anonymize_text)..docx— the anonymized text is written back into the paragraph/cell, preserving document structure..xlsx— matching cell values are overwritten..pdf— each detected span is mapped back to the exact word bounding boxes it covers;page.add_redact_annot()draws a filled black box with a short white label, andpage.apply_redactions()permanently removes the underlying text from the PDF content stream (a true irreversible redaction, not just a visual cover).
4. Report — Excel findings
nophi/reporter.py writes an openpyxl workbook (default
phi_report.xlsx) with two sheets:
- Findings — one row per detection: file, entity type, original text, replacement, character position.
- Summary — entity-type counts and per-file PHI counts.
Models & first run
The NLP models are downloaded on first use and cached under
~/.cache/no-phi/models/ — they are not bundled into the program.
nophi/models.py handles fetching, extracting, and
(for the scispaCy models) patching them to load under the installed spaCy
version.
| Model | Size | Source |
|---|---|---|
en_core_web_lg (base NER) |
~560 MB | spaCy GitHub releases (pip wheel) |
en_ner_bc5cdr_md |
~115 MB | scispaCy S3 release (.tar.gz) |
en_ner_bionlp13cg_md |
~120 MB | scispaCy S3 release (.tar.gz) |
The scispaCy biomedical models load with plain spaCy — the heavyweight
scispacypackage (and itsnmslib/scipy/scikit-learndependencies) is not required. nophi/models.py rewrites a stale boolean in each model'sconfig.cfgduring extraction so it validates under spaCy 3.8.
Run python main.py download-models to fetch everything ahead of time, or just
run a scan and the models download automatically on first invocation.
Project layout
pyproject.toml # packaging + dependencies + `nophi` console entry point
main.py # thin entry-point shim → nophi.cli:main (used by Nuitka build)
nophi/ # the package
├── __main__.py # enables `python -m nophi`
├── cli.py # Typer app + orchestration
├── analyzer.py # build_analyzer() + scan_text() (detection)
├── recognizers.py # custom MEDICAL_TERM recognizers
├── redactor.py # anonymization
├── reporter.py # Excel findings report
├── models.py # model download / cache
├── handlers/ # per-format read/write/redact (text, docx, xlsx, pdf)
└── data/ # medical_terms.py + bundled rxnorm_names.txt.gz
scripts/ # build_rxnorm_list.py (refreshes the bundled RxNorm list)
docs/ # expansion_notes.md (user guide lives in the repo-root docs/)
Install
As a pip package (recommended)
Installs a nophi command on your PATH:
pip install . # or `pip install nophi` once published to PyPI
nophi download-models # one-time: fetch NLP models (~560 MB)
nophi scan report.pdf
You can also run it without installing the script via python -m nophi scan ....
For development
pip install -e . # editable install (deps come from pyproject.toml)
# or: pip install -r requirements.txt
python main.py download-models # optional: pre-fetch models
See the user guide for end-user instructions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nophi-0.1.0.tar.gz.
File metadata
- Download URL: nophi-0.1.0.tar.gz
- Upload date:
- Size: 164.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
dc35b2a29b4c6e26e00f1a6ac6082cd2d3c6374e7c3370263add6a067475ce2a
|
|
| MD5 |
49cc196eb2be1e5441e04ad6c4aa6ea3
|
|
| BLAKE2b-256 |
c1e3b00b2325645b0fa6c98b6d4b66f7e621965c38273bad0dc05f58e66af2af
|
File details
Details for the file nophi-0.1.0-py3-none-any.whl.
File metadata
- Download URL: nophi-0.1.0-py3-none-any.whl
- Upload date:
- Size: 162.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a56bf0123ffbdfb6fbf7924ebaf6ffbee585d0264718c7063dbb5db050d7e390
|
|
| MD5 |
92c811fe32a939a0179f06b6128359f3
|
|
| BLAKE2b-256 |
9bd909100e88f0514e686470257b918175c087b08242eb3e411044fcadaa898b
|