Structured feature extraction from clinical notes
Project description
ehrextract
Structured feature extraction from clinical notes. Three steps:
- Bring your notes — CSV, JSONL, JSON, XLSX, plain text, or a pandas DataFrame.
- Pick a task — a built-in task (
comorbidity,clinical_vars,full) or your own YAML file with your own fields and prompt. - Pick a model — a fine-tuned LoRA adapter on a local base model, your own local HuggingFace weights, or an API model (OpenAI-compatible or Anthropic).
One command (or one function call) later you have a results table — CSV, JSONL, JSON, XLSX, or Parquet — with one column per extracted field.
Important — read before use. ehrextract is research-grade software. It is NOT a medical device, is NOT FDA-cleared / Health Canada-approved, and MUST NOT be used for clinical decision-making, patient triage, eligibility determination, re-identification, surveillance, or any setting where its outputs affect a person's access to care, insurance, employment, or legal status. Outputs may hallucinate; any research use requires per-row human review. The egress-warning system is informational, not a privacy compliance control. Users are solely responsible for HIPAA / PHIPA / PIPEDA / GDPR / REB compliance. See
NOTICEfor the full acceptable-use scope.
Install
pip install ehrextract # core (~50 MB)
pip install 'ehrextract[hf]' # + torch + transformers + peft (~3 GB)
pip install 'ehrextract[openai]' # + openai SDK
pip install 'ehrextract[anthropic]' # + anthropic SDK
Python ≥ 3.11. For a development install from a clone, see CONTRIBUTING.md.
30-second example
ehrextract \
--task comorbidity \
--model Qwen/Qwen3.5-27B --adapter /path/to/adapter \
--input notes.csv --output results.csv
or, as a library:
from pathlib import Path
from ehrextract import extract
df = extract(
Path("notes.csv"),
"comorbidity",
model="Qwen/Qwen3.5-27B",
adapter="/path/to/adapter",
output="results.csv",
)
The input needs a note_text column (configurable via --text-column); a
note_id column is added automatically when absent. The output has one
column per task field plus parse_success, validation_errors,
raw_response, finish_reason, repair_attempts, and token counts.
On API providers, --batch submits the whole run as one provider-side
batch at 50% API cost, and --max-repairs N re-prompts the model with the
exact field errors when a response fails to parse or validate. See
quickstart.md.
Built-in tasks
| Task | Fields | What it extracts |
|---|---|---|
comorbidity |
17 | Free-text diagnosis list + 16 Y/N comorbidity categories |
clinical_vars |
4 | Feeding and neurologic variables (tube/oral feeding, aspiration risk, NI trajectory) |
full |
20 | Joint task: the 16 comorbidity categories + the 4 clinical variables |
Built-in tasks ship inside the package; --task <name> works without any
extra files. Define your own task in YAML — see
schema-reference.md.
Note on the
fulltask. Thefulltask enables constrained JSON decoding by default on the local HuggingFace provider — the same mechanism the research pipeline used for the published joint-task numbers, forcing structurally valid, schema-conformant output at the token level. It requires the[hf]extra (which includeslm-format-enforcer); disable it with--no-constrained. API providers ignore the setting (Anthropic already forces the schema via tool-use).
Data handling
If your input may contain PHI, read data-handling.md
BEFORE running with any API provider. The package writes a data-egress
notice to stderr (once per process per destination) on API use; it never
blocks, and it does not (and cannot) guarantee compliance for you. The
local HuggingFace provider keeps all data on your machine.
Documentation
quickstart.md— fine-tuned adapters, custom tasks, API providersschema-reference.md— the task-file YAML referencedata-handling.md— PHI, egress notice, BAA-eligible providersextending-providers.md— plug in a custom provider
Authors and institutions
ehrextract was developed by:
- Chen Zhang (lead author)
- Yibing Xia (co-author)
- Sanjay Mahant, MD -- supervisor, The Hospital for Sick Children (SickKids)
- Nathan Taback, PhD -- supervisor, University of Toronto
at The Hospital for Sick Children (Toronto, Canada) and the University of Toronto (Toronto, Canada). Please cite the project if you use it in published work.
License
Licensed under the Apache License, Version 2.0. See LICENSE
for the full license text and NOTICE for
attribution, the no-endorsement clause, the clinical-use disclaimer, and the
acceptable-use restrictions that supplement (but do not override) the License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ehrextract-0.3.0.tar.gz.
File metadata
- Download URL: ehrextract-0.3.0.tar.gz
- Upload date:
- Size: 45.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab07359a529283856a6e8256b92b00971e9ed813f12985d348e9abe3303ae747
|
|
| MD5 |
0d42c111a852bff9faa3e1000bba035f
|
|
| BLAKE2b-256 |
d5db60f1cbd213f4777f703886354a03ade10c917567d1f99fe4ad8ebce63d86
|
Provenance
The following attestation bundles were made for ehrextract-0.3.0.tar.gz:
Publisher:
release.yml on shifosss/ehrextract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ehrextract-0.3.0.tar.gz -
Subject digest:
ab07359a529283856a6e8256b92b00971e9ed813f12985d348e9abe3303ae747 - Sigstore transparency entry: 1806538238
- Sigstore integration time:
-
Permalink:
shifosss/ehrextract@84fbd7f3a4fbf8eb50bb67b4e50aeba672023b3d -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/shifosss
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@84fbd7f3a4fbf8eb50bb67b4e50aeba672023b3d -
Trigger Event:
push
-
Statement type:
File details
Details for the file ehrextract-0.3.0-py3-none-any.whl.
File metadata
- Download URL: ehrextract-0.3.0-py3-none-any.whl
- Upload date:
- Size: 51.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e5901e17b68c04f90f9fb34d180c5373767e625b8c081d77439290ea8b585ed
|
|
| MD5 |
a65bb6f7fa605829908cc1a4e46e4ce4
|
|
| BLAKE2b-256 |
8224cc1a959e1a87ba70b568f3a4fa82f7a2abd30c5635704244349e1dc9424f
|
Provenance
The following attestation bundles were made for ehrextract-0.3.0-py3-none-any.whl:
Publisher:
release.yml on shifosss/ehrextract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ehrextract-0.3.0-py3-none-any.whl -
Subject digest:
1e5901e17b68c04f90f9fb34d180c5373767e625b8c081d77439290ea8b585ed - Sigstore transparency entry: 1806538265
- Sigstore integration time:
-
Permalink:
shifosss/ehrextract@84fbd7f3a4fbf8eb50bb67b4e50aeba672023b3d -
Branch / Tag:
refs/tags/v0.3.0 - Owner: https://github.com/shifosss
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@84fbd7f3a4fbf8eb50bb67b4e50aeba672023b3d -
Trigger Event:
push
-
Statement type: