Structured feature extraction from clinical notes
Project description
ehrextract
Structured feature extraction from clinical notes. Three steps:
- Bring your notes — CSV, JSONL, JSON, XLSX, plain text, or a pandas DataFrame.
- Pick a task — a built-in task (
comorbidity,clinical_vars,full) or your own YAML file with your own fields and prompt. - Pick a model — a fine-tuned LoRA adapter on a local base model, your own local HuggingFace weights, or an API model (OpenAI-compatible or Anthropic).
One command (or one function call) later you have a results table — CSV, JSONL, JSON, XLSX, or Parquet — with one column per extracted field.
Important — read before use. ehrextract is research-grade software. It is NOT a medical device, is NOT FDA-cleared / Health Canada-approved, and MUST NOT be used for clinical decision-making, patient triage, eligibility determination, re-identification, surveillance, or any setting where its outputs affect a person's access to care, insurance, employment, or legal status. Outputs may hallucinate; any research use requires per-row human review. The egress-warning system is informational, not a privacy compliance control. Users are solely responsible for HIPAA / PHIPA / PIPEDA / GDPR / REB compliance. See
NOTICEfor the full acceptable-use scope.
Install
Until the PyPI release, install from source (current method):
git clone https://github.com/shifosss/ehrextract
pip install './ehrextract[hf]' # or [openai], [anthropic]
Once published to PyPI:
pip install ehrextract # core (~50 MB)
pip install 'ehrextract[hf]' # + torch + transformers + peft (~3 GB)
pip install 'ehrextract[openai]' # + openai SDK
pip install 'ehrextract[anthropic]' # + anthropic SDK
30-second example
ehrextract \
--task comorbidity \
--model Qwen/Qwen3.5-27B --adapter /path/to/adapter \
--input notes.csv --output results.csv
or, as a library:
from pathlib import Path
from ehrextract import extract
df = extract(
Path("notes.csv"),
"comorbidity",
model="Qwen/Qwen3.5-27B",
adapter="/path/to/adapter",
output="results.csv",
)
The input needs a note_text column (configurable via --text-column); a
note_id column is added automatically when absent. The output has one
column per task field plus parse_success, validation_errors,
raw_response, finish_reason, and token counts.
Built-in tasks
| Task | Fields | What it extracts |
|---|---|---|
comorbidity |
17 | Free-text diagnosis list + 16 Y/N comorbidity categories |
clinical_vars |
4 | Feeding and neurologic variables (tube/oral feeding, aspiration risk, NI trajectory) |
full |
20 | Joint task: the 16 comorbidity categories + the 4 clinical variables |
Built-in tasks ship inside the package; --task <name> works without any
extra files. Define your own task in YAML — see
schema-reference.md.
Note on the
fulltask. The research pipeline that produced the published evaluation numbers for the joint 20-field task used constrained JSON decoding to force the output shape. ehrextract v0.2.0 does not constrain decoding (planned as a future feature), sofull-task outputs can diverge from the published numbers on hard notes — watch theparse_successandvalidation_errorscolumns.
Data handling
If your input may contain PHI, read data-handling.md
BEFORE running with any API provider. The package writes a data-egress
notice to stderr (once per process per destination) on API use; it never
blocks, and it does not (and cannot) guarantee compliance for you. The
local HuggingFace provider keeps all data on your machine.
Documentation
quickstart.md— fine-tuned adapters, custom tasks, API providersschema-reference.md— the task-file YAML referencedata-handling.md— PHI, egress notice, BAA-eligible providersextending-providers.md— plug in a custom provider
Authors and institutions
ehrextract was developed by:
- Chen Zhang (lead author)
- Yibing Xia (co-author)
- Sanjay Mahant, MD -- supervisor, The Hospital for Sick Children (SickKids)
- Nathan Taback, PhD -- supervisor, University of Toronto
at The Hospital for Sick Children (Toronto, Canada) and the University of Toronto (Toronto, Canada). Please cite the project if you use it in published work.
License
Licensed under the Apache License, Version 2.0. See LICENSE
for the full license text and NOTICE for
attribution, the no-endorsement clause, the clinical-use disclaimer, and the
acceptable-use restrictions that supplement (but do not override) the License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ehrextract-0.2.0.tar.gz.
File metadata
- Download URL: ehrextract-0.2.0.tar.gz
- Upload date:
- Size: 39.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e67df1d2ec50f1e9027c3cb4034ecd3b16b1c9038ca4c1ccf3bb136c304b2b63
|
|
| MD5 |
e61d385208947dec921b07b1d0893024
|
|
| BLAKE2b-256 |
bf677c48fc170aef733466ace175f1a5521aac59f6dd1461661261ba0ba0ff49
|
Provenance
The following attestation bundles were made for ehrextract-0.2.0.tar.gz:
Publisher:
release.yml on shifosss/ehrextract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ehrextract-0.2.0.tar.gz -
Subject digest:
e67df1d2ec50f1e9027c3cb4034ecd3b16b1c9038ca4c1ccf3bb136c304b2b63 - Sigstore transparency entry: 1805300650
- Sigstore integration time:
-
Permalink:
shifosss/ehrextract@0b7fa8a0ced8be46c02332e6026cacd98d603d37 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/shifosss
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0b7fa8a0ced8be46c02332e6026cacd98d603d37 -
Trigger Event:
push
-
Statement type:
File details
Details for the file ehrextract-0.2.0-py3-none-any.whl.
File metadata
- Download URL: ehrextract-0.2.0-py3-none-any.whl
- Upload date:
- Size: 45.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
23516de6273ded71dbf50c213a0a7e65e9eda04e63c31e79b3db173100517d1a
|
|
| MD5 |
7cc58ba084c36060ba925bc766dd6026
|
|
| BLAKE2b-256 |
e9fec1b4f0c20c872fd9d11aecf31bd66cfffa0282f94c049acc69bacb6d3936
|
Provenance
The following attestation bundles were made for ehrextract-0.2.0-py3-none-any.whl:
Publisher:
release.yml on shifosss/ehrextract
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ehrextract-0.2.0-py3-none-any.whl -
Subject digest:
23516de6273ded71dbf50c213a0a7e65e9eda04e63c31e79b3db173100517d1a - Sigstore transparency entry: 1805300692
- Sigstore integration time:
-
Permalink:
shifosss/ehrextract@0b7fa8a0ced8be46c02332e6026cacd98d603d37 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/shifosss
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@0b7fa8a0ced8be46c02332e6026cacd98d603d37 -
Trigger Event:
push
-
Statement type: