Skip to main content

Structured feature extraction from clinical notes

Project description

ehrextract

Structured feature extraction from clinical notes. Three steps:

  1. Bring your notes — CSV, JSONL, JSON, XLSX, plain text, or a pandas DataFrame.
  2. Pick a task — a built-in task (comorbidity, clinical_vars, full) or your own YAML file with your own fields and prompt.
  3. Pick a model — a fine-tuned LoRA adapter on a local base model, your own local HuggingFace weights, or an API model (OpenAI-compatible or Anthropic).

One command (or one function call) later you have a results table — CSV, JSONL, JSON, XLSX, or Parquet — with one column per extracted field.

Important — read before use. ehrextract is research-grade software. It is NOT a medical device, is NOT FDA-cleared / Health Canada-approved, and MUST NOT be used for clinical decision-making, patient triage, eligibility determination, re-identification, surveillance, or any setting where its outputs affect a person's access to care, insurance, employment, or legal status. Outputs may hallucinate; any research use requires per-row human review. The egress-warning system is informational, not a privacy compliance control. Users are solely responsible for HIPAA / PHIPA / PIPEDA / GDPR / REB compliance. See NOTICE for the full acceptable-use scope.

Install

Until the PyPI release, install from source (current method):

git clone https://github.com/shifosss/ehrextract
pip install './ehrextract[hf]'          # or [openai], [anthropic]

Once published to PyPI:

pip install ehrextract                  # core (~50 MB)
pip install 'ehrextract[hf]'            # + torch + transformers + peft (~3 GB)
pip install 'ehrextract[openai]'        # + openai SDK
pip install 'ehrextract[anthropic]'     # + anthropic SDK

30-second example

ehrextract \
  --task comorbidity \
  --model Qwen/Qwen3.5-27B --adapter /path/to/adapter \
  --input notes.csv --output results.csv

or, as a library:

from pathlib import Path
from ehrextract import extract

df = extract(
    Path("notes.csv"),
    "comorbidity",
    model="Qwen/Qwen3.5-27B",
    adapter="/path/to/adapter",
    output="results.csv",
)

The input needs a note_text column (configurable via --text-column); a note_id column is added automatically when absent. The output has one column per task field plus parse_success, validation_errors, raw_response, finish_reason, and token counts.

Built-in tasks

Task Fields What it extracts
comorbidity 17 Free-text diagnosis list + 16 Y/N comorbidity categories
clinical_vars 4 Feeding and neurologic variables (tube/oral feeding, aspiration risk, NI trajectory)
full 20 Joint task: the 16 comorbidity categories + the 4 clinical variables

Built-in tasks ship inside the package; --task <name> works without any extra files. Define your own task in YAML — see schema-reference.md.

Note on the full task. The research pipeline that produced the published evaluation numbers for the joint 20-field task used constrained JSON decoding to force the output shape. ehrextract v0.2.0 does not constrain decoding (planned as a future feature), so full-task outputs can diverge from the published numbers on hard notes — watch the parse_success and validation_errors columns.

Data handling

If your input may contain PHI, read data-handling.md BEFORE running with any API provider. The package writes a data-egress notice to stderr (once per process per destination) on API use; it never blocks, and it does not (and cannot) guarantee compliance for you. The local HuggingFace provider keeps all data on your machine.

Documentation

Authors and institutions

ehrextract was developed by:

  • Chen Zhang (lead author)
  • Yibing Xia (co-author)
  • Sanjay Mahant, MD -- supervisor, The Hospital for Sick Children (SickKids)
  • Nathan Taback, PhD -- supervisor, University of Toronto

at The Hospital for Sick Children (Toronto, Canada) and the University of Toronto (Toronto, Canada). Please cite the project if you use it in published work.

License

Licensed under the Apache License, Version 2.0. See LICENSE for the full license text and NOTICE for attribution, the no-endorsement clause, the clinical-use disclaimer, and the acceptable-use restrictions that supplement (but do not override) the License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ehrextract-0.2.0.tar.gz (39.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ehrextract-0.2.0-py3-none-any.whl (45.2 kB view details)

Uploaded Python 3

File details

Details for the file ehrextract-0.2.0.tar.gz.

File metadata

  • Download URL: ehrextract-0.2.0.tar.gz
  • Upload date:
  • Size: 39.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ehrextract-0.2.0.tar.gz
Algorithm Hash digest
SHA256 e67df1d2ec50f1e9027c3cb4034ecd3b16b1c9038ca4c1ccf3bb136c304b2b63
MD5 e61d385208947dec921b07b1d0893024
BLAKE2b-256 bf677c48fc170aef733466ace175f1a5521aac59f6dd1461661261ba0ba0ff49

See more details on using hashes here.

Provenance

The following attestation bundles were made for ehrextract-0.2.0.tar.gz:

Publisher: release.yml on shifosss/ehrextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ehrextract-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: ehrextract-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 45.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ehrextract-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 23516de6273ded71dbf50c213a0a7e65e9eda04e63c31e79b3db173100517d1a
MD5 7cc58ba084c36060ba925bc766dd6026
BLAKE2b-256 e9fec1b4f0c20c872fd9d11aecf31bd66cfffa0282f94c049acc69bacb6d3936

See more details on using hashes here.

Provenance

The following attestation bundles were made for ehrextract-0.2.0-py3-none-any.whl:

Publisher: release.yml on shifosss/ehrextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page