Skip to main content

Structured feature extraction from clinical notes

Project description

ehrextract

PyPI Python License tests

Structured feature extraction from clinical notes. Three steps:

  1. Bring your notes — CSV, JSONL, JSON, XLSX, plain text, or a pandas DataFrame.
  2. Pick a task — a built-in task (comorbidity, clinical_vars, full) or your own YAML file with your own fields and prompt.
  3. Pick a model — a fine-tuned LoRA adapter on a local base model, your own local HuggingFace weights, or an API model (OpenAI-compatible or Anthropic).

One command (or one function call) later you have a results table — CSV, JSONL, JSON, XLSX, or Parquet — with one column per extracted field.

Important — read before use. ehrextract is research-grade software. It is NOT a medical device, is NOT FDA-cleared / Health Canada-approved, and MUST NOT be used for clinical decision-making, patient triage, eligibility determination, re-identification, surveillance, or any setting where its outputs affect a person's access to care, insurance, employment, or legal status. Outputs may hallucinate; any research use requires per-row human review. The egress-warning system is informational, not a privacy compliance control. Users are solely responsible for HIPAA / PHIPA / PIPEDA / GDPR / REB compliance. See NOTICE for the full acceptable-use scope.

Install

pip install ehrextract                  # core (~50 MB)
pip install 'ehrextract[hf]'            # + torch + transformers + peft (~3 GB)
pip install 'ehrextract[openai]'        # + openai SDK
pip install 'ehrextract[anthropic]'     # + anthropic SDK

Python ≥ 3.11. For a development install from a clone, see CONTRIBUTING.md.

30-second example

ehrextract \
  --task comorbidity \
  --model Qwen/Qwen3.5-27B --adapter /path/to/adapter \
  --input notes.csv --output results.csv

or, as a library:

from pathlib import Path
from ehrextract import extract

df = extract(
    Path("notes.csv"),
    "comorbidity",
    model="Qwen/Qwen3.5-27B",
    adapter="/path/to/adapter",
    output="results.csv",
)

The input needs a note_text column (configurable via --text-column); a note_id column is added automatically when absent. The output has one column per task field plus parse_success, validation_errors, raw_response, finish_reason, repair_attempts, and token counts.

On API providers, --batch submits the whole run as one provider-side batch at 50% API cost, and --max-repairs N re-prompts the model with the exact field errors when a response fails to parse or validate. See quickstart.md.

Built-in tasks

Task Fields What it extracts
comorbidity 17 Free-text diagnosis list + 16 Y/N comorbidity categories
clinical_vars 4 Feeding and neurologic variables (tube/oral feeding, aspiration risk, NI trajectory)
full 20 Joint task: the 16 comorbidity categories + the 4 clinical variables

Built-in tasks ship inside the package; --task <name> works without any extra files. Define your own task in YAML — see schema-reference.md.

Note on the full task. The full task enables constrained JSON decoding by default on the local HuggingFace provider — the same mechanism the research pipeline used for the published joint-task numbers, forcing structurally valid, schema-conformant output at the token level. It requires the [hf] extra (which includes lm-format-enforcer); disable it with --no-constrained. API providers ignore the setting (Anthropic already forces the schema via tool-use).

Data handling

If your input may contain PHI, read data-handling.md BEFORE running with any API provider. The package writes a data-egress notice to stderr (once per process per destination) on API use; it never blocks, and it does not (and cannot) guarantee compliance for you. The local HuggingFace provider keeps all data on your machine.

Documentation

Authors and institutions

ehrextract was developed by:

  • Chen Zhang (lead author)
  • Yibing Xia (co-author)
  • Sanjay Mahant, MD -- supervisor, The Hospital for Sick Children (SickKids)
  • Nathan Taback, PhD -- supervisor, University of Toronto

at The Hospital for Sick Children (Toronto, Canada) and the University of Toronto (Toronto, Canada). Please cite the project if you use it in published work.

License

Licensed under the Apache License, Version 2.0. See LICENSE for the full license text and NOTICE for attribution, the no-endorsement clause, the clinical-use disclaimer, and the acceptable-use restrictions that supplement (but do not override) the License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ehrextract-0.3.0.tar.gz (45.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ehrextract-0.3.0-py3-none-any.whl (51.8 kB view details)

Uploaded Python 3

File details

Details for the file ehrextract-0.3.0.tar.gz.

File metadata

  • Download URL: ehrextract-0.3.0.tar.gz
  • Upload date:
  • Size: 45.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ehrextract-0.3.0.tar.gz
Algorithm Hash digest
SHA256 ab07359a529283856a6e8256b92b00971e9ed813f12985d348e9abe3303ae747
MD5 0d42c111a852bff9faa3e1000bba035f
BLAKE2b-256 d5db60f1cbd213f4777f703886354a03ade10c917567d1f99fe4ad8ebce63d86

See more details on using hashes here.

Provenance

The following attestation bundles were made for ehrextract-0.3.0.tar.gz:

Publisher: release.yml on shifosss/ehrextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ehrextract-0.3.0-py3-none-any.whl.

File metadata

  • Download URL: ehrextract-0.3.0-py3-none-any.whl
  • Upload date:
  • Size: 51.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for ehrextract-0.3.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1e5901e17b68c04f90f9fb34d180c5373767e625b8c081d77439290ea8b585ed
MD5 a65bb6f7fa605829908cc1a4e46e4ce4
BLAKE2b-256 8224cc1a959e1a87ba70b568f3a4fa82f7a2abd30c5635704244349e1dc9424f

See more details on using hashes here.

Provenance

The following attestation bundles were made for ehrextract-0.3.0-py3-none-any.whl:

Publisher: release.yml on shifosss/ehrextract

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page