Skip to main content

Offline HPO code extraction from free-text clinical notes

Project description

pleio-hpo

Offline extraction of Human Phenotype Ontology (HPO) codes from free-text clinical notes. Runs fully offline on CPU — no data leaves the machine, which makes it usable with PHI.

A three-stage pipeline: lexical matching (Aho-Corasick over HPO labels + synonyms, with morphology-aware and order-free token-set matching for plural and reordered phrasings) → biomedical-embedding nearest-neighbour (SapBERT) for paraphrases → a PubMedBERT cross-encoder validator that filters the fuzzy candidates.

Install

pip install pleio-hpo            # Python >=3.10
python -m spacy download en_core_web_sm
pleio-hpo download              # fetch the embedding index + validator (~570 MB, one time)

The large model assets are fetched on first use, not bundled in the wheel. Runtime needs roughly 1.2 GB RAM (SapBERT + the PubMedBERT validator on CPU).

Quickstart — CLI

$ pleio-hpo "The patient has macrocephaly and hypotonia."
HP:0000256  Macrocephaly                             1.00
HP:0001252  Hypotonia                                1.00

# TSV / JSON, from a file or stdin, with context control:
pleio-hpo "Bilateral hearing loss; no microcephaly." --format tsv --include-negated
pleio-hpo --file note.txt --format json --output codes.json
echo "Short stature and seizures." | pleio-hpo --format tsv

pleio-hpo info shows the version, pinned HPO release, and asset paths.

Quickstart — Python

from pleio_hpo import Annotator

annotator = Annotator()  # construct once, reuse (models load lazily on first call)
result = annotator.annotate("Global developmental delay and seizures.")

for code in result.codes:
    print(code.hpo_id, code.label, code.score, code.score_source)

print(result.to_json())

code.score is an opaque per-source strength signal (1.0 for an exact lexical match, cosine similarity for an embedding match) — not a calibrated probability. See examples/ for batch processing, context filtering, and threshold tuning.

Evaluation

On GSC+ (228 PubMed abstracts, community human gold), under a uniform document-level micro-F1 protocol (HPO v2026-02-16):

Tool GSC+ F1 Offline? Footprint
pleio-hpo 0.660 yes pip, CPU, ~1.5 GB
PhenoTagger 0.606 yes TensorFlow; GPU optional
txt2hpo 0.556 yes pip, CPU
Doc2HPO (acdat) 0.516 no¹ web API, or local + UMLS
PhenoGPT 0.341 yes GPU, ~22 GB

¹ The Doc2HPO numbers use its public web API (text leaves the machine); a local, PHI-safe install needs a UMLS license.

pleio-hpo tops the full benchmark, ahead of PhenoTagger by 0.055 F1 (p<0.001). Part of that margin is inheritance/onset coverage the benchmark scores; on phenotype-only recognition the two are level (0.651 vs 0.647, p=0.67). Across three human-gold corpora — GSC+, BC8, and 112 out-of-distribution genetics case reports — pleio-hpo and PhenoTagger are statistically tied (the nominal leader varies by corpus), and the strongest frontier-LLM extractor performs comparably. The corpora are abstracts, exam observations, and case reports — not free-text clinical notes.

Condensed results, figures, ablation, and caveats: docs/RESULTS.md; per-tool comparison: COMPARISON.md.

HPO version

The library runtime tracks the latest HPO release (the index + validator are built against the current hp.obo). The reported evaluation is pinned to v2026-02-16 for comparability, so a user's runtime ontology may be newer than the evaluation snapshot (docs/RESULTS.md).

Scope and disclaimer

This tool is decision-support, not clinical decision-making. It is not FDA-cleared and not for autonomous use in patient care; inferred codes may be incorrect and should be verified by a qualified user. English-only. HPO is a Western/English-language ontology — apply cautiously to other populations.

License & citation

Apache-2.0 (see LICENSE and NOTICE). If you use this work, please cite it — see CITATION.cff.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pleio_hpo-0.1.0.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pleio_hpo-0.1.0-py3-none-any.whl (2.1 MB view details)

Uploaded Python 3

File details

Details for the file pleio_hpo-0.1.0.tar.gz.

File metadata

  • Download URL: pleio_hpo-0.1.0.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pleio_hpo-0.1.0.tar.gz
Algorithm Hash digest
SHA256 e13db67fa3290e629e17f9b6f1de16c3c3ebde9a6778b56ebb2e61ff6cd96195
MD5 355b11769dc270789180d12467603de8
BLAKE2b-256 68bed4c05829c75ad9f9f36a135b1f57e86de3a88516b4badcff68fe15e62d08

See more details on using hashes here.

Provenance

The following attestation bundles were made for pleio_hpo-0.1.0.tar.gz:

Publisher: release.yml on Pleio-Labs/pleio-hpo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file pleio_hpo-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: pleio_hpo-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 2.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for pleio_hpo-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7ea65e73715d276e1954626dd0b3f6bed9e09b75e807d9fee5d522e86a12e724
MD5 da9ec112e1eb546f2fbde316755ce633
BLAKE2b-256 d351e8003609a4fc3285b56aed53c503466aaacb5ed0831ed55b1e5802cc14fd

See more details on using hashes here.

Provenance

The following attestation bundles were made for pleio_hpo-0.1.0-py3-none-any.whl:

Publisher: release.yml on Pleio-Labs/pleio-hpo

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page