Skip to main content

PetTag is a Python package designed for automated disease coding of veterinary clinical texts using either a pre-trained model.

Project description

🧠 DiseaseCoder

Automated disease code mapping for veterinary and human clinical text using ICD-11, ICD-10, or SNOMED frameworks.

License: MIT Python Hugging Face PyTorch Release


🧩 Overview

DiseaseCoder automatically maps free-text medical or veterinary notes to structured disease codes (ICD-11, ICD-10, or SNOMED).
It integrates:

  • 🧬 Named Entity Recognition (NER) to identify relevant medical terms
  • 🧠 Sentence embeddings for semantic understanding
  • 🗂️ Embedding-based lookup for precise code mapping
  • Caching and batch support for large-scale EHR datasets

This tool is designed for researchers and practitioners working with electronic health records (EHRs), epidemiological data, or clinical NLP pipelines.


⚙️ Installation

pip install pettag

If you're using GPU acceleration, ensure you have the CUDA-enabled version of PyTorch installed.


🚀 Quickstart

🔹 Single Text Input

from disease_coder import DiseaseCoder

# Initialize the coder
coder = DiseaseCoder(
    framework="icd11",
    model="seanfarrell/bert-base-uncased",
    embedding_model="sentence-transformers/embeddinggemma-300m-medical"
)

# Example text
text = "Cookie presented with vomiting and diarrhea. Suspected gastroenteritis."

# Predict the disease code(s)
output = coder.predict(text=text)
print(output)

Example Output:

{
    "text": "COOKIE PRESENT WITH VOMITING AND DIARRHEA. SUSPECTED GASTROENTERITIS.",
    "Code": [
        {
            "Chapter": "Certain infectious or parasitic diseases",
            "Code": "1A40.Z",
            "Framework": "ICD11",
            "Input Disease": "gastroenteritis",
            "Similarity": 0.9393,
            "Title": "Infectious gastroenteritis or colitis without specification of infectious agent",
            "URI": "https://icd.who.int/browse/2025-01/mms/en#1688127370/"
        }
    ],
    "pathogen_extraction": [],
    "symptom_extraction": [
        "vomiting",
        "diarrhea"
    ]
}

🔹 Dataset Input

coder = DiseaseCoder(
    dataset="data/clinical_notes.csv",
    text_column="note",
    framework="icd11",
    output_dir="outputs/icd11_coded/"
)

# Run predictions on an entire dataset
coder.predict()

The coded dataset will be saved automatically to the specified output_dir.


🧠 Parameters

Parameter Type Default Description
framework str 'icd11' Coding framework: 'icd11', 'icd10', or 'snomed'
dataset str or Dataset None Path to dataset or HuggingFace Dataset
split str 'train' Dataset split (e.g., 'train', 'test')
model str 'seanfarrell/bert-base-uncased' Token classification model
tokenizer str None Tokenizer name (defaults to model)
embedding_model str 'sentence-transformers/embeddinggemma-300m-medical' Sentence embedding model
synonyms_dataset str 'seanfarrell/ICD-11_synonyms' ICD synonym dataset
synonyms_embeddings_dataset str 'cache/ICD-11_synonyms_embeddings.pt' Cached ICD embeddings
text_column str 'text' Text column name
label_column str 'labels' Label column name
cache bool True Enable caching
cache_path str 'petharbor_cache/' Cache directory
logs str None Log file path (logs to console if None)
device str 'cuda:0' or 'cpu' Device for computation
output_dir str None Directory to save outputs

🧩 How It Works

  1. Entity Extraction
    Identifies medically relevant entities using a pretrained token-classification model.

  2. Semantic Embedding
    Converts entities to dense embeddings with a SentenceTransformer model.

  3. Code Matching
    Finds the most semantically similar ICD-11 / ICD-10 / SNOMED entry using cosine similarity.

  4. Caching & Efficiency
    ICD embeddings are saved to disk on the first run (.pt format) for faster reuse later.


📦 Output

Depending on the input mode:

  • Single text input: returns a structured Python dictionary with predicted codes.
  • Dataset input: saves a processed dataset with new code columns to output_dir.

🔧 Advanced Usage

💾 Regenerate ICD Embedding Store

If the ICD embedding store doesn’t exist, it will be created automatically.
To rebuild it manually:

from datasets import load_dataset

coder = DiseaseCoder()
dataset = load_dataset("seanfarrell/ICD-11_synonyms", split="train")
coder._preprocess_icd_lookup(disease_code_lookup=dataset, save_path="cache/icd_lookup.pt")

🧾 Logging

Enable persistent logs:

coder = DiseaseCoder(logs="logs/run.log")

🧬 Framework Switching

Switch easily between ICD and SNOMED frameworks:

coder = DiseaseCoder(framework="snomed")

📂 Recommended Project Structure

project/
│
├── data/
│   └── clinical_notes.csv
│
├── cache/
│   └── ICD-11_synonyms_embeddings.pt
│
├── outputs/
│   └── icd11_coded/
│
├── logs/
│   └── run.log
│
└── disease_coder.py

🧾 Citation

If you use this tool in your research, please cite the PetHarbor and PetTag projects:

@misc{pettag2025,
  author       = {Farrell, Sean},
  title        = {PetHarbor: Veterinary Language Models for Structured Health Record Coding},
  year         = {2025},
  publisher    = {GitHub},
  url          = {https://github.com/sean-farrell/petharbor}
}

❤️ Acknowledgements

This package is part of the PetTag / PetHarbor ecosystem —
a suite of NLP tools for large-scale veterinary EHR data analysis.

Built with:


🐾 License

This project is licensed under the MIT License — see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pettag-0.0.11.tar.gz (18.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pettag-0.0.11-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file pettag-0.0.11.tar.gz.

File metadata

  • Download URL: pettag-0.0.11.tar.gz
  • Upload date:
  • Size: 18.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for pettag-0.0.11.tar.gz
Algorithm Hash digest
SHA256 601e53c2c35299febda7a7965b8877d537a285ef0a42e1921ea5ae6c1ed4c1ad
MD5 8781accf6b35337e043d08f1e7f2ce70
BLAKE2b-256 d12b5c836bff903b9c676573d4e5815010c92f1fddf57716d45b92bdbab44b8a

See more details on using hashes here.

File details

Details for the file pettag-0.0.11-py3-none-any.whl.

File metadata

  • Download URL: pettag-0.0.11-py3-none-any.whl
  • Upload date:
  • Size: 18.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for pettag-0.0.11-py3-none-any.whl
Algorithm Hash digest
SHA256 12ad276765a2a8eb1f063ede4b5f727ca3187eb23cace2541251f40f3762b82c
MD5 4dd7bd480cfb98a26f31c43b21c508b6
BLAKE2b-256 f14403795d4c8246cb40af8e0beb11acf2c5069b016e72a8cd68fe8f838021d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page