Skip to main content

PetTag is a Python package designed for automated disease coding of veterinary clinical texts using either a pre-trained model.

Project description

🧠 DiseaseCoder

Automated disease code mapping for veterinary and human clinical text using ICD-11, ICD-10, or SNOMED frameworks.

License: MIT Python Hugging Face PyTorch Release


🧩 Overview

DiseaseCoder automatically maps free-text medical or veterinary notes to structured disease codes (ICD-11, ICD-10, or SNOMED).
It integrates:

  • 🧬 Named Entity Recognition (NER) to identify relevant medical terms
  • 🧠 Sentence embeddings for semantic understanding
  • 🗂️ Embedding-based lookup for precise code mapping
  • Caching and batch support for large-scale EHR datasets

This tool is designed for researchers and practitioners working with electronic health records (EHRs), epidemiological data, or clinical NLP pipelines.


⚙️ Installation

pip install pettag

If you're using GPU acceleration, ensure you have the CUDA-enabled version of PyTorch installed.


🚀 Quickstart

🔹 Single Text Input

from disease_coder import DiseaseCoder

# Initialize the coder
coder = DiseaseCoder()

# Example text
text = "Cookie presented with vomiting and diarrhea. Suspected gastroenteritis."

# Predict the disease code(s)
output = coder.predict(text=text)
print(output)

Example Output:

{
    "Code": [
        {
            "Chapter": "Certain infectious or parasitic diseases",
            "Code": "1A40.Z",
            "Framework": "ICD11",
            "Input Disease": "gastroenteritis",
            "Similarity": 0.9393,
            "Title": "Infectious gastroenteritis or colitis without specification of infectious agent",
            "URI": "https://icd.who.int/browse/2025-01/mms/en#1688127370/"
        }
    ],
    "pathogen_extraction": [],
    "symptom_extraction": [
        "vomiting",
        "diarrhea"
    ]
}

🔹 Dataset Input

coder = DiseaseCoder(
    dataset="data/clinical_notes.csv",
    text_column="note",
    framework="icd11",
    output_dir="outputs/icd11_coded/"
)

# Run predictions on an entire dataset
coder.predict()

The coded dataset will be saved automatically to the specified output_dir.


🧠 Parameters

Parameter Type Default Description
framework str 'icd11' Coding framework: 'icd11', 'icd10', or 'snomed'
dataset str or Dataset None Path to dataset or HuggingFace Dataset
split str 'train' Dataset split (e.g., 'train', 'test')
model str 'seanfarrell/bert-base-uncased' Token classification model
tokenizer str None Tokenizer name (defaults to model)
embedding_model str 'sentence-transformers/embeddinggemma-300m-medical' Sentence embedding model
synonyms_dataset str 'seanfarrell/ICD-11_synonyms' ICD synonym dataset
synonyms_embeddings_dataset str 'cache/ICD-11_synonyms_embeddings.pt' Cached ICD embeddings
text_column str 'text' Text column name
label_column str 'labels' Label column name
cache bool True Enable caching
cache_path str 'petharbor_cache/' Cache directory
logs str None Log file path (logs to console if None)
device str 'cuda:0' or 'cpu' Device for computation
output_dir str None Directory to save outputs

🧩 How It Works

  1. Entity Extraction
    Identifies medically relevant entities using a pretrained token-classification model.

  2. Semantic Embedding
    Converts entities to dense embeddings with a SentenceTransformer model.

  3. Code Matching
    Finds the most semantically similar ICD-11 / ICD-10 / SNOMED entry using cosine similarity.

  4. Caching & Efficiency
    ICD embeddings are saved to disk on the first run (.pt format) for faster reuse later.


📦 Output

Depending on the input mode:

  • Single text input: returns a structured Python dictionary with predicted codes.
  • Dataset input: saves a processed dataset with new code columns to output_dir.

🔧 Advanced Usage

💾 Regenerate ICD Embedding Store

If the ICD embedding store doesn’t exist, it will be created automatically.
To rebuild it manually:

from datasets import load_dataset

coder = DiseaseCoder()
dataset = load_dataset("seanfarrell/ICD-11_synonyms", split="train")
coder._preprocess_icd_lookup(disease_code_lookup=dataset, save_path="cache/icd_lookup.pt")

🧾 Logging

Enable persistent logs:

coder = DiseaseCoder(logs="logs/run.log")

🧬 Framework Switching

Switch easily between ICD and SNOMED frameworks:

coder = DiseaseCoder(framework="snomed")

📂 Recommended Project Structure

project/
│
├── data/
│   └── clinical_notes.csv
│
├── cache/
│   └── ICD-11_synonyms_embeddings.pt
│
├── outputs/
│   └── icd11_coded/
│
├── logs/
│   └── run.log
│
└── disease_coder.py

🧾 Citation

If you use this tool in your research, please cite the PetHarbor and PetTag projects:

@misc{pettag2025,
  author       = {Farrell, Sean},
  title        = {PetHarbor: Veterinary Language Models for Structured Health Record Coding},
  year         = {2025},
  publisher    = {GitHub},
  url          = {https://github.com/sean-farrell/petharbor}
}

❤️ Acknowledgements

This package is part of the PetTag / PetHarbor ecosystem —
a suite of NLP tools for large-scale veterinary EHR data analysis.

Built with:


🐾 License

This project is licensed under the MIT License — see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pettag-0.0.12.tar.gz (20.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pettag-0.0.12-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file pettag-0.0.12.tar.gz.

File metadata

  • Download URL: pettag-0.0.12.tar.gz
  • Upload date:
  • Size: 20.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pettag-0.0.12.tar.gz
Algorithm Hash digest
SHA256 3c7afa2ccc6890df95427c0f51711c365bbf566ff88e5ae6e38891af41f0935d
MD5 9d518e1cd38f69ef68f078ecff3f76e5
BLAKE2b-256 0f5cf465033603664167241b7e3f4cb14ac8578d18e3c5c009db1857fd66264c

See more details on using hashes here.

File details

Details for the file pettag-0.0.12-py3-none-any.whl.

File metadata

  • Download URL: pettag-0.0.12-py3-none-any.whl
  • Upload date:
  • Size: 21.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pettag-0.0.12-py3-none-any.whl
Algorithm Hash digest
SHA256 6218730541142a3f641dd6f9515983926bc0c88f693748d8bbcc5b079ad77fb2
MD5 454edbc51d9f4a7a0937f3ae16bd2bde
BLAKE2b-256 134702afd40467d137bf17d60d1e5cf085145cb974e7e94d1e731bc328df0794

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page