PetTag is a Python package designed for automated disease coding of veterinary clinical texts using either a pre-trained model.

These details have not been verified by PyPI

Project description

🧠 DiseaseCoder

Automated disease code mapping for veterinary and human clinical text using ICD-11, ICD-10, or SNOMED frameworks.

🧩 Overview

DiseaseCoder automatically maps free-text medical or veterinary notes to structured disease codes (ICD-11, ICD-10, or SNOMED).
It integrates:

🧬 Named Entity Recognition (NER) to identify relevant medical terms
🧠 Sentence embeddings for semantic understanding
🗂️ Embedding-based lookup for precise code mapping
⚡ Caching and batch support for large-scale EHR datasets

This tool is designed for researchers and practitioners working with electronic health records (EHRs), epidemiological data, or clinical NLP pipelines.

⚙️ Installation

pip install pettag

If you're using GPU acceleration, ensure you have the CUDA-enabled version of PyTorch installed.

🚀 Quickstart

🔹 Single Text Input

from disease_coder import DiseaseCoder

# Initialize the coder
coder = DiseaseCoder()

# Example text
text = "Cookie presented with vomiting and diarrhea. Suspected gastroenteritis."

# Predict the disease code(s)
output = coder.predict(text=text)
print(output)

Example Output:

{
    "Code": [
        {
            "Chapter": "Certain infectious or parasitic diseases",
            "Code": "1A40.Z",
            "Framework": "ICD11",
            "Input Disease": "gastroenteritis",
            "Similarity": 0.9393,
            "Title": "Infectious gastroenteritis or colitis without specification of infectious agent",
            "URI": "https://icd.who.int/browse/2025-01/mms/en#1688127370/"
        }
    ],
    "pathogen_extraction": [],
    "symptom_extraction": [
        "vomiting",
        "diarrhea"
    ]
}

🔹 Dataset Input

coder = DiseaseCoder(
    dataset="data/clinical_notes.csv",
    text_column="note",
    framework="icd11",
    output_dir="outputs/icd11_coded/"
)

# Run predictions on an entire dataset
coder.predict()

The coded dataset will be saved automatically to the specified output_dir.

🧠 Parameters

Parameter	Type	Default	Description
`framework`	`str`	`'icd11'`	Coding framework: `'icd11'`, `'icd10'`, or `'snomed'`
`dataset`	`str` or `Dataset`	`None`	Path to dataset or HuggingFace `Dataset`
`split`	`str`	`'train'`	Dataset split (e.g., `'train'`, `'test'`)
`model`	`str`	`'seanfarrell/bert-base-uncased'`	Token classification model
`tokenizer`	`str`	`None`	Tokenizer name (defaults to model)
`embedding_model`	`str`	`'sentence-transformers/embeddinggemma-300m-medical'`	Sentence embedding model
`synonyms_dataset`	`str`	`'seanfarrell/ICD-11_synonyms'`	ICD synonym dataset
`synonyms_embeddings_dataset`	`str`	`'cache/ICD-11_synonyms_embeddings.pt'`	Cached ICD embeddings
`text_column`	`str`	`'text'`	Text column name
`label_column`	`str`	`'labels'`	Label column name
`cache`	`bool`	`True`	Enable caching
`cache_path`	`str`	`'petharbor_cache/'`	Cache directory
`logs`	`str`	`None`	Log file path (logs to console if `None`)
`device`	`str`	`'cuda:0'` or `'cpu'`	Device for computation
`output_dir`	`str`	`None`	Directory to save outputs

🧩 How It Works

Entity Extraction
Identifies medically relevant entities using a pretrained token-classification model.
Semantic Embedding
Converts entities to dense embeddings with a SentenceTransformer model.
Code Matching
Finds the most semantically similar ICD-11 / ICD-10 / SNOMED entry using cosine similarity.
Caching & Efficiency
ICD embeddings are saved to disk on the first run (.pt format) for faster reuse later.

📦 Output

Depending on the input mode:

Single text input: returns a structured Python dictionary with predicted codes.
Dataset input: saves a processed dataset with new code columns to output_dir.

🔧 Advanced Usage

💾 Regenerate ICD Embedding Store

If the ICD embedding store doesn’t exist, it will be created automatically.
To rebuild it manually:

from datasets import load_dataset

coder = DiseaseCoder()
dataset = load_dataset("seanfarrell/ICD-11_synonyms", split="train")
coder._preprocess_icd_lookup(disease_code_lookup=dataset, save_path="cache/icd_lookup.pt")

🧾 Logging

Enable persistent logs:

coder = DiseaseCoder(logs="logs/run.log")

🧬 Framework Switching

Switch easily between ICD and SNOMED frameworks:

coder = DiseaseCoder(framework="snomed")

📂 Recommended Project Structure

project/
│
├── data/
│   └── clinical_notes.csv
│
├── cache/
│   └── ICD-11_synonyms_embeddings.pt
│
├── outputs/
│   └── icd11_coded/
│
├── logs/
│   └── run.log
│
└── disease_coder.py

🧾 Citation

If you use this tool in your research, please cite the PetHarbor and PetTag projects:

@misc{pettag2025,
  author       = {Farrell, Sean},
  title        = {PetHarbor: Veterinary Language Models for Structured Health Record Coding},
  year         = {2025},
  publisher    = {GitHub},
  url          = {https://github.com/sean-farrell/petharbor}
}

❤️ Acknowledgements

This package is part of the PetTag / PetHarbor ecosystem —
a suite of NLP tools for large-scale veterinary EHR data analysis.

Built with:

🐾 License

This project is licensed under the MIT License — see the LICENSE file for details.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.1 yanked

Oct 24, 2025

This version

0.0.12

Jan 6, 2026

0.0.11

Oct 28, 2025

0.0.10

Oct 28, 2025

0.0.9

Oct 28, 2025

0.0.8

Oct 27, 2025

0.0.7

Oct 27, 2025

0.0.6

Oct 27, 2025

0.0.5

Oct 27, 2025

0.0.4

Oct 24, 2025

0.0.3

Oct 24, 2025

0.0.2

Oct 24, 2025

0.0.1

Oct 24, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pettag-0.0.12.tar.gz (20.3 kB view details)

Uploaded Jan 6, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

pettag-0.0.12-py3-none-any.whl (21.3 kB view details)

Uploaded Jan 6, 2026 Python 3

File details

Details for the file pettag-0.0.12.tar.gz.

File metadata

Download URL: pettag-0.0.12.tar.gz
Upload date: Jan 6, 2026
Size: 20.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pettag-0.0.12.tar.gz
Algorithm	Hash digest
SHA256	`3c7afa2ccc6890df95427c0f51711c365bbf566ff88e5ae6e38891af41f0935d`
MD5	`9d518e1cd38f69ef68f078ecff3f76e5`
BLAKE2b-256	`0f5cf465033603664167241b7e3f4cb14ac8578d18e3c5c009db1857fd66264c`

See more details on using hashes here.

File details

Details for the file pettag-0.0.12-py3-none-any.whl.

File metadata

Download URL: pettag-0.0.12-py3-none-any.whl
Upload date: Jan 6, 2026
Size: 21.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for pettag-0.0.12-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6218730541142a3f641dd6f9515983926bc0c88f693748d8bbcc5b079ad77fb2`
MD5	`454edbc51d9f4a7a0937f3ae16bd2bde`
BLAKE2b-256	`134702afd40467d137bf17d60d1e5cf085145cb974e7e94d1e731bc328df0794`

See more details on using hashes here.

PetTag 0.0.12

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

🧠 DiseaseCoder

🧩 Overview

⚙️ Installation

🚀 Quickstart

🔹 Single Text Input

🔹 Dataset Input

🧠 Parameters

🧩 How It Works

📦 Output

🔧 Advanced Usage

💾 Regenerate ICD Embedding Store

🧾 Logging

🧬 Framework Switching

📂 Recommended Project Structure

🧾 Citation

❤️ Acknowledgements

🐾 License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes