PetTag is a Python package designed for automated disease coding of veterinary clinical texts using either a pre-trained model.
Project description
🧠 DiseaseCoder
Automated disease code mapping for veterinary and human clinical text using ICD-11, ICD-10, or SNOMED frameworks.
🧩 Overview
DiseaseCoder automatically maps free-text medical or veterinary notes to structured disease codes (ICD-11, ICD-10, or SNOMED).
It integrates:
- 🧬 Named Entity Recognition (NER) to identify relevant medical terms
- 🧠 Sentence embeddings for semantic understanding
- 🗂️ Embedding-based lookup for precise code mapping
- ⚡ Caching and batch support for large-scale EHR datasets
This tool is designed for researchers and practitioners working with electronic health records (EHRs), epidemiological data, or clinical NLP pipelines.
⚙️ Installation
pip install pettag
If you're using GPU acceleration, ensure you have the CUDA-enabled version of PyTorch installed.
🚀 Quickstart
🔹 Single Text Input
from disease_coder import DiseaseCoder
# Initialize the coder
coder = DiseaseCoder()
# Example text
text = "Cookie presented with vomiting and diarrhea. Suspected gastroenteritis."
# Predict the disease code(s)
output = coder.predict(text=text)
print(output)
Example Output:
{
"Code": [
{
"Chapter": "Certain infectious or parasitic diseases",
"Code": "1A40.Z",
"Framework": "ICD11",
"Input Disease": "gastroenteritis",
"Similarity": 0.9393,
"Title": "Infectious gastroenteritis or colitis without specification of infectious agent",
"URI": "https://icd.who.int/browse/2025-01/mms/en#1688127370/"
}
],
"pathogen_extraction": [],
"symptom_extraction": [
"vomiting",
"diarrhea"
]
}
🔹 Dataset Input
coder = DiseaseCoder(
dataset="data/clinical_notes.csv",
text_column="note",
framework="icd11",
output_dir="outputs/icd11_coded/"
)
# Run predictions on an entire dataset
coder.predict()
The coded dataset will be saved automatically to the specified output_dir.
🧠 Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
framework |
str |
'icd11' |
Coding framework: 'icd11', 'icd10', or 'snomed' |
dataset |
str or Dataset |
None |
Path to dataset or HuggingFace Dataset |
split |
str |
'train' |
Dataset split (e.g., 'train', 'test') |
model |
str |
'seanfarrell/bert-base-uncased' |
Token classification model |
tokenizer |
str |
None |
Tokenizer name (defaults to model) |
embedding_model |
str |
'sentence-transformers/embeddinggemma-300m-medical' |
Sentence embedding model |
synonyms_dataset |
str |
'seanfarrell/ICD-11_synonyms' |
ICD synonym dataset |
synonyms_embeddings_dataset |
str |
'cache/ICD-11_synonyms_embeddings.pt' |
Cached ICD embeddings |
text_column |
str |
'text' |
Text column name |
label_column |
str |
'labels' |
Label column name |
cache |
bool |
True |
Enable caching |
cache_path |
str |
'petharbor_cache/' |
Cache directory |
logs |
str |
None |
Log file path (logs to console if None) |
device |
str |
'cuda:0' or 'cpu' |
Device for computation |
output_dir |
str |
None |
Directory to save outputs |
🧩 How It Works
-
Entity Extraction
Identifies medically relevant entities using a pretrained token-classification model. -
Semantic Embedding
Converts entities to dense embeddings with a SentenceTransformer model. -
Code Matching
Finds the most semantically similar ICD-11 / ICD-10 / SNOMED entry using cosine similarity. -
Caching & Efficiency
ICD embeddings are saved to disk on the first run (.ptformat) for faster reuse later.
📦 Output
Depending on the input mode:
- Single text input: returns a structured Python dictionary with predicted codes.
- Dataset input: saves a processed dataset with new code columns to
output_dir.
🔧 Advanced Usage
💾 Regenerate ICD Embedding Store
If the ICD embedding store doesn’t exist, it will be created automatically.
To rebuild it manually:
from datasets import load_dataset
coder = DiseaseCoder()
dataset = load_dataset("seanfarrell/ICD-11_synonyms", split="train")
coder._preprocess_icd_lookup(disease_code_lookup=dataset, save_path="cache/icd_lookup.pt")
🧾 Logging
Enable persistent logs:
coder = DiseaseCoder(logs="logs/run.log")
🧬 Framework Switching
Switch easily between ICD and SNOMED frameworks:
coder = DiseaseCoder(framework="snomed")
📂 Recommended Project Structure
project/
│
├── data/
│ └── clinical_notes.csv
│
├── cache/
│ └── ICD-11_synonyms_embeddings.pt
│
├── outputs/
│ └── icd11_coded/
│
├── logs/
│ └── run.log
│
└── disease_coder.py
🧾 Citation
If you use this tool in your research, please cite the PetHarbor and PetTag projects:
@misc{pettag2025,
author = {Farrell, Sean},
title = {PetHarbor: Veterinary Language Models for Structured Health Record Coding},
year = {2025},
publisher = {GitHub},
url = {https://github.com/sean-farrell/petharbor}
}
❤️ Acknowledgements
This package is part of the PetTag / PetHarbor ecosystem —
a suite of NLP tools for large-scale veterinary EHR data analysis.
Built with:
🐾 License
This project is licensed under the MIT License — see the LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file pettag-0.0.12.tar.gz.
File metadata
- Download URL: pettag-0.0.12.tar.gz
- Upload date:
- Size: 20.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c7afa2ccc6890df95427c0f51711c365bbf566ff88e5ae6e38891af41f0935d
|
|
| MD5 |
9d518e1cd38f69ef68f078ecff3f76e5
|
|
| BLAKE2b-256 |
0f5cf465033603664167241b7e3f4cb14ac8578d18e3c5c009db1857fd66264c
|
File details
Details for the file pettag-0.0.12-py3-none-any.whl.
File metadata
- Download URL: pettag-0.0.12-py3-none-any.whl
- Upload date:
- Size: 21.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6218730541142a3f641dd6f9515983926bc0c88f693748d8bbcc5b079ad77fb2
|
|
| MD5 |
454edbc51d9f4a7a0937f3ae16bd2bde
|
|
| BLAKE2b-256 |
134702afd40467d137bf17d60d1e5cf085145cb974e7e94d1e731bc328df0794
|