Generate vector embeddings from OWL ontologies using Graph Neural Networks with HuggingFace integration and MTEB benchmarking
Project description
on2vec
A toolkit for generating vector embeddings from OWL ontologies using Graph Neural Networks (GNNs), with HuggingFace Sentence Transformers integration and MTEB benchmarking.
🚀 Quick Start
Installation
pip install on2vec
Create production-ready Sentence Transformers models with ontology knowledge in one command:
# Complete end-to-end workflow
on2vec hf biomedical.owl my-biomedical-model
Use like any sentence transformer:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('./hf_models/my-biomedical-model')
embeddings = model.encode(['heart disease', 'cardiovascular problems'])
📋 Table of Contents
- 🚀 Quick Start
- 📥 Installation
- 🤗 HuggingFace Integration
- 🧪 MTEB Benchmarking
- 💻 Core on2vec Usage
- 🏗️ Architecture
- 📚 Documentation
📥 Installation
From PyPI (Recommended)
# Basic installation
pip install on2vec
# With MTEB benchmarking support
pip install on2vec[benchmark]
# With all optional dependencies
pip install on2vec[all]
From Source
git clone <repository-url>
cd on2vec
pip install -e .
Dependencies
- Python >= 3.10
- PyTorch + torch-geometric
- owlready2, sentence-transformers
- polars, matplotlib, umap-learn
🤗 HuggingFace Integration
One-Command Model Creation
# Create complete model with auto-generated documentation
on2vec hf ontology.owl model-name
# With custom settings
on2vec hf ontology.owl model-name \
--base-model all-mpnet-base-v2 \
--fusion gated \
--epochs 200
Step-by-Step Workflow
# 1. Train ontology embeddings
on2vec hf-train ontology.owl --output embeddings.parquet
# 2. Create HuggingFace model (auto-detects base model)
on2vec hf-create embeddings.parquet model-name
# 3. Test model functionality
on2vec hf-test ./hf_models/model-name
# 4. Inspect model details
on2vec inspect ./hf_models/model-name
Batch Processing
# Process multiple ontologies
on2vec hf-batch owl_files/ ./output \
--base-models all-MiniLM-L6-v2 all-mpnet-base-v2 \
--fusion-methods concat gated \
--max-workers 4
Features
- ✅ Auto-generated model cards with comprehensive metadata
- ✅ Smart base model detection from embeddings
- ✅ Upload instructions and HuggingFace Hub preparation
- ✅ Domain detection and appropriate tagging
- ✅ Multiple fusion methods: concat, attention, gated, weighted_avg
- ✅ Batch processing for multiple ontologies
🧪 MTEB Benchmarking
Evaluate your models against the Massive Text Embedding Benchmark:
Quick Benchmark
# Fast evaluation on subset of tasks
on2vec benchmark ./hf_models/my-model --quick
# Focus on specific task types
on2vec benchmark ./hf_models/my-model --task-types STS Classification
# Full MTEB benchmark
on2vec benchmark ./hf_models/my-model
Compare Models
# Benchmark vanilla baseline
on2vec benchmark sentence-transformers/all-MiniLM-L6-v2 \
--model-name vanilla-baseline --quick
# Compare ontology vs vanilla models
on2vec compare ./hf_models/my-model --detailed
Features
- ✅ Full MTEB integration with 58+ evaluation tasks
- ✅ Task filtering by category (STS, Classification, Clustering, etc.)
- ✅ Automated reporting with JSON summaries and markdown reports
- ✅ Resource management with configurable batch sizes
- ✅ Comparison tools for baseline evaluation
💻 Core on2vec Usage
Basic Training
# Train GCN model
on2vec train ontology.owl --output model.pt --model-type gcn --epochs 100
# Train with text features (for HuggingFace integration)
on2vec train ontology.owl --output embeddings.parquet --use-text-features
# Multi-relation models with all ObjectProperties
on2vec train ontology.owl --output model.pt --use-multi-relation --model-type rgcn
Generate Embeddings
# Generate embeddings from trained model
on2vec embed model.pt ontology.owl --output embeddings.parquet
Visualization
# Create UMAP visualization
on2vec visualize embeddings.parquet --output visualization.png
Python API
from sentence_transformers import SentenceTransformer
from on2vec import train_ontology_embeddings, embed_ontology_with_model
# Train model
result = train_ontology_embeddings(
owl_file="ontology.owl",
model_output="model.pt",
model_type="gcn",
hidden_dim=128,
out_dim=64
)
# Generate embeddings
embeddings = embed_ontology_with_model(
model_path="model.pt",
owl_file="ontology.owl",
output_file="embeddings.parquet"
)
# Use HuggingFace model
model = SentenceTransformer('./hf_models/my-model')
vectors = model.encode(['concept 1', 'concept 2'])
🏗️ Architecture
Core Components
- Graph Construction: Converts OWL ontologies to graph representations
- GNN Training: Supports GCN, GAT, RGCN, and heterogeneous architectures
- Text Integration: Combines structural and semantic features using sentence transformers
- Fusion Methods: Multiple approaches to combine text + structural embeddings
- HuggingFace Bridge: Creates sentence-transformers compatible models
Model Pipeline
OWL Ontology → Graph → GNN Training → Structural Embeddings
↓
Text Features → Sentence Transformer → Text Embeddings
↓
Fusion Layer → Final Model
↓
HuggingFace Model + Model Card
Supported Architectures
- GCN: Graph Convolutional Networks
- GAT: Graph Attention Networks
- RGCN: Relational GCN for multi-relation graphs
- Heterogeneous: Relation-specific layers with attention
📚 Documentation
- 📚 CLI Quick Reference - All commands and examples
- 📖 HuggingFace Integration - Complete workflow guide
- 🧪 MTEB Benchmarking - Evaluation framework
- 🧬 Project Instructions - Development guidelines
🎯 Key Features
- 🤗 HuggingFace Ready: One-command model creation with professional documentation
- 🧪 MTEB Integration: Comprehensive benchmarking against standard tasks
- 📊 Rich Metadata: Auto-generated model cards with complete technical details
- 🔧 Smart Automation: Auto-detects base models, domains, and configurations
- ⚡ Batch Processing: Handle multiple ontologies efficiently
- 🎨 Multiple Fusion Methods: Flexible combination of text and structural features
- 📈 Comprehensive Evaluation: Built-in comparison and testing tools
🚀 Example Workflow
# 1. Install on2vec
pip install on2vec[benchmark]
# 2. Create a model from biomedical ontology
on2vec hf EDAM.owl edam-biomedical
# 3. Quick benchmark evaluation
on2vec benchmark ./hf_models/edam-biomedical --quick
# 4. Compare with vanilla models
on2vec compare ./hf_models/edam-biomedical --detailed
# 5. Inspect model details
on2vec inspect ./hf_models/edam-biomedical
# 6. Upload to HuggingFace Hub (instructions auto-generated)
# See ./hf_models/edam-biomedical/UPLOAD_INSTRUCTIONS.md
The model is immediately usable as a drop-in replacement for any sentence-transformer, with the added benefit of ontological domain knowledge!
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Citation
If you use on2vec in your research, please cite:
@software{on2vec2025,
title={on2vec: Ontology Embeddings with Graph Neural Networks},
author={David Steinberg},
year={2025},
url={https://github.com/david4096/on2vec}
}
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file on2vec-0.1.1.tar.gz.
File metadata
- Download URL: on2vec-0.1.1.tar.gz
- Upload date:
- Size: 126.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29ea3d11a4a38dbb94f4fd22688899ed99436fad1941901777c3f9574d3a651a
|
|
| MD5 |
c5b22b6d9e00a1585de88032bfdabf73
|
|
| BLAKE2b-256 |
4b41bbf8e4833631a61b3a14173e2bdf8781f7e4382622f783e4fa5f12b36f03
|
Provenance
The following attestation bundles were made for on2vec-0.1.1.tar.gz:
Publisher:
python-publish.yml on david4096/on2vec
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
on2vec-0.1.1.tar.gz -
Subject digest:
29ea3d11a4a38dbb94f4fd22688899ed99436fad1941901777c3f9574d3a651a - Sigstore transparency entry: 536829615
- Sigstore integration time:
-
Permalink:
david4096/on2vec@cf7a742e64067f14894ffc0718a0ec33cd1dbda7 -
Branch / Tag:
refs/tags/0.1.1 - Owner: https://github.com/david4096
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@cf7a742e64067f14894ffc0718a0ec33cd1dbda7 -
Trigger Event:
release
-
Statement type:
File details
Details for the file on2vec-0.1.1-py3-none-any.whl.
File metadata
- Download URL: on2vec-0.1.1-py3-none-any.whl
- Upload date:
- Size: 120.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
556dd59b92b109b75b81038a8efce230ad071700a0f3d48e30cb145e5f794558
|
|
| MD5 |
f46aa38a03580fb210fda49b8378423a
|
|
| BLAKE2b-256 |
13aff7ca0171f4dd99dadcd760b7167515a3d2cccf1e6e8bf820d1839d31a488
|
Provenance
The following attestation bundles were made for on2vec-0.1.1-py3-none-any.whl:
Publisher:
python-publish.yml on david4096/on2vec
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
on2vec-0.1.1-py3-none-any.whl -
Subject digest:
556dd59b92b109b75b81038a8efce230ad071700a0f3d48e30cb145e5f794558 - Sigstore transparency entry: 536829648
- Sigstore integration time:
-
Permalink:
david4096/on2vec@cf7a742e64067f14894ffc0718a0ec33cd1dbda7 -
Branch / Tag:
refs/tags/0.1.1 - Owner: https://github.com/david4096
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@cf7a742e64067f14894ffc0718a0ec33cd1dbda7 -
Trigger Event:
release
-
Statement type: