Skip to main content

A Self-Supervised Learning Library

Project description

MK_SSL Logo

MK_SSL: A Modular Self-Supervised Learning Library for Audio, Vision, Graph, and Cross-Modal Data

A research-driven library with high-level APIs, tightly integrated with HuggingFace, PyTorch Lightning, and state-of-the-art tools for self-supervised learning.


📚 Table of Contents


📍 Overview

Say hello to MK_SSL — a library born from late-night debugging sessions, too much coffee, and the realization that self-supervised learning didn’t need to feel like solving a Rubik’s cube in the dark. In our research, we bounced between half-finished repos, clashing APIs, and “it worked on my machine” moments. Out of that chaos, we decided to build something cleaner: one place where SSL across audio, vision, graph, and cross-modal data actually makes sense.

At its core, MK_SSL is a unified playground for SSL. Imagine a command center where you can test state-of-the-art methods, swap modalities with a single line change, and still keep your sanity intact. Everything is modular, transparent, and reproducible — because science should be fun, not frustrating.

We also wanted MK_SSL to be welcoming. Whether you’re a student curious about representation learning, a researcher hunting for benchmarks, or a practitioner putting SSL into production, this library has your back. With HuggingFace and PyTorch Lightning baked in, plus support for distributed training, hyperparameter tuning, and lightweight fine-tuning, you’ll spend less time wrestling with setup and more time exploring ideas.

In short: MK_SSL is where rigor meets playfulness. Built from academic struggles but polished for the community, it lowers the barriers to SSL while giving you the tools to push the boundaries further. It also represents an improved version of an earlier research project, AK_SSL, developed by two previous students. That library contained the implementation of some other ssl methods, and the good news is: everything from AK_SSL is now accessible directly from MK_SSL with the same syntax. If you’d like to read more about AK_SSL or see the original methods, check the link above — but for practical use, everything has been consolidated here into one unified framework.


🧠 What is Self-Supervised Learning?

Self-Supervised Learning (SSL) is basically the art of teaching machines to make up their own homework and then solve it. Instead of us spoon-feeding models with expensive, hand-labeled data, SSL lets them invent clever tasks using only the raw input. Mask part of an audio signal and predict it? Shuffle an image and put it back together? Align speech with text? All of these are ways for models to get smarter without needing humans to sit down and annotate millions of examples.

From an academic angle, SSL has become a game-changer. It powers breakthroughs in speech recognition for low-resource languages, revolutionizes medical imaging where labels are scarce, and even helps scientists model molecules and proteins. At the same time, it’s the secret sauce behind today’s most powerful foundation models — making it both theoretically fascinating and practically indispensable.

But SSL isn’t just serious science — it’s also a bit of fun. There’s something delightful about watching a model reconstruct missing audio or fill in the gaps of an image, almost like it’s playing puzzles at scale. That blend of rigor and playfulness is exactly why we built MK_SSL: to give you a sandbox where curiosity, research, and real-world applications all come together.


🚀 Supported Methods

🎧 Audio-based Methods

Self-supervised audio modeling has transformed speech processing by enabling models to generalize from unlabeled sound. MK_SSL includes all the major paradigms, each capturing a different angle of how machines can learn to understand sound.

Wav2Vec2

Wav2Vec2 masks segments of raw audio and predicts them using latent features. The clever trick is that it forces the model to capture contextual information in speech without needing phonetic labels. This method has shown that even with minimal annotated data, models can reach near state-of-the-art performance in automatic speech recognition. It is especially impactful for languages and domains where labeled datasets are scarce.

HuBERT

HuBERT (Hidden-Unit BERT) takes the Wav2Vec2 philosophy further. It introduces pseudo-labeling through k-means clustering of hidden representations and uses those as targets for a BERT-like masked prediction. This iterative process of clustering and prediction refines the model over time, resulting in more robust and generalizable embeddings that can transfer effectively to multiple downstream tasks.

SpeechSimCLR

SpeechSimCLR adapts the contrastive learning approach SimCLR from vision to the audio domain. By applying augmentations such as time warping, noise injection, and speed perturbation, it teaches models to bring augmented versions of the same audio close together in representation space. This results in representations that are robust to noise and variations, and useful for speaker verification, classification, and general audio understanding.

COLA

COLA (Contrastive Learning with Alignment) emphasizes the temporal aspect of speech. Instead of treating audio as independent segments, it enforces alignment such that temporally close segments are nearby in the embedding space, while distant segments are pushed apart. This design makes embeddings more faithful to the sequential nature of speech, aiding tasks like dialogue modeling and speech segmentation.

EAT

The Embedding Audio Transformer (EAT) introduces the concept of masked autoencoders into the audio domain. It converts audio into spectrogram patches, masks random sections, and trains the model to reconstruct them. This pushes the model to learn high-level acoustic structures and relationships, similar to how vision transformers learn about images. EAT is especially promising for music understanding and large-scale pretraining where context-rich embeddings matter.


🖼️ Vision-based Method

MAE (Masked AutoEncoder)

MAE is a vision SSL method that masks random patches of an image and reconstructs them. The beauty of MAE is that it does not require labels yet learns powerful visual representations by solving this reconstruction puzzle. It has proven highly effective as a pretraining approach, enabling models to perform well with fewer labels in transfer tasks like object classification, segmentation, and fine-grained recognition.


🧬 Graph-based Method

GraphCL

GraphCL applies contrastive learning to graph-structured data. It creates multiple augmented versions of the same graph through techniques such as edge perturbation, node dropping, and attribute masking, and then aligns their embeddings. By doing so, it captures structural invariances that are central to understanding graphs. This makes it valuable for applications such as molecular property prediction, biological network analysis, and social network embeddings.


🔀 Cross-Modal Methods

Cross-modal SSL allows models to bridge domains like text, audio, and images, which is crucial for multimodal AI systems.

CLAP

CLAP learns joint embeddings for paired audio and text data. It aligns sound with natural language, enabling models to perform cross-modal retrieval and semantic classification. This makes it possible to, for instance, search for sound effects by typing text queries, or build systems that understand both speech and textual descriptions.

AudioCLIP

AudioCLIP extends the CLIP architecture into the audio domain, aligning text, audio, and image together. This tri-modal alignment creates a rich shared embedding space that can be applied to multimedia search, generative AI, and multimodal classification tasks. It essentially gives models the ability to understand and connect three different modalities at once.

Wav2CLIP

Wav2CLIP simplifies the cross-modal problem by directly mapping raw audio into the pretrained CLIP embedding space. With frozen CLIP encoders guiding the training, it leverages the vast visual-text knowledge already baked into CLIP and transfers it to audio. This opens doors to creative tasks like audio-to-image retrieval and multimodal creative applications.


📦 Installation

pip install mk-ssl

Requirements:

  • Python ≥ 3.8
  • PyTorch ≥ 1.12
  • CUDA-enabled GPU recommended for large-scale training

🛠️ Usage Tutorial

With MK_SSL, you can go from raw data to results in minutes. The design philosophy is plug-and-play, letting you switch methods or modalities seamlessly.

🧩 Trainer Initialization (Audio Example)

from MK_SSL.audio.Trainer import Trainer

trainer = Trainer(
    method = 'wav2vec2',
    backbone = None,
    save_dir = './',
    wandb_project = 'wav2vec2-pretext',
    wandb_mode = "online",
    use_data_parallel = True,
    checkpoint_interval = 5,
    verbose = True,
    reload_checkpoint=False,
    mixed_precision_training=False
)

🎯 Train the Model

trainer.train(
    train_dataset=train_dataset,
    val_dataset=val_dataset,
    batch_size=16,
    epochs=100,
    lr=1e-4,
    weight_decay=1e-2,
    optimizer="adamw",
    use_hpo=True,
    n_trials=20,
    tuning_epochs=5,
    use_embedding_logger=True,
    logger_loader=logger_loader
)

🧪 Evaluate on Downstream Task

trainer.evaluate(
    train_dataset=train_dataset,
    test_dataset=test_dataset,
    num_classes=39,
    batch_size=64,
    lr=1e-3,
    epochs=10,
    freeze_backbone=True
)

📊 Benchmarks

MK_SSL is designed for reproducible benchmarking across domains.

🎧 Audio (Wav2Vec2 - TESS Emotion Dataset)

Wav2Vec2 pretrained with MK_SSL.

libri_wav2vec2
timit_wav2vec2
vctk_wav2vec2

Task Dataset Model Accuracy
Emotion Clf Speaker Recognition (2 speakers) Speech SimCLR 72.5%
Emotion Clf TESS COLA 88.39%
Speaker Clf TESS EAT 93.21%

🔀 Cross-Modal (Wav2CLIP)

Wav2CLIP learns powerful joint embeddings, enabling intuitive cross-modal retrieval.

wav2clip_zero_shot
wav2clip_dog_prediction
wav2clip_cat_prediction
wav2clip_sim


🖼️ Vision (MAE on CIFAR-10)

MAE pretrained with MK_SSL yields competitive performance with limited fine-tuning.

Setting Accuracy
Linear Probing 61.84%
Fine-tuned 87.98%
MAE Result


🧬 Graph (GraphCL)

GraphCL learns molecular-level embeddings competitive with supervised baselines.

GraphCL BBBP

Dataset Accuracy AUC
BBBP 89.76% 92.62%
Tox21 task0: 96.61%
Tox21 task1: 97.25%
Tox21 task2: 87.28%
Tox21 task3: 91.39%
Tox21 task4: 86.73%
Tox21 task5: 96.30%
Tox21 task6: 96.11%
Tox21 task7: 76.65%
Tox21 task8: 94.61%
Tox21 task9: 91.71%
Tox21 task10: 83.11%
Tox21 task11: 88.78%
Tox21 12-task avg: 90.54%

🔧 Extra Superpowers

MK_SSL isn’t just a collection of SSL methods — it’s armed with extra superpowers that make your research life smoother, faster, and a lot more fun. Think of these as the cheat codes we always wished existed when we were wrestling with messy experiments:

  • 🖥️ Distributed Deep Learning (DDL) — Scale your experiments across multiple GPUs or nodes without needing to summon a cluster-wrangling wizard. Big models? Big data? Bring it on.
  • 🎯 Hyperparameter Optimization (HPO) — Stop playing guessing games. Automated tuning with Optuna helps you find the sweet spots without losing weeks of your life.
  • 🧠 LoRA Finetuning — Efficiently adapt giant models with lightweight parameter updates. It’s like upgrading your model’s brain without burning your GPU.
  • 📊 WandB Integration — Track, visualize, and share every training run like a pro. Who doesn’t love pretty dashboards?
  • 🧾 Logging System — Clean, colorful, and customizable logs that won’t make your terminal cry.
  • 🤗 HuggingFace Compatibility — Plug and play with transformers and pretrained backbones. Because reinventing the wheel is overrated.
  • 🎥 Dynamic Visualizations — Watch your embeddings evolve over time with animated plots. It’s science, but make it art.

In other words: MK_SSL doesn’t just help you run experiments — it helps you run better experiments, with less pain and more insight.


🧬 HuggingFace Example

from transformers import BertForPreTraining, AutoTokenizer
model = BertForPreTraining.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

trainer = GenericSSLTrainer(
    model=model,
    loss_fn=bert_loss_fn,
    dataloader=dataloader,
    optimizer_ctor=optimizer,
    epochs=10
)
trainer.fit()

🤝 Collaborators and Advisors

This project was made possible through our collaborative research and academic mentorship. The main contributors are:

Our combined efforts shaped the design, implementation, and structure of MK_SSL. The project was further enriched by the guidance of Dr. Peyman Adibi and Dr. Hossein Karshenas, whose academic mentorship ensured rigor and practical impact.


📜 License

We’re keeping things chill with the MIT License. In plain English: do whatever you want with this code — use it, remix it, build something wild on top of it. Just don’t sue us if your GPU explodes or your cat walks across your keyboard mid-training and somehow invents AGI. Fair game? Cool. 🚀

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mk_ssl-0.1.4.tar.gz (156.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mk_ssl-0.1.4-py3-none-any.whl (199.6 kB view details)

Uploaded Python 3

File details

Details for the file mk_ssl-0.1.4.tar.gz.

File metadata

  • Download URL: mk_ssl-0.1.4.tar.gz
  • Upload date:
  • Size: 156.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for mk_ssl-0.1.4.tar.gz
Algorithm Hash digest
SHA256 671f844dccb2ef0387d5d42d031678683ca1e199306680f68324648d517254e6
MD5 bbfd48c366258aed08f1a958bd97515e
BLAKE2b-256 21e48776a99548f1485209db7b681828a60bd06837c63c35fee5a927f57dab09

See more details on using hashes here.

File details

Details for the file mk_ssl-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: mk_ssl-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 199.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.11

File hashes

Hashes for mk_ssl-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 77568cbcf4113fd223b083404642dbb96c586603fa55ab241f8bdcf068c9028e
MD5 c54dbc23add755504bc0c90b3528fc1d
BLAKE2b-256 93c238f9e20bc47cae49552004b40fe33c32a86d5cf1b14a8b6332067cd07c26

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page