SMILES-based Matryoshka Representation Learning Embedding Model

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

ecortes

These details have not been verified by PyPI

Project description

CHEM-MRL

Chem-MRL is a SMILES embedding transformer model that leverages Matryoshka Representation Learning (MRL) to generate efficient, truncatable embeddings for downstream tasks such as classification, clustering, and database querying.

The dataset (split 75%/15%/10% for train/val/test) consists of SMILES pairs and their corresponding Morgan fingerprint (8192-bit vectors) Tanimoto similarity scores. The model employs SentenceTransformers' (SBERT) 2D Matryoshka Sentence Embeddings (Matryoshka2dLoss) to enable truncatable embeddings with minimal accuracy loss, improving query performance in downstream applications.

Hyperparameter tuning indicates that a custom Tanimoto similarity loss function, based on CoSENTLoss, outperforms Tanimoto similarity, CoSENTLoss, AnglELoss, and cosine similarity.

Installation

pip install chem-mrl

Usage

Hydra & Training Scripts

Hydra configuration files are in chem_mrl/conf. The base config defines shared arguments, while model-specific configs are located in chem_mrl/conf/model. Use chem_mrl_config.yaml or classifier_config.yaml to run specific models.

The scripts directory provides training scripts with Hydra for parameter management:

Train Chem-MRL model:

python scripts/train_chem_mrl.py train_dataset_path=/path/to/training.parquet val_dataset_path=/path/to/val.parquet

Train a linear classifier:

python scripts/train_classifier.py train_dataset_path=/path/to/training.parquet val_dataset_path=/path/to/val.parquet

Basic Training Workflow

To train a model, initialize the configuration with dataset paths and model parameters, then pass it to ChemMRLTrainer for training.

from chem_mrl.schemas import ChemMRLConfig
from chem_mrl.constants import BASE_MODEL_NAME
from chem_mrl.trainers import ChemMRLTrainer

# Define training configuration
config = BaseConfig(
    model=ChemMRLConfig(
        model_name=BASE_MODEL_NAME,  # Predefined model name - Can be a any transformer model name or path that is compatible with sentence-transformers
        smiles_a_column_name="smiles_a",  # Column with first molecule SMILES representation
        smiles_b_column_name="smiles_b",  # Column with second molecule SMILES representation
        label_column_name="similarity",  # Similarity score between molecules
        n_dims_per_step=3,  # Model-specific hyperparameter
        use_2d_matryoshka=True,  # Enable 2d MRL
        # Additional parameters specific to 2D MRL models
        n_layers_per_step=2,
        kl_div_weight=0.7,  # Weight for KL divergence regularization
        kl_temperature=0.5,  # Temperature parameter for KL loss
    ),
    train_dataset_path="train.parquet",  # Path to training data
    val_dataset_path="val.parquet",  # Path to validation data
    test_dataset_path="test.parquet",  # Optional test dataset
)

# Initialize trainer and start training
trainer = ChemMRLTrainer(config)
test_eval_metric = (
    trainer.train()
)  # Returns evaluation metric (if test dataset exists) otherwise returns the final validation eval metric

Custom Evaluation Callbacks

You can provide a callback function that is executed every evaluation_steps steps, allowing custom logic such as logging, early stopping, or model checkpointing.

from chem_mrl.schemas import Chem2dMRLConfig
from chem_mrl.constants import BASE_MODEL_NAME
from chem_mrl.trainers import ChemMRLTrainer


# Define a callback function for logging evaluation metrics
def eval_callback(score: float, epoch: int, steps: int):
    print(f"Step {steps}, Epoch {epoch}: Evaluation Score = {score}")


config = BaseConfig(
    model=ChemMRLConfig(
        model_name=BASE_MODEL_NAME,
        smiles_a_column_name="smiles_a",
        smiles_b_column_name="smiles_b",
        label_column_name="similarity",
    ),
    train_dataset_path="train.parquet",
    val_dataset_path="val.parquet",
)

# Train with callback
trainer = ChemMRLTrainer(config)
val_eval_metric = trainer.train(
    eval_callback=eval_callback
)  # Callback executed every `evaluation_steps`

W&B Integration

This library includes a WandBTrainerExecutor class for seamless Weights & Biases (W&B) integration. It handles authentication, initialization, and logging at the frequency specified by evaluation_steps. This setup ensures seamless logging and experiment tracking, allowing for better visualization and monitoring of model performance.

from chem_mrl.schemas import Chem2dMRLConfig, ChemMRLConfig
from chem_mrl.constants import BASE_MODEL_NAME
from chem_mrl.trainers import ChemMRLTrainer, WandBTrainerExecutor

# Define W&B configuration for experiment tracking
wandb_config = WandbConfig(
    project_name="chem_mrl_test",  # W&B project name
    run_name="test",  # Name for the experiment run
    use_watch=True,  # Enables model watching for tracking gradients
    watch_log="all",  # Logs all model parameters and gradients
    watch_log_freq=1000,  # Logging frequency
    watch_log_graph=True,  # Logs model computation graph
)

# Configure training with W&B integration
config = BaseConfig(
    model=ChemMRLConfig(
        model_name=BASE_MODEL_NAME,
        smiles_a_column_name="smiles_a",
        smiles_b_column_name="smiles_b",
        label_column_name="similarity",
    ),
    train_dataset_path="train.parquet",
    val_dataset_path="val.parquet",
    evaluation_steps=1000,
    use_wandb=True,  # Enables W&B logging
    wandb_config=wandb_config,
)

# Initialize trainer and W&B executor
trainer = ChemMRLTrainer(config)
executor = WandBTrainerExecutor(trainer)
executor.execute()  # Handles training and W&B logging

Classifier

This repository includes code for training a linear classifier with optional dropout regularization. The classifier categorizes substances based on SMILES and category features. While demonstrated on the Isomer Design dataset, it is generalizable to any dataset containing smiles and label columns. The training scripts (see below) allow users to specify these column names.

Currently, the dataset must be in Parquet format.

Hyperparameter tuning shows that cross-entropy loss (softmax option) outperforms self-adjusting dice loss in terms of accuracy, making it the preferred choice for molecular property classification.

Usage

Basic Classification Training

To train a classifier, configure the model with dataset paths and column names, then initialize ClassifierTrainer to start training.

from chem_mrl.schemas import ClassifierConfig
from chem_mrl.trainers import ClassifierTrainer

# Define classification training configuration
config = BaseConfig(
    model=ClassifierConfig(
        model_name="path/to/trained_mrl_model",  # Pretrained MRL model path
        smiles_column_name="smiles",  # Column containing SMILES representations of molecules
        label_column_name="label",  # Column containing classification labels
    ),
    train_dataset_path="train_classification.parquet",  # Path to training dataset
    val_dataset_path="val_classification.parquet",  # Path to validation dataset
)

# Initialize and train the classifier
trainer = ClassifierTrainer(config)
trainer.train()

Training with Dice Loss

For imbalanced classification tasks, Dice Loss can improve performance by focusing on hard-to-classify samples. Below is a configuration using DiceLossClassifierConfig, which introduces additional hyperparameters.

from chem_mrl.schemas import DiceLossClassifierConfig
from chem_mrl.trainers import ClassifierTrainer
from chem_mrl.constants import BASE_MODEL_NAME

# Define classification training configuration with Dice Loss
config = BaseConfig(
    model=ClassifierConfig(
        model_name="path/to/trained_mrl_model",
        smiles_column_name="smiles",
        label_column_name="label",
        dice_reduction="sum",  # Reduction method for Dice Loss (e.g., 'mean' or 'sum')
        dice_gamma=1.0,  # Dice loss hyperparameter
    ),
    train_dataset_path="train_classification.parquet",  # Path to training dataset
    val_dataset_path="val_classification.parquet",  # Path to validation dataset
)

# Initialize and train the classifier with Dice Loss
trainer = ClassifierTrainer(config)
trainer.train()

References:

Chithrananda, Seyone, et al. "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction." arXiv [Cs.LG], 2020. Link.
Ahmad, Walid, et al. "ChemBERTa-2: Towards Chemical Foundation Models." arXiv [Cs.LG], 2022. Link.
Kusupati, Aditya, et al. "Matryoshka Representation Learning." arXiv [Cs.LG], 2022. Link.
Li, Xianming, et al. "2D Matryoshka Sentence Embeddings." arXiv [Cs.CL], 2024. Link.
Bajusz, Dávid, et al. "Why is the Tanimoto Index an Appropriate Choice for Fingerprint-Based Similarity Calculations?" J Cheminform, 7, 20 (2015). Link.
Li, Xiaoya, et al. "Dice Loss for Data-imbalanced NLP Tasks." arXiv [Cs.CL], 2020. Link

Project details

These details have been verified by PyPI

Project links

Repository

GitHub Statistics

Maintainers

ecortes

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.8.3

Dec 14, 2025

0.8.2

Nov 16, 2025

0.8.1

Nov 3, 2025

0.8.0

Oct 27, 2025

0.7.3

Aug 19, 2025

0.7.2

Jul 25, 2025

0.7.1

Jul 24, 2025

0.7.0

Jul 22, 2025

0.6.3

Jun 11, 2025

0.6.2

Jun 5, 2025

0.6.1

Jun 4, 2025

0.6.0

Jun 3, 2025

0.5.9

May 30, 2025

0.5.8

Mar 29, 2025

0.5.7

Mar 29, 2025

0.5.6

Feb 28, 2025

0.5.5

Feb 26, 2025

0.5.4

Feb 24, 2025

0.5.3

Feb 14, 2025

0.5.2

Feb 14, 2025

0.5.1

Feb 13, 2025

0.5.0

Feb 9, 2025

0.4.1

Feb 7, 2025

This version

0.4.0

Feb 6, 2025

0.3.7

Feb 4, 2025

0.3.6

Feb 3, 2025

0.3.5

Feb 3, 2025

0.3.4

Feb 3, 2025

0.3.3

Feb 2, 2025

0.3.2

Feb 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chem_mrl-0.4.0.tar.gz (391.8 kB view details)

Uploaded Feb 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chem_mrl-0.4.0-py3-none-any.whl (57.9 kB view details)

Uploaded Feb 6, 2025 Python 3

File details

Details for the file chem_mrl-0.4.0.tar.gz.

File metadata

Download URL: chem_mrl-0.4.0.tar.gz
Upload date: Feb 6, 2025
Size: 391.8 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for chem_mrl-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`c8a28e8bfc256a2919c5aa1a09ad45e336beeb2d3f7142aa5bd6ee2f35c4ebf8`
MD5	`63eb27f1b55513c5192f162144b12997`
BLAKE2b-256	`74c74ce44656f171f8bb0162cdccb64d5d772f94b4fe5286e3e086ef006fc4d5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for chem_mrl-0.4.0.tar.gz:

Publisher: release.yml on emapco/chem-mrl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: chem_mrl-0.4.0.tar.gz
- Subject digest: c8a28e8bfc256a2919c5aa1a09ad45e336beeb2d3f7142aa5bd6ee2f35c4ebf8
- Sigstore transparency entry: 169230376
- Sigstore integration time: Feb 6, 2025
Source repository:
- Permalink: emapco/chem-mrl@699d565aeb855ab6bccabd9fe527fc005460c01c
- Branch / Tag: refs/heads/main
- Owner: https://github.com/emapco
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@699d565aeb855ab6bccabd9fe527fc005460c01c
- Trigger Event: push

File details

Details for the file chem_mrl-0.4.0-py3-none-any.whl.

File metadata

Download URL: chem_mrl-0.4.0-py3-none-any.whl
Upload date: Feb 6, 2025
Size: 57.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for chem_mrl-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1820c74d394087640d67129cd08d7230fcfd0fafc76773ade03a43d9b3d89c3b`
MD5	`7da2fbacc712f2214caccd01cfbe9593`
BLAKE2b-256	`5062f21639935870ea63efddd3cd6e7428a0f6ff57149ac25d0685ce93a780ba`

See more details on using hashes here.

Provenance

The following attestation bundles were made for chem_mrl-0.4.0-py3-none-any.whl:

Publisher: release.yml on emapco/chem-mrl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: chem_mrl-0.4.0-py3-none-any.whl
- Subject digest: 1820c74d394087640d67129cd08d7230fcfd0fafc76773ade03a43d9b3d89c3b
- Sigstore transparency entry: 169230377
- Sigstore integration time: Feb 6, 2025
Source repository:
- Permalink: emapco/chem-mrl@699d565aeb855ab6bccabd9fe527fc005460c01c
- Branch / Tag: refs/heads/main
- Owner: https://github.com/emapco
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@699d565aeb855ab6bccabd9fe527fc005460c01c
- Trigger Event: push

chem-mrl 0.4.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Project description

CHEM-MRL

Installation

Usage

Hydra & Training Scripts

Basic Training Workflow

Custom Evaluation Callbacks

W&B Integration

Classifier

Usage

Basic Classification Training

Training with Dice Loss

References:

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance