Skip to main content

SMILES-based Matryoshka Representation Learning Embedding Model

Project description

CHEM-MRL

Chem-MRL is a SMILES embedding transformer model that leverages Matryoshka Representation Learning (MRL) to generate efficient, truncatable embeddings for downstream tasks such as classification, clustering, and database querying.

The model employs SentenceTransformers' (SBERT) 2D Matryoshka Sentence Embeddings (Matryoshka2dLoss) to enable truncatable embeddings with minimal accuracy loss, improving query performance and flexibility in downstream applications.

Datasets should consists of SMILES pairs and their corresponding Morgan fingerprint Tanimoto similarity scores.

Hyperparameter tuning indicates that a custom Tanimoto similarity loss function, TanimotoSentLoss, based on CoSENTLoss, outperforms Tanimoto similarity, CoSENTLoss, AnglELoss, and cosine similarity.

v0.6.0 Release

Chem-MRL library is now built on sentence-transformers v4.0.0 API and uses datasets library for loading data from local files or Hugging Face Hub. sentence-transformers library supports checkpoint resuming and extends transformers.Trainer and transformers.TrainingArguments for enhanced training capabilities. Modify training arguments in chem_mrl/conf/base.yaml:training_args when using Hydra-enabled training scripts.

The PubChem 10M GenMol Fingerprint Similarity Dataset is available on Hugging Face Hub for training Chem-MRL models.

Installation

Install with pip

pip install chem-mrl

Install from source code

pip install -e .

Usage

Hydra & Training Scripts

Hydra configuration files are in chem_mrl/conf. The base config defines shared arguments, while model-specific configs are located in chem_mrl/conf/model. Use chem_mrl_config.yaml or classifier_config.yaml to run specific models.

The scripts directory provides training scripts with Hydra for parameter management:

  • Train Chem-MRL model:
    python scripts/train_chem_mrl.py train_dataset_path=/path/to/training.parquet val_dataset_path=/path/to/val.parquet
    
  • Train a linear classifier:
    python scripts/train_classifier.py train_dataset_path=/path/to/training.parquet val_dataset_path=/path/to/val.parquet
    

Basic Training Workflow

To train a model, initialize the configuration with dataset paths and model parameters, then pass it to ChemMRLTrainer for training.

from sentence_transformers import SentenceTransformerTrainingArguments

from chem_mrl.constants import BASE_MODEL_NAME
from chem_mrl.schemas import BaseConfig, ChemMRLConfig
from chem_mrl.trainers import ChemMRLTrainer

# Define training configuration
config = BaseConfig(
    model=ChemMRLConfig(
        model_name=BASE_MODEL_NAME,  # Predefined model name - Can be any transformer model name or path that is compatible with sentence-transformers
        n_dims_per_step=3,  # Model-specific hyperparameter
        use_2d_matryoshka=True,  # Enable 2d MRL
        # Additional parameters specific to 2D MRL models
        n_layers_per_step=2,
        kl_div_weight=0.7,  # Weight for KL divergence regularization
        kl_temperature=0.5,  # Temperature parameter for KL loss
    ),
    training_args=SentenceTransformerTrainingArguments("training_output"),
    train_dataset_path="train.parquet",  # Path to training data
    val_dataset_path="val.parquet",  # Path to validation data
    test_dataset_path="test.parquet",  # Optional test dataset
    smiles_a_column_name="smiles_a",  # Column with first molecule SMILES representation
    smiles_b_column_name="smiles_b",  # Column with second molecule SMILES representation
    label_column_name="similarity",  # Similarity score between molecules
)

# Initialize trainer and start training
trainer = ChemMRLTrainer(config)
test_eval_metric = (
    trainer.train()
)  # Returns the test evaluation metric if a test dataset is provided.
# Otherwise returns the final validation eval metric

Experimental

Train a Query Model

To train a querying model, configure the model to utilize the specialized query tokenizer.

The query tokenizer supports the following query types:

  • similar: Computes SMILES similarity between two molecular structures. For retrieving similar SMILES.
  • substructure: Determines the presence of a substructure within the second SMILES string.

Supported query formats for smiles_a column:

  • similar {smiles}
  • substructure {smiles}
from sentence_transformers import SentenceTransformerTrainingArguments

from chem_mrl.constants import BASE_MODEL_NAME
from chem_mrl.schemas import BaseConfig, ChemMRLConfig
from chem_mrl.trainers import ChemMRLTrainer

config = BaseConfig(
    model=ChemMRLConfig(
        model_name=BASE_MODEL_NAME,
        use_query_tokenizer=True,  # Train a query model
    ),
    training_args=SentenceTransformerTrainingArguments("training_output"),
    train_dataset_path="train.parquet",
    val_dataset_path="val.parquet",
    smiles_a_column_name="query",
    smiles_b_column_name="target_smiles",
    label_column_name="similarity",
)
trainer = ChemMRLTrainer(config)

Latent Attention Layer

The Latent Attention Layer model is an experimental component designed to enhance the representation learning of transformer-based models by introducing a trainable latent dictionary. This mechanism applies cross-attention between token embeddings and a set of learnable latent vectors before pooling. The output of this layer contributes to both 1D Matryoshka loss (as the final layer output) and 2D Matryoshka loss (by integrating into all-layer outputs). Note: initial tests suggests that when using default configuration, the latent attention layer leads to overfitting.

from sentence_transformers import SentenceTransformerTrainingArguments

from chem_mrl.constants import BASE_MODEL_NAME
from chem_mrl.schemas import BaseConfig, ChemMRLConfig, LatentAttentionConfig
from chem_mrl.trainers import ChemMRLTrainer

config = BaseConfig(
    model=ChemMRLConfig(
        model_name=BASE_MODEL_NAME,
        latent_attention_config=LatentAttentionConfig(
            hidden_dim=768,  # Transformer hidden size
            num_latents=512,  # Number of learnable latents
            num_cross_heads=8,  # Number of attention heads
            cross_head_dim=32,  # Dimensionality of each head
            output_normalize=True,  # Apply L2 normalization to outputs
        ),
        use_2d_matryoshka=True,
    ),
    training_args=SentenceTransformerTrainingArguments("training_output"),
    train_dataset_path="train.parquet",
    val_dataset_path="val.parquet",
)

# Train a model with latent attention
trainer = ChemMRLTrainer(config)

Custom Callbacks

You can provide a list of transformers.TrainerCallback classes to execute while training.

from typing import Any

from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainingArguments,
)
from transformers import TrainerCallback, TrainerControl, TrainerState

from chem_mrl.constants import BASE_MODEL_NAME
from chem_mrl.schemas import BaseConfig, ChemMRLConfig
from chem_mrl.trainers import ChemMRLTrainer


# Define a callback class for logging evaluation metrics
class EvalCallback(TrainerCallback):
    def on_evaluate(
        self,
        args: SentenceTransformerTrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        metrics: dict[str, Any],
        model: SentenceTransformer,
        **kwargs,
    ) -> None:
        """
        Event called after an evaluation phase.
        """
        pass


config = BaseConfig(
    model=ChemMRLConfig(
        model_name=BASE_MODEL_NAME,
    ),
    training_args=SentenceTransformerTrainingArguments("training_output"),
    train_dataset_path="train.parquet",
    val_dataset_path="val.parquet",
    smiles_a_column_name="smiles_a",
    smiles_b_column_name="smiles_b",
    label_column_name="similarity",
)

# Train with callback
trainer = ChemMRLTrainer(config)
val_eval_metric = trainer.train(callbacks=[EvalCallback(...)])

Classifier

This repository includes code for training a linear classifier with optional dropout regularization. The classifier categorizes substances based on SMILES and category features.

Hyperparameter tuning shows that cross-entropy loss (softmax option) outperforms self-adjusting dice loss in terms of accuracy, making it the preferred choice for molecular property classification.

Usage

Basic Classification Training

To train a classifier, configure the model with dataset paths and column names, then initialize ClassifierTrainer to start training.

from sentence_transformers import SentenceTransformerTrainingArguments

from chem_mrl.schemas import BaseConfig, ClassifierConfig
from chem_mrl.trainers import ClassifierTrainer

# Define classification training configuration
config = BaseConfig(
    model=ClassifierConfig(
        model_name="path/to/trained_mrl_model",  # Pretrained MRL model path
    ),
    training_args=SentenceTransformerTrainingArguments("training_output"),
    train_dataset_path="train_classification.parquet",  # Path to training dataset
    val_dataset_path="val_classification.parquet",  # Path to validation dataset
    smiles_a_column_name="smiles",  # Column containing SMILES representations of molecules
    label_column_name="label",  # Column containing classification labels
)

# Initialize and train the classifier
trainer = ClassifierTrainer(config)
trainer.train()

Training with Dice Loss

For imbalanced classification tasks, Dice Loss can improve performance by focusing on hard-to-classify samples. Below is a configuration using DiceLossClassifierConfig, which introduces additional hyperparameters.

from sentence_transformers import SentenceTransformerTrainingArguments

from chem_mrl.schemas import BaseConfig, ClassifierConfig
from chem_mrl.schemas.Enums import ClassifierLossFctOption, DiceReductionOption
from chem_mrl.trainers import ClassifierTrainer

# Define classification training configuration with Dice Loss
config = BaseConfig(
    model=ClassifierConfig(
        model_name="path/to/trained_mrl_model",
        loss_func=ClassifierLossFctOption.selfadjdice,
        dice_reduction=DiceReductionOption.sum,  # Reduction method for Dice Loss (e.g., 'mean' or 'sum')
        dice_gamma=1.0,  # Smoothing factor hyperparameter
    ),
    training_args=SentenceTransformerTrainingArguments("training_output"),
    train_dataset_path="train_classification.parquet",  # Path to training dataset
    val_dataset_path="val_classification.parquet",  # Path to validation dataset
    smiles_a_column_name="smiles",
    label_column_name="label",
)

# Initialize and train the classifier with Dice Loss
trainer = ClassifierTrainer(config)
trainer.train()

References:

  • Chithrananda, Seyone, et al. "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction." arXiv [Cs.LG], 2020. Link.
  • Ahmad, Walid, et al. "ChemBERTa-2: Towards Chemical Foundation Models." arXiv [Cs.LG], 2022. Link.
  • Kusupati, Aditya, et al. "Matryoshka Representation Learning." arXiv [Cs.LG], 2022. Link.
  • Li, Xianming, et al. "2D Matryoshka Sentence Embeddings." arXiv [Cs.CL], 2024. Link.
  • Bajusz, Dávid, et al. "Why is the Tanimoto Index an Appropriate Choice for Fingerprint-Based Similarity Calculations?" J Cheminform, 7, 20 (2015). Link.
  • Li, Xiaoya, et al. "Dice Loss for Data-imbalanced NLP Tasks." arXiv [Cs.CL], 2020. Link
  • Reimers, Nils, and Gurevych, Iryna. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019. Link.
  • Lee, Chankyu, et al. "NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models." arXiv [Cs.CL], 2025. Link.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chem_mrl-0.6.0.tar.gz (410.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chem_mrl-0.6.0-py3-none-any.whl (61.5 kB view details)

Uploaded Python 3

File details

Details for the file chem_mrl-0.6.0.tar.gz.

File metadata

  • Download URL: chem_mrl-0.6.0.tar.gz
  • Upload date:
  • Size: 410.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for chem_mrl-0.6.0.tar.gz
Algorithm Hash digest
SHA256 1689c497f9839579e441f7a1a0b304065bd875ecb390b8cdd069022481564f9b
MD5 6833db752ef6f02c33658b87d9a4ecea
BLAKE2b-256 e3906394e4c80b84f51e83c8ccd11c8d6d4ad0c63eb6c79f6c9369991e520207

See more details on using hashes here.

Provenance

The following attestation bundles were made for chem_mrl-0.6.0.tar.gz:

Publisher: release.yml on emapco/chem-mrl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chem_mrl-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: chem_mrl-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 61.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for chem_mrl-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b4e5f420bd11307bd055174f08921fd6f4a65f8d1c8a7be242fd3fb2140857bf
MD5 dfee9c7a458a77f796f0b7b224e851af
BLAKE2b-256 eac1f47adeeaa99ab79ea9825e064d4539212bd6f4d35699c917e3bf485b54a5

See more details on using hashes here.

Provenance

The following attestation bundles were made for chem_mrl-0.6.0-py3-none-any.whl:

Publisher: release.yml on emapco/chem-mrl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page