SMILES-based Matryoshka Representation Learning Embedding Model

These details have not been verified by PyPI

Project description

CHEM-MRL

PyPI - Version GitHub Actions Workflow Status PyPI - Status

Chem-MRL is a SMILES embedding transformer model that leverages Matryoshka Representation Learning (MRL) to generate efficient, truncatable embeddings for downstream tasks such as classification, clustering, and database indexing.

Datasets should consist of SMILES pairs and their corresponding Morgan fingerprint Tanimoto similarity scores.

Hyperparameter optimization indicates that a custom Tanimoto similarity loss function, TanimotoSentLoss, based on CoSENTLoss, outperforms CoSENTLoss, AnglELoss, Tanimoto similarity, and cosine similarity.

Installation

Install with pip

pip install chem-mrl

Install from source code

pip install -e .

Install Flash Attention (optional, improves training speed)

The default base model, Derify/ModChemBERT-IR-BASE, benefits from Flash Attention for faster training and inference. Install it via pip:

MAX_JOBS=4 pip install flash-attn --no-build-isolation

For more information and installation options, refer to the Flash Attention repository.

Usage

Inference

ChemMRL provides pre-trained models for generating molecular embeddings from SMILES strings. The model supports various precision configurations to balance accuracy, speed, and memory usage.

Quick Start

from chem_mrl import ChemMRL

# Load the model from the 🤗 Hub
model = ChemMRL(
    similarity_fn_name="tanimoto",  # tanimoto (default) | cosine | dot
    model_kwargs={
        "dtype": "float16",  # float32 | float16 | bfloat16
        "attn_implementation": "sdpa",  # eager | sdpa | flash_attention_2
    },
)

# Encode SMILES to embeddings
smiles = [
    "OCCCc1cc(F)cc(F)c1",
    "Fc1cc(F)cc(-n2cc[o+]n2)c1",
    "CCC(C)C(=O)C1(C(NN)C(C)C)CCCC1",
]
embeddings = model.embed(smiles)

# Calculate similarity matrix using Tanimoto similarity
similarities = model.similarity(embeddings, embeddings)
# tensor([[1.0000, 0.3875, 0.0080],
#         [0.3875, 1.0000, 0.0029],
#         [0.0080, 0.0029, 1.0000]], dtype=torch.float16)

# Calculate similarity between a molecule against a list of SMILES
query = 5 * ["CN(C)CCc1c[nH]c2cccc(OP(=O)(O)O)c12"]
docs = [
    "CN(C)CCc1c[nH]c2cccc(OP(=O)(O)Cl)c12",
    "CN(C)CCc1c[nH]c2cccc(OP(=O)(O)OP(=O)(O)O)c12",
    "CCN(C)CCc1c[nH]c2cccc(OP(=O)(O)O)c12",
    "C[N+](C)(C)CCc1c[nH]c2cccc(OP(=O)(O)O)c12",
    "CN(C)CCc1c[nH]c2cccc(OP(=O)(Cl)Cl)c12",
]
query_embeddings = model.embed(query)
docs_embeddings = model.embed(docs)
similarities = model.similarity_pairwise(query_embeddings, docs_embeddings)
# tensor([0.9697, 0.9214, 0.9751, 0.8892, 0.9067], dtype=torch.float16)

Precision Configurations

Different precision and optimization settings offer trade-offs between accuracy, inference speed, and memory usage. The table below lists recommended configurations with their performance characteristics. All metrics were benchmarked with scripts/evaluate_precision.py on 131K samples (batch size = 1024), comparing speed and memory usage against the float32 baseline.

Configuration	Speedup*†	Memory Savings*†	Accuracy Impact
bf16 (sdpa)	2.12x / 1.99x	49.9% / 49.9%	Minimal (~0.01%)
bf16 + torch.compile (sdpa)	2.55× / 2.36x	41.9% / 41.8%	Minimal (~0.01%)
bf16 (flash-attn)	2.54× / 2.25x	64.3% / 64.3%	Minimal (~0.01%)
fp16 (sdpa)	2.09x / 2.00x	49.9% / 49.9%	Negligible (<0.01%)
fp16 + torch.compile (sdpa)	2.56× / 2.35x	44.7% / 44.0%	Negligible (<0.01%)
fp16 (flash-attn)	2.52× / 2.25x	64.2% / 64.1%	Negligible (<0.01%)

* NVIDIA 4070 Ti Super and NVIDIA 3090 FE values respectively
† Higher is better

Click to expand code examples for each configuration

# bfloat16 with SDPA
model = ChemMRL(
    model_kwargs={
        "dtype": "bfloat16",
        "attn_implementation": "sdpa"
    }
)

# bfloat16 with torch.compile and SDPA
model = ChemMRL(
    model_kwargs={
        "dtype": "bfloat16",
        "attn_implementation": "sdpa"
    },
    compile_kwargs={
        "backend": "inductor",
        "dynamic": True
    }
)

# bfloat16 with Flash Attention
model = ChemMRL(
    model_kwargs={
        "dtype": "bfloat16",
        "attn_implementation": "flash_attention_2"
    }
)

# float16 with SDPA
model = ChemMRL(
    model_kwargs={
        "dtype": "float16",
        "attn_implementation": "sdpa"
    }
)

# float16 with torch.compile and SDPA
model = ChemMRL(
    model_kwargs={
        "dtype": "float16",
        "attn_implementation": "sdpa"
    },
    compile_kwargs={
        "backend": "inductor",
        "dynamic": True
    }
)

# float16 with Flash Attention
model = ChemMRL(
    model_kwargs={
        "dtype": "float16",
        "attn_implementation": "flash_attention_2"
    }
)

Hydra & Training Scripts

Hydra configuration files are in chem_mrl/conf. The base config (base.yaml) defines shared arguments and includes model-specific configurations from chem_mrl/conf/model. Supported models: chem_mrl, chem_2d_mrl, classifier, and dice_loss_classifier.

Training Examples:

# Default (chem_mrl model)
python scripts/train_chem_mrl.py

# Specify model type
python scripts/train_chem_mrl.py model=chem_2d_mrl
python scripts/train_chem_mrl.py model=classifier

# Override parameters
python scripts/train_chem_mrl.py model=chem_2d_mrl training_args.num_train_epochs=5 datasets[0].train_dataset.name=/path/to/data.parquet

# Use a different custom config also located in `chem_mrl/conf`
python scripts/train_chem_mrl.py --config-name=my_custom_config.yaml

Configuration Options:

Command line overrides: Use model=<type> and parameter overrides as shown above
Modify base.yaml directly: Edit the - /model: chem_mrl line in the defaults section to change the default model, or modify any other parameters directly
Override config file: Use --config-name=<config_name> to specify a different base configuration file instead of the default base.yaml

Basic Training Workflow

To train a model, initialize the configuration with dataset paths and model parameters, then pass it to ChemMRLTrainer for training.

from sentence_transformers import SentenceTransformerTrainingArguments

from chem_mrl.constants import BASE_MODEL_NAME
from chem_mrl.schemas import BaseConfig, ChemMRLConfig, DatasetConfig, SplitConfig
from chem_mrl.schemas.Enums import FieldTypeOption
from chem_mrl.trainers import ChemMRLTrainer

dataset_config = DatasetConfig(
    key="my_dataset",
    train_dataset=SplitConfig(
        name="train.parquet",
        split_key="train",
        label_cast_type=FieldTypeOption.float32,
        sample_size=1000,
    ),
    val_dataset=SplitConfig(
        name="val.parquet",
        split_key="train",  # Use "train" for local files
        label_cast_type=FieldTypeOption.float16,
        sample_size=500,
    ),
    test_dataset=SplitConfig(
        name="test.parquet",
        split_key="train",
        label_cast_type=FieldTypeOption.float16,
        sample_size=500,
    ),
    smiles_a_column_name="smiles_a",
    smiles_b_column_name="smiles_b",
    label_column_name="similarity",
)

config = BaseConfig(
    model=ChemMRLConfig(
        model_name=BASE_MODEL_NAME,  # Predefined model name - Can be any transformer model name or path that is compatible with sentence-transformers
        n_dims_per_step=3,  # Model-specific hyperparameter
        use_2d_matryoshka=True,  # Enable 2d MRL
        # Additional parameters specific to 2D MRL models
        n_layers_per_step=2,
        kl_div_weight=0.7,  # Weight for KL divergence regularization
        kl_temperature=0.5,  # Temperature parameter for KL loss
    ),
    datasets=[dataset_config],  # List of dataset configurations
    training_args=SentenceTransformerTrainingArguments("training_output"),
)

# Initialize trainer and start training
trainer = ChemMRLTrainer(config)
test_eval_metric = (
    trainer.train()
)  # Returns the test evaluation metric if a test dataset is provided.
# Otherwise returns the final validation eval metric

Custom Callbacks

You can provide a list of transformers.TrainerCallback classes to execute while training.

import torch
from sentence_transformers import (
    SentenceTransformerTrainingArguments,
)
from transformers import PreTrainedModel
from transformers.trainer_callback import TrainerCallback, TrainerControl, TrainerState
from transformers.training_args import TrainingArguments

from chem_mrl.constants import BASE_MODEL_NAME
from chem_mrl.schemas import BaseConfig, ChemMRLConfig, DatasetConfig, SplitConfig
from chem_mrl.schemas.Enums import FieldTypeOption
from chem_mrl.trainers import ChemMRLTrainer


# Define a callback class for logging evaluation metrics
# https://huggingface.co/docs/transformers/main/en/main_classes/callback#transformers.TrainerCallback
class EvalCallback(TrainerCallback):
    def on_evaluate(
        self,
        args: TrainingArguments,
        state: TrainerState,
        control: TrainerControl,
        model: PreTrainedModel | torch.nn.Module,
        metrics: dict[str, float],
        **kwargs,
    ) -> None:
        """Event called after an evaluation phase."""
        pass


dataset_config = DatasetConfig(
    key="callback_dataset",
    train_dataset=SplitConfig(
        name="train.parquet",
        split_key="train",
        label_cast_type=FieldTypeOption.float32,
        sample_size=1000,
    ),
    val_dataset=SplitConfig(
        name="val.parquet",
        split_key="train",
        label_cast_type=FieldTypeOption.float16,
        sample_size=500,
    ),
    smiles_a_column_name="smiles_a",
    smiles_b_column_name="smiles_b",
    label_column_name="similarity",
)

config = BaseConfig(
    model=ChemMRLConfig(
        model_name=BASE_MODEL_NAME,
    ),
    datasets=[dataset_config],
    training_args=SentenceTransformerTrainingArguments("training_output"),
)

# Train with callback
trainer = ChemMRLTrainer(config)
val_eval_metric = trainer.train(callbacks=[EvalCallback()])

Classifier

This repository includes code for training a linear classifier with optional dropout regularization. The classifier categorizes substances based on SMILES and category features.

Hyperparameter tuning shows that cross-entropy loss (softmax option) outperforms self-adjusting dice loss in terms of accuracy, making it the preferred choice for molecular property classification.

Usage

Basic Classification Training

To train a classifier, configure the model with dataset paths and column names, then initialize ClassifierTrainer to start training.

from sentence_transformers import SentenceTransformerTrainingArguments

from chem_mrl.schemas import BaseConfig, ClassifierConfig, DatasetConfig, SplitConfig
from chem_mrl.schemas.Enums import FieldTypeOption
from chem_mrl.trainers import ClassifierTrainer

dataset_config = DatasetConfig(
    key="classification_dataset",
    train_dataset=SplitConfig(
        name="train_classification.parquet",
        split_key="train",
        label_cast_type=FieldTypeOption.float32,
        sample_size=1000,
    ),
    val_dataset=SplitConfig(
        name="val_classification.parquet",
        split_key="train",
        label_cast_type=FieldTypeOption.float16,
        sample_size=500,
    ),
    smiles_a_column_name="smiles",
    smiles_b_column_name=None,  # Not needed for classification
    label_column_name="label",
)

# Define classification training configuration
config = BaseConfig(
    model=ClassifierConfig(
        model_name="path/to/trained_mrl_model",  # Pretrained MRL model path
    ),
    datasets=[dataset_config],
    training_args=SentenceTransformerTrainingArguments("training_output"),
)

# Initialize and train the classifier
trainer = ClassifierTrainer(config)
trainer.train()

Training with Dice Loss

For imbalanced classification tasks, Dice Loss can improve performance by focusing on hard-to-classify samples. Below is a configuration using DiceLossClassifierConfig, which introduces additional hyperparameters.

from sentence_transformers import SentenceTransformerTrainingArguments

from chem_mrl.schemas import BaseConfig, ClassifierConfig, DatasetConfig, SplitConfig
from chem_mrl.schemas.Enums import ClassifierLossFctOption, DiceReductionOption, FieldTypeOption
from chem_mrl.trainers import ClassifierTrainer

dataset_config = DatasetConfig(
    key="dice_loss_dataset",
    train_dataset=SplitConfig(
        name="train_classification.parquet",
        split_key="train",
        label_cast_type=FieldTypeOption.float32,
        sample_size=1000,
    ),
    val_dataset=SplitConfig(
        name="val_classification.parquet",
        split_key="train",
        label_cast_type=FieldTypeOption.float16,
        sample_size=500,
    ),
    smiles_a_column_name="smiles",
    smiles_b_column_name=None,  # Not needed for classification
    label_column_name="label",
)

# Define classification training configuration with Dice Loss
config = BaseConfig(
    model=ClassifierConfig(
        model_name="path/to/trained_mrl_model",
        loss_func=ClassifierLossFctOption.selfadjdice,
        dice_reduction=DiceReductionOption.sum,  # Reduction method for Dice Loss (e.g., 'mean' or 'sum')
        dice_gamma=1.0,  # Smoothing factor hyperparameter
    ),
    datasets=[dataset_config],
    training_args=SentenceTransformerTrainingArguments("training_output"),
)

# Initialize and train the classifier with Dice Loss
trainer = ClassifierTrainer(config)
trainer.train()

References:

Chithrananda, Seyone, et al. "ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction." arXiv [Cs.LG], 2020. Link.
Ahmad, Walid, et al. "ChemBERTa-2: Towards Chemical Foundation Models." arXiv [Cs.LG], 2022. Link.
Kusupati, Aditya, et al. "Matryoshka Representation Learning." arXiv [Cs.LG], 2022. Link.
Li, Xianming, et al. "2D Matryoshka Sentence Embeddings." arXiv [Cs.CL], 2024. Link.
Bajusz, Dávid, et al. "Why is the Tanimoto Index an Appropriate Choice for Fingerprint-Based Similarity Calculations?" J Cheminform, 7, 20 (2015). Link.
Li, Xiaoya, et al. "Dice Loss for Data-imbalanced NLP Tasks." arXiv [Cs.CL], 2020. Link
Reimers, Nils, and Gurevych, Iryna. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019. Link.

Citation

If you use this code or model in your research, please cite:

@software{cortes-2025-chem-mrl,
    author    = {Emmanuel Cortes},
    title     = {CHEM-MRL: SMILES-based Matryoshka Representation Learning Embedding Transformer},
    year      = {2025},
    publisher = {GitHub},
    howpublished = {GitHub repository},
    url       = {https://github.com/emapco/chem-mrl},
}

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.8.3

Dec 14, 2025

0.8.2

Nov 16, 2025

This version

0.8.1

Nov 3, 2025

0.8.0

Oct 27, 2025

0.7.3

Aug 19, 2025

0.7.2

Jul 25, 2025

0.7.1

Jul 24, 2025

0.7.0

Jul 22, 2025

0.6.3

Jun 11, 2025

0.6.2

Jun 5, 2025

0.6.1

Jun 4, 2025

0.6.0

Jun 3, 2025

0.5.9

May 30, 2025

0.5.8

Mar 29, 2025

0.5.7

Mar 29, 2025

0.5.6

Feb 28, 2025

0.5.5

Feb 26, 2025

0.5.4

Feb 24, 2025

0.5.3

Feb 14, 2025

0.5.2

Feb 14, 2025

0.5.1

Feb 13, 2025

0.5.0

Feb 9, 2025

0.4.1

Feb 7, 2025

0.4.0

Feb 6, 2025

0.3.7

Feb 4, 2025

0.3.6

Feb 3, 2025

0.3.5

Feb 3, 2025

0.3.4

Feb 3, 2025

0.3.3

Feb 2, 2025

0.3.2

Feb 2, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chem_mrl-0.8.1.tar.gz (423.1 kB view details)

Uploaded Nov 3, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chem_mrl-0.8.1-py3-none-any.whl (67.8 kB view details)

Uploaded Nov 3, 2025 Python 3

File details

Details for the file chem_mrl-0.8.1.tar.gz.

File metadata

Download URL: chem_mrl-0.8.1.tar.gz
Upload date: Nov 3, 2025
Size: 423.1 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chem_mrl-0.8.1.tar.gz
Algorithm	Hash digest
SHA256	`ef80e95aa4b9ae3ed6ddff1842c1095b538eaeb73eac68e93dbedbe2e3181c93`
MD5	`bc06f2cfa317bd967c92b0d84e8a7cfa`
BLAKE2b-256	`f300f65c4ae046931e00e61be5e0c6b9de8ceccd0a7da6a1490af0af5a0c65b3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for chem_mrl-0.8.1.tar.gz:

Publisher: release.yml on emapco/chem-mrl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: chem_mrl-0.8.1.tar.gz
- Subject digest: ef80e95aa4b9ae3ed6ddff1842c1095b538eaeb73eac68e93dbedbe2e3181c93
- Sigstore transparency entry: 662297078
- Sigstore integration time: Nov 3, 2025
Source repository:
- Permalink: emapco/chem-mrl@022f515fc393266e266dcf8a7957531e6aaf6854
- Branch / Tag: refs/heads/main
- Owner: https://github.com/emapco
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@022f515fc393266e266dcf8a7957531e6aaf6854
- Trigger Event: push

File details

Details for the file chem_mrl-0.8.1-py3-none-any.whl.

File metadata

Download URL: chem_mrl-0.8.1-py3-none-any.whl
Upload date: Nov 3, 2025
Size: 67.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for chem_mrl-0.8.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d2c38e259eab4b23c423e779365609e342cb0a80c02c418941728f30eb1145f5`
MD5	`1eaa08d49343fa3b95a6fd52843c49e5`
BLAKE2b-256	`3ad0ebb4ff7745fa34e7bdfa915156a521edaf1b44e8da2a657966eabac13501`

See more details on using hashes here.

Provenance

The following attestation bundles were made for chem_mrl-0.8.1-py3-none-any.whl:

Publisher: release.yml on emapco/chem-mrl

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: chem_mrl-0.8.1-py3-none-any.whl
- Subject digest: d2c38e259eab4b23c423e779365609e342cb0a80c02c418941728f30eb1145f5
- Sigstore transparency entry: 662297088
- Sigstore integration time: Nov 3, 2025
Source repository:
- Permalink: emapco/chem-mrl@022f515fc393266e266dcf8a7957531e6aaf6854
- Branch / Tag: refs/heads/main
- Owner: https://github.com/emapco
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@022f515fc393266e266dcf8a7957531e6aaf6854
- Trigger Event: push

chem-mrl 0.8.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

CHEM-MRL

Installation

Install with pip

Install from source code

Install Flash Attention (optional, improves training speed)

Usage

Inference

Quick Start

Precision Configurations

Hydra & Training Scripts

Basic Training Workflow

Custom Callbacks

Classifier

Usage

Basic Classification Training

Training with Dice Loss

References:

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance