Skip to main content

A comprehensive benchmark suite for evaluating generative models for molecules

Project description

Molecule Benchmarks

PyPI version Python 3.12+ License: MIT

A comprehensive benchmark suite for evaluating generative models for molecules. This package provides standardized metrics and evaluation protocols for assessing the quality of molecular generation models in drug discovery and cheminformatics.

Features

  • Comprehensive Metrics: Validity, uniqueness, novelty, diversity, and similarity metrics
  • Standard Benchmarks: Implements metrics from Moses, GuacaMol, and FCD papers
  • Easy Integration: Simple interface for integrating with any generative model
  • Direct SMILES Evaluation: Benchmark pre-generated SMILES lists without implementing a model interface
  • Multiple Datasets: Built-in support for QM9, Moses, and GuacaMol datasets
  • Efficient Computation: Optimized for large-scale evaluation with multiprocessing support

Installation

pip install molecule-benchmarks

Quick Start

You can use the benchmark suite in two ways:

Option 1: Direct SMILES Evaluation (Simplified)

If you already have generated SMILES strings, you can benchmark them directly. Just ensure you have at least the number of samples specified in num_samples_to_generate.

from molecule_benchmarks import Benchmarker, SmilesDataset

# Load a dataset
dataset = SmilesDataset.load_qm9_dataset(subset_size=10000)

# Initialize benchmarker
benchmarker = Benchmarker(
    dataset=dataset,
    num_samples_to_generate=10000,# You need to generate at least this many samples
    device="cpu"  # or "cuda" for GPU
)

# Your generated SMILES (replace with your actual generated molecules)
generated_smiles = [
    "CCO",           # Ethanol
    "CC(=O)O",       # Acetic acid
    "c1ccccc1",      # Benzene
    "CC(C)O",        # Isopropanol
    "CCN",           # Ethylamine
    None,            # Invalid molecule (use None for failures)
    # ... more molecules up to num_samples_to_generate
]

# Run benchmarks directly on the SMILES list
results = benchmarker.benchmark(generated_smiles)
print(results)

Option 2: Model-Based Evaluation

To use the benchmark suite with a generative model, implement the MoleculeGenerationModel protocol. This will generate the required number of samples and run the benchmarks.

from molecule_benchmarks.model import MoleculeGenerationModel

class MyGenerativeModel(MoleculeGenerationModel):
    def __init__(self, model_path):
        # Initialize your model here
        self.model = load_model(model_path)
    
    def generate_molecule_batch(self) -> list[str | None]:
        """Generate a batch of molecules as SMILES strings.
        
        Returns:
            List of SMILES strings. Return None for invalid molecules.
        """
        # Your generation logic here
        batch = self.model.generate(batch_size=100)
        return [self.convert_to_smiles(mol) for mol in batch]

# Initialize your model
model = MyGenerativeModel("path/to/model")

# Run benchmarks using the model
results = benchmarker.benchmark_model(model)
print(results)

3. Analyze Results

The benchmark returns comprehensive metrics:

# Validity metrics
print(f"Valid molecules: {results['validity']['valid_fraction']:.3f}")
print(f"Valid & unique: {results['validity']['valid_and_unique_fraction']:.3f}")
print(f"Valid & unique & novel: {results['validity']['valid_and_unique_and_novel_fraction']:.3f}")

# Diversity and similarity metrics
print(f"Internal diversity: {results['moses']['IntDiv']:.3f}")
print(f"SNN score: {results['moses']['snn_score']:.3f}")

# Chemical property distribution similarity
print(f"KL divergence score: {results['kl_score']:.3f}")

# Fréchet ChemNet Distance
print(f"FCD score: {results['fcd']['fcd']:.3f}")

Complete Examples

Example 1: Direct SMILES Benchmarking (Recommended for Simplicity)

from molecule_benchmarks import Benchmarker, SmilesDataset

# Load dataset
print("Loading dataset...")
dataset = SmilesDataset.load_qm9_dataset(max_train_samples=1000)

# Create benchmarker
benchmarker = Benchmarker(
    dataset=dataset,
    num_samples_to_generate=100,
    device="cpu"
)

# Your generated SMILES (replace with your actual generated molecules)
generated_smiles = [
    "CCO",           # Ethanol
    "CC(=O)O",       # Acetic acid
    "c1ccccc1",      # Benzene
    "CC(C)O",        # Isopropanol
    "CCN",           # Ethylamine
    None,            # Invalid molecule
    # ... add more molecules up to 100 total
] + [None] * (100 - 6)  # Pad with None to reach desired count

# Run benchmarks directly
print("Running benchmarks...")
results = benchmarker.benchmark(generated_smiles)

# Print results (same as below)
print("\n=== Validity Metrics ===")
print(f"Valid molecules: {results['validity']['valid_fraction']:.3f}")
print(f"Unique molecules: {results['validity']['unique_fraction']:.3f}")
print(f"Valid & unique: {results['validity']['valid_and_unique_fraction']:.3f}")
print(f"Novel molecules: {results['validity']['valid_and_unique_and_novel_fraction']:.3f}")

print("\n=== Moses Metrics ===")
print(f"Passing Moses filters: {results['moses']['fraction_passing_moses_filters']:.3f}")
print(f"SNN score: {results['moses']['snn_score']:.3f}")
print(f"Internal diversity (p=1): {results['moses']['IntDiv']:.3f}")
print(f"Internal diversity (p=2): {results['moses']['IntDiv2']:.3f}")

print("\n=== Distribution Metrics ===")
print(f"KL divergence score: {results['kl_score']:.3f}")
print(f"FCD score: {results['fcd']['fcd']:.3f}")
print(f"FCD (valid only): {results['fcd']['fcd_valid']:.3f}")

Example 2: Model-Based Benchmarking

Here's a complete example using the built-in dummy model:

from molecule_benchmarks import Benchmarker, SmilesDataset
from molecule_benchmarks.model import DummyMoleculeGenerationModel

# Load dataset
print("Loading dataset...")
dataset = SmilesDataset.load_qm9_dataset(max_train_samples=1000)

# Create benchmarker
benchmarker = Benchmarker(
    dataset=dataset,
    num_samples_to_generate=100,
    device="cpu"
)

# Create a dummy model (replace with your model)
model = DummyMoleculeGenerationModel([
    "CCO",           # Ethanol
    "CC(=O)O",       # Acetic acid
    "c1ccccc1",      # Benzene
    "CC(C)O",        # Isopropanol
    "CCN",           # Ethylamine
    None,            # Invalid molecule
])

# Run benchmarks using the model
print("Running benchmarks...")
results = benchmarker.benchmark_model(model)

# Print results
print("\n=== Validity Metrics ===")
print(f"Valid molecules: {results['validity']['valid_fraction']:.3f}")
print(f"Unique molecules: {results['validity']['unique_fraction']:.3f}")
print(f"Valid & unique: {results['validity']['valid_and_unique_fraction']:.3f}")
print(f"Novel molecules: {results['validity']['valid_and_unique_and_novel_fraction']:.3f}")

print("\n=== Moses Metrics ===")
print(f"Passing Moses filters: {results['moses']['fraction_passing_moses_filters']:.3f}")
print(f"SNN score: {results['moses']['snn_score']:.3f}")
print(f"Internal diversity (p=1): {results['moses']['IntDiv']:.3f}")
print(f"Internal diversity (p=2): {results['moses']['IntDiv2']:.3f}")

print("\n=== Distribution Metrics ===")
print(f"KL divergence score: {results['kl_score']:.3f}")
print(f"FCD score: {results['fcd']['fcd']:.3f}")
print(f"FCD (valid only): {results['fcd']['fcd_valid']:.3f}")

Supported Datasets

The package includes several built-in datasets:

from molecule_benchmarks import SmilesDataset

# QM9 dataset (small molecules)
dataset = SmilesDataset.load_qm9_dataset(subset_size=10000)

# Moses dataset (larger, drug-like molecules)
dataset = SmilesDataset.load_moses_dataset(fraction=0.1)

# GuacaMol dataset
dataset = SmilesDataset.load_guacamol_dataset(fraction=0.1)

# Custom dataset from files
dataset = SmilesDataset(
    train_smiles="path/to/train.txt",
    validation_smiles="path/to/valid.txt"
)

Metrics Explained

Validity Metrics

  • Valid fraction: Percentage of generated molecules that are chemically valid
  • Unique fraction: Percentage of generated molecules that are unique
  • Novel fraction: Percentage of generated molecules not seen in training data

Moses Metrics

Based on the Moses paper:

  • SNN score: Similarity to nearest neighbor in training set
  • Internal diversity: Average pairwise Tanimoto distance within generated set
  • Scaffold similarity: Similarity of molecular scaffolds to training set
  • Fragment similarity: Similarity of molecular fragments to training set

Distribution Metrics

  • KL divergence score: Measures similarity of molecular property distributions
  • FCD score: Fréchet ChemNet Distance, measures distribution similarity in learned feature space

Advanced Usage

Direct SMILES Evaluation

For most use cases, directly evaluating a list of generated SMILES is the simplest approach:

# Custom number of samples and device
benchmarker = Benchmarker(
    dataset=dataset,
    num_samples_to_generate=50000,
    device="cuda"  # Use GPU for faster computation
)

# Your generated SMILES list (with None for invalid generations)
my_generated_smiles = [
    "CCO", "c1ccccc1", "CC(=O)O", None, "invalid_smiles", 
    # ... up to 50000 molecules
]

# Run benchmarks directly
results = benchmarker.benchmark(my_generated_smiles)

# Access specific metric computations
validity_scores = benchmarker._compute_validity_scores(my_generated_smiles)
fcd_scores = benchmarker._compute_fcd_scores(my_generated_smiles)

Model-Based Evaluation

For integration with generative models:

class BatchedModel(MoleculeGenerationModel):
    def generate_molecule_batch(self) -> list[str | None]:
        # Generate larger batches for efficiency
        return self.model.sample(batch_size=1000)

# Use the model with benchmarker
results = benchmarker.benchmark_model(BatchedModel())

Important Notes

  • SMILES format: Use None for molecules that failed to generate or are invalid
  • Batch size: The num_samples_to_generate parameter determines how many molecules will be evaluated
  • Validation: Invalid SMILES are automatically detected and handled in the metrics
  • Memory: For large evaluations (>10k molecules), consider using GPU acceleration with device="cuda"

Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This benchmark suite implements and builds upon metrics from several important papers:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

molecule_benchmarks-0.1.3.tar.gz (113.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

molecule_benchmarks-0.1.3-py3-none-any.whl (40.7 kB view details)

Uploaded Python 3

File details

Details for the file molecule_benchmarks-0.1.3.tar.gz.

File metadata

  • Download URL: molecule_benchmarks-0.1.3.tar.gz
  • Upload date:
  • Size: 113.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.13

File hashes

Hashes for molecule_benchmarks-0.1.3.tar.gz
Algorithm Hash digest
SHA256 e5b5576318bba6722c75a42a5b40e106a53493477931fd35d36c9dc0b9176484
MD5 c3462781eafdfe4e05a2a0b87b0563cd
BLAKE2b-256 6628299fc9a6155f33935281259a256a194b7fe29f9d6874646d9bf4b15483e4

See more details on using hashes here.

File details

Details for the file molecule_benchmarks-0.1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for molecule_benchmarks-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 53c487671ed18d1f3c535446633d5601f27eb05d3a2a372702a48c4eecbaa029
MD5 aebb9f711f854c5943a54751e0eee148
BLAKE2b-256 eaf2b6d70cd0bf8fb0b8aa666a0adfd97f4e819ae74624375f33a65d91ba81b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page