Skip to main content

A modern platform for benchmarking ligand-based virtual screening methods with automated data processing, ML training, and evaluation. Supports molecular fingerprints, neural embeddings, similarity search, and ML classifiers.

Project description

Litmus

Ligand-based virtual screening is one of the most popular chemoinformatics approaches to screening for bioactive compounds in large molecular databases. It utilizes molecule vectorization approaches and machine learning (ML) models to identify potential bioactive molecules for novel problems. It poses a significant challenge due to extreme data imbalance (often <1% positive class), large datasets, and frequency of false positives. There is a pressing need for modern benchmark for evaluating algorithms in this area, with the most popular software published in 2013 and based on Python 2. The created platform should have an extendible, modular structure, allowing usage of many benchmarking datasets, molecular embedding algorithms, and evaluation methods. In particular, it should automate the data downloading, preprocessing, training of ML algorithms, evaluation, and statistical analysis of results. Practical tests of created software should include benchmarking multiple molecular fingerprints. Embedding from pretrained neural networks can also be analyzed. Bioactive search should support both efficient similarity searching based on appropriate distance metrics and ML classifiers like logistic regression and boosted ensembles.

A comprehensive platform for benchmarking ligand-based virtual screening methods with scikit-learn compatible interfaces.

Features

  • Standardized benchmarking protocols for virtual screening methods
  • Scikit-learn compatible API
  • Built-in dataset loaders and metrics
  • Comprehensive documentation and examples

Glossary

  • Dataset A single protein-ligand pairing task, consisting of one target protein (y) and a set of associated ligands (X). It represents one learning or prediction unit, typically used in structure-based drug discovery workflows.

  • Benchmark A curated collection of datasets proposed in the literature to evaluate models across a range of protein-ligand tasks. Benchmarks often represent a standard suite of datasets used to compare algorithmic performance under consistent conditions.

  • Platform An overarching collection of datasets, possibly post-processed (e.g., deduplicated, filtered) to meet certain criteria such as target uniqueness. A platform is designed to serve as a consistent, reusable foundation for large-scale experimentation and model development.

Available datasets

All datasets are hosted on HuggingFace Hub and are automatically downloaded when used:

  • MUV - 17 targets, Parquet format
  • LIT-PCBA - 15 targets, Parquet format
  • WelQrate - 9 targets, Parquet format, 5 seeds for cross-validation
  • DUD-AD - 55 targets, CSV format

Datasets are available at: https://huggingface.co/datasets/scikit-fingerprints/litmus

How to use this benchmark?

All datasets are automatically downloaded from HuggingFace Hub when first used. The example below demonstrates how to use the Benchmark class to evaluate the performance of an embedding method. In this case, we will benchmark the ECFPFingerprint from the skfp library.

Note: For WelQrate dataset, the benchmark automatically runs cross-validation across all 5 seeds and reports both per-seed results and averaged results.

from skfp.fingerprints import ECFPFingerprint
from skfp.preprocessing import MolFromSmilesTransformer
from sklearn.pipeline import make_pipeline

from litmus.benchmarking.benchmark import Benchmark

# Define a pipeline that transforms SMILES strings into numerical vectors.
# The pipeline must expose a `fit_transform` method.
pipeline = make_pipeline(
    MolFromSmilesTransformer(suppress_warnings=True), ECFPFingerprint()
)

# Create a Benchmark object and provide the pipline to the benchmark
benchmark = Benchmark(pipeline=pipeline)

# Run the benchmark
results = benchmark.run()

# Print a summary of the results
print(results.summary)

Dataset Splitting

Litmus provides advanced dataset splitting capabilities for virtual screening experiments, ensuring proper train/test separation while maintaining class balance and molecular diversity.

Max-Min Splitting Algorithm

The platform implements a max-min splitting algorithm that:

  • Separates active and inactive compounds independently
  • Maintains class balance in both train and test sets
  • Ensures molecular diversity through maximum dissimilarity selection
  • Supports reproducible splits with configurable random seeds
  • Handles extreme class imbalance common in virtual screening datasets

Working with Dataset Splits

Loading Pre-generated Splits

All datasets come with pre-generated train/test splits that are automatically downloaded from HuggingFace Hub. These splits are generated using the max-min algorithm to ensure molecular diversity and class balance.

Access pre-generated splits for reproducible experiments:

from litmus.datasets import LITPCBADownloader, MUVDownloader, WelQrateDownloader

# Load full dataset
lit_pcba_downloader = LITPCBADownloader()
df = lit_pcba_downloader.load(target="OPRK1")

# Get split indices (automatically downloaded from HuggingFace)
splits = lit_pcba_downloader.get_splits(target="OPRK1")

# Create train and test subsets
train_df = df.iloc[splits["train"]].copy()
test_df = df.iloc[splits["test"]].copy()

# For WelQrate, you can specify a seed for cross-validation (1-5)
welqrate_downloader = WelQrateDownloader()
df = welqrate_downloader.load(target="AID2258")
splits = welqrate_downloader.get_splits(target="AID2258", seed=1)  # Use seed 1

Custom Split Generation (Not Recommended)

⚠️ Warning: Generating custom splits is not recommended for reproducible research. Use pre-generated splits whenever possible to ensure consistent benchmarking across studies.

However, if you need custom splits for specific research requirements:

from litmus.model_selection.splitters.maxmin_split import maxmin_train_test_split

# Load your dataset
df = lit_pcba_downloader.load(target="OPRK1")

# Separate active and inactive compounds
active_smiles = df[df["OPRK1"] == 1]["SMILES"].tolist()
inactive_smiles = df[df["OPRK1"] == 0]["SMILES"].tolist()

# Generate custom splits with different parameters
train_active, test_active = maxmin_train_test_split(
    data=active_smiles,
    train_size=0.8,  # 80% for training (different from default 75%)
    random_state=123,  # Custom random seed
    show_progress=True
)

train_inactive, test_inactive = maxmin_train_test_split(
    data=inactive_smiles,
    train_size=0.8,
    random_state=123,
    show_progress=True
)

# Combine indices
train_idx = train_active + train_inactive
test_idx = test_active + test_inactive

# Save custom splits (optional)
import numpy as np
np.save("custom_train_idx.npy", train_idx)
np.save("custom_test_idx.npy", test_idx)

Important considerations for custom splits:

  • Document your splitting parameters and random seed
  • Ensure class balance is maintained
  • Consider molecular diversity in your split
  • Validate split quality before proceeding with experiments

Supported File Formats

  • Parquet: High-performance columnar format (MUV, LIT-PCBA, WelQrate)
  • CSV: Standard comma-separated values format (DUD-AD)
  • NumPy arrays: All splits are provided as .npy files with train/test indices

All datasets and splits are automatically downloaded from HuggingFace Hub when first accessed.

Example Scripts

Check out the example scripts in the examples/ directory:

  • work_with_dataset_splits.py: Comprehensive example of working with dataset splits
  • download_muv_dataset.py: Example of downloading and using MUV dataset
  • download_lit_pcba_dataset.py: Example of downloading and using LIT-PCBA dataset
  • download_welqrate_dataset.py: Example of downloading and using WelQrate dataset with seeds
  • benchmarking_example.py: Example of running benchmarks (includes WelQrate cross-validation)

Visualization Guide

Litmus provides comprehensive visualization tools for analyzing and interpreting benchmark results. This section describes each chart type, what it shows, and how to interpret the results.

Quick Start

from litmus.visualization import plot_all, plot_model_comparison, plot_top_targets

# Generate all plots at once
plot_all(results, output_dir="plots")

# Or generate specific plots
plot_model_comparison(results, output_dir="plots")
plot_top_targets(results, n_targets=15, output_dir="plots")

Available Charts

1. Model Comparison (model_comparison.png)

What it shows: Boxplots comparing the distribution of metric scores (AUROC, AUPRC, BEDROC) across different models.

How to interpret:

  • Each subplot represents one metric (e.g., AUROC, AUPRC, BEDROC)
  • The box shows the interquartile range (IQR) - middle 50% of scores
  • The horizontal line inside the box is the median score
  • Whiskers extend to 1.5× IQR; points beyond are outliers
  • Mean values are annotated on top of each box
  • Higher and tighter boxes indicate better and more consistent performance
  • Compare boxes side-by-side to see which model performs better overall

Use case: Quick comparison of overall model performance across all targets and datasets.


2. Benchmark Performance (benchmark_performance.png)

What it shows: Grouped bar charts showing average metric scores for each model across different benchmarks/datasets.

How to interpret:

  • Each subplot represents one metric (AUROC, AUPRC)
  • X-axis shows different benchmarks (MUV, LIT-PCBA, DUD-AD, WELQRATE)
  • Y-axis shows the average score
  • Bars are grouped by model, using consistent colors
  • Taller bars indicate better performance on that benchmark
  • Compare bar heights within each benchmark to see relative model performance

Use case: Understanding how models perform on different types of datasets and identifying dataset-specific strengths/weaknesses.


3. Benchmark Heatmap (benchmark_heatmap.png)

What it shows: Heatmaps displaying average scores for each benchmark-metric combination, with separate heatmaps per model.

How to interpret:

  • Rows represent benchmarks (datasets)
  • Columns represent different metrics
  • Color intensity indicates score magnitude (darker = higher for YlOrRd colormap)
  • Numerical values are annotated in each cell
  • Look for patterns: consistent high scores across metrics indicate robust performance
  • Red/orange cells indicate strong performance; yellow cells indicate weaker performance

Use case: Comprehensive overview of model performance across all dimensions; identifying which metrics a model excels at.


4. Violin Grid (violin_grid.png)

What it shows: A grid of violin plots where rows are datasets and columns are metrics, showing the full distribution of scores.

How to interpret:

  • Each cell shows the score distribution for one dataset-metric combination
  • Violin width indicates density of scores at that value (wider = more common)
  • Red diamond markers show the mean value
  • Inner points show individual data points
  • Wider violins at high values indicate consistently good performance
  • Bimodal distributions (two peaks) may indicate different behavior on different targets

Use case: Detailed analysis of score distributions; understanding variance and identifying potential outliers or subgroups.


5. Metric Bars (metric_bars.png)

What it shows: Grouped bar charts for each benchmark, showing average scores for different metrics side-by-side.

How to interpret:

  • Each subplot represents one benchmark/dataset
  • X-axis shows different metrics
  • Bars are grouped by model
  • Compare bar heights to see which model performs better on each metric
  • Compare across subplots to see how the same model-metric combination varies by dataset

Use case: Detailed metric-by-metric comparison within each benchmark.


6. Distribution Violins (distribution_violins.png)

What it shows: A 2×2 grid of violin plots showing how metric scores are distributed across different benchmarks.

How to interpret:

  • Each subplot shows one metric (AUROC, AUPRC, BEDROC, RIE)
  • X-axis shows different benchmarks
  • Split violins (when 2 models) or grouped violins show model comparison
  • Overlapping violins indicate similar performance
  • Non-overlapping violins indicate significant performance differences
  • Long tails indicate high variance in performance

Use case: Comparing model performance distributions across benchmarks; statistical significance assessment.


7. Enrichment Factors (enrichment_factors.png)

What it shows: Bar chart showing Enrichment Factor (EF) values at 1% and 5% thresholds for each benchmark-model combination.

How to interpret:

  • EF measures how much better than random the model is at ranking actives
  • EF 1%: How enriched are actives in the top 1% of ranked compounds
  • EF 5%: How enriched are actives in the top 5% of ranked compounds
  • EF = 1 means random performance; EF = 10 means 10× better than random
  • Higher bars indicate better early enrichment (more actives found early)
  • EF 1% is more stringent; EF 5% captures broader early enrichment

Use case: Virtual screening applications where only top-ranked compounds will be tested experimentally.


8. Metric Correlation (metric_correlation.png)

What it shows: Correlation heatmaps showing how different metrics correlate with each other, with separate heatmaps per model.

How to interpret:

  • Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation)
  • Red cells indicate positive correlation (metrics move together)
  • Blue cells indicate negative correlation (metrics move oppositely)
  • Diagonal is always 1.0 (perfect self-correlation)
  • High correlations between metrics suggest they measure similar aspects
  • Low correlations indicate the metrics capture different information

Use case: Understanding metric relationships; identifying redundant metrics; validating that chosen metrics provide diverse evaluation.


9. Metric Histograms (metric_histograms.png)

What it shows: Overlapping histograms with kernel density estimation (KDE) curves showing the frequency distribution of metric scores.

How to interpret:

  • X-axis shows score values
  • Y-axis shows frequency (count of targets with that score)
  • Bars show histogram bins; curves show smoothed density estimates
  • Non-overlapping distributions indicate clear performance differences between models
  • Right-shifted distributions indicate overall better performance
  • Narrow, tall peaks indicate consistent performance
  • Wide, flat distributions indicate high variance

Use case: Understanding the overall score distribution; identifying whether differences are statistically meaningful.


10. Top Targets (top_targets.png)

What it shows: Horizontal bar charts showing the top N best-performing targets for each metric, with target labels including the dataset name.

How to interpret:

  • Y-axis shows target names with dataset in parentheses, e.g., "OPRK1 (LIT-PCBA)"
  • X-axis shows the score value
  • Bars are grouped by model for each target
  • Longer bars indicate higher scores
  • Similar bar lengths for both models indicate comparable performance on that target
  • Large differences highlight targets where one model significantly outperforms the other

Use case: Identifying which specific protein targets are easiest/hardest to predict; finding model-specific strengths on particular targets.


Customizing Plots

All plot functions accept the following common parameters:

Parameter Default Description
output_dir "plots" Directory to save the plot
style "whitegrid" Seaborn style (whitegrid, darkgrid, white, dark)
figure_dpi 100 DPI for figure display
savefig_dpi 300 DPI for saved figures (higher = better quality)
font_size 10 Base font size for labels

Some plots have additional parameters:

# Customize number of top targets
plot_top_targets(results, n_targets=20)

# Customize histogram bins
plot_metric_histograms(results, bins=20)

# Customize which metrics to show
plot_violin_grid(results, metrics=["AUROC", "AUPRC"])

Color Consistency

All plots use a consistent color palette (MODEL_PALETTE) to ensure models are represented with the same colors across all visualizations:

  • Model 1: Teal (#66c2a5)
  • Model 2: Orange (#fc8d62)
  • Model 3: Blue (#8da0cb)
  • Model 4: Pink (#e78ac3)
  • Model 5: Green (#a6d854)

This consistency makes it easy to track model performance across different chart types.

Development

Setup

  1. Install Poetry (package manager):

    pip install poetry
    
  2. Install dependencies:

    poetry install
    
  3. Install pre-commit hooks:

    poetry run pre-commit install
    

Code Quality

This project uses Ruff for code formatting and linting. To format your code:

# Format and lint all files
poetry run ruff format .
poetry run ruff check . --fix

# Format and lint specific files
poetry run ruff format path/to/file.py
poetry run ruff check path/to/file.py --fix

The code formatting will also run automatically when you commit changes, thanks to pre-commit hooks.

Running Tests

To run the tests:

# Run all tests
poetry run pytest

# Run specific test file
poetry run pytest tests/test_specific.py

# Run tests with coverage report
poetry run pytest --cov=litmus --cov-report=term-missing

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lbvslitmus-1.0.1.tar.gz (52.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lbvslitmus-1.0.1-py3-none-any.whl (76.0 kB view details)

Uploaded Python 3

File details

Details for the file lbvslitmus-1.0.1.tar.gz.

File metadata

  • Download URL: lbvslitmus-1.0.1.tar.gz
  • Upload date:
  • Size: 52.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for lbvslitmus-1.0.1.tar.gz
Algorithm Hash digest
SHA256 bcfd94ad3369405df204dfd3dc14419139515dbe2dadc5a8d37cabf3bf868414
MD5 90017401aed74edaef43f70d2f8ceeb0
BLAKE2b-256 7be392a4ac363879b2cbc6abad126fbc0e103aa2fddcf1f3bf512b4ed1ae5f48

See more details on using hashes here.

File details

Details for the file lbvslitmus-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: lbvslitmus-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 76.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.18 {"installer":{"name":"uv","version":"0.9.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for lbvslitmus-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 4303b570b9845a8a68a040151f91cd68ed3feebab9bbb26db3dba87a9531b017
MD5 a030d621953b8451d2c77356e0ec911b
BLAKE2b-256 a1f8cd529f7e591e8ca470b16eb29ca4dc7166561a30e21dd621e3053d611656

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page