Skip to main content

A modern, extendible platform for ligand-based virtual screening benchmarking. Automates data downloading, preprocessing, ML training, evaluation, and statistical analysis. Supports multiple datasets, molecular embeddings (fingerprints and pretrained neural networks), similarity search, and ML classifiers for identifying bioactive compounds.

Project description

Litmus

Ligand-based virtual screening is one of the most popular chemoinformatics approaches to screening for bioactive compounds in large molecular databases. It utilizes molecule vectorization approaches and machine learning (ML) models to identify potential bioactive molecules for novel problems. It poses a significant challenge due to extreme data imbalance (often <1% positive class), large datasets, and frequency of false positives. There is a pressing need for modern benchmark for evaluating algorithms in this area, with the most popular software published in 2013 and based on Python 2. The created platform should have an extendible, modular structure, allowing usage of many benchmarking datasets, molecular embedding algorithms, and evaluation methods. In particular, it should automate the data downloading, preprocessing, training of ML algorithms, evaluation, and statistical analysis of results. Practical tests of created software should include benchmarking multiple molecular fingerprints. Embedding from pretrained neural networks can also be analyzed. Bioactive search should support both efficient similarity searching based on appropriate distance metrics and ML classifiers like logistic regression and boosted ensembles.

A comprehensive platform for benchmarking ligand-based virtual screening methods with scikit-learn compatible interfaces.

Features

  • Standardized benchmarking protocols for virtual screening methods
  • Scikit-learn compatible API
  • Built-in dataset loaders and metrics
  • Comprehensive documentation and examples

Glossary

  • Dataset A single protein-ligand pairing task, consisting of one target protein (y) and a set of associated ligands (X). It represents one learning or prediction unit, typically used in structure-based drug discovery workflows.

  • Benchmark A curated collection of datasets proposed in the literature to evaluate models across a range of protein-ligand tasks. Benchmarks often represent a standard suite of datasets used to compare algorithmic performance under consistent conditions.

  • Platform An overarching collection of datasets, possibly post-processed (e.g., deduplicated, filtered) to meet certain criteria such as target uniqueness. A platform is designed to serve as a consistent, reusable foundation for large-scale experimentation and model development.

Available datasets

All datasets are hosted on HuggingFace Hub and are automatically downloaded when used:

  • MUV - 17 targets, Parquet format
  • LIT-PCBA - 15 targets, Parquet format
  • WelQrate - 9 targets, Parquet format, 5 seeds for cross-validation
  • DUD-AD - 55 targets, CSV format

Datasets are available at: https://huggingface.co/datasets/scikit-fingerprints/litmus

How to use this benchmark?

All datasets are automatically downloaded from HuggingFace Hub when first used. The example below demonstrates how to use the Benchmark class to evaluate the performance of an embedding method. In this case, we will benchmark the ECFPFingerprint from the skfp library.

Note: For WelQrate dataset, the benchmark automatically runs cross-validation across all 5 seeds and reports both per-seed results and averaged results.

from skfp.fingerprints import ECFPFingerprint
from skfp.preprocessing import MolFromSmilesTransformer
from sklearn.pipeline import make_pipeline

from lbvslitmus.benchmarking.benchmark import Benchmark

# Define a pipeline that transforms SMILES strings into numerical vectors.
# The pipeline must expose a `fit_transform` method.
pipeline = make_pipeline(
   MolFromSmilesTransformer(suppress_warnings=True), ECFPFingerprint()
)

# Create a Benchmark object and provide the pipline to the benchmark
benchmark = Benchmark(pipeline=pipeline)

# Run the benchmark
results = benchmark.run()

# Print a summary of the results
print(results.summary)

Dataset Splitting

Litmus provides advanced dataset splitting capabilities for virtual screening experiments, ensuring proper train/test separation while maintaining class balance and molecular diversity.

Max-Min Splitting Algorithm

The platform implements a max-min splitting algorithm that:

  • Separates active and inactive compounds independently
  • Maintains class balance in both train and test sets
  • Ensures molecular diversity through maximum dissimilarity selection
  • Supports reproducible splits with configurable random seeds
  • Handles extreme class imbalance common in virtual screening datasets

Working with Dataset Splits

Loading Pre-generated Splits

All datasets come with pre-generated train/test splits that are automatically downloaded from HuggingFace Hub. These splits are generated using the max-min algorithm to ensure molecular diversity and class balance.

Access pre-generated splits for reproducible experiments:

from lbvslitmus.datasets import LITPCBADownloader, MUVDownloader, WelQrateDownloader

# Load full dataset
lit_pcba_downloader = LITPCBADownloader()
df = lit_pcba_downloader.load(target="OPRK1")

# Get split indices (automatically downloaded from HuggingFace)
splits = lit_pcba_downloader.get_splits(target="OPRK1")

# Create train and test subsets
train_df = df.iloc[splits["train"]].copy()
test_df = df.iloc[splits["test"]].copy()

# For WelQrate, you can specify a seed for cross-validation (1-5)
welqrate_downloader = WelQrateDownloader()
df = welqrate_downloader.load(target="AID2258")
splits = welqrate_downloader.get_splits(target="AID2258", seed=1)  # Use seed 1

Custom Split Generation (Not Recommended)

⚠️ Warning: Generating custom splits is not recommended for reproducible research. Use pre-generated splits whenever possible to ensure consistent benchmarking across studies.

However, if you need custom splits for specific research requirements:

from lbvslitmus.model_selection.splitters.maxmin_split import maxmin_train_test_split

# Load your dataset
df = lit_pcba_downloader.load(target="OPRK1")

# Separate active and inactive compounds
active_smiles = df[df["OPRK1"] == 1]["SMILES"].tolist()
inactive_smiles = df[df["OPRK1"] == 0]["SMILES"].tolist()

# Generate custom splits with different parameters
train_active, test_active = maxmin_train_test_split(
   data=active_smiles,
   train_size=0.8,  # 80% for training (different from default 75%)
   random_state=123,  # Custom random seed
   show_progress=True
)

train_inactive, test_inactive = maxmin_train_test_split(
   data=inactive_smiles,
   train_size=0.8,
   random_state=123,
   show_progress=True
)

# Combine indices
train_idx = train_active + train_inactive
test_idx = test_active + test_inactive

# Save custom splits (optional)
import numpy as np

np.save("custom_train_idx.npy", train_idx)
np.save("custom_test_idx.npy", test_idx)

Important considerations for custom splits:

  • Document your splitting parameters and random seed
  • Ensure class balance is maintained
  • Consider molecular diversity in your split
  • Validate split quality before proceeding with experiments

Supported File Formats

  • Parquet: High-performance columnar format (MUV, LIT-PCBA, WelQrate)
  • CSV: Standard comma-separated values format (DUD-AD)
  • NumPy arrays: All splits are provided as .npy files with train/test indices

All datasets and splits are automatically downloaded from HuggingFace Hub when first accessed.

Example Scripts

Check out the example scripts in the examples/ directory:

  • work_with_dataset_splits.py: Comprehensive example of working with dataset splits
  • download_muv_dataset.py: Example of downloading and using MUV dataset
  • download_lit_pcba_dataset.py: Example of downloading and using LIT-PCBA dataset
  • download_welqrate_dataset.py: Example of downloading and using WelQrate dataset with seeds
  • benchmarking_example.py: Example of running benchmarks (includes WelQrate cross-validation)
  • compare_with_baselines_example.py: Bayesian statistical comparison of molecular fingerprints with baselines

Bayesian Fingerprint Comparison

Litmus provides a comprehensive framework for statistically rigorous comparison of molecular fingerprints using Bayesian methods. This enables you to determine whether differences between fingerprints are meaningful or negligible.

Quick Start:

from lbvslitmus.benchmarking import Benchmark, BaselineLoader
from lbvslitmus.comparison import BaselineComparator

# Run benchmark with your fingerprint
benchmark = Benchmark(pipeline=your_pipeline, benchmarks=["MUV"])
results = benchmark.run()

# Compare with baseline fingerprints
comparison = results.compare_with_baseline(
    baseline=["ECFP4", "MACCS", "AtomPair"],
    metric="AUROC",
    fingerprint_name="MyFingerprint",
    control_fingerprint="MyFingerprint",  # Compare all vs yours
)

# Generate visualizations
comparison.plot(output_dir="plots/comparison")

Key Features:

  • Pairwise Bayesian comparisons using Bradley-Terry model
  • ROPE (Region of Practical Equivalence) framework for meaningful differences
  • Support for multiple metrics (AUROC, AUPRC, BEDROC, etc.)
  • Automatic generation of heatmaps and win count visualizations
  • Configurable MCMC sampling parameters
  • Pre-computed baseline results for standard fingerprints (ECFP4, MACCS, etc.)

Available Baseline Fingerprints: All baseline results are hosted on HuggingFace Hub and automatically downloaded:

  • ECFP4, ECFP4_Count, ECFP6, ECFP6_Count, ECFP8, ECFP8_Count
  • MACCS, AtomPair, PubChem, TopologicalTorsion

See examples/compare_with_baselines_example.py for a complete example.

Visualization Guide

Litmus provides comprehensive visualization tools for analyzing and interpreting benchmark results. This section describes each chart type, what it shows, and how to interpret the results.

Quick Start

from lbvslitmus.visualization import plot_all, plot_model_comparison, plot_top_targets

# Generate all plots at once
plot_all(results, output_dir="plots")

# Or generate specific plots
plot_model_comparison(results, output_dir="plots")
plot_top_targets(results, n_targets=15, output_dir="plots")

Available Charts

1. Model Comparison (model_comparison.png)

What it shows: Boxplots comparing the distribution of metric scores (AUROC, AUPRC, BEDROC) across different models.

How to interpret:

  • Each subplot represents one metric (e.g., AUROC, AUPRC, BEDROC)
  • The box shows the interquartile range (IQR) - middle 50% of scores
  • The horizontal line inside the box is the median score
  • Whiskers extend to 1.5× IQR; points beyond are outliers
  • Mean values are annotated on top of each box
  • Higher and tighter boxes indicate better and more consistent performance
  • Compare boxes side-by-side to see which model performs better overall

Use case: Quick comparison of overall model performance across all targets and datasets.


2. Benchmark Performance (benchmark_performance.png)

What it shows: Grouped bar charts showing average metric scores for each model across different benchmarks/datasets.

How to interpret:

  • Each subplot represents one metric (AUROC, AUPRC)
  • X-axis shows different benchmarks (MUV, LIT-PCBA, DUD-AD, WELQRATE)
  • Y-axis shows the average score
  • Bars are grouped by model, using consistent colors
  • Taller bars indicate better performance on that benchmark
  • Compare bar heights within each benchmark to see relative model performance

Use case: Understanding how models perform on different types of datasets and identifying dataset-specific strengths/weaknesses.


3. Benchmark Heatmap (benchmark_heatmap.png)

What it shows: Heatmaps displaying average scores for each benchmark-metric combination, with separate heatmaps per model.

How to interpret:

  • Rows represent benchmarks (datasets)
  • Columns represent different metrics
  • Color intensity indicates score magnitude (darker = higher for YlOrRd colormap)
  • Numerical values are annotated in each cell
  • Look for patterns: consistent high scores across metrics indicate robust performance
  • Red/orange cells indicate strong performance; yellow cells indicate weaker performance

Use case: Comprehensive overview of model performance across all dimensions; identifying which metrics a model excels at.


4. Violin Grid (violin_grid.png)

What it shows: A grid of violin plots where rows are datasets and columns are metrics, showing the full distribution of scores.

How to interpret:

  • Each cell shows the score distribution for one dataset-metric combination
  • Violin width indicates density of scores at that value (wider = more common)
  • Red diamond markers show the mean value
  • Inner points show individual data points
  • Wider violins at high values indicate consistently good performance
  • Bimodal distributions (two peaks) may indicate different behavior on different targets

Use case: Detailed analysis of score distributions; understanding variance and identifying potential outliers or subgroups.


5. Metric Bars (metric_bars.png)

What it shows: Grouped bar charts for each benchmark, showing average scores for different metrics side-by-side.

How to interpret:

  • Each subplot represents one benchmark/dataset
  • X-axis shows different metrics
  • Bars are grouped by model
  • Compare bar heights to see which model performs better on each metric
  • Compare across subplots to see how the same model-metric combination varies by dataset

Use case: Detailed metric-by-metric comparison within each benchmark.


6. Distribution Violins (distribution_violins.png)

What it shows: A 2×2 grid of violin plots showing how metric scores are distributed across different benchmarks.

How to interpret:

  • Each subplot shows one metric (AUROC, AUPRC, BEDROC, RIE)
  • X-axis shows different benchmarks
  • Split violins (when 2 models) or grouped violins show model comparison
  • Overlapping violins indicate similar performance
  • Non-overlapping violins indicate significant performance differences
  • Long tails indicate high variance in performance

Use case: Comparing model performance distributions across benchmarks; statistical significance assessment.


7. Enrichment Factors (enrichment_factors.png)

What it shows: Bar chart showing Enrichment Factor (EF) values at 1% and 5% thresholds for each benchmark-model combination.

How to interpret:

  • EF measures how much better than random the model is at ranking actives
  • EF 1%: How enriched are actives in the top 1% of ranked compounds
  • EF 5%: How enriched are actives in the top 5% of ranked compounds
  • EF = 1 means random performance; EF = 10 means 10× better than random
  • Higher bars indicate better early enrichment (more actives found early)
  • EF 1% is more stringent; EF 5% captures broader early enrichment

Use case: Virtual screening applications where only top-ranked compounds will be tested experimentally.


8. Metric Correlation (metric_correlation.png)

What it shows: Correlation heatmaps showing how different metrics correlate with each other, with separate heatmaps per model.

How to interpret:

  • Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation)
  • Red cells indicate positive correlation (metrics move together)
  • Blue cells indicate negative correlation (metrics move oppositely)
  • Diagonal is always 1.0 (perfect self-correlation)
  • High correlations between metrics suggest they measure similar aspects
  • Low correlations indicate the metrics capture different information

Use case: Understanding metric relationships; identifying redundant metrics; validating that chosen metrics provide diverse evaluation.


9. Metric Histograms (metric_histograms.png)

What it shows: Overlapping histograms with kernel density estimation (KDE) curves showing the frequency distribution of metric scores.

How to interpret:

  • X-axis shows score values
  • Y-axis shows frequency (count of targets with that score)
  • Bars show histogram bins; curves show smoothed density estimates
  • Non-overlapping distributions indicate clear performance differences between models
  • Right-shifted distributions indicate overall better performance
  • Narrow, tall peaks indicate consistent performance
  • Wide, flat distributions indicate high variance

Use case: Understanding the overall score distribution; identifying whether differences are statistically meaningful.


10. Top Targets (top_targets.png)

What it shows: Horizontal bar charts showing the top N best-performing targets for each metric, with target labels including the dataset name.

How to interpret:

  • Y-axis shows target names with dataset in parentheses, e.g., "OPRK1 (LIT-PCBA)"
  • X-axis shows the score value
  • Bars are grouped by model for each target
  • Longer bars indicate higher scores
  • Similar bar lengths for both models indicate comparable performance on that target
  • Large differences highlight targets where one model significantly outperforms the other

Use case: Identifying which specific protein targets are easiest/hardest to predict; finding model-specific strengths on particular targets.


Comparison Visualizations

In addition to benchmark result visualizations, Litmus provides specialized visualizations for fingerprint comparison results: Example:

from lbvslitmus.comparison import plot_comparison_heatmap, plot_win_counts

# Generate comparison visualizations
comparison = results.compare_with_baseline(
    baseline=["ECFP4", "MACCS", "AtomPair"],
    metric="AUROC",
    fingerprint_name="MyFP",
    control_fingerprint="MyFP"
)

# Generate all visualizations at once
comparison.plot(output_dir="plots/comparison")

Customizing Plots

All plot functions accept the following common parameters:

Parameter Default Description
output_dir "plots" Directory to save the plot
style "whitegrid" Seaborn style (whitegrid, darkgrid, white, dark)
figure_dpi 100 DPI for figure display
savefig_dpi 300 DPI for saved figures (higher = better quality)
font_size 10 Base font size for labels

Some plots have additional parameters:

# Customize number of top targets
plot_top_targets(results, n_targets=20)

# Customize histogram bins
plot_metric_histograms(results, bins=20)

# Customize which metrics to show
plot_violin_grid(results, metrics=["AUROC", "AUPRC"])

Color Consistency

All plots use a consistent color palette (MODEL_PALETTE) to ensure models are represented with the same colors across all visualizations:

  • Model 1: Teal (#66c2a5)
  • Model 2: Orange (#fc8d62)
  • Model 3: Blue (#8da0cb)
  • Model 4: Pink (#e78ac3)
  • Model 5: Green (#a6d854)

This consistency makes it easy to track model performance across different chart types.

Installation

Install the latest stable release from PyPI:

pip install LBVSLitmus

Development

Setup

To install from source for development:

  1. Clone the repository:

    git clone https://github.com/your-org/virtual_screening_platform.git
    cd virtual_screening_platform
    
  2. Install uv using official guide

  3. Install dependencies:

    uv sync
    
  4. Install pre-commit hooks:

    uv run pre-commit install
    

Code Quality

This project uses Ruff for code formatting and linting. To format your code:

# Format and lint all files
uv run ruff format .
uv run ruff check . --fix

# Format and lint specific files
uv run ruff format path/to/file.py
uv run ruff check path/to/file.py --fix

The code formatting will also run automatically when you commit changes, thanks to pre-commit hooks.

Running Tests

To run the tests:

# Run all tests
uv run pytest

# Run specific test file
uv run pytest tests/test_specific.py

# Run tests with coverage report
uv run pytest --cov=lbvslitmus --cov-report=term-missing

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lbvslitmus-1.1.0.tar.gz (61.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lbvslitmus-1.1.0-py3-none-any.whl (89.1 kB view details)

Uploaded Python 3

File details

Details for the file lbvslitmus-1.1.0.tar.gz.

File metadata

  • Download URL: lbvslitmus-1.1.0.tar.gz
  • Upload date:
  • Size: 61.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lbvslitmus-1.1.0.tar.gz
Algorithm Hash digest
SHA256 aba896561c80b6a0894fd8b70d8650ddd90bd729281b780dfdc0dd5b10c09c39
MD5 63e709dfddcef425d8baff19876b1ed8
BLAKE2b-256 3a968a7a387c6f8a1b5b0d208342507f61bc13877465767003713676cce02586

See more details on using hashes here.

File details

Details for the file lbvslitmus-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: lbvslitmus-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 89.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for lbvslitmus-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 312afc7dcc118b21807a9cceca84b4b744cd4831282b115bc2d7c615def77f12
MD5 5e5ae27d0124d49ce025cd43fe855126
BLAKE2b-256 fbab7afd65d8cba0ce707fd89e5737673419cfac9b7b95f7ea93fb14a7f271bc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page