A modern, extendible platform for ligand-based virtual screening benchmarking. Automates data downloading, preprocessing, ML training, evaluation, and statistical analysis. Supports multiple datasets, molecular embeddings (fingerprints and pretrained neural networks), similarity search, and ML classifiers for identifying bioactive compounds.
Project description
Litmus
Ligand-based virtual screening is one of the most popular chemoinformatics approaches to screening for bioactive compounds in large molecular databases. It utilizes molecule vectorization approaches and machine learning (ML) models to identify potential bioactive molecules for novel problems. It poses a significant challenge due to extreme data imbalance (often <1% positive class), large datasets, and frequency of false positives. There is a pressing need for modern benchmark for evaluating algorithms in this area, with the most popular software published in 2013 and based on Python 2. The created platform should have an extendible, modular structure, allowing usage of many benchmarking datasets, molecular embedding algorithms, and evaluation methods. In particular, it should automate the data downloading, preprocessing, training of ML algorithms, evaluation, and statistical analysis of results. Practical tests of created software should include benchmarking multiple molecular fingerprints. Embedding from pretrained neural networks can also be analyzed. Bioactive search should support both efficient similarity searching based on appropriate distance metrics and ML classifiers like logistic regression and boosted ensembles.
A comprehensive platform for benchmarking ligand-based virtual screening methods with scikit-learn compatible interfaces.
Features
- Standardized benchmarking protocols for virtual screening methods
- Scikit-learn compatible API
- Built-in dataset loaders and metrics
- Comprehensive documentation and examples
Glossary
-
Dataset A single protein-ligand pairing task, consisting of one target protein (y) and a set of associated ligands (X). It represents one learning or prediction unit, typically used in structure-based drug discovery workflows.
-
Benchmark A curated collection of datasets proposed in the literature to evaluate models across a range of protein-ligand tasks. Benchmarks often represent a standard suite of datasets used to compare algorithmic performance under consistent conditions.
-
Platform An overarching collection of datasets, possibly post-processed (e.g., deduplicated, filtered) to meet certain criteria such as target uniqueness. A platform is designed to serve as a consistent, reusable foundation for large-scale experimentation and model development.
Available datasets
All datasets are hosted on HuggingFace Hub and are automatically downloaded when used:
- MUV - 17 targets, Parquet format
- LIT-PCBA - 15 targets, Parquet format
- WelQrate - 9 targets, Parquet format, 5 seeds for cross-validation
- DUD-AD - 55 targets, CSV format
Datasets are available at: https://huggingface.co/datasets/scikit-fingerprints/litmus
How to use this benchmark?
All datasets are automatically downloaded from HuggingFace Hub when first used.
The example below demonstrates how to use the Benchmark class to evaluate the performance
of an embedding method. In this case, we will benchmark the ECFPFingerprint from the skfp library.
Note: For WelQrate dataset, the benchmark automatically runs cross-validation across all 5 seeds and reports both per-seed results and averaged results.
from skfp.fingerprints import ECFPFingerprint
from skfp.preprocessing import MolFromSmilesTransformer
from sklearn.pipeline import make_pipeline
from lbvslitmus.benchmarking.benchmark import Benchmark
# Define a pipeline that transforms SMILES strings into numerical vectors.
# The pipeline must expose a `fit_transform` method.
pipeline = make_pipeline(
MolFromSmilesTransformer(suppress_warnings=True), ECFPFingerprint()
)
# Create a Benchmark object and provide the pipline to the benchmark
benchmark = Benchmark(pipeline=pipeline)
# Run the benchmark
results = benchmark.run()
# Print a summary of the results
print(results.summary)
Dataset Splitting
Litmus provides advanced dataset splitting capabilities for virtual screening experiments, ensuring proper train/test separation while maintaining class balance and molecular diversity.
Max-Min Splitting Algorithm
The platform implements a max-min splitting algorithm that:
- Separates active and inactive compounds independently
- Maintains class balance in both train and test sets
- Ensures molecular diversity through maximum dissimilarity selection
- Supports reproducible splits with configurable random seeds
- Handles extreme class imbalance common in virtual screening datasets
Working with Dataset Splits
Loading Pre-generated Splits
All datasets come with pre-generated train/test splits that are automatically downloaded from HuggingFace Hub. These splits are generated using the max-min algorithm to ensure molecular diversity and class balance.
Access pre-generated splits for reproducible experiments:
from lbvslitmus.datasets import LITPCBADownloader, MUVDownloader, WelQrateDownloader
# Load full dataset
lit_pcba_downloader = LITPCBADownloader()
df = lit_pcba_downloader.load(target="OPRK1")
# Get split indices (automatically downloaded from HuggingFace)
splits = lit_pcba_downloader.get_splits(target="OPRK1")
# Create train and test subsets
train_df = df.iloc[splits["train"]].copy()
test_df = df.iloc[splits["test"]].copy()
# For WelQrate, you can specify a seed for cross-validation (1-5)
welqrate_downloader = WelQrateDownloader()
df = welqrate_downloader.load(target="AID2258")
splits = welqrate_downloader.get_splits(target="AID2258", seed=1) # Use seed 1
Custom Split Generation (Not Recommended)
⚠️ Warning: Generating custom splits is not recommended for reproducible research. Use pre-generated splits whenever possible to ensure consistent benchmarking across studies.
However, if you need custom splits for specific research requirements:
from lbvslitmus.model_selection.splitters.maxmin_split import maxmin_train_test_split
# Load your dataset
df = lit_pcba_downloader.load(target="OPRK1")
# Separate active and inactive compounds
active_smiles = df[df["OPRK1"] == 1]["SMILES"].tolist()
inactive_smiles = df[df["OPRK1"] == 0]["SMILES"].tolist()
# Generate custom splits with different parameters
train_active, test_active = maxmin_train_test_split(
data=active_smiles,
train_size=0.8, # 80% for training (different from default 75%)
random_state=123, # Custom random seed
show_progress=True
)
train_inactive, test_inactive = maxmin_train_test_split(
data=inactive_smiles,
train_size=0.8,
random_state=123,
show_progress=True
)
# Combine indices
train_idx = train_active + train_inactive
test_idx = test_active + test_inactive
# Save custom splits (optional)
import numpy as np
np.save("custom_train_idx.npy", train_idx)
np.save("custom_test_idx.npy", test_idx)
Important considerations for custom splits:
- Document your splitting parameters and random seed
- Ensure class balance is maintained
- Consider molecular diversity in your split
- Validate split quality before proceeding with experiments
Supported File Formats
- Parquet: High-performance columnar format (MUV, LIT-PCBA, WelQrate)
- CSV: Standard comma-separated values format (DUD-AD)
- NumPy arrays: All splits are provided as
.npyfiles with train/test indices
All datasets and splits are automatically downloaded from HuggingFace Hub when first accessed.
Example Scripts
Check out the example scripts in the examples/ directory:
work_with_dataset_splits.py: Comprehensive example of working with dataset splitsdownload_muv_dataset.py: Example of downloading and using MUV datasetdownload_lit_pcba_dataset.py: Example of downloading and using LIT-PCBA datasetdownload_welqrate_dataset.py: Example of downloading and using WelQrate dataset with seedsbenchmarking_example.py: Example of running benchmarks (includes WelQrate cross-validation)compare_with_baselines_example.py: Bayesian statistical comparison of molecular fingerprints with baselines
Bayesian Fingerprint Comparison
Litmus provides a comprehensive framework for statistically rigorous comparison of molecular fingerprints using Bayesian methods. This enables you to determine whether differences between fingerprints are meaningful or negligible.
Quick Start:
from lbvslitmus.benchmarking import Benchmark, BaselineLoader
from lbvslitmus.comparison import BaselineComparator
# Run benchmark with your fingerprint
benchmark = Benchmark(pipeline=your_pipeline, benchmarks=["MUV"])
results = benchmark.run()
# Compare with baseline fingerprints
comparison = results.compare_with_baseline(
baseline=["ECFP4", "MACCS", "AtomPair"],
metric="AUROC",
fingerprint_name="MyFingerprint",
control_fingerprint="MyFingerprint", # Compare all vs yours
)
# Generate visualizations
comparison.plot(output_dir="plots/comparison")
Key Features:
- Pairwise Bayesian comparisons using Bradley-Terry model
- ROPE (Region of Practical Equivalence) framework for meaningful differences
- Support for multiple metrics (AUROC, AUPRC, BEDROC, etc.)
- Automatic generation of heatmaps and win count visualizations
- Configurable MCMC sampling parameters
- Pre-computed baseline results for standard fingerprints (ECFP4, MACCS, etc.)
Available Baseline Fingerprints: All baseline results are hosted on HuggingFace Hub and automatically downloaded:
- ECFP4, ECFP4_Count, ECFP6, ECFP6_Count, ECFP8, ECFP8_Count
- MACCS, AtomPair, PubChem, TopologicalTorsion
See examples/compare_with_baselines_example.py for a complete example.
Visualization Guide
Litmus provides comprehensive visualization tools for analyzing and interpreting benchmark results. This section describes each chart type, what it shows, and how to interpret the results.
Quick Start
from lbvslitmus.visualization import plot_all, plot_model_comparison, plot_top_targets
# Generate all plots at once
plot_all(results, output_dir="plots")
# Or generate specific plots
plot_model_comparison(results, output_dir="plots")
plot_top_targets(results, n_targets=15, output_dir="plots")
Available Charts
1. Model Comparison (model_comparison.png)
What it shows: Boxplots comparing the distribution of metric scores (AUROC, AUPRC, BEDROC) across different models.
How to interpret:
- Each subplot represents one metric (e.g., AUROC, AUPRC, BEDROC)
- The box shows the interquartile range (IQR) - middle 50% of scores
- The horizontal line inside the box is the median score
- Whiskers extend to 1.5× IQR; points beyond are outliers
- Mean values are annotated on top of each box
- Higher and tighter boxes indicate better and more consistent performance
- Compare boxes side-by-side to see which model performs better overall
Use case: Quick comparison of overall model performance across all targets and datasets.
2. Benchmark Performance (benchmark_performance.png)
What it shows: Grouped bar charts showing average metric scores for each model across different benchmarks/datasets.
How to interpret:
- Each subplot represents one metric (AUROC, AUPRC)
- X-axis shows different benchmarks (MUV, LIT-PCBA, DUD-AD, WELQRATE)
- Y-axis shows the average score
- Bars are grouped by model, using consistent colors
- Taller bars indicate better performance on that benchmark
- Compare bar heights within each benchmark to see relative model performance
Use case: Understanding how models perform on different types of datasets and identifying dataset-specific strengths/weaknesses.
3. Benchmark Heatmap (benchmark_heatmap.png)
What it shows: Heatmaps displaying average scores for each benchmark-metric combination, with separate heatmaps per model.
How to interpret:
- Rows represent benchmarks (datasets)
- Columns represent different metrics
- Color intensity indicates score magnitude (darker = higher for YlOrRd colormap)
- Numerical values are annotated in each cell
- Look for patterns: consistent high scores across metrics indicate robust performance
- Red/orange cells indicate strong performance; yellow cells indicate weaker performance
Use case: Comprehensive overview of model performance across all dimensions; identifying which metrics a model excels at.
4. Violin Grid (violin_grid.png)
What it shows: A grid of violin plots where rows are datasets and columns are metrics, showing the full distribution of scores.
How to interpret:
- Each cell shows the score distribution for one dataset-metric combination
- Violin width indicates density of scores at that value (wider = more common)
- Red diamond markers show the mean value
- Inner points show individual data points
- Wider violins at high values indicate consistently good performance
- Bimodal distributions (two peaks) may indicate different behavior on different targets
Use case: Detailed analysis of score distributions; understanding variance and identifying potential outliers or subgroups.
5. Metric Bars (metric_bars.png)
What it shows: Grouped bar charts for each benchmark, showing average scores for different metrics side-by-side.
How to interpret:
- Each subplot represents one benchmark/dataset
- X-axis shows different metrics
- Bars are grouped by model
- Compare bar heights to see which model performs better on each metric
- Compare across subplots to see how the same model-metric combination varies by dataset
Use case: Detailed metric-by-metric comparison within each benchmark.
6. Distribution Violins (distribution_violins.png)
What it shows: A 2×2 grid of violin plots showing how metric scores are distributed across different benchmarks.
How to interpret:
- Each subplot shows one metric (AUROC, AUPRC, BEDROC, RIE)
- X-axis shows different benchmarks
- Split violins (when 2 models) or grouped violins show model comparison
- Overlapping violins indicate similar performance
- Non-overlapping violins indicate significant performance differences
- Long tails indicate high variance in performance
Use case: Comparing model performance distributions across benchmarks; statistical significance assessment.
7. Enrichment Factors (enrichment_factors.png)
What it shows: Bar chart showing Enrichment Factor (EF) values at 1% and 5% thresholds for each benchmark-model combination.
How to interpret:
- EF measures how much better than random the model is at ranking actives
- EF 1%: How enriched are actives in the top 1% of ranked compounds
- EF 5%: How enriched are actives in the top 5% of ranked compounds
- EF = 1 means random performance; EF = 10 means 10× better than random
- Higher bars indicate better early enrichment (more actives found early)
- EF 1% is more stringent; EF 5% captures broader early enrichment
Use case: Virtual screening applications where only top-ranked compounds will be tested experimentally.
8. Metric Correlation (metric_correlation.png)
What it shows: Correlation heatmaps showing how different metrics correlate with each other, with separate heatmaps per model.
How to interpret:
- Values range from -1 (perfect negative correlation) to +1 (perfect positive correlation)
- Red cells indicate positive correlation (metrics move together)
- Blue cells indicate negative correlation (metrics move oppositely)
- Diagonal is always 1.0 (perfect self-correlation)
- High correlations between metrics suggest they measure similar aspects
- Low correlations indicate the metrics capture different information
Use case: Understanding metric relationships; identifying redundant metrics; validating that chosen metrics provide diverse evaluation.
9. Metric Histograms (metric_histograms.png)
What it shows: Overlapping histograms with kernel density estimation (KDE) curves showing the frequency distribution of metric scores.
How to interpret:
- X-axis shows score values
- Y-axis shows frequency (count of targets with that score)
- Bars show histogram bins; curves show smoothed density estimates
- Non-overlapping distributions indicate clear performance differences between models
- Right-shifted distributions indicate overall better performance
- Narrow, tall peaks indicate consistent performance
- Wide, flat distributions indicate high variance
Use case: Understanding the overall score distribution; identifying whether differences are statistically meaningful.
10. Top Targets (top_targets.png)
What it shows: Horizontal bar charts showing the top N best-performing targets for each metric, with target labels including the dataset name.
How to interpret:
- Y-axis shows target names with dataset in parentheses, e.g., "OPRK1 (LIT-PCBA)"
- X-axis shows the score value
- Bars are grouped by model for each target
- Longer bars indicate higher scores
- Similar bar lengths for both models indicate comparable performance on that target
- Large differences highlight targets where one model significantly outperforms the other
Use case: Identifying which specific protein targets are easiest/hardest to predict; finding model-specific strengths on particular targets.
Comparison Visualizations
In addition to benchmark result visualizations, Litmus provides specialized visualizations for fingerprint comparison results: Example:
from lbvslitmus.comparison import plot_comparison_heatmap, plot_win_counts
# Generate comparison visualizations
comparison = results.compare_with_baseline(
baseline=["ECFP4", "MACCS", "AtomPair"],
metric="AUROC",
fingerprint_name="MyFP",
control_fingerprint="MyFP"
)
# Generate all visualizations at once
comparison.plot(output_dir="plots/comparison")
Customizing Plots
All plot functions accept the following common parameters:
| Parameter | Default | Description |
|---|---|---|
output_dir |
"plots" |
Directory to save the plot |
style |
"whitegrid" |
Seaborn style (whitegrid, darkgrid, white, dark) |
figure_dpi |
100 |
DPI for figure display |
savefig_dpi |
300 |
DPI for saved figures (higher = better quality) |
font_size |
10 |
Base font size for labels |
Some plots have additional parameters:
# Customize number of top targets
plot_top_targets(results, n_targets=20)
# Customize histogram bins
plot_metric_histograms(results, bins=20)
# Customize which metrics to show
plot_violin_grid(results, metrics=["AUROC", "AUPRC"])
Color Consistency
All plots use a consistent color palette (MODEL_PALETTE) to ensure models are represented with the same colors across all visualizations:
- Model 1: Teal (
#66c2a5) - Model 2: Orange (
#fc8d62) - Model 3: Blue (
#8da0cb) - Model 4: Pink (
#e78ac3) - Model 5: Green (
#a6d854)
This consistency makes it easy to track model performance across different chart types.
Installation
Install the latest stable release from PyPI:
pip install LBVSLitmus
Development
Setup
To install from source for development:
-
Clone the repository:
git clone https://github.com/your-org/virtual_screening_platform.git cd virtual_screening_platform
-
Install uv using official guide
-
Install dependencies:
uv sync -
Install pre-commit hooks:
uv run pre-commit install
Code Quality
This project uses Ruff for code formatting and linting. To format your code:
# Format and lint all files
uv run ruff format .
uv run ruff check . --fix
# Format and lint specific files
uv run ruff format path/to/file.py
uv run ruff check path/to/file.py --fix
The code formatting will also run automatically when you commit changes, thanks to pre-commit hooks.
Running Tests
To run the tests:
# Run all tests
uv run pytest
# Run specific test file
uv run pytest tests/test_specific.py
# Run tests with coverage report
uv run pytest --cov=lbvslitmus --cov-report=term-missing
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lbvslitmus-1.1.0.tar.gz.
File metadata
- Download URL: lbvslitmus-1.1.0.tar.gz
- Upload date:
- Size: 61.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
aba896561c80b6a0894fd8b70d8650ddd90bd729281b780dfdc0dd5b10c09c39
|
|
| MD5 |
63e709dfddcef425d8baff19876b1ed8
|
|
| BLAKE2b-256 |
3a968a7a387c6f8a1b5b0d208342507f61bc13877465767003713676cce02586
|
File details
Details for the file lbvslitmus-1.1.0-py3-none-any.whl.
File metadata
- Download URL: lbvslitmus-1.1.0-py3-none-any.whl
- Upload date:
- Size: 89.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.9.24 {"installer":{"name":"uv","version":"0.9.24","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
312afc7dcc118b21807a9cceca84b4b744cd4831282b115bc2d7c615def77f12
|
|
| MD5 |
5e5ae27d0124d49ce025cd43fe855126
|
|
| BLAKE2b-256 |
fbab7afd65d8cba0ce707fd89e5737673419cfac9b7b95f7ea93fb14a7f271bc
|