Skip to main content

Set of functionalities to assess molecular property prediction models.

Project description

Maturity level-0

Mosses - Model Assessment Toolkit

Description

Mosses is a library that provides a set of functionalities to assess molecular property prediction models, e.g., QSAR/QSPR models. The library currently includes:

  • Predictive Validity Module (predictive_validity.py) - Built on top of the concept of predictive validity described by Scannell et al. Nat Rev Drug Discov. 2022;21(12):915-931. doi:10.1038/s41573-022-00552-x. The function predictive_validity.evaluate_pv() allows the analysis of the quality of predictions on a given data set (e.g., a prospective test set of compounds), according to a desired threshold.

  • Heatmap Module (heatmap.py) - Summarises the information from the validation using predictive validity. The heatmap shows in one table, for each series in the data and according to the selected experimental threshold (SET), what the PPV and FOR percentages are, the recommended thresholds and resulting optimised PPV and FOR percentages, as well as, a qualitative label indicating whether the model is Good, Medium, or Bad.

  • Multi-Parameter Optimization (MPO) Module (mpo.py) - Provides a comprehensive toolkit for computing and optimizing MPO scores. MPO combines multiple molecular properties into a single score using sigmoid-based desirability functions.

Software Requirements

The library is written in Python and requires a version >= 3.10 for runtime. The dependencies required by the library are defined in pyproject.toml and are automatically installed when installing the library.

How to Install mosses

You can install the library using pip install mosses, or you can clone this repository then run make build && make install.

Examples of Usage

Jupyter notebooks with examples can be found in the folder examples. We recommend following those to adapt your data, configs, and code to work with the modules in mosses.


Multi-Parameter Optimization (MPO) Module

The mosses.mpo module provides a high-level API for Multi-Parameter Optimization analysis of compound data. It is commonly used in drug discovery to combine multiple ADMET properties into a single desirability score.

Key Features

  • Sigmoid-based scoring functions for transforming raw values to 0-1 scores
  • Multiple optimization algorithms for weight optimization
  • ML-based weight estimation using Random Forest, Ridge, and Logistic classifiers
  • Feature importance analysis via mutual information
  • Enrichment and correlation statistics
  • Visualization tools for analysis and comparison

Quick Start

from mosses import mpo
import pandas as pd

# Load your compound data
df = pd.read_csv("compounds.csv")

# Define parameter configurations
config = {
    "LogD": {
        "preference": "middle",      # Optimal range preferred
        "threshold": (0.0, 3.0),     # Values in this range score highest
        "weight": 1.0,
    },
    "Solubility": {
        "preference": "maximize",    # Higher is better
        "threshold": 50.0,           # Values > 50 score high
        "weight": 1.5,
    },
    "Clearance": {
        "preference": "minimize",    # Lower is better
        "threshold": 50.0,           # Values < 50 score high
        "weight": 1.0,
    },
}

# Compute MPO scores
result = mpo.compute_scores(df, config, return_intermediate=True)
print(result[["Compound Name", "MPO_Score"]].head())

Preference Types

The module supports three optimization preferences that determine how raw values are transformed into scores:

Preference Description Threshold Scoring Function
maximize Higher values are better Single value (inflection point) sigmoid()
minimize Lower values are better Single value (inflection point) reverse_sigmoid()
middle Optimal range preferred Tuple (lower, upper) double_sigmoid()

Visualizing Scoring Functions

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-2, 6, 200)

# Maximize: values above threshold score high
maximize_scores = mpo.sigmoid(x, threshold=2.0, steepness=2.0)

# Minimize: values below threshold score high
minimize_scores = mpo.reverse_sigmoid(x, threshold=3.0, steepness=2.0)

# Middle: values in range score high
middle_scores = mpo.double_sigmoid(x, lower_threshold=1.0, upper_threshold=4.0, steepness=3.0)

Weight Optimization

Optimize weights to match experimental target data using various algorithms:

# First compute individual parameter scores
result = mpo.compute_scores(df, config, return_intermediate=True)
df_with_scores = df.merge(result, on="Compound Name")

# Define score columns
score_columns = ["LogD_score", "Solubility_score", "Clearance_score"]

# Optimize weights against experimental activity
optimized_weights, opt_result = mpo.optimize_mpo_weights(
    df_with_scores,
    score_columns,
    target_column="Activity",
    method="differential_evolution",  # or "least_squares", "minimize", "powell"
    verbose=True,
)

print("Optimized weights:", optimized_weights)

Available Optimization Methods

Method Description Use Case
least_squares Linear least squares Fast, good baseline
minimize Scipy minimize (L-BFGS-B) General purpose
differential_evolution Global evolutionary algorithm Robust, handles local minima
dual_annealing Simulated annealing variant Complex landscapes
powell Powell's method Derivative-free
pygad Genetic algorithm (optional) Highly customizable

ML-Based Weight Estimation

Use machine learning to estimate feature importance as weights:

# Random Forest regression
ml_result = mpo.rf_regression(
    df_with_scores,
    score_columns,
    reference_col="Activity",
)

print("Feature importance:", ml_result.weights)
print(f"R² Score: {ml_result.metrics['test_r2']:.3f}")

# Other estimators available:
# mpo.rf_classifier() - Random Forest classification
# mpo.logistic_classifier() - Logistic regression
# mpo.ridge_classifier() - Ridge classifier

Evaluation Metrics

Evaluate MPO performance against experimental data:

stats = mpo.evaluate_mpo(
    df_with_scores,
    mpo_column="MPO_Score",
    reference_column="Activity",
    top_percent=10.0,  # Analyze top 10% of compounds
)

# Access metrics
print(f"Enrichment: {stats.enrichment:.2f}")
print(f"Spearman correlation: {stats.spearman_correlation:.3f}")
print(f"F1 score: {stats.f1_score:.3f}")
print(f"RMSE: {stats.rmse:.3f}")

Feature Importance Analysis

Analyze which parameters contribute most to the target:

importance_result = mpo.analyze_feature_importance(
    df_with_scores,
    score_columns,
    reference_col="Activity",
)

# Visualize
mpo.plot_mutual_info(importance_result, title="Feature Importance")

Complete Pipeline

For end-to-end analysis with automatic threshold detection and optional weight optimization:

result = mpo.build_mpo_pipeline(
    df,
    experimental_columns=["LogD", "Solubility", "Clearance", "Permeability"],
    target_column="Activity",
    preferences={
        "LogD": "middle",
        "Solubility": "maximize",
        "Clearance": "minimize",
        "Permeability": "maximize",
    },
    auto_threshold=True,  # Calculate thresholds from data
    optimize_weights_method="least_squares",  # Optional: optimize weights
)

# Result contains individual scores and final MPO_Score
print(result[["Compound Name", "MPO_Score"]].head())

Visualization Functions

The module provides several plotting utilities:

# Score distribution histogram
mpo.plot_mpo_histogram(result["MPO_Score"], title="MPO Score Distribution")

# Scatter plot with regression line
mpo.plot_best_fit_scatter(
    result["Activity"],
    result["MPO_Score"],
    label="MPO vs Activity"
)

# Correlation matrix
mpo.plot_parameter_correlation_matrix(
    df,
    columns=["LogD", "Solubility", "Clearance"],
    title="Parameter Correlations",
)

# Compare multiple methods
mpo.plot_comparison(
    df_with_scores,
    method_columns=["MPO_Score", "Optimized_MPO"],
    reference_column="Activity"
)

API Reference

Main Functions

Function Description
compute_scores(df, config) Compute MPO scores from parameter configuration
optimize_mpo_weights(df, score_cols, target) Optimize weights against target column
evaluate_mpo(df, mpo_col, ref_col) Compute enrichment and correlation statistics
build_mpo_pipeline(df, columns, ...) End-to-end MPO workflow

Scoring Functions

Function Description
sigmoid(x, threshold, steepness) Standard sigmoid for maximization
reverse_sigmoid(x, threshold, steepness) Reversed sigmoid for minimization
double_sigmoid(x, lower, upper, steepness) Double sigmoid for middle preference

Statistics Functions

Function Description
calculate_enrichment(percent_top, df, ref_col, method_col) Enrichment factor calculation
calculate_spearman_correlation(df, col1, col2) Spearman rank correlation
find_top_n_percent_ids(percent_top, df, score_col) Get IDs of top N% compounds
collect_stats(...) Comprehensive statistics collection

Plotting Functions

Function Description
plot_mpo_histogram(scores) Distribution of MPO scores
plot_best_fit_scatter(x, y) Scatter plot with regression
plot_parameter_correlation_matrix(df, columns) Correlation heatmap
plot_experimental_correlation_matrix(df, cols) Experimental parameter correlations
plot_predicted_correlation_matrix(df, cols) Predicted parameter correlations
plot_mutual_info(importance) Feature importance bar chart
plot_comparison(df, methods, ref) Side-by-side method comparison
plot_scoring_curves(config) Visualize sigmoid functions

Example Notebook

See examples/mpo_example.ipynb for a complete walkthrough including:

  1. Loading and exploring compound data
  2. Configuring parameters with different preferences
  3. Computing and visualizing MPO scores
  4. Optimizing weights against experimental data
  5. Evaluating MPO performance
  6. Using ML-based weight estimation
  7. Feature importance analysis
  8. Building complete pipelines

License

See LICENSE.md for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mosses-0.3.5.tar.gz (67.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mosses-0.3.5-py3-none-any.whl (70.1 kB view details)

Uploaded Python 3

File details

Details for the file mosses-0.3.5.tar.gz.

File metadata

  • Download URL: mosses-0.3.5.tar.gz
  • Upload date:
  • Size: 67.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for mosses-0.3.5.tar.gz
Algorithm Hash digest
SHA256 c56dfc7d6731d1416d6693f5632018e5e519d4c25c3a15c228a3e4c7197a7dc5
MD5 2f1e8cbe5077ca6f89a9c17e3b974a9e
BLAKE2b-256 5e4ff9bb917443da57e6f89cfb799344312e383508bbcd501c34418504fb3fb4

See more details on using hashes here.

File details

Details for the file mosses-0.3.5-py3-none-any.whl.

File metadata

  • Download URL: mosses-0.3.5-py3-none-any.whl
  • Upload date:
  • Size: 70.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.7

File hashes

Hashes for mosses-0.3.5-py3-none-any.whl
Algorithm Hash digest
SHA256 5fa154927a187d59a381038b5356768c0d6a42aae3cae89753b402064964e4d5
MD5 202b19c2ab3ac514f69bad2a25449f88
BLAKE2b-256 1c8c949f932fecf5e9bbb6acab48da1e24861c1955562c6ba81d3c3fca4a54b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page