Skip to main content

A Chemically Biased Parametric Data Splitting Method

Project description

Analogue Split

A Chemically Biased Parametric Data Splitting Method

Overview

The Analogue Split method is designed to analyze and improve the robustness of machine learning models by considering activity cliffs in molecular datasets. Activity cliffs are pairs of similar molecules with significantly different biological activities, which can challenge the performance of predictive models.

This package provides tools to:

  1. Ensure a specified fraction of the test set molecules are involved in activity cliffs.
  2. Analyze model performance as a function of the proportion of activity cliffs in the test set.
  3. Visualize these analyses through gamma plots.

Installation

You can install the package from PyPI using:

pip install analoguesplit

Usage

Parameters


  • gamma: Fraction of the test set comprising of activity cliffs.
  • omega: Similarity threshold to create edges between molecules.
  • test_size: Fraction of the dataset to be used as the test set.
  • X: Feature vector (molecular fingerprints).
  • y: Label vector (biological activities).

API


func set_random_seed

Sets a random seed for reproducibility.

def set_random_seed(seed: int) -> None:

func calculate_fp

Calculates molecular fingerprints for a list of molecules.

def calculate_fp(mols: list[Chem.rdchem.Mol], fp: str = "ecfp4") -> np.ndarray:

func convert_smiles_to_mol

Converts a list of SMILES strings to RDKit molecule objects.

def convert_smiles_to_mol(smis: list[str]) -> list[Chem.rdchem.Mol]:

func calculate_simmat

Calculates the similarity matrix for molecular fingerprints using a specified similarity function.

def calculate_simmat(fps: np.ndarray, similarity_function) -> np.ndarray:

func tanimoto_similarity

Calculates the Tanimoto similarity coefficient between two binary vectors.

def tanimoto_similarity(fp1: np.ndarray, fp2: np.ndarray) -> float:

func find_activity_cliffs

Identifies activity cliffs in the dataset.

def find_activity_cliffs(fps: np.ndarray, labels: np.ndarray, threshold: float) -> list[tuple[int, int]]:

func analogue_split

Splits the dataset into training and test sets, ensuring a specified fraction of the test set molecules are activity cliffs.

def analogue_split(fps: np.ndarray, labels: np.ndarray, test_size: float, gamma: float, omega: float) -> tuple[np.ndarray, np.ndarray]:

func train_and_evaluate_models

Trains and evaluates models using the analogue split and returns evaluation results.

def train_and_evaluate_models(gammas: list[float], fps: np.ndarray, labels: np.ndarray, models: dict, test_size: float, omega: float) -> dict:

func plot_evaluation_results

Plots evaluation results for different gamma values.

def plot_evaluation_results(results: dict, gammas: list[float], title: str) -> None:

How to use analoguesplit ?

  1. Identify Activity Cliff Molecules: Determine which molecules are part of activity cliffs based on their similarity and class labels.
  2. Generate Test Sets: For each gamma value, create test sets with the desired proportion of activity cliff molecules.
  3. Evaluate Model Performance: Train models on the training set and evaluate them on the test sets, calculating metrics such as accuracy, precision, recall, and F1 score.
  4. Create Gamma Plot: Visualize the model performance metrics against gamma values to understand the impact of activity cliffs on model robustness.

Example

Please check Notebook to learn how to use analoguesplit.

License

This project is licensed under the MIT License.

Acknowledgments

This package relies on several excellent Python libraries including RDKit, scikit-learn, NumPy, and Matplotlib.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

analoguesplit-0.1.3.tar.gz (6.2 kB view details)

Uploaded Source

Built Distribution

analoguesplit-0.1.3-py3-none-any.whl (7.3 kB view details)

Uploaded Python 3

File details

Details for the file analoguesplit-0.1.3.tar.gz.

File metadata

  • Download URL: analoguesplit-0.1.3.tar.gz
  • Upload date:
  • Size: 6.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.9.15 Darwin/24.0.0

File hashes

Hashes for analoguesplit-0.1.3.tar.gz
Algorithm Hash digest
SHA256 0519da7100f6d9bb1264cb3bac5d7855f2486b908592789de1894d51826a0a64
MD5 4da3346e3723cc208e699f921de40fa2
BLAKE2b-256 69d0fd81b0d54b3d1645afca9c5e2dfdb6e57e33f5f60fbe4c748e87b4567f30

See more details on using hashes here.

File details

Details for the file analoguesplit-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: analoguesplit-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 7.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.9.15 Darwin/24.0.0

File hashes

Hashes for analoguesplit-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 fd7d83c36b8c621f053d5dc0c35fd21675262904a20dba07262be2d5de808e09
MD5 2352b6361f7bbbebb19a7f9833da7e6c
BLAKE2b-256 868f05d234ce7d26c8296a9db9ad91f42e0cf87798fbe8f24ea4d65be5dc34c9

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page