A Chemically Biased Parametric Data Splitting Method
Project description
Analogue Split
A Chemically Biased Parametric Data Splitting Method
Overview
The Analogue Split method is designed to analyze and improve the robustness of machine learning models by considering activity cliffs in molecular datasets. Activity cliffs are pairs of similar molecules with significantly different biological activities, which can challenge the performance of predictive models.
This package provides tools to:
- Ensure a specified fraction of the test set molecules are involved in activity cliffs.
- Analyze model performance as a function of the proportion of activity cliffs in the test set.
- Visualize these analyses through gamma plots.
Installation
You can install the package from PyPI using:
pip install analoguesplit
Usage
Parameters
- gamma: Fraction of the test set comprising of activity cliffs.
- omega: Similarity threshold to create edges between molecules.
- test_size: Fraction of the dataset to be used as the test set.
- X: Feature vector (molecular fingerprints).
- y: Label vector (biological activities).
API
func
set_random_seed
Sets a random seed for reproducibility.
def set_random_seed(seed: int) -> None:
func
calculate_fp
Calculates molecular fingerprints for a list of molecules.
def calculate_fp(mols: list[Chem.rdchem.Mol], fp: str = "ecfp4") -> np.ndarray:
func
convert_smiles_to_mol
Converts a list of SMILES strings to RDKit molecule objects.
def convert_smiles_to_mol(smis: list[str]) -> list[Chem.rdchem.Mol]:
func
calculate_simmat
Calculates the similarity matrix for molecular fingerprints using a specified similarity function.
def calculate_simmat(fps: np.ndarray, similarity_function) -> np.ndarray:
func
tanimoto_similarity
Calculates the Tanimoto similarity coefficient between two binary vectors.
def tanimoto_similarity(fp1: np.ndarray, fp2: np.ndarray) -> float:
func
find_activity_cliffs
Identifies activity cliffs in the dataset.
def find_activity_cliffs(fps: np.ndarray, labels: np.ndarray, threshold: float) -> list[tuple[int, int]]:
func
analogue_split
Splits the dataset into training and test sets, ensuring a specified fraction of the test set molecules are activity cliffs.
def analogue_split(fps: np.ndarray, labels: np.ndarray, test_size: float, gamma: float, omega: float) -> tuple[np.ndarray, np.ndarray]:
func
train_and_evaluate_models
Trains and evaluates models using the analogue split and returns evaluation results.
def train_and_evaluate_models(gammas: list[float], fps: np.ndarray, labels: np.ndarray, models: dict, test_size: float, omega: float) -> dict:
func
plot_evaluation_results
Plots evaluation results for different gamma values.
def plot_evaluation_results(results: dict, gammas: list[float], title: str) -> None:
How to use analoguesplit
?
- Identify Activity Cliff Molecules: Determine which molecules are part of activity cliffs based on their similarity and class labels.
- Generate Test Sets: For each gamma value, create test sets with the desired proportion of activity cliff molecules.
- Evaluate Model Performance: Train models on the training set and evaluate them on the test sets, calculating metrics such as accuracy, precision, recall, and F1 score.
- Create Gamma Plot: Visualize the model performance metrics against gamma values to understand the impact of activity cliffs on model robustness.
Example
Please check Notebook to learn how to use analoguesplit
.
License
This project is licensed under the MIT License.
Acknowledgments
This package relies on several excellent Python libraries including RDKit, scikit-learn, NumPy, and Matplotlib.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file analoguesplit-0.1.2.tar.gz
.
File metadata
- Download URL: analoguesplit-0.1.2.tar.gz
- Upload date:
- Size: 6.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.9.15 Darwin/24.0.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | a7006f7d0a5e93177875430563e52c4edc73ddb00ec2c456fda569b19ca870f6 |
|
MD5 | 95b627ff0e3e98898956e05cc69b7ba3 |
|
BLAKE2b-256 | ec9ccdb9f35b1f2327341b82ffa59f8775014c00976f13c86bf2cddc2e2c2fec |
File details
Details for the file analoguesplit-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: analoguesplit-0.1.2-py3-none-any.whl
- Upload date:
- Size: 7.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.9.15 Darwin/24.0.0
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d653335492e6773c831af957267112964e6555c0ac2326adfc1bf88e8062d2f4 |
|
MD5 | 53964368208a58ae45dafe9586b9a384 |
|
BLAKE2b-256 | d7219d07970d6ccf98300d345426271035aee547cf2fd3103a8a176c3daa91c8 |