A tool to create well-balanced data splits for multi-task learning
Project description
Globally balanced multi-task splits
A tool to create well-balanced multi-task splits without data leakage between different tasks for QSAR modelling.
This package is based on the work of Giovanni Tricarico presented in Construction of balanced, chemically dissimilar training, validation and test sets for machine learning on molecular datasets.
Three splits are available: random-, dissimilarity- (clustering based on Tanimoto similarity of fingerprints with MaxMin or LeaderPicker) and scaffold-based (clustering based on Murcko scaffolds).
Installation
python -m pip install gbmtsplits
Getting started
CLI
The split can be easily created from the command line with
gbmtsplits -i <dataset.csv> -c <random/dissimilarity_maxmin/dissimilarity_leader/scaffold>
with <datasets.csv> an pivoted dataset where each row corresponds to a unique molecules and each task has it's own column. For more options use -h/--help
.
API
The splits can be also created (more options for linear programming to merge initial clusters) and visualised with an API.
import pandas as pd
from gbmtsplits.split import GloballyBalancedSplit
from gbmtsplits.clustering import RandomClustering, MaxMinClustering, LeaderPickerClustering, MurckoScaffoldClustering
# Load dataset or create pivoted dataset (each row corresponds to a unique molecules and each task has it's own column)
dataset = pd.read_csv('dataset.csv')
# Set up splitter with a initial clustering method
clustering_method = MaxMinClustering() # For dissimilarity based clustering using MaxMin algorithm to pick cluster centroids
splitter = GloballyBalancedSplit(clustering_method=clustering_method)
# or use dictionnary with precalculates clusters with keys cluster indices and values list of indices of molecules part of the cluster
clusters = {0 : [1,4,7,...], 1 : [2,3,8,...], ...}
splitter = GloballyBalancedSplit(clusters=clusters)
# Split the data
data = splitter(data=data)
The chemical (dis)similarity of the subsets and the balance of subsets per task can visualized either for a single dataset/split:
from gbmtsplits.plot import PlottingSingleDataset
plotter = PlottingSingleDataset(data_rgbs)
plotter.plot_all()
or to compare multiple datasets/splits:
from gbmtsplits.plot import PlottingCompareDatasets
data_rgbs['Dataset'] = 'RGBS'
data_dgbs['Dataset'] = 'DGBS'
data_both = pd.concat([data_rgbs, data_dgbs], ignore_index=True])
plotter = PlottingCompareDatasets(data_both, compare_col='Dataset')
plotter.plot_all()
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file gbmtsplits-0.0.8.tar.gz
.
File metadata
- Download URL: gbmtsplits-0.0.8.tar.gz
- Upload date:
- Size: 73.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7e61341cb070ce7d7ab166cbb1a41db6d5f3384657fad4626a2debfb837c83c4 |
|
MD5 | 28c51ec43a8fbed7c7de72b0ce0712c9 |
|
BLAKE2b-256 | 5c040aa57e3182f9793b6bbe257122975b45586215bcdb1ad107cfbc99abe16f |
File details
Details for the file gbmtsplits-0.0.8-py3-none-any.whl
.
File metadata
- Download URL: gbmtsplits-0.0.8-py3-none-any.whl
- Upload date:
- Size: 45.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.0 CPython/3.11.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 779f28f3404ebaa796917c019f215e92b8a0a59a65286d21d6cd1ba3b4d3693c |
|
MD5 | ba02b6943ed55875ca1adea85eeb048f |
|
BLAKE2b-256 | 1b3ec0cd35eae6ae617c97ad2cc7bb80170787fe312854f6302c1996fabb3403 |