Skip to main content

A tool to create well-balanced data splits for multi-task learning

Project description

Globally balanced multi-task splits

A tool to create well-balanced multi-task splits without data leakage between different tasks for QSAR modelling.

This package is based on the work of Giovanni Tricarico presented in Construction of balanced, chemically dissimilar training, validation and test sets for machine learning on molecular datasets.

Three splits are available: random-, dissimilarity- (clustering based on Tanimoto similarity of fingerprints with MaxMin or LeaderPicker) and scaffold-based (clustering based on Murcko scaffolds).

Installation

python -m pip install gbmtsplits

Getting started

CLI

The split can be easily created from the command line with

gbmtsplits -i <dataset.csv> -c <random/dissimilarity_maxmin/dissimilarity_leader/scaffold> 

with <datasets.csv> an pivoted dataset where each row corresponds to a unique molecules and each task has it's own column. For more options use -h/--help.

API

The splits can be also created (more options for linear programming to merge initial clusters) and visualised with an API.

import pandas as pd
from gbmtsplits.split import GloballyBalancedSplit
from gbmtsplits.clustering import RandomClustering, MaxMinClustering, LeaderPickerClustering, MurckoScaffoldClustering

# Load dataset or create pivoted dataset (each row corresponds to a unique molecules and each task has it's own column)
dataset = pd.read_csv('dataset.csv')

# Set up splitter with a initial clustering method
clustering_method = MaxMinClustering() # For dissimilarity based clustering using MaxMin algorithm to pick cluster centroids
splitter = GloballyBalancedSplit(clustering_method=clustering_method)

# or use dictionnary with precalculates clusters with keys cluster indices and values list of indices of molecules part of the cluster
clusters = {0 : [1,4,7,...], 1 : [2,3,8,...], ...}
splitter = GloballyBalancedSplit(clusters=clusters)

# Split the data
data = splitter(data=data)

The chemical (dis)similarity of the subsets and the balance of subsets per task can visualized either for a single dataset/split:

from gbmtsplits.plot import PlottingSingleDataset
plotter = PlottingSingleDataset(data_rgbs)
plotter.plot_all()

or to compare multiple datasets/splits:

from gbmtsplits.plot import PlottingCompareDatasets

data_rgbs['Dataset'] = 'RGBS'
data_dgbs['Dataset'] = 'DGBS'
data_both = pd.concat([data_rgbs, data_dgbs], ignore_index=True])

plotter = PlottingCompareDatasets(data_both, compare_col='Dataset')
plotter.plot_all()

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gbmtsplits-0.0.8.tar.gz (73.7 kB view details)

Uploaded Source

Built Distribution

gbmtsplits-0.0.8-py3-none-any.whl (45.0 kB view details)

Uploaded Python 3

File details

Details for the file gbmtsplits-0.0.8.tar.gz.

File metadata

  • Download URL: gbmtsplits-0.0.8.tar.gz
  • Upload date:
  • Size: 73.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.5

File hashes

Hashes for gbmtsplits-0.0.8.tar.gz
Algorithm Hash digest
SHA256 7e61341cb070ce7d7ab166cbb1a41db6d5f3384657fad4626a2debfb837c83c4
MD5 28c51ec43a8fbed7c7de72b0ce0712c9
BLAKE2b-256 5c040aa57e3182f9793b6bbe257122975b45586215bcdb1ad107cfbc99abe16f

See more details on using hashes here.

File details

Details for the file gbmtsplits-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: gbmtsplits-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 45.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.11.5

File hashes

Hashes for gbmtsplits-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 779f28f3404ebaa796917c019f215e92b8a0a59a65286d21d6cd1ba3b4d3693c
MD5 ba02b6943ed55875ca1adea85eeb048f
BLAKE2b-256 1b3ec0cd35eae6ae617c97ad2cc7bb80170787fe312854f6302c1996fabb3403

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page