A modular platform to test data valuation methods on High Throughput Screen data applications

Project description

DataValuationPlatform

This package can be used to implement a range of data valuation methods. It has the following features:

Features

Data Loading and Preprocessing: The platform includes an HTS Data Processor that allows easy preprocessing of pubchem datasets. View Code
Model Integration: The platform supports various data valuation models, each offering unique approaches to data valuation.
Ready-to-use applications: Applications such as active learning, false positive detection, and importance undersampling are implemented for all data valuation models and ready to use.

The package contains the data valuation models descriped in our manuscript:

Data Valuation Models

CatBoost Model: An implementation of the CatBoost algorithm, known for handling categorical data efficiently. View Code
DVRL Model: Integrates the DVRL (Data Valuation using Reinforcement Learning) approach for data valuation. View Code
KNN Shapley Model: Applies the KNN Shapley method, a technique based on Shapley values, for assessing data influence. View Code
TracIn Model: Applies the TracIn method, calculating sample influence by tracing gradient descent. View Code
MVSA Model: Implements the MVSA (Most Valuable Subset Analysis) for evaluating data subsets. View Code

Tutorial

The following tutorial shows how to load some of the datasets included in this repository into a jupyter notebook, calculate molecular descriptors, and use one of the data valuation methods for false positive prediction

Dataset Loading

#1. you can load the preinstalled datasets used in this publication via names
preprocessor = HTSDataPreprocessor(["GPCR_3", "GPCR_2", "GPCR"])
preprocessor.load_preprocessed_data()
preprocessor.create_descriptors(descriptor_type = "ecfp")

dataset_gpcr3 = preprocessor.get_dataset("GPCR_3")
dataset_gpcr2 = preprocessor.get_dataset("GPCR_2")
dataset_gpcr = preprocessor.get_dataset("GPCR_")

#2. You can add pubchem assay combinations that combine a primary and a confirmatory assay by downloading the raw files and adding
#the dataset to the existing collection (here example with made up aids)
preprocessor = HTSDataPreprocessor([])
preprocessor.add_dataset_by_AID(codename = "MadeUpAssayName", primary_AID = "001",confirmatory_AID= "002")
preprocessor.add_dataset_by_AID(codename = "MadeUpAssayName2", primary_AID = "003",confirmatory_AID= "004")
preprocessor.preprocess_data(path_to_raw="Path/To/Raw_data/")
preprocessor.create_descriptors("ecfp")

dataset_MadeUpAssayName = preprocessor.get_dataset("MadeUpAssayName")
dataset_MadeUpAssayName2 = preprocessor.get_dataset("MadeUpAssayName2")

3. you can add your own data directly as a custom dataset:
preprocessor = HTSDataPreprocessor([])
preprocessor.create_custom_dataset(
    dataset_name = CustomDataset,
    training_set_smiles=train_smiles,
    training_set_labels=train_labels,
    training_set_confirmatory_labels=train_confirmatory_labels) #this is only necessary for the false positive identification
preprocessor.create_descriptors("ecfp")

datasetCustomDataset = preprocessor.get_dataset("CustomDataset")

Model usage

from DataValuationPlatform import HTSDataPreprocessor, MVSA, TracIn, CatBoost, DVRL


#create a preprocessor object and load the datasets you are interested in (e.g. the preprocessed datasets supplied in this repository by using their names)
preprocessor = HTSDataPreprocessor(["GPCR_3", "GPCR_2", "GPCR"])
preprocessor.load_preprocessed_data()

#calculate their molecular descriptors (currently implemented are ECFPs, a set of 208 RDKit descriptors, and SMILES)
preprocessor.create_descriptors(descriptor_type = "ecfp")

# create dataset objects for each dataset, which contain their train and test sets, molecular descriptors, labels
dataset_gpcr3 = preprocessor.get_dataset("GPCR_3")
dataset_gpcr2 = preprocessor.get_dataset("GPCR_2")

#create a data valuation model
mvsa_model = MVSA()

#you can either use these models just for calculating importance scores for a dataset
gpcr3_influence_scores = mvsa_model.calculate_influence(dataset_gpcr3)

#or apply one of the applications explained in the paper

#false positive prediction
gpcr3_false_positives_mvsa_results,gpcr3_mvsa_logs = mvsa_model.apply_false_positive_identification(dataset = dataset_gpcr3, replicates = 3)

#active learning
gpcr3_active_learning_mvsa_results = mvsa_model.apply_active_learning(dataset = dataset_gpcr3, step_size = 1, steps = 6, regression_function = "gpr", sampling_function = "greedy")

#importance undersampling
gpcr3_undersampling_mvsa_results = mvsa_model.apply_undersampling(dataset = dataset_gpcr3, steps = 19)

Project details

Release history Release notifications | RSS feed

This version

1.3

Dec 21, 2023

1.2

Dec 20, 2023

1.1

Dec 20, 2023

1.0

Dec 20, 2023

0.8

Dec 20, 2023

0.7

Dec 20, 2023

0.6

Dec 20, 2023

0.5

Dec 20, 2023

0.4

Dec 20, 2023

0.3

Dec 20, 2023

0.2

Dec 20, 2023

0.1

Dec 20, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DataValuationPlatform-1.3.tar.gz (70.5 MB view hashes)

Uploaded Dec 21, 2023 Source

Built Distribution

DataValuationPlatform-1.3-py3-none-any.whl (75.1 MB view hashes)

Uploaded Dec 21, 2023 Python 3

Hashes for DataValuationPlatform-1.3.tar.gz

Hashes for DataValuationPlatform-1.3.tar.gz
Algorithm	Hash digest
SHA256	`96e0794dc630727efbfa55f92af29aa8301358b4afc487e3c32d0aa348d4086e`
MD5	`6ebec92207a96ae3b1cba343727d9b50`
BLAKE2b-256	`d157a9ffd29f25ffbc359e41b4bd99673ef311ac91a12552dbaa82d62424fff2`

Hashes for DataValuationPlatform-1.3-py3-none-any.whl

Hashes for DataValuationPlatform-1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`95e90bec6706ef0a17a20c15d6da02d5e037e87ed4791a7cd2de9f6fd28da7a9`
MD5	`4d0e970c187102c7d977ac017d10cd7b`
BLAKE2b-256	`123aa410893310d4d1f4cb7eb834658fdd237abe5663f016ea1b6e9a4ef8247f`