Skip to main content

A modular platform to test data valuation methods on High Throughput Screen data applications

Project description

DataValuationPlatform

This package can be used to implement a range of data valuation methods. It has the following features:

Features

  • Data Loading and Preprocessing: The platform includes an HTS Data Processor that allows easy preprocessing of pubchem datasets. View Code
  • Model Integration: The platform supports various data valuation models, each offering unique approaches to data valuation.
  • Ready-to-use applications: Applications such as active learning, false positive detection, and importance undersampling are implemented for all data valuation models and ready to use.

The package contains the data valuation models descriped in our manuscript:

Data Valuation Models

  1. CatBoost Model: An implementation of the CatBoost algorithm, known for handling categorical data efficiently. View Code
  2. DVRL Model: Integrates the DVRL (Data Valuation using Reinforcement Learning) approach for data valuation. View Code
  3. KNN Shapley Model: Applies the KNN Shapley method, a technique based on Shapley values, for assessing data influence. View Code
  4. TracIn Model: Applies the TracIn method, calculating sample influence by tracing gradient descent. View Code
  5. MVSA Model: Implements the MVSA (Most Valuable Subset Analysis) for evaluating data subsets. View Code

Tutorial

The following tutorial shows how to load some of the datasets included in this repository into a jupyter notebook, calculate molecular descriptors, and use one of the data valuation methods for false positive prediction

Dataset Loading

#1. you can load the preinstalled datasets used in this publication via names
preprocessor = HTSDataPreprocessor(["GPCR_3", "GPCR_2", "GPCR"])
preprocessor.load_preprocessed_data()
preprocessor.create_descriptors(descriptor_type = "ecfp")

dataset_gpcr3 = preprocessor.get_dataset("GPCR_3")
dataset_gpcr2 = preprocessor.get_dataset("GPCR_2")
dataset_gpcr = preprocessor.get_dataset("GPCR_")

#2. You can add pubchem assay combinations that combine a primary and a confirmatory assay by downloading the raw files and adding
#the dataset to the existing collection (here example with made up aids)
preprocessor = HTSDataPreprocessor([])
preprocessor.add_dataset_by_AID(codename = "MadeUpAssayName", primary_AID = "001",confirmatory_AID= "002")
preprocessor.add_dataset_by_AID(codename = "MadeUpAssayName2", primary_AID = "003",confirmatory_AID= "004")
preprocessor.preprocess_data(path_to_raw="Path/To/Raw_data/")
preprocessor.create_descriptors("ecfp")

dataset_MadeUpAssayName = preprocessor.get_dataset("MadeUpAssayName")
dataset_MadeUpAssayName2 = preprocessor.get_dataset("MadeUpAssayName2")

3. you can add your own data directly as a custom dataset:
preprocessor = HTSDataPreprocessor([])
preprocessor.create_custom_dataset(
    dataset_name = CustomDataset,
    training_set_smiles=train_smiles,
    training_set_labels=train_labels,
    training_set_confirmatory_labels=train_confirmatory_labels) #this is only necessary for the false positive identification
preprocessor.create_descriptors("ecfp")

datasetCustomDataset = preprocessor.get_dataset("CustomDataset")

Model usage

from DataValuationPlatform import HTSDataPreprocessor, MVSA, TracIn, CatBoost, DVRL


#create a preprocessor object and load the datasets you are interested in (e.g. the preprocessed datasets supplied in this repository by using their names)
preprocessor = HTSDataPreprocessor(["GPCR_3", "GPCR_2", "GPCR"])
preprocessor.load_preprocessed_data()

#calculate their molecular descriptors (currently implemented are ECFPs, a set of 208 RDKit descriptors, and SMILES)
preprocessor.create_descriptors(descriptor_type = "ecfp")

# create dataset objects for each dataset, which contain their train and test sets, molecular descriptors, labels
dataset_gpcr3 = preprocessor.get_dataset("GPCR_3")
dataset_gpcr2 = preprocessor.get_dataset("GPCR_2")

#create a data valuation model
mvsa_model = MVSA()

#you can either use these models just for calculating importance scores for a dataset
gpcr3_influence_scores = mvsa_model.calculate_influence(dataset_gpcr3)

#or apply one of the applications explained in the paper

#false positive prediction
gpcr3_false_positives_mvsa_results,gpcr3_mvsa_logs = mvsa_model.apply_false_positive_identification(dataset = dataset_gpcr3, replicates = 3)

#active learning
gpcr3_active_learning_mvsa_results = mvsa_model.apply_active_learning(dataset = dataset_gpcr3, step_size = 1, steps = 6, regression_function = "gpr", sampling_function = "greedy")

#importance undersampling
gpcr3_undersampling_mvsa_results = mvsa_model.apply_undersampling(dataset = dataset_gpcr3, steps = 19)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DataValuationPlatform-1.3.tar.gz (70.5 MB view details)

Uploaded Source

Built Distribution

DataValuationPlatform-1.3-py3-none-any.whl (75.1 MB view details)

Uploaded Python 3

File details

Details for the file DataValuationPlatform-1.3.tar.gz.

File metadata

  • Download URL: DataValuationPlatform-1.3.tar.gz
  • Upload date:
  • Size: 70.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.8.5

File hashes

Hashes for DataValuationPlatform-1.3.tar.gz
Algorithm Hash digest
SHA256 96e0794dc630727efbfa55f92af29aa8301358b4afc487e3c32d0aa348d4086e
MD5 6ebec92207a96ae3b1cba343727d9b50
BLAKE2b-256 d157a9ffd29f25ffbc359e41b4bd99673ef311ac91a12552dbaa82d62424fff2

See more details on using hashes here.

File details

Details for the file DataValuationPlatform-1.3-py3-none-any.whl.

File metadata

File hashes

Hashes for DataValuationPlatform-1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 95e90bec6706ef0a17a20c15d6da02d5e037e87ed4791a7cd2de9f6fd28da7a9
MD5 4d0e970c187102c7d977ac017d10cd7b
BLAKE2b-256 123aa410893310d4d1f4cb7eb834658fdd237abe5663f016ea1b6e9a4ef8247f

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page