A short description of your package

Project description

Data Valuation Platform

python version license

Repository Structure

DataValuationPlatform: This folder contains the codebase for the DataValuationPlatform Package. This package allows the easy application of Data Valuation Methods with predefined Data Loader and Model classes as well as implemented applications such as false positive detection, active learning, and undersampling.
Datasets: This folder contains the 25 preprocessed datasets used for the false positive detection application and the active learning application as .csv files, split into training and validation set.
Experiments: This folder contains all scripts used to generate the results presented in our publication, as well as the results themselves.

DataValuationPlatform

This package can be used to implement a range of data valuation methods. It has the following features:

Features

Data Loading and Preprocessing: The platform includes an HTS Data Processor that allows easy preprocessing of pubchem datasets. View Code
Model Integration: The platform supports various data valuation models, each offering unique approaches to data valuation.
Ready-to-use applications: Applications such as active learning, false positive detection, and importance undersampling are implemented for all data valuation models and ready to use.

The package contains the data valuation models descriped in our manuscript:

Data Valuation Models

CatBoost Model: An implementation of the CatBoost algorithm, known for handling categorical data efficiently. View Code
DVRL Model: Integrates the DVRL (Data Valuation using Reinforcement Learning) approach for data valuation. View Code
KNN Shapley Model: Applies the KNN Shapley method, a technique based on Shapley values, for assessing data influence. View Code
TracIn Model: Applies the TracIn method, calculating sample influence by tracing gradient descent. View Code
MVSA Model: Implements the MVSA (Most Valuable Subset Analysis) for evaluating data subsets. View Code

Tutorial

The following tutorial shows how to load some of the datasets included in this repository into a jupyter notebook, calculate molecular descriptors, and use one of the data valuation methods for false positive prediction

from DataValuationPlatform import HTSDataPreprocessor, MVSA, TracIn, CatBoost, DVRL

#create a preprocessor object and load the datasets you are interested in (e.g. the preprocessed datasets supplied in this repository by using their names)
preprocessor = HTSDataPreprocessor(["GPCR_3", "GPCR_2", "GPCR"])
preprocessor.load_preprocessed_data()

#calculate their molecular descriptors (currently implemented are ECFPs, a set of 208 RDKit descriptors, and SMILES)
preprocessor.create_descriptors(descriptor_type = "ecfp")

# create dataset objects for each dataset, which contain their train and test sets, molecular descriptors, labels
dataset_gpcr3 = preprocessor.get_dataset("GPCR_3")
dataset_gpcr2 = preprocessor.get_dataset("GPCR_2")

#create a data valuation model
mvsa_model = MVSA()

#you can either use these models just for calculating importance scores for a dataset
gpcr3_influence_scores = mvsa_model.calculate_influence(dataset_gpcr3)

#or apply one of the applications explained in the paper

#false positive prediction
gpcr3_false_positives_mvsa_results,gpcr3_mvsa_logs = mvsa_model.apply_false_positive_identification(dataset = dataset_gpcr3, replicates = 3)

#active learning
gpcr3_active_learning_mvsa_results = mvsa_model.apply_active_learning(dataset = dataset_gpcr3, step_size = 1, steps = 6, regression_function = "gpr", sampling_function = "greedy")

#importance undersampling
gpcr3_undersampling_mvsa_results = mvsa_model.apply_undersampling(dataset = dataset_gpcr3, steps = 19)

Experiments

The files included here are the preprocessing scripts:

create the Dataset files from the raw files downloaded from pubchem cleanup_pipeline
create molecular descriptors from Dataset files descr_export_pipeline_jh
create molecular descriptors for the moldata dataset mol_data_descr_export_pipeline_jh as well as the scripts and utility files used to perform the experiments shown in the paper
[FalsePositivePrediction] (https://github.com/JoshuaHesse/DataValuationPlatform/tree/master/Experiments/Scripts/FalsePositivePrediction) This folder contains the eval_pipeline used for the false positive prediction application as well as the necessary utility files
ActiveLearning contains the active_learning_pipeline used for the active learning application as well as the necessary utility files
Undersampling contains the undersampling_pipeline used for the undersampling application and the necessary utility files

Usage

In order to reproduce the results, you need to first create the molecular descriptors for the datasets you are interested in using the descr_export_pipeline_jh.

cd DataValuationPlatform/Experiments/Scripts
python3 --dataset all --representation ECFP

False Positive Prediction

Here is an example of how to use the eval_pipeline to test the MVS-A, Tracin, and Catboost false and true positive prediction performance on one dataset, using PAINS fragment filters and the Score method as benchmarks with 5 replicates:

cd DataValuationPlatform/Experiments/Scripts/FalsePositivePrediction
python3 eval_pipeline_jh.py --dataset GPCR_3 --knn no --dvrl no --tracin yes --mvs_a yes --catboost yes --score yes --fragment_filter yes --representation ECFP --replicates 5 --filename output --log_predictions yes --environment others

more information is given in the FalsePositivePrediction folder.

Active Learning

Here is an example of how to use active_learning_pipeline on all using the standard parameters on all datasets

cd DataValuationPlatform/Experiments/Scripts/ActiveLearning
python3 active_learning_pipeline_jh.py

Undersampling

This part of the project was done using the MolData benchmark instead of our curated dataset group. To reproduce this, clone the moldata benchmark into this folder first and calculate the molecular descriptors

cd DataValuationPlatform/Experiments/Scripts
git clone https://github.com/LumosBio/MolData
python3 mol_data_descr_export_pipeline_jh.py --group_type disease --dataset_group aging --representation ECFP
cd DataValuationPlatform/Experiments/Scripts
python3 undersampling_pipeline_jh.py --dataset_group aging --influence MVSA --replicates 5

Prerequisites

The platform currently supports Python 3.8. Some required packages are not included in the pip install:

Tensorflow (2.4.0)
Datascope (0.0.10)

Installation

The DataValuationPlatform will be installable using pip. Alternatively, you can clone this repository manually. The KNN Shapley model is not available in the pip install package due to incompatibilites with the remaining platform (datascope uses numpy version 1.24.2, the remaining packages uses 1.19.2).

Project details

Release history Release notifications | RSS feed

1.3

Dec 21, 2023

1.2

Dec 20, 2023

1.1

Dec 20, 2023

1.0

Dec 20, 2023

0.8

Dec 20, 2023

0.7

Dec 20, 2023

0.6

Dec 20, 2023

0.5

Dec 20, 2023

This version

0.4

Dec 20, 2023

0.3

Dec 20, 2023

0.2

Dec 20, 2023

0.1

Dec 20, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

DataValuationPlatform-0.4.tar.gz (70.5 MB view hashes)

Uploaded Dec 20, 2023 Source

Built Distribution

DataValuationPlatform-0.4-py3-none-any.whl (52.0 kB view hashes)

Uploaded Dec 20, 2023 Python 3

Hashes for DataValuationPlatform-0.4.tar.gz

Hashes for DataValuationPlatform-0.4.tar.gz
Algorithm	Hash digest
SHA256	`ee45740b192507cba8f186b09c932e89ef92058b9588f970b6a0ea24ac235ed0`
MD5	`8a653e321f45a773a32cb48c70f14534`
BLAKE2b-256	`c87af237f261f476503e7f1291c25ac9ed17d8c8076b62262e05eace67b85f40`

Hashes for DataValuationPlatform-0.4-py3-none-any.whl

Hashes for DataValuationPlatform-0.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e368a36e9c201ef0f283bc1c592190af5ebf1961bf6fae188f44809dcab7b615`
MD5	`97f8749692e36dea0653a950d7aef8f3`
BLAKE2b-256	`36618b96ae69ac26500ace3933df89624512937ade3e6ad3a8e3c37396879570`