A Python package for synthetic proteomics data augmentation using ProtoGAIN
Project description
GenerativeProteomics
WORK STILL IN PROGRESS
In this repository you may find a PyTorch implementation of Generative Adversarial Imputation Networks (GAIN) [1] for imputing missing iBAQ values in proteomics datasets.
Table of Contents
Installation
- Clone this repository:
git clone https://github.com/QuantitativeBiology/ProtoGain/ - Create a Python environment:
conda create -n proto python=3.10if you have conda installed - Activate the previously created environment:
conda activate proto - Install the necessary packages:
pip install -r requirements.txt
How to Use
If you just want to impute a general dataset, the most straightforward and simplest way to run ProtoGain is to run: python protogain.py -i /path/to/file_to_impute.csv
Running in this manner will result in two separate training phases.
-
Evaluation run: In this run a percentage of the values (10% by default) are concealed during the training phase and then the dataset is imputed. The RMSE is calculated with those hidden values as targets and at the end of the training phase a
test_imputed.csvfile will be created containing the original hidden values and the resulting imputation, this way you can have an estimation of the imputation accuracy. -
Imputation run: Then a proper training phase takes place using the entire dataset. An
imputed.csvfile will be created containing the imputed dataset.
However, there are a few arguments which you may want to change. You can do this using a parameters.json file (you may find an example in GenerativeProteomics/breast/parameters.json) or you can choose them directly in the command line.
Run with a parameters.json file: python protogain.py --parameters /path/to/parameters.json
Run with command line arguments: python protogain.py -i /path/to/file_to_impute.csv -o imputed_name --ofolder ./results/ --it 2001
Arguments:
-i: Path to file to impute
-o: Name of imputed file
--ofolder: Path to the output folder
--it: Number of iterations to train the model
--miss: The percentage of values to be concealed during the evaluation run (from 0 to 1)
--outall: Set this argument to 1 if you want to output every metric
--override: Set this argument to 1 if you want to delete the previously created files when writing the new output
If you want to test the efficacy of the code you may give a reference file containing a complete version of the dataset (without missing values): python protogain.py -i /path/to/file_to_impute.csv --ref /path/to/complete_dataset.csv
Running this way will calculate the RMSE of the imputation in relation to the complete dataset.
Demo
In this repository you may find a folder named breast, inside it you have a breast cancer diagnostic dataset [2] which you may use to try out the code.
breast.csv: complete dataset
breastMissing_20.csv: the same dataset but with 20% of its values taken out
To simply impute breastMissing_20.csv run: python protogain.py -i ./breast/breastMissing_20.csv
If you want to compare the imputation with the original dataset run: python protogain.py -i ./breast/breastMissing_20.csv --ref ./breast/breast.csv or python protogain.py --parameters ./breast/parameters.json
If you want to go deep in the analysis of every metric you either set --outall to 1 or you run the code in an IPython console, this way you can access every variable you want in the metrics object, e.g. metrics.loss_D.
References
[1]
J. Yoon, J. Jordon & M. van der Schaar (2018). GAIN: Missing Data Imputation using Generative Adversarial Nets
[2]
https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file GenerativeProteomics-0.2.1.tar.gz.
File metadata
- Download URL: GenerativeProteomics-0.2.1.tar.gz
- Upload date:
- Size: 10.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4da2bf87ef19d61b24489ec68a6522fab859d0aab7f60f36c45de183811353b9
|
|
| MD5 |
e10a828364e8a023cd155c278a8706c5
|
|
| BLAKE2b-256 |
03681260f63f241065814fa225413eef3d0e120a4c03fa6d48836de1d3766395
|
File details
Details for the file GenerativeProteomics-0.2.1-py3-none-any.whl.
File metadata
- Download URL: GenerativeProteomics-0.2.1-py3-none-any.whl
- Upload date:
- Size: 12.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bbc9b1ac8866c322f42f19967356c00aabe003f1b6ea9f81a13cb62c986ad75f
|
|
| MD5 |
1cabeac2ba405ecb36f2f5fc72a783da
|
|
| BLAKE2b-256 |
7bc984fd37f95fafa33c6a4a75b76555bdc86f54bd9a4520d85b9f78d32b43b7
|