Skip to main content

A Python package for synthetic proteomics data augmentation using ProtoGAIN

Project description

GenerativeProteomics

WORK STILL IN PROGRESS

In this repository you may find a PyTorch implementation of Generative Adversarial Imputation Networks (GAIN) [1] for imputing missing iBAQ values in proteomics datasets.

Table of Contents

Installation

  1. Clone this repository: git clone https://github.com/QuantitativeBiology/ProtoGain/
  2. Create a Python environment: conda create -n proto python=3.10 if you have conda installed
  3. Activate the previously created environment: conda activate proto
  4. Install the necessary packages: pip install -r requirements.txt

How to Use

If you just want to impute a general dataset, the most straightforward and simplest way to run ProtoGain is to run: python protogain.py -i /path/to/file_to_impute.csv Running in this manner will result in two separate training phases.

  1. Evaluation run: In this run a percentage of the values (10% by default) are concealed during the training phase and then the dataset is imputed. The RMSE is calculated with those hidden values as targets and at the end of the training phase a test_imputed.csv file will be created containing the original hidden values and the resulting imputation, this way you can have an estimation of the imputation accuracy.

  2. Imputation run: Then a proper training phase takes place using the entire dataset. An imputed.csv file will be created containing the imputed dataset.

However, there are a few arguments which you may want to change. You can do this using a parameters.json file (you may find an example in GenerativeProteomics/breast/parameters.json) or you can choose them directly in the command line.

Run with a parameters.json file: python protogain.py --parameters /path/to/parameters.json
Run with command line arguments: python protogain.py -i /path/to/file_to_impute.csv -o imputed_name --ofolder ./results/ --it 2001

Arguments:

-i: Path to file to impute
-o: Name of imputed file
--ofolder: Path to the output folder
--it: Number of iterations to train the model
--miss: The percentage of values to be concealed during the evaluation run (from 0 to 1)
--outall: Set this argument to 1 if you want to output every metric
--override: Set this argument to 1 if you want to delete the previously created files when writing the new output

If you want to test the efficacy of the code you may give a reference file containing a complete version of the dataset (without missing values): python protogain.py -i /path/to/file_to_impute.csv --ref /path/to/complete_dataset.csv

Running this way will calculate the RMSE of the imputation in relation to the complete dataset.

Demo

In this repository you may find a folder named breast, inside it you have a breast cancer diagnostic dataset [2] which you may use to try out the code.

breast.csv: complete dataset
breastMissing_20.csv: the same dataset but with 20% of its values taken out

To simply impute breastMissing_20.csv run: python protogain.py -i ./breast/breastMissing_20.csv
If you want to compare the imputation with the original dataset run: python protogain.py -i ./breast/breastMissing_20.csv --ref ./breast/breast.csv or python protogain.py --parameters ./breast/parameters.json

If you want to go deep in the analysis of every metric you either set --outall to 1 or you run the code in an IPython console, this way you can access every variable you want in the metrics object, e.g. metrics.loss_D.

References

[1] J. Yoon, J. Jordon & M. van der Schaar (2018). GAIN: Missing Data Imputation using Generative Adversarial Nets
[2] https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

GenerativeProteomics-0.2.1.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

GenerativeProteomics-0.2.1-py3-none-any.whl (12.6 kB view details)

Uploaded Python 3

File details

Details for the file GenerativeProteomics-0.2.1.tar.gz.

File metadata

  • Download URL: GenerativeProteomics-0.2.1.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.5

File hashes

Hashes for GenerativeProteomics-0.2.1.tar.gz
Algorithm Hash digest
SHA256 4da2bf87ef19d61b24489ec68a6522fab859d0aab7f60f36c45de183811353b9
MD5 e10a828364e8a023cd155c278a8706c5
BLAKE2b-256 03681260f63f241065814fa225413eef3d0e120a4c03fa6d48836de1d3766395

See more details on using hashes here.

File details

Details for the file GenerativeProteomics-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for GenerativeProteomics-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bbc9b1ac8866c322f42f19967356c00aabe003f1b6ea9f81a13cb62c986ad75f
MD5 1cabeac2ba405ecb36f2f5fc72a783da
BLAKE2b-256 7bc984fd37f95fafa33c6a4a75b76555bdc86f54bd9a4520d85b9f78d32b43b7

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page