Skip to main content

Lightweight package that allows for the generation of augmented RNA-seq data from a base dataset, for expanding training datasets or large-scale dataset analysis.

Project description

augmentRNA

AugmentRNA is a simple toolbox for RNA-seq based datasets which is compatible both with Pandas and Polars

Features

  • Normalize data based on read counts
  • Augment new samples of data for different labels based on negative binomial distributions or generative adversarial models, with tuneable noise
  • Down-sample data data to equalize across class labels

Installation

augmentRNA can be installed via the pip package manager for python

pip install augmentRNA

Features


augment_data

data = augment_data(data, num_samples, label, selected_label = 0, evals = False, epochs = 20, augment_type = 'nbinom', polars = False, normalize = False, noise = 0)

Augments new data samples for RNA-seq analysis

Inputs

data : polars df, pandas df, str A dataframe containing the RNA-seq data, or a path to a .csv file of the dataframe

num_samples : int The additional numbers of samples that should be augmented from the data

label : str The label of the df column containing the classification label

selected_label : str, int The selected label that should be amplified. 'all' will amplify all labels to the selected amount

augment_type : str The type of augmentation that should be performed. A string containing 'nbinom' will sample from negative binomial
where applicable, otherwise sampling from a normal distribution, or for genes with no expression in the sample, will just output zeroes. A string containing 'gan' will sample from a generative adversarial network to generate samples. Defaults to nbinom

noise: int, float The amount of noise that should be applied to the model (uniform noise based on the existing gene distribution). Defaults to zero

polars : bool Whether a polars (True) or pandas dataframe (False) should be used as the input. dataframe. Defaults to False

normalize_data : bool Whether the data should be normalized based on read counts. Defaults to False

epochs : int If a GAN is generated, how many epochs should the model be run for? Defaults to 20

Outputs

data : polars df, pandas df Output dataframe containing augmented data and old data.

data = add_noise(data, label = 'RA', noise = 0.1, noise_type = None, polars = True):

Adds several different forms of noise to a dataframe of data

Inputs

data: polars df, pandas df A data containing the data that should have noise injected

label: str, int The label column name containing classes

noise: float, int The proportion of noise added (by the proportion of the dataset mean/inddividual sample noise) to the data. Default .1

noise_type: string The type of noise. Can either be mean (mean of the dataset noise) or uniform (across a uniform distribution of the maximum/minimum of the column). Defaults to uniform

polars: bool Whether polars or pandas should be used. Defaults to True

Outputs

data : polars df, pandas df Originala dataframe with noise added


normalize_data

data =  normalize_data(data, polars = False, round_data = True)

Inputs

data : polars df, pandas df Input dataframe to normalize

polars : bool Whether a polars dataframe (True) or pandas dataframe (False) should be used

round_data : bool Whether the output values should be converted to integers or kept as floats

Outputs

data : polars df, pandas df Output normalized dataframe


relevant_genes

data = relevant_genes(data, label = 'RA', polars = False):

Filters dataset to only contain genes that have non-zero values in all columns, or zero vaues in all columns for every label.Seeks to minimize bias from different sequencing/sampling methods for different labels, and make the training dataset more representative.

Inputs

data : polars df, pandas df, str RNA-seq expresison dataframe

label : str Dataframe column containing labels

polars : bool Whether pandas (False) or polars (True) dataframe is the input

Outputs

data : polars df, pandas df An output dataframe containing only genes that are relevant across all samples

Development

This project is currently under active beta development. New features are being added, and if there is an additional processing feature that would fit the toolbox, please reach out the lead developer at christian@defrondeville.com.

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

augmentRNA-1.0.1.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

augmentRNA-1.0.1-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file augmentRNA-1.0.1.tar.gz.

File metadata

  • Download URL: augmentRNA-1.0.1.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.0

File hashes

Hashes for augmentRNA-1.0.1.tar.gz
Algorithm Hash digest
SHA256 d4f71835ba9044814a41cd58908076bc967b77f24adb12a9e50f859f6f6ec76d
MD5 e13845c5bec78c1277a276c1de14ccd6
BLAKE2b-256 8aea4051e4a198256b974317c6de9cbc3d1a2f2f393ba98ec283effa2d046958

See more details on using hashes here.

File details

Details for the file augmentRNA-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: augmentRNA-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 13.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.0

File hashes

Hashes for augmentRNA-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 36f0efc3de5313193af798f6cae9fd7ca7d0874f0fbeb9df4441807fde993b5a
MD5 4fac3433636ac1f85c5a94a176ad6587
BLAKE2b-256 4c3ae22bea226c0c1763dd4480c39a9342d615943c98a6aa9968decec265fc97

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page