Tensorframework for mutational signature analysis.
Project description
DISCLAIMER: TensorSignatures is currently being developed and not stable yet. Although, the current version is in principle fully functional, it is likely that you may face problems using the software; if so, please don’t hesitate to get in touch.
TensorSignatures is a tensor factorization framework for mutational signature analysis, which in contrast to other methods, deciphers mutational processes not only in terms of mutational spectra, but also assess their properties with respect to various genomic variables.
Quick install
There are several ways to install TensorSignatures.
Via GitHub
To obtain the most recent version of TensorSignatures, we recommend to create a virtual environment and download the repository directly from GitHub. To get started, clone the repository by executing the following commands in your terminal
$ git clone https://github.com/gerstung-lab/tensorsignatures.git && cd tensorsignatures
Then, create a new virtual environment and install all dependencies.
$ python -m venv env
$ source env/bin/activate
$ pip install --upgrade pip setuptools wheel && pip install -r requirements.txt
Finally, install TensorSignatures.
$ pip install -e .
Via Pypi
To install tensorsignatures
via Pypi simply type
$ pip install tensorsignatures
into your shell. To get started with tensorsignatures please refer to the documentation.
Via docker (& jupyter)
To run TensorSignatures within a docker environment (and jupyter) clone the first the repository
$ git clone https://github.com/gerstung-lab/tensorsignatures.git
$ cd tensorsignatures
and then spin up the container using docker-compose
$ docker-compose up --build
Free software: MIT license
Documentation: https://tensorsignatures.readthedocs.io.
Getting started
Step 1: Data preparation
To apply TensorSignatures on your data single nucleotide variants (SNVs) need to
be split according to their genomic context and represented in a highdimensional
count tensor. Similarly, multinucleotide variants (MNVs), deletions and indels
(indels) have to be classified and represented in count matrix (currently we
do not provide a automated way of generating a structural variant table yet).
Despite the fact that TensorSignatures is written in Python, this part of the
pipeline runs in R
and and depends on the bioconductor
packages
VariantAnnotation
and rhdf5
. Make sure you have R3.4.x
installed, and the packages VariantAnnotation
and rhdf5
. You can
install them, if necessary, by executing
$ Rscript -e "source('https://bioconductor.org/biocLite.R'); biocLite('VariantAnnotation')"
and
$ Rscript -e "source('https://bioconductor.org/biocLite.R'); biocLite('rhdf5')"
from your command line.
To get started, download the following files and place them in the same directory:
Constants.RData (contains
GRanges
objects that annotate transcription/replication orientation,
nucleosomal and epigenetic states)
mutations.R (all required functions to partiton SNVs, MNVs and indels)
processVcf.R (loads vcf
files and creates the SNV count tensor, MNV and indel count matrix; eventually
needs custom modification to make the script run on your vcfs.)
genome.zip (optionally).
To obtain the SNV count tensor and the matrices containing all other mutation types try to execute
$ Rscript processVcf.R yourVcfFile1.vcf.gz yourVcfFile2.vcf.gz ... yourVcfFileN.vcf.gz outputHdf5File.h5
which ideally outputs an hdf5 file that can be used as an input for the TensorSignatures
software. In case of errors please check wether you have correctly specified paths
in line 6-8. Also, take a look at the readVcfSave
function and adjust it
in case of errors.
Before you can run TensorSignatures, a trinucleotide normalization constant needs to be
added to the hdf5 data file. You can do this by calling the prep
subroutine
of the TensorSignatures commandline programme.
$ tensorsignatures prep outputHdf5File.ht tsData.h5
Step 2: Run TensorSignatures
Once you have obtained the prepared input file, there are to ways to run
TensorSignatures using either the refit
option, which fits the exposures of
a set of pre-defined signatures to a new dataset, or via the train
subroutine,
that performs a denovo extraction of TensorSignatures. Both options have advantages
and disadvantages: Refitting tensor signatures is computationally fast but does not
allow to discover new signatures, while fitting new signatures requires a large
number of samples and is computationally intensive (GPU required). For most use cases,
with a small number of samples, we advice to use the refit
option:
$ tensorsignatures --verbose refit tsData.h5 refit.pkl -n
Here, is an example call to run a denovo extraction of tensor signatures
$ tensorsignatures --verbose train tsData.h5 denovo.pkl <rank> -k <size> -n -ep <epochs>
Running Tensorsignatures will yield a pickle
dump which can subsequently
inspected using the tensorsignatures
package (tutorials will follow soon).
Features
Run
tensorsignatures
on your dataset using theTensorSignature
class provided by the package or via the command line tool.Compute percentile based bootstrap confidence intervals for inferred parameters.
Basic plotting tools to visualize tensor signatures and inferred parameters
Credits
Harald Vöhringer and Moritz Gerstung
History
0.4.0 (2019-11-25)
added subroutine prep which adds the normalization constant to a hdf5 input file of tensorsignatures
added subroutine refit which refits a set of predefined signatures to mew dataset
updated README.rst
fixed issue with package data
0.3.0 (2019-10-03)
various fixes
design changes
fixed setup.py
0.1.0 (2019-08-21)
First release on PyPI.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for tensorsignatures-0.4.0-py2.py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0d948b233aa56ed9eba5380d8ce38f639056ec4bf8678724d2a0c1d1d9d052f1 |
|
MD5 | 0298d66e2f6bdbe039b636a28d47110b |
|
BLAKE2b-256 | 7e5f901d39e7c46f6087cb3e1e7c63c80a030aa1e02d92f7d4effcf6dafad1b2 |