Tensorframework for mutational signature analysis.
Project description
TensorSignatures is a tensor factorisation framework for mutational signature analysis, which in contrast to other methods, deciphers mutational processes not only in terms of mutational spectra, but also assess their properties with respect to various genomic variables, allows the inclusion of different mutation types and integrates a robust noise model toperform the inference.
TensorSignatures is a young project and breaking changes are to be expected. We keep a changelog and it will have possible breakage clearly documented.
Quick install
TensorSignatures makes use of the TensorFlow 1.5.x framework requiring the user to install a separate package to enable GPU support, i.e. tensorflow-gpu
instead of tensorflow
. We highly recommend to install TensorSignatures into an environment with tensorflow-gpu, as the tensor computations greatly benefit from GPU-acceleration.
Via GitHub
To obtain the most recent version of TensorSignatures, we recommend to download the repository directly from GitHub and to install the package into a virtual environment. To get started, clone the repository by executing the following commands in your terminal
$ git clone https://github.com/gerstung-lab/tensorsignatures.git && cd tensorsignatures
Then, create a new virtual environment and install all dependencies. If you have access to a GPU with cuda support use requirements-gpu.txt
instead of requirements.txt
.
$ python -m venv env
$ source env/bin/activate
$ pip install --upgrade pip setuptools wheel && pip install -r requirements.txt
Finally, install TensorSignatures.
$ python setup.py install
Via Pypi
To install tensorsignatures
via Pypi simply type
$ pip install tensorsignatures
into your shell.
Via docker (& jupyter)
To run TensorSignatures within a docker environment, clone the repository
$ git clone https://github.com/gerstung-lab/tensorsignatures.git
$ cd tensorsignatures
and spin up the container using docker-compose
$ docker-compose up --build
This spins up a jupyter server including notebooks with tutorials on http://localhost:8889.
Free software: MIT license
Documentation: https://tensorsignatures.readthedocs.io.
Getting started
Step 1: Data preparation
Running TensorSignatures involves three steps: preparing the input data, i.e. creating the mutation count tensor as well as the mutation count matrix, computing a trinucleotide normalisation to account for differences in the nucleotide composition of different genomic regions, and running TensorSignatures.
Preparing input data using docker
We provide a docker image that contains all R
and bioconductor
dependencies to create the variant tensor and the other mutation type matrix. To use it, pull the image from docker. Note that the image is approximately 5 GB large.
$ docker pull sagar87/tensorsignatures-data:latest
To use the image switch into the folder containing your VCF data. Then run image using the following command and supply the VCF files as well as the name of the hdf5
output file (must be the last argument) as arguments.
$ docker run -v $PWD:/usr/src/app sagar87/tensorsignatures-data <vcf1.vcf> <vcf2.vcf> ... <vcfn.vcf> <output.h5>
Then continue with Step 2.
Preparing input data using a custom installation
Make sure you have R3.4.x
(!) and the packages VariantAnnotation
and rhdf5
installed. You can install them, if necessary, by executing
$ Rscript -e "source('https://bioconductor.org/biocLite.R'); biocLite('VariantAnnotation')"
and
$ Rscript -e "source('https://bioconductor.org/biocLite.R'); biocLite('rhdf5')"
from your command line.
To get started, download the following files and place them in the same directory:
Constants.RData (contains GRanges
objects that annotate transcription/replication orientation, nucleosomal and epigenetic states)
mutations.R (all required functions to partiton SNVs, MNVs and indels)
processVcf.R (loads vcf
files and creates the SNV count tensor, MNV and indel count matrix; eventually needs custom modification to make the script run on your vcfs.)
To obtain the SNV count tensor and the matrices containing other mutation types, execute processVcf.R
and pass the VCF files you want to convert, as well as a name for an output hdf5
file as command line arguments, e.g.
$ Rscript processVcf.R <vcf1.vcf> <vcf2.vcf> ... <vcfn.vcf> <output.h5>
In case of errors please check wether you have correctly specified paths in line 6-8. Also, take a look at the readVcfSave
function and adjust it when it fails.
Step 2: Computing trinucleotide normalisation
TensorSignatures requires a trinucleotide normalisation constant to account for differences in the nucleotide composition of genomic states. To compute it, invoke the prep sub routine of TensorSignatures and pass the hd5
file from Step 1 as well as the path for the output file as positional arguments to the programme.
$ tensorsignatures prep <output.h5> <tsdata.h5>
Step 3: Run TensorSignatures
There are two ways to run TensorSignatures using either the refit
option, which fits the exposures of a set of pre-defined signatures extracted from the PCAWG cohort to a your dataset, or via the train
subroutine, that performs a denovo extraction of tensor signatures. Refitting tensor signatures is computationally fast but does not allow to discover new signatures, while extracting new signatures from scratch is computationally intensive (GPU required) and requires ideally larger numbers of samples. For most use cases, with a small number of samples, we advice to use the refit option:
$ tensorsignatures --verbose refit tsData.h5 refit.pkl -n
To run a denovo extraction use
$ tensorsignatures --verbose train tsData.h5 denovo.pkl <rank> -k <size> -n -ep <epochs>
where rank
specifies the decomposition rank, size
controls the dispersion of the model, and epochs
the number of desired epochs to fit the model. TensorSignatures outputs value of the objective function (log likelihood) that is minimised during training as well as the change of the objective during an epoch interval (delta
). When deciding on the number of epochs to train the model ensure that it is sufficiently large such that the objective function converges, i.e. the delta
value is close to, or fluctuates around zero. For more information on how to run TensorSignatures in a practical setting see the documentation. Running TensorSignatures will yield a pickle dump which can subsequently inspected using the tensorsignatures package.
Features
Run
tensorsignatures
on your dataset using theTensorSignature
class provided by the package or via the command line tool.Compute percentile based bootstrap confidence intervals for inferred parameters.
Basic plotting tools to visualize tensor signatures and inferred parameters
Credits
Harald Vöhringer and Moritz Gerstung
History
0.5.0 (2020-08-25)
updated installation and quick start guid
added/updated tutorials to the documentation
added tutorial jupyter notebooks
added new docker image to install R pipeline required to format input data
Experiment class reads hdf5 files with “r” flag
Data class provides tensor factors directly via a, b and m attributes
0.4.1 (2019-07-29)
modified reshape of normalisation constant to enable tissue specific normalisations
0.4.0 (2019-11-25)
added subroutine prep which adds the normalization constant to a hdf5 input file of tensorsignatures
added subroutine refit which refits a set of predefined signatures to mew dataset
updated README.rst
fixed issue with package data
0.3.0 (2019-10-03)
various fixes
design changes
fixed setup.py
0.1.0 (2019-08-21)
First release on PyPI.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file tensorsignatures-0.5.0.tar.gz
.
File metadata
- Download URL: tensorsignatures-0.5.0.tar.gz
- Upload date:
- Size: 305.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: Python-urllib/3.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 95e008257be2e9d0eb18bb83ac292e64628211b0496944c6bc71daf15d39e821 |
|
MD5 | c5a23bc70fc9615f050d5788243ba4dc |
|
BLAKE2b-256 | 3fe48333cd11880462a4537b5642b14efeefec8d76f4a843fdb72fc37ffca3b1 |
File details
Details for the file tensorsignatures-0.5.0-py2.py3-none-any.whl
.
File metadata
- Download URL: tensorsignatures-0.5.0-py2.py3-none-any.whl
- Upload date:
- Size: 732.8 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: Python-urllib/3.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 46fa89a9b6f14e37c885bd429204726c8b7f45415b1555868ef191f94944adaf |
|
MD5 | 912921bfae23d1ef6b0dc58abf5543dc |
|
BLAKE2b-256 | 464936c1f0ae2d5810f05a559804974e1667b10a203f897c2de1992f095c1738 |