Within-sample CNV calling
Project description
Wisestork
This is a complete re-implementation of the original Wisecondor program. Its original purpose was to detect trisomies and smaller CNVs in maternal plasma samples using low-coverage WGS.
Wisestork adds practical support for small bin sizes, and is intended to be useful on regular WGS and Exome sequencing as well.
For a full overview of differences with the original Wisecondor, see section Differences.
Installation
PyPI
Install wisestork from PyPI with a simple:
pip install wisestork
Manually (development versions)
The following system dependencies are required
- Python 3.5+
Furthermore, the following python packages are required:
- numpy
- matplotlib
- biopython
- statsmodels
- sklearn
- pysam
- pyfaidx
- click
It is recommended you use a virtualenv.
To install wisestork, create a virtualenv, install the python
requirements using pip install -r requirements.txt
and then run
python setup.py develop
Input
Wisestork takes BAM files as input. These BAM files must be indexed.
Additionally, you must provide a reference Fasta file, which should
likewise be indexed with samtools faidx <fasta>
.
Running
A typical workflow starts with BAM files. Those BAM files must be sorted and indexed.
The first step in a Wisestork analysis is the count
step. This
generates read counts per bin, and writes this to a BED file. The
command to do this, would look like the following:
wisestork count -I <input.bam> -R <fasta.fa> -O <out.bed> -B <binszise>
The -B
flag can be left out: Wisestork defaults to a binsize of 50kb.
However, you will likely want a different binsize.
Once you have the count BED file, we have to correct for GC bias. The command to do this is:
wisestork gc-correct -I <input.bed> -R <fasta.fa> -O <out.gc.bed> -B <binsize>
For the next step, we need the result bgzipped and tabixed, so you'll
have to execute bgzip <out.gc.bed> && tabix -pbed <out.gc.bed.gz>
The last step, the zscore
step, calculates Z-scores for each bin.
It requires you to have generated a reference dictionary beforehand.
The command to create z-scores again looks pretty similar to the
earlier two:
wisestork zscore -I <input.bed.gz> -R <fasta.fa> -O <out.z.bed> -D <dictionary.bed.gz> -B <binsize>
User-supplied bins
In stead of supplying a bin size for each step, you may also supply a bin file. This file should be a (preferably sorted) BED file with regions that exist in the input BAM file. This option is primarily useful for WES analyses, where the bin file would correspond to a target/bait region file. Please do note that contigs must be identical to those in the input BAM file.
You can supply a bin file using the -L
flag for any subcommand.
This will supersede any usage of the -B
flag.
Creating reference dictionaries
The above assumes you have already created a reference dictionary. If this is not the case, you will have to generate this file.
To create the reference dictionary you will need a set of gc-corrected
BED files (from wisestork gc-correct
) of normal samples, and feed those
to wisestork newref
. The rewref command will then find the nearest
neighbours of every bin. Later on, in the zscore command, this
information is used to get a set of "reference bins" from the query
sample.
Command to be used:
wisestork newref -I <input.gz.bed> -I <input2.gz.bed> [...] -O <out.ref.bed> -R <fasta.fa> -B <binsize>
The output of this must be sorted with bedtools, and then bgzipped and tabixed.
Usage
Usage: wisestork [OPTIONS] COMMAND [ARGS]...
Discover CNVs from BAM files.
A typical workflow first extracts regions from a BAM file
The resulting BED tracks must then be GC-corrected.
Using a reference track of region similarity,
One can then calculate Z-scores for every region.
The following sub-commands are supported:
- count: count coverage per bin
- gc-correct: GC-correct bins
- zscore: calculate Z-scores
- newref: Generate a new reference dictionary of bin similarities
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
count Count coverages
gc-correct GC correct
newref Create new reference
zscore Calculate Z-scores
You can additional help by typing wisestork <command> --help
Differences
There are several important differences between this re-implementation and the original wisecondor.
- This re-implementation is organized as a regular python package, while exposing several command-line tools.
- Python 3 support. In fact, it's only tested on python 3.
- All command-line tools now have UNIX-style argument parsing
- Generating reference sets for small bin sizes is now possible in much less time.
- Pickle files are no longer used. The output format is now regular BED, with a possible additional column. This means results can be used by common downstream tools like Bedtools.
- User supplied bin files in regular BED format.
- The countgc step is now redundant. Its functionality is now integrated in the gcc step.
- The reference bin selection method was modified. The original wisecondor calculated differences for every bin against every bin of every sample, and then repeated this calculation for every chromosome. As this is an exponential operation, this made reference bin selection prohibitively slow and memory-consuming for smaller bin size. In stead of calculating differences, the new method applies a method (e.g. median) over the same bins of all samples, and then sorts the resulting list of bins. Similar bins can be selected using regular list slicing. This means the time complexity of creating a new reference set is now just loglinear. Additional filterings were left the same.
- Use of the
statsmodels
lowess function, rather than biopython's. This results in a significant speed-up of the gc correction.
Naming
Why name this tool wisestork, you might think? Well, a condor is a bird. As this is a re-implementation / fork of wisecondor, I figured another bird would be nice name. As I live in The Hague, and The Hague has a stork as a city symbol, I put one and one together. Thus, wisestork was born.
License
GPLv3
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file wisestork-0.1.2.tar.gz
.
File metadata
- Download URL: wisestork-0.1.2.tar.gz
- Upload date:
- Size: 26.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f363a5d794d6aa4c4b2305e6af976916909bfa6f058ddf9304ac6a71fa3013d1 |
|
MD5 | 48c4eeafb453e96a70ab5b9055ca8b71 |
|
BLAKE2b-256 | f386a0f61b9eca09470be959726b4c463ddbb5049a153f9d4a037c88b38727cf |
File details
Details for the file wisestork-0.1.2-py3.6.egg
.
File metadata
- Download URL: wisestork-0.1.2-py3.6.egg
- Upload date:
- Size: 31.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 31b99bc98b05ba8e5ee823dedebf1afe2a2e8f5ecd17ef9f4be6833c962e4375 |
|
MD5 | a4a94e8d14ce16468f440e294f64161c |
|
BLAKE2b-256 | 44dab3ded01e9b97aaf8461a8256e901212808196fe1bc8e9a1e8f18757e095e |
File details
Details for the file wisestork-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: wisestork-0.1.2-py3-none-any.whl
- Upload date:
- Size: 29.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.6.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c62612bc8975c62dd1652a697cb69135b56b6e0e8f715fd212d73a32217ebe5c |
|
MD5 | ad342696492fd5330b073b3a88d3ccc1 |
|
BLAKE2b-256 | 91b6a2a9e542bcc4fac90bfb1a928f5bc129c3999086ba19fa7610cff4d7df3e |