A top up tool to enhance SNV calling from Nanopore sequencing data.

These details have not been verified by PyPI

Project links

GitHub

Project description

Improving SNV detection from low coverage nanopore sequencing data (<30x)

Installation
- Using pip
- From source
SNVoter Modules
Tutorial
Example

Installation

NOTE: Before installation dependencies in environment.yaml must be installed. SNVoter uses several fixed versions of its dependencies in environment.yaml file . Users are encouraged to use a conda or similar environment to isolate the packages from their default python instance. Then activate the environment and install SNVoter using pip or you can clone the git repo and use it from source.
You can make the conda environment and install all dependencies by downloading the environment.yaml file and running these lines of codes:

conda env create -f environment.yaml
conda activate snvoter

Now you can install SNVoter using pip or use it from source in the dedicated environment with all dependencies installed.

Using pip

pip install snvoter

From source

git clone https://github.com/vahidAK/SNVoter.git
cd SNVoter
./snvoter.py

SNVoter Modules

prediction:

To predict dtetedte SNVs are true calls or false positives.

usage: snvoter prediction [-h] --input INPUT --bam BAM --reference REFERENCE
                          --output OUTPUT [--model_file MODEL_FILE]
                          [--mappingQuality MAPPINGQUALITY] [--depth DEPTH]
                          [--window_bam WINDOW_BAM]
                          [--threads THREADS] [--chunk_size CHUNK_SIZE]

Predict based on a model.

optional arguments:
  -h, --help            show this help message and exit

required arguments:
  --input INPUT, -i INPUT
                        The path to the input vcf or bed file. NOTE. Files
                        must end with .bed or .vcf. vcf files are 1-based and
                        beds are zero-based
  --bam BAM, -b BAM     The path to the alignment bam file
  --reference REFERENCE, -r REFERENCE
                        The path to the reference file. File must be indexed
                        by samtools faidx.
  --output OUTPUT, -o OUTPUT
                        The path to the output directory and prefix for output
                        file.

optional arguments:
  --model_file MODEL_FILE, -mf MODEL_FILE
                        Path to the trained model. Default is
                        NA12878_20FC_model.h5
  --mappingQuality MAPPINGQUALITY, -mq MAPPINGQUALITY
                        Cutt off for filtering out low quality mapped reads
                        from bam. Default is 0
  --depth DEPTH, -d DEPTH
                        Cutt off for filtering out regions with low depth to
                        have frequencies. Default >= 1
  --window_bam WINDOW_BAM, -w WINDOW_BAM
                        if you want to only do for a region or chromosom You
                        must insert region like this chr1 or chr1:1000-100000.
  --threads THREADS, -t THREADS
                        Number of threads. Default is 4.
  --chunk_size CHUNK_SIZE, -cs CHUNK_SIZE
                        Chunk size. Default is 100.

extraction:

Extract features to train a new model.

usage: snvoter extraction [-h] --input INPUT --status STATUS --bam BAM
                          --reference REFERENCE
                          [--mappingQuality MAPPINGQUALITY] [--depth DEPTH]
                          [--window_bam WINDOW_BAM]
                          [--threads THREADS] [--chunk_size CHUNK_SIZE]

Extract mutation frequencicies in 5-mer window.

optional arguments:
  -h, --help            show this help message and exit

required arguments:
  --input INPUT, -i INPUT
                        The path to the input vcf file.
  --status MOD_STATUS, -s STATUS
                        0 or 1. If you are extracting frequencies to train a
                        model, give the status for your vcf file
                        either it is true calls (1) or false calls (0).
  --bam BAM, -b BAM     The path to the alignment bam file
  --reference REFERENCE, -r REFERENCE
                        The path to the reference file. File must be indexed
                        by samtools faidx

optional arguments:
  --mappingQuality MAPPINGQUALITY, -mq MAPPINGQUALITY
                        Cutt off for filtering out low quality mapped reads
                        from bam. Default is 0
  --depth DEPTH, -d DEPTH
                        Cutt off for filtering out regions with low depth to
                        have frequencies. Default >=1
  --window_bam WINDOW_BAM, -w WINDOW_BAM
                        if you want to only do for a region or chromosom, you
                        must insert region like this chr1 or chr1:1000-100000.
  --threads THREADS, -t THREADS
                        Number of threads
  --chunk_size CHUNK_SIZE, -cs CHUNK_SIZE
                        Number of sites send to each processes for parrallel
                        processing. Default is 50.

train:

To train a new model using extracted features.

usage: snvoter train [-h] --train TRAIN --test TEST --out_dir OUT_DIR
                     [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--plot]

train a new model

optional arguments:
  -h, --help            show this help message and exit

required arguments:
  --train TRAIN, -tr TRAIN
                        The path to the shuffled and ready file for training
  --test TEST, -te TEST
                        The path to the shuffled and ready file for testing.
  --out_dir OUT_DIR, -o OUT_DIR
                        Output directory and prefix for saving the model and
                        figures

optional arguments:
  --epochs EPOCHS, -e EPOCHS
                        Number of epochs. Default is 100
  --batch_size BATCH_SIZE, -batch BATCH_SIZE
                        batch size for model training. Default is 400.
  --plot, -plt          Select this option if you wish to output training
                        plots.

Tutorial

Variant Calling:

You first need to call variants using Clair

You can call variants for each chromosome using the following command and the concatenate all files:

for i in chr{1..22} chrX chrY; do callVarBam --chkpnt_fn <path to model file> --ref_fn <reference_genome.fa> --bam_fn <sorted_indexed.bam> --ctgName $i --sampleName <your sample name> --call_fn $i".vcf" --threshold 0.2 --samtools <path to executable samtools software> --pypy <path to executable pypy > --threads <number of threads>

For the full tutorial please refer to Clair page on GitHub.

Improving SNV calling using SNVoter:

snvoter prediction -i <SNVs_Clair.vcf> -b <sorted_indexed.bam> -r <reference_genome.fa> -t number_of_threads -o output_prefix

It will produce two files.

1- Prediction file that includes each prediction for each 5-mer. The first 10 columns are from vcf file and the last seven columns indicate:

chrom: the chromosome name
pos_start: 0-based position of the 5-mer start
pos_end: 0-based position of the 5-mer end
pos: 0-based position of the SNV
5-mer sequence: sequence of five-mer
Coverage: this might be different from Clair's coverage as SNVoter uses different mapping quality threshold
Prediction

2- The second file is the ready vcf file with weighted qualities. You can plot the distribution of weighted quality to obtain optimal threshold for filtering. The plot usually looks like the following plots:

Quality distribution of 10x coverage data

Quality distribution of 18x coverage data

Quality distribution of 22x coverage data

The optimal threshold is the end of the first peak and start of the valley (highlighted regions).

By default SNVoter will use the model file trained by us using NA12878 20 flow cells and you do not need to specify path to the model if you want to use our model.

Train a New Model:

In order to train a new model you need to have two vcf files. One for true SNVs and the other for false positive SNV calls. Having these data allows you to then extract the features using snvoter extraction module. Subsequently, you can train a new model on extracted features using the snvoter train module.

Extracting Features:

snvoter extraction -i True_SNVs.vcf -b alignment.bam -r reference.fa -s 1 -t 24 > Extracted_Features.csv
snvoter extraction -i False_SNVs.vcf -b alignment.bam -r reference.fa -s 0 -t 24 >> Extracted_Features.csv

After extracting the features you need to shuffle the file.

cat Extracted_Features.csv | shuf | shuf | shuf > Shuffled_Extracted_Features.csv

Subsequently, seperate the shuffled file into training and test set as you wish. We recommend using at least 10% of suffled file as test set and the rest as training set

Training the Model

snvoter train -tr training_set.csv -te test_set.csv -o ./Trained_model --plot

Training will produce a .h5 file and a .h5.pkl file. If --plot option selected it will also output accuracy, precision, recall, loss, and ROC plots. In order to use this model via the snvoter prediction module, the .h5 and .h5.pkl files must be in the same directory and just give the path to the .h5 file using the --model_file flag.

Example

We have included an example data in the Example_data folder which you can use for a quick prediction.

Project details

These details have not been verified by PyPI

Project links

GitHub

Release history Release notifications | RSS feed

This version

1.0

Jan 26, 2021

0.1.2

Aug 19, 2020

0.1.1

Aug 19, 2020

0.0 yanked

Aug 18, 2020

Reason this release was yanked:

Tensorflow dependency unmet in python3.9

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snvoter-1.0.tar.gz (2.1 MB view details)

Uploaded Jan 26, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

snvoter-1.0-py3-none-any.whl (2.0 MB view details)

Uploaded Jan 26, 2021 Python 3

File details

Details for the file snvoter-1.0.tar.gz.

File metadata

Download URL: snvoter-1.0.tar.gz
Upload date: Jan 26, 2021
Size: 2.1 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.5

File hashes

Hashes for snvoter-1.0.tar.gz
Algorithm	Hash digest
SHA256	`5ec349c6d5136090601d5ff9bf2d939ab8b8265163418eced2c9f32fd4596bb3`
MD5	`b6b3de0aabd629cc84afe4cb58aafe9e`
BLAKE2b-256	`959401bb0ce05cb78838e2a47f5deadec08bf357e90d94b52db6a19cc0767b3c`

See more details on using hashes here.

File details

Details for the file snvoter-1.0-py3-none-any.whl.

File metadata

Download URL: snvoter-1.0-py3-none-any.whl
Upload date: Jan 26, 2021
Size: 2.0 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.5

File hashes

Hashes for snvoter-1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`ba0e63840c3ec83a7701eea24280bf09c23f08b01b9bea8d376dcd1b58edd51a`
MD5	`38afb91a5a50ac30ee5f8ec850ab8af6`
BLAKE2b-256	`be8361b2a3ca6eddc012bc361624a10c0b7416db726d7f9d0eda42e2bbd59e60`

See more details on using hashes here.

snvoter 1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Table of Contents

Installation

Using pip

From source

SNVoter Modules

prediction:

extraction:

train:

Tutorial

Variant Calling:

Improving SNV calling using SNVoter:

Train a New Model:

Extracting Features:

Training the Model

Example

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes