Skip to main content

SplitStrains detects and separates mixed strains of Mycobacterium tuberculosis.

Project description

SplitStrains

In this repo, we introduce the tool SplitStrains used for detecting and separating mixed strains of M. tuberculosis.
Grounded in a rigorous statistical framework, it is based on formulating, for a given set of WGS reads, two alternative hypotheses, namely: the reads belong to a single strain (null hypothesis) or to a mixture of two or more strains (alternative hypothesis). We then use MLE (Maximum Likelihood Estimates) for the parameters of both hypotheses, and compare their likelihoods to draw a conclusion. As a result, we obtain:

  1. A determination on whether the sample represents a simple or mixed infection
  2. A likelihood ratio for this determination (between the null and the alternative hypothesis)
  3. If mixed, the proportion of each constituent strain and its identity defined by its SNPs (single-nucleotide polymorphisms) relative to a reference genome.

Installation

SplitStrains is available on PyPI. It can be installed using the pip install splitstrains command.

Usage:

Run SplitStrains -h to view help.
The gsc.sh is a master shell script that generates 50 mixed synthetic samples with different proportions, alignes them and runs splitStrain.py.

Examples:

Examples can be found in the example folder of the GitHub repository. Several bash scripts show how SplitStrains can be used in a pipeline. The two main example scripts are gsc.sh and gsc_3_strains.sh. A Snakemake pipeline is also provided as an example. Examples need to be run from the related example folder.

Reusing results:

First run:
SplitStrains outputs results into stdout.

SplitStrains -g 2 -s 100 -e 4000000 -o output_dir -fd min_depth indexed_sorted.bam > result.txt

Second run:
After the first run, it is possible to reuse (--reuse) cached data (freqVec.csv) for faster analysis and parameter tunning

SplitStrains --reuse -fe 0.7 -g 2 -s 100 -e 4000000 -o output_dir -fd min_depth indexed_sorted.bam > result.txt

The directory output_dir contains freqVec.csv and plots for visual inspection.

Notes:

Before running splitStrains.py, make sure that sorted and indexed BAM's aligned sequences and indexed fasta reference have the same sequence ID. In other words, bam files must be used with the same reference which was used for alignment.

For example, if BAM aligned sequences refer to "gi|41353971|emb|AL123456.2| " then the fasta reference file should start with ">gi|41353971|emb|AL123456.2| ".

Tips:

After the first run of splitStrains it is possible to reuse the results for faster analysis, set --reuse.
Alternatively, set reuse=1 in gsc.sh.
Always check produced plots for visual inspection and parameter tunning.

Alignment guide:

bwa mem doesn't do a good job when aligning M. tb. This results in a short genome regions with high rate of false SNPs or ubnormally high number of variants. This can be observed in scatter plots generated by splitStrains.py.
The best alignement results can be achieved using bwa aln. This workflow is implemented in runSplitStrains.sh which is called from the master script gsc.sh.

TODO:

  1. Need to run a check on the gff file (version).
  2. Compute filtering depth based of the provided float value from 0.1 to 1. (important)
  3. Separate splitStrains.py code into files.
  4. Introduce the option of working with single-end reads

Possible bad behavior:

When depth coverage is high (300 and greater) it is possibe that likelyhood_ratio_test function can overflow.
Couldn't reproduce the error so far.

Citations

SplitStrains original paper: Einar Gabbasov, Miguel Moreno-Molina, Iñaki Comas, Maxwell Libbrecht, Leonid Chindelevitch SplitStrains, a tool to identify and separate mixed Mycobacterium tuberculosis infections from WGS data. Microbial Genomics https://doi.org/10.1099/mgen.0.000607

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SplitStrains-0.2.1.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

SplitStrains-0.2.1-py3-none-any.whl (15.3 kB view details)

Uploaded Python 3

File details

Details for the file SplitStrains-0.2.1.tar.gz.

File metadata

  • Download URL: SplitStrains-0.2.1.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.4.2 requests/2.22.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for SplitStrains-0.2.1.tar.gz
Algorithm Hash digest
SHA256 3381961a1720919460524b086baaaef191887c8e82d349c282fb891ead12d87b
MD5 2dd40cb836c7db9788b1127cd5ccc8b6
BLAKE2b-256 9ec008ada78a1a1cd608a8c7ecad6803b335ce763987c301a9ecd8fa8d269ef4

See more details on using hashes here.

File details

Details for the file SplitStrains-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: SplitStrains-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 15.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.4.1 importlib_metadata/4.0.1 pkginfo/1.4.2 requests/2.22.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for SplitStrains-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a7e44d2dc91006ed771e8ea257f3aa87b4e0d6388319df2ca098befbc66b2cff
MD5 be26fd19dc943b73eb130397d1e333c1
BLAKE2b-256 90d0b46112df3d356635a9ad8d8491d39ee7ad21d3ea9e60bcf20b5276ecac7f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page