Skip to main content

Python implementation of YAMB (Yet Another Metagenome Binner)

Project description

pyYAMB

Python versions PyPI version Anaconda version

pyYAMB is an implementation of YAMB (Yet another metagenome binner) on Python (>=3.8). Originally YAMB was described in the preprint https://www.biorxiv.org/content/10.1101/521286.abstract and it's main point is the use of tSNE and HDBSCAN to process tetramer frequencies and coverage depth of metagenome fragments. pyYAMB strives for parallel computing wherever possible.

pyYAMB data processing includes

  • contig filtering and fragmentation
  • read mapping with minimap2
  • mapping files processing with pysam and coverage depth extraction with pycoverm
  • k-mer (by default tetramer) frequency calculation
  • data dimensions reduction with tSNE
  • data clustering with HDBSCAN
  • writing bins to FASTA
  • writing plots to PNG and SVG

Possible features in far future

  • read processing
  • metagenome assembly
  • bin QC

How to start

Installation

PyPI

pyYAMB is available at PyPI and may be installed with:

pip install pyYAMB

Also yo need to install dependencies (see below).

GitHub

Another way (not recommended) is to clone the repository

git clone https://github.com/laxeye/pyYAMB.git or gh repo clone laxeye/pyYAMB

and run

python setup.py install or pip install .

It installs pyYAMB and python libraries. Problems may appear with hdbscan module and cython. Just reinstall hdbscan using pip install hdbscan and try again python setup.py install.

Dependencies

If you installed pyYAMB from PyPI or GitHub, you need to install dependencies: minimap2 and samtools (e.g. using conda).

conda install -c bioconda minimap2 "samtools>=1.9"

Conda

Currently only outdated versions of pyYAMB are available at Anaconda. They may be installed with all dependencies:

conda install -c laxeye pyyamb or mamba install -c laxeye pyyamb

Usage

pyYAMB entry point is the all-in-one command pyyamb. pyYAMB has two dozens of arguments, their description is available after running pyyamb -h

You may start from metagenome assembly and processed (quality trimmed etc.) reads, e.g.:

pyyamb --task all -1 Sample_1.R1.fastq.gz Sample_2.R1.fastq.gz -2 Sample_1.R2.fastq.gz Sample_2.R2.fastq.gz -i assembly.fasta -o results/will/be/here --threads 8

After completion bins could be found in bins subfolder in output folder. "-1" bin collects unbinned sequences. Quality check of resulting bins is strongly recommended, You may use CheckM or CheckM2.

Results and benchmarks

pyYAMB was tested on low complexity data set for the 1st CAMI challenge (simulated Illumina HiSeq data, small insert size).

  • Number of samples: 1
  • Total Size: 15 Gbp
  • Read length: 2x150 bp
  • Insert size mean: 270 bp
  • Insert size stddev: 27 bp

The run took 12 minutes and 17 seconds on AMD Ryzen 3900X using 8 threads. Completeness and purity results are given below:

Property Value
Average completeness (bp) 94.4%
Average completeness (seq) 84.2%
Average purity (bp) 67.1%
Average purity (seq) 56.5%

Earlier YAMB showed quality comparable with CONCOCT binner (see the preprint for details).

References

Van Der Maaten, L. (2014). Accelerating t-SNE using tree-based algorithms. The Journal of Machine Learning Research, 15(1), 3221-3245.

Campello, R. J., Moulavi, D., & Sander, J. (2013, April). Density-based clustering based on hierarchical density estimates. In Pacific-Asia conference on knowledge discovery and data mining (pp. 160-172). Springer, Berlin, Heidelberg.

Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100. https://dx.doi.org/10.1093/bioinformatics/bty191

Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P., & Tyson, G. W. (2015). CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome research, 25(7), 1043-1055. https://dx.doi.org/10.1101/gr.186072.114

Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., ... & De Hoon, M. J. (2009). Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11), 1422-1423. https://doi.org/10.1093/bioinformatics/btp163

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyyamb-0.1.5b1.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pyyamb-0.1.5b1-py3-none-any.whl (12.3 kB view details)

Uploaded Python 3

File details

Details for the file pyyamb-0.1.5b1.tar.gz.

File metadata

  • Download URL: pyyamb-0.1.5b1.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for pyyamb-0.1.5b1.tar.gz
Algorithm Hash digest
SHA256 9b47d7c4d154ebb9e176cc96d8788f975df8f9767e44b17f43f985c67d5fe033
MD5 3e400fa56629c7702ff99748d1773cbe
BLAKE2b-256 5717bb9985158c7a7744fb4b6f7d63092d5a4aa81008229ff5284d9c2f61d9b3

See more details on using hashes here.

File details

Details for the file pyyamb-0.1.5b1-py3-none-any.whl.

File metadata

  • Download URL: pyyamb-0.1.5b1-py3-none-any.whl
  • Upload date:
  • Size: 12.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for pyyamb-0.1.5b1-py3-none-any.whl
Algorithm Hash digest
SHA256 875da7f2487e1855ecbf2b8296be46b6a6d02197e0a9ed5cee8f508e5ba08619
MD5 9c474be7bce988dab28f3e44c4f0b54e
BLAKE2b-256 9d3953e1e72db967ab72598218aabc5a2466f953dcba6d7b1f6b59621c037a87

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page