Skip to main content

Multiple sequence alignment analysis with Affinity Tree generation

Project description

Build Status

PangtreeeBuild

This repository contains tool for multiple sequence alignment analysis. It implements the idea of pan-genome (Ref. 1) by representing the multialignment as a PO-MSA structure (Partial Order Alignment Graph - Ref. 2). The main purpose of this software is to construct an Affinity Tree - a phylogenetic-like tree, with an agreed sequence (consensus sequence) assigned for each node. The result is saved in JSON file (see its schema in pangtree/pangtreebuild/serialization/affinity_tree_schema.json). Its content can be visualised using PangtreeVis.

This software is a part of the article: P.Dziadkiewicz, N.Dojer 'Getting insight into the pan-genome structure with Pangtree' that will be published soon in BMC Genomics.

Getting Started

Prerequisites

Running:

Testing:

Installing

pip install pangtreebuild

Quick installation check

This line builds a pan-genome model for an example alignment of 160 Ebola virus sequences and saves it to a JSON file.

python3 -m pangtreebuild --multialignment example_data/Ebola/multialignment.maf

Usage

  1. Import package pangtreebuild to your Python program and use it according to the documentation.

or

  1. Use pangtreebuild via command line with following arguments:

python3 -m pangtreebuild [args]

Name CLI Required Description
Arguments affecting PO-MSA construction:
MULTIALIGNMENT --multialignment Yes Path to the mulitalignment file (.maf or .po)
METADATA --metadata No Optional information about sequences in csv format. The only required column: 'seqid' and its value must match multialignment files identifiers as described in Sequence Naming Convention (below). Example: example_data/Ebola/metadata.csv
RAW_MAF --raw_maf No, default=False Build PO-MSA without transforming multialignment (MAF file) to DAG. PO-MSA built in this way does not reflect real life sequences.
FASTA_PROVIDER --fasta_provider No Nucleotides source if any residues are missed in the multialignment file. Possible values: 'ncbi', 'file'. If not specified: MISSING_NUCLEOTIDE is used.
MISSING_SYMBOL --missing_symbol No, default='?' Symbol for missing nucleotides used if no FASTA_PROVIDER is given.
CACHE --cache No If set, sequences downloaded from NCBI are stored on local disc and reused between program calls, used if FASTA_PROVIDER is 'ncbi'
FASTA_PATH -fasta_path Yes if FASTA_PROVIDER='FILE' Path to fasta file or zipped fasta files with whole sequences present in multialignment, used if FASTA_PROVIDER is 'FILE'.
Arguments affecting Affinity Tree construction:
AFFINITY -affinity No Possible values: 'TREE' (default algorithm, descibed in Documentation.md), 'POA' (simplified version, based solely on Ref. 2)
BLOSUM --blosum No, default=bin\blosum80.mat Path to the blosum filem. Blosum file must include MISSING_NUCLEOTIDE.
HBMIN --hbmin No, default=0.9 'POA' parameter. The minimum value of sequence compatibility to generated consensus.
STOP --stop No, default=0.99 'TREE' parameter. Minimum value of compatibility in tree leaves.
P -p No, default=1 'TREE' parameter. It changes the linear meaning of compatiblities during cutoff finding because the compatibilities are raised to the power o P. For P from range [0,1] it decreases distances between small compatibilities and increases distances between the bigger ones. For p > 1 it increases distances between small compatibilities and decreases distances between the bigger ones.
Arguments affecting output generation:
OUTPUT_DIR --output_dir, -o No, default=timestamped folder in current working directory Output directory path.
OUTPUT_FULL --output_full No, default=False Set, if list of pangenome nodes for sequences and consensuses should be included in pangenome.json.
VERBOSE --verbose, -v No, default=False Set if detailed log files must be produced.
QUIET --quiet, -q No, default=False Set to turn off console logging.
FASTA --output_fasta No, default=False Set to create fasta files with consensuses.
PO -output_po No, default=False Set to create po file with multialignment (without consensuses).

Sequence Naming Convention

[seqid].[anything after first dot is ignored]

Example use cases

  1. Build PO-MSA using default settings (transform to DAG, download missing nucleotides from NCBI) and save to .po file :
python -m pangtreebuild --multialignment example_data/Ebola/multialignment.maf -po

will produce:

  • pangenome.json
  • poagraph.po
  1. Generate Affinity Tree, use metadata, detailed logging and default algorithm settings.
python3 -m pangtreebuild --multialignemnt example_data/Ebola/multialignment.maf -metadata example_data/Ebola/metadata.csv -affinity tree -v

will produce:

  • pangenome.json
  • details.log
  • affinitytree/
    • tresholds.csv
    • .po files from internal calls to poa software

Tests

python3 -m unittest discover -s pangtreebuild -p tests_*

or

nosetests pangtreebuild

Authors

This software is developed with support of OPUS 11 scientific project of National Science Centre: Incorporating genomic variation information into DNA sequencing data analysis

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Bibliography

  1. Computational pan-genomics: status, promises and challenges The Computational Pan-Genomics Consortium. Briefings in Bioinformatics, Volume 19, Issue 1, January 2018, Pages 118–135.

  2. Generating consensus sequences from partial order multiple sequence alignment graphs C. Lee, Bioinformatics, Volume 19, Issue 8, 22 May 2003, Pages 999–1008

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pangtreebuild-1.0.3.tar.gz (159.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pangtreebuild-1.0.3-py3-none-any.whl (211.9 kB view details)

Uploaded Python 3

File details

Details for the file pangtreebuild-1.0.3.tar.gz.

File metadata

  • Download URL: pangtreebuild-1.0.3.tar.gz
  • Upload date:
  • Size: 159.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.40.1 CPython/3.8.0

File hashes

Hashes for pangtreebuild-1.0.3.tar.gz
Algorithm Hash digest
SHA256 80da94318ac38e6a1e0e1bf3bd50fa39d7bb3c65a50c314e4a9fff85f57ae6ea
MD5 edb5501eb2ab1ef9f12bc984b181ae6d
BLAKE2b-256 36e52fe555ee1fe1e3cf1f29abe00ac008f2114180b0c693a17424b8787b7bde

See more details on using hashes here.

File details

Details for the file pangtreebuild-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: pangtreebuild-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 211.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.6.0 requests-toolbelt/0.9.1 tqdm/4.40.1 CPython/3.8.0

File hashes

Hashes for pangtreebuild-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 f8fa82c2b44e8d1b58b7bec87e71a96c612e14c707dc4a07d2c90e089e39565b
MD5 eadad64cbcf5ee05406b846dca2fcb81
BLAKE2b-256 16f17a1c1580c201467ae7a9951ad124e176be22bda0cdfbc8109f2dcc6ae007

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page