Cultivate your MSA to get better trees

These details have not been verified by PyPI

Project links

Project description

Arbow: cultivate your multiple sequence aligment to get better trees

PyPI PyPI - Python Version PyPI - License

Name

We named this tool arbow as that would be the phonetic pronounciation of the short, endearing, term for an arborist in Australia.

What it does

The goal of arbow is to automate and simplify the production of trees from multiple sequence alignments. The tool has been developed in the context of viral phylogenomics.

In the current version (0.5.*) it:

Reads an alignment in multiFASTA format
Calculates stats for each sequence in the alignment
Trims 5/3 prime UTR regions --- defaults set to SARS-COV-2 (Genbank accession: NC_045512.2)
Calculates stats per column in the alignment
Allows the user to set a threshold of tolerable missing data in a column, and removes all non-conforming columns from the alignment
From the remaining columns, arbow finds all the constant columns according to two user defined criteria: allow missing data (i.e., a column with missing data can still count to towards constant sites if it meets other criteria), and the frequency of the major allow is equal to or larger than a trheshold (i.e., if the threshold is set to 0.99 and there are 100 samples, 99 of which are A and one is G, that column would be counted as a constant A). Filtering by frequency allows one to remove potential sequencing error.
It then filters out all the variable columns, and outputs the variable alignment as a multiFASTA alignment.
It runs IQTree with a few sensible presets

Currently, in step 4 above, columns that have a single observed nucleotide (e.g., C) but still have missing data that were not filtered out in step 3 are counted towards the overall frequency of that base in the alignment. In other words, if a user specifies a maximum number of 20 missing bases, and a column with 5 missing bases but with A in all other samples, that column will count towards the overall frequency of A in the alignment (i.e., majority consensus imputation). This assumptions is less risky the larger the number of samples in the alignment.

For step 5, missing data (i.e., - and N) are all codes as N.

Tests are underway to figure out how these assumptions might affect the output.

Dependencies

Python >=3.6
IQTree 1.6+ (not tested on IQTree 2 as it is not production ready yet)
BioPython
Pandas
NumPy

Installation

Brew

brew install iqtree
pip<3> install arbow

Conda

conda install -c bioconda iqtree
pip<3> install arbow

Running

Generate a mulitple sequence alignment with your favourite aligner (e.g., MAFTT). Output a multiFASTA file.
Run arbow <aln.fa>
Open tree-YYYY-MM-DD_HHMMSS.treefile in your favourite tree viewer (e.g, FigTree)
Open tree-YYYY-MM-DD_HHMMSS_bb.treefile or tree-YYYY-MM-DD_HHMMSS_alrt.treefile for branches with ultra-fast bootstrap support or SH-aLRT support only, respectively.

Data stream

When running arbow, by default a stream is output to the console (stdout).

Data about the each sequence in the alignment is prefixed with [SEQ], and is followed by:

Count of each base (A, C, G, T, and N – N is any character other than ACGT)
Percent missing data
A status column that has 0, 1, 2, or 3 * depending on whether the percent missing data is <0.5, >=0.5 and <1.0, >=1.0 and <5.0, or >=5, respectively.

Data about each column in the alignment is prefixed with [ALN], and is followed by:

Position in the alignment
Count of each base (bases counted will depend on whether all IUPAC codes are allowed or not - see below in usage)

Command line

Usage

Usage: arbow [OPTIONS] ALN

Options:
  --version
  -i, --all-iupac               Print count of all IUPAC code for column
                                stats?

  -s, --no-stream               Stop streaming stats to console
  -mm, --max-missing INTEGER    Remove sites with 'mm' missing sites or more
                                [default: 20]

  -x, --major-allele-freq FLOAT  If major allele frequency is equal or larger
                                 than consider the site constant.  [default:
                                 0.99]

  -o, --out-var-aln TEXT        Filename for alignment of variable sites.
                                [default: aln-2020-04-07-150443.aln]

  -p, --prefix TEXT             Prefix to append to IQTree output files.
                                [default: tree-2020-04-07-150443]

  -t, --iqtree-threads INTEGER  Number of cores to run IQtree  [default: 4]
  -m, --iqtree-models TEXT      Substitution models to test.  [default:
                                HKY,TIM2,GTR]

  -f, --iqtree-freq TEXT        Base frequency models to test.  [default: F]
  -r, --iqtree-rates TEXT       Rate category models to test.  [default: G,R]
  -b, --iqtree-bb INTEGER       Maximum number of UltraFast Bootstrap
                                iterations to attempt.  [default: 1000]

  -a, --iqtree-alrt INTEGER     Number of replicates to perform SH-aLRT.
                                [default: 1000]

  -c, --iqtree-cmax INTEGER     Maximum number of rate categories to test.
                                [default: 5]

  -r, --ref-id TEXT              Sequence ID of the reference  [default:
                                 MN908947.3]

  --five-prime-end INTEGER       Last base of the 5' UTR region in 1-index in
                                 the ref sequence  [default: 265]

  --three-prime-start INTEGER    First base of the 3' UTR region in 1-index in
                                 the ref sequence  [default: 29675]

  --include-const                When outputting the clean alignment, leave
                                 constant sites in the alignment. [default is
                                 to remove]

  --help                        Show this message and exit.

Default behaviour explained

By default, arbow will remove any site in the alignment that has 20 missing data points or more, will trim the 5' and 3' UTR regions, and will consider as constant any site that has a major allele frequency larger or equal to 0.997.

Remove sites with any gaps in the alignment

Let us say that you wish to remove all sites in the alignment that have any missing data, and retain all complete columns:

arbow -x 1.0 -mm 0 <in.aln>

Keep all sites in an alignment (i.e., skip any filtering)

Let us say that you wish to keep all sites in the alignment, and you have an alignment with 200 sequences:

arbow -x 1.0 -m 200 <in.aln>

Keep constant sites in the clean alignment

arbow --include-const <in.aln>

Get help

arbow <-h|--help>

Get version

arbow --version

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

May 11, 2020

0.5.3

May 5, 2020

This version

0.5.2

May 5, 2020

0.5.1

May 5, 2020

0.5.0

Apr 27, 2020

0.4.3

Apr 27, 2020

0.4.1

Apr 26, 2020

0.4.0

Apr 26, 2020

0.3.1

Apr 20, 2020

0.3.0

Apr 20, 2020

0.2.0

Apr 19, 2020

0.1.3

Apr 8, 2020

0.1.2

Apr 7, 2020

0.1.1

Apr 7, 2020

0.1.0

Apr 7, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

arbow-0.5.2.tar.gz (10.5 kB view details)

Uploaded May 5, 2020 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

arbow-0.5.2-py3-none-any.whl (9.6 kB view details)

Uploaded May 5, 2020 Python 3

File details

Details for the file arbow-0.5.2.tar.gz.

File metadata

Download URL: arbow-0.5.2.tar.gz
Upload date: May 5, 2020
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7

File hashes

Hashes for arbow-0.5.2.tar.gz
Algorithm	Hash digest
SHA256	`a4f331271a721a1fa3a76498afd59a4dffb31ba912620c568bb58eb96a532193`
MD5	`e9c575fab5e5e012b5656df3dd796ce4`
BLAKE2b-256	`d7dac6f2934c85aa5741f8b3f3959d08b6e21363cd7298fc09bac86340e87b65`

See more details on using hashes here.

File details

Details for the file arbow-0.5.2-py3-none-any.whl.

File metadata

Download URL: arbow-0.5.2-py3-none-any.whl
Upload date: May 5, 2020
Size: 9.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7

File hashes

Hashes for arbow-0.5.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`421a4b1d8e2489d8450066450dcdea3fe3cd7f0dea1e3ae841d2655ce5d3e790`
MD5	`8384b5d83fbc01dc043408875ac79f4d`
BLAKE2b-256	`2a98366722b569f0173daf8bf8f7d17da292ac18cb8fae28c31176a6088407e9`

See more details on using hashes here.

arbow 0.5.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Arbow: cultivate your multiple sequence aligment to get better trees

Name

What it does

Dependencies

Installation

Brew

Conda

Running

Data stream

Command line

Usage

Default behaviour explained

Remove sites with any gaps in the alignment

Keep all sites in an alignment (i.e., skip any filtering)

Keep constant sites in the clean alignment

Get help

Get version

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes