Cultivate your MSA to get better trees
Project description
Arbow: cultivate your multiple sequence aligment to get better trees
Name
We named this tool arbow as that would be the phonetic pronounciation of the short, endearing,
term for an arborist in Australia.
What it does
The goal of arbow is to automate and simplify the production of trees from multiple sequence alignments. The tool
has been developed in the context of viral phylogenomics.
In the current version (0.5.*) it:
- Reads an alignment in
multiFASTAformat - Calculates stats for each sequence in the alignment
- Trims 5/3 prime UTR regions --- defaults set to SARS-COV-2 (Genbank accession:
NC_045512.2) - Calculates stats per column in the alignment
- Allows the user to set a threshold of tolerable missing data in a column, and removes all non-conforming columns from the alignment
- From the remaining columns,
arbowfinds all theconstantcolumns according to twouserdefined criteria:allow missing data(i.e., a column with missing data can still count to towardsconstantsites if it meets other criteria), and the frequency of the major allow is equal to or larger than a trheshold (i.e., if the threshold is set to 0.99 and there are 100 samples, 99 of which areAand one isG, that column would be counted as a constantA). Filtering by frequency allows one to remove potential sequencing error. - It then filters out all the
variablecolumns, and outputs the variable alignment as amultiFASTAalignment. - It runs
IQTreewith a few sensiblepresets
Currently, in step 4 above, columns that have a single observed nucleotide (e.g., C) but still have missing data that were not filtered out in step 3 are counted towards the overall frequency of that base in the alignment. In other words, if a user specifies a maximum number of 20 missing bases, and a column with 5 missing bases but with A in all other samples, that column will count towards the overall frequency of A in the alignment (i.e., majority consensus imputation). This assumptions is less risky the larger the number of samples in the alignment.
For step 5, missing data (i.e., - and N) are all codes as N.
Tests are underway to figure out how these assumptions might affect the output.
Dependencies
- Python >=3.6
- IQTree 1.6+ (not tested on IQTree 2 as it is not production ready yet)
- BioPython
- Pandas
- NumPy
Installation
Brew
brew install iqtree
pip<3> install arbow
Conda
conda install -c bioconda iqtree
pip<3> install arbow
Running
- Generate a mulitple sequence alignment with your favourite aligner (e.g., MAFTT). Output a
multiFASTAfile. - Run
arbow <aln.fa> - Open
tree-YYYY-MM-DD_HHMMSS.treefilein your favourite tree viewer (e.g, FigTree) - Open
tree-YYYY-MM-DD_HHMMSS_bb.treefileortree-YYYY-MM-DD_HHMMSS_alrt.treefilefor branches withultra-fast bootstrapsupport orSH-aLRTsupport only, respectively.
Data stream
When running arbow, by default a stream is output to the console (stdout).
Data about the each sequence in the alignment is prefixed with [SEQ], and is followed by:
- Count of each base (
A,C,G,T, andN–Nis any character other thanACGT) - Percent missing data
- A status column that has 0, 1, 2, or 3
*depending on whether the percent missing data is<0.5,>=0.5 and <1.0,>=1.0 and <5.0, or>=5, respectively.
Data about each column in the alignment is prefixed with [ALN], and is followed by:
- Position in the alignment
- Count of each base (bases counted will depend on whether all IUPAC codes are allowed or not - see below in usage)
Command line
Usage
Usage: arbow [OPTIONS] ALN
Options:
--version
-i, --all-iupac Print count of all IUPAC code for column
stats?
-s, --no-stream Stop streaming stats to console
-mm, --max-missing INTEGER Remove sites with 'mm' missing sites or more
[default: 20]
-x, --major-allele-freq FLOAT If major allele frequency is equal or larger
than consider the site constant. [default:
0.99]
-o, --out-var-aln TEXT Filename for alignment of variable sites.
[default: aln-2020-04-07-150443.aln]
-p, --prefix TEXT Prefix to append to IQTree output files.
[default: tree-2020-04-07-150443]
-t, --iqtree-threads INTEGER Number of cores to run IQtree [default: 4]
-m, --iqtree-models TEXT Substitution models to test. [default:
HKY,TIM2,GTR]
-f, --iqtree-freq TEXT Base frequency models to test. [default: F]
-r, --iqtree-rates TEXT Rate category models to test. [default: G,R]
-b, --iqtree-bb INTEGER Maximum number of UltraFast Bootstrap
iterations to attempt. [default: 1000]
-a, --iqtree-alrt INTEGER Number of replicates to perform SH-aLRT.
[default: 1000]
-c, --iqtree-cmax INTEGER Maximum number of rate categories to test.
[default: 5]
-r, --ref-id TEXT Sequence ID of the reference [default:
MN908947.3]
--five-prime-end INTEGER Last base of the 5' UTR region in 1-index in
the ref sequence [default: 265]
--three-prime-start INTEGER First base of the 3' UTR region in 1-index in
the ref sequence [default: 29675]
--include-const When outputting the clean alignment, leave
constant sites in the alignment. [default is
to remove]
--help Show this message and exit.
Default behaviour explained
By default, arbow will remove any site in the alignment that has 20 missing data points or more, will trim the 5' and 3' UTR regions, and will consider as constant any site that has a major allele frequency larger or equal to 0.997.
Remove sites with any gaps in the alignment
Let us say that you wish to remove all sites in the alignment that have any missing data, and retain all complete columns:
arbow -x 1.0 -mm 0 <in.aln>
Keep all sites in an alignment (i.e., skip any filtering)
Let us say that you wish to keep all sites in the alignment, and you have an alignment with 200 sequences:
arbow -x 1.0 -m 200 <in.aln>
Keep constant sites in the clean alignment
arbow --include-const <in.aln>
Get help
arbow <-h|--help>
Get version
arbow --version
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file arbow-0.5.2.tar.gz.
File metadata
- Download URL: arbow-0.5.2.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a4f331271a721a1fa3a76498afd59a4dffb31ba912620c568bb58eb96a532193
|
|
| MD5 |
e9c575fab5e5e012b5656df3dd796ce4
|
|
| BLAKE2b-256 |
d7dac6f2934c85aa5741f8b3f3959d08b6e21363cd7298fc09bac86340e87b65
|
File details
Details for the file arbow-0.5.2-py3-none-any.whl.
File metadata
- Download URL: arbow-0.5.2-py3-none-any.whl
- Upload date:
- Size: 9.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.0.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.7.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
421a4b1d8e2489d8450066450dcdea3fe3cd7f0dea1e3ae841d2655ce5d3e790
|
|
| MD5 |
8384b5d83fbc01dc043408875ac79f4d
|
|
| BLAKE2b-256 |
2a98366722b569f0173daf8bf8f7d17da292ac18cb8fae28c31176a6088407e9
|