Map Blast results on a common-knowledge taxonomix (phylogenetic) tree
Project description
ProTaxoVis
This package contains tools to map BLAST results on the NCBI taxonomy phylogenetic (taxonomic) tree. It helps to analyse the presence/absence of proteins or genes in various taxa. This information can be useful, for example, for the analysis of two competing pathways (see e.g. Bockwoldt et al. 2019).
If you used this tool in your work, please cite:
TODO: Citation of our paper.
Installation
You will need Python 3.6+ and, optionally, BLAST+. Everything else can be install using pip
. We suggest to use a virtual environment or Conda environment to install everything into. The examples below are using virtual environments with conda environment equivalents given in the comments.
# Create a virtual environment. Not strictly necessary,
# but generally recommended to avoid problems with conflicting versions
python3 -m venv taxoenv
# conda create -n taxoenv
# Activate the environment. You have to do this every time, you want
# to work with ProTaxoVis and open a new shell.
source /path/to/taxotest/taxoenv/bin/activate
# conda activate taxoenv
# Install ProTaxoVis
pip install protaxovis
# You need to tell taxfinder once to download the latest NCBI
# taxonomy information
taxfinder_update
# Now, you can run all the good stuff. Start, for example, with:
taxovis --init
After the installation, when opening a new terminal, you only have to reactivate the virtual/conda environment
source /path/to/taxotest/taxoenv/bin/activate
# conda activate taxoenv
Components
This package consists of four command line tools and one importable Python module. All command line tools can be run with --help
to get help.
Workflow
Example data to run with can be downloaded from https://github.com/MolecularBioinformatics/ProTaxoVis-examples
This data includes the results of a blast run against nr
and has config files ready, allowing you to skip steps 2, 3, and 5 of the workflow below.
A typical workflow is described in (TODO: Our paper) and can be summed up to these steps:
- Create a folder for your project. Open a terminal/shell in this folder and run
taxovis --init
. - Collect fasta files of your proteins or genes of interest. These are called seeds. Fasta files of multiple species for the same protein or gene are possible. Place the fasta files in the
fasta
folder of your project. Modify the filesproteinlist.txt
andlimits.txt
. - Run BLAST against the database of your choice (e.g.
nr
for proteins,nt
for genes). This can be done using the command line tool or the NCBI BLAST website. It is important that taxonomy ids are included in the results. This is given in all NCBI databases. For your own databases, please consult the documentation formakeblastdb
on how to include taxonomy ids. If you run BLAST on the NCBI Website, make sure to download the result as "Single-file XML2" and save them to theblastresults
folder with the same filename as the corresponding fasta file but with.xml
instead of.fasta
. - Run the command
taxovis --all
. - Modify the files
tree_config.txt
andtree_to_prune.txt
. - Run the command
taxotree --show
. You may want to go forth and back between modifyingtree_to_prune.txt
and visualizing the tree until the focus of the tree is right. - Save the tree using
taxotree --outfile tree_name.pdf
. - If you are interested in the interactive heatmap, modify
heatmap_config.txt
and runtaxovis --only intheat
.
Output files
histograms/
shows the length distribution of BLAST results for each seed. blastmappings/
shows where the seeds map on the BLAST results dependent on the e-value cutoff. Both can be used to tweak limits.txt
.
trees/
contains the trees for each seed. These trees contain all species in which the seed was found by BLAST. general.tre
is the combined tree for all seeds.
heatmap.html
is an interactive heatmap of selected taxa. You can select taxa and other parameters in heatmap_config.txt
and re-run taxovis --only intheat
.
matrix.csv
is a table that shows how similar two seeds are. The number of the cell where seed A and seed B cross is the e-value at which the BLAST search of seed A found seed B.
Other command line tools
blast2fasta
can be used to download sequences based on BLAST results.
Importable modules
taxovis
can not only used as command line tool, but also imported for your own workflows without the hardcoded filenames etc.
venn
can create Venn diagrams. There is no direct integration with the other tools in this package, but it may serve useful for your own custom workflows.
sample_taxids
can be used to randomly sample taxids from a file of taxids, and write these out to a heatmap configuration file. Rerunning the heatmap step of taxovis will then redraw the heatmap with sampled taxids.
Utilities
There are some utility scripts that are only on Github and not downloaded by pip
. You can find them in the utils
folder in the repository. These scripts are not polished and are meant to be changed before using them. They might come in handy, though, so we did not delete them outright. Each of the scripts has a little information about them in the top of the file.
lineage_value.py
gives an overview about how present given seeds are along a phylogenetic lineage. If you, for example, given human (taxonomy id 9606) as target, it will show, how well the seeds are found in Hominidae, Simiiformes, Primates, Mammalia, etc.
static_heatmap.py
is similar to the intheat
step in taxovis
. As the name says, it is not interactive, but can be used as a possible starting image for publication.
sample_taxids.py
is a helper library that reads in list of NCBI taxonomic ids and a sample size and returns randomly sampled taxids. This can be used to randomly sample organisms for the taxovis heatmap. Make sure to add your email address first!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file ProTaxoVis-0.0.3.tar.gz
.
File metadata
- Download URL: ProTaxoVis-0.0.3.tar.gz
- Upload date:
- Size: 37.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 2a3f630fe2fad256719121ba2314b970c1970b7b7de9a5b1af299b8a78f2390e |
|
MD5 | 1406e4bf50b49e953da1de4dfe354bab |
|
BLAKE2b-256 | a32946e01791ce8a49bfbd6dfde233c9fd569958cca2f2e2c200bb139e4e7e86 |