Simple program to more efficiently run blast in multicore systems, as well as rough taxonomomic annoation using BASTA LCA
Project description
justblast
This is a simple program to more efficiently run blast on multicore machines, have a simple extension to run and plot the last common ancestor (LCA) using Tim Kahlke's BASTA, and allowing the input to be in fastq format.
Requirements
READ THIS BEFORE INSTALLING
To run this program you will need to have the blast+ tools install in your machine. Go to the link above and follow the instructions. The installation instruction below should try to install BASTA, however, BASTA has as requirement the database LevelDB that needs to be installed on your system. On Linux systems you can do:
sudo apt-get update
sudo apt-get install python-leveldb
If you dont have administrative privileges, contact your sys admin.
On mac OS:
brew install leveldb
A note for Compute Canada Users
Before installing you will need to load the python packages and the levelDB modules by:
module load nixpkgs/16.09 scipy-stack/2018b # python modules
module load gcc/5.4.0 leveldb/1.20 # leveldb
Installation
justblast in on the PyPI repository, and can be installed by:
python3 -m pip install justblast
Or you can clone this repository and run the setup.py install
command.
If you do not have admin privileges, you can add the --user
option.
Usage
You can explore the options by typing:
justblast -h
and you will get
usage: justblast [-h] [-e EVALUE] [-p PERCENT_ID] [-m MAX_TARGET_SEQS]
[-q QUERY_COVERAGE] [-c CPUS] [-i] [-o OUT_FILENAME] [-f OUTFMT]
query db
positional arguments:
query Fasta file with query sequences
db path to blast database
optional arguments:
-h, --help show this help message and exit
-e EVALUE, --evalue EVALUE
evalue for blast search (default: 10)
-p PERCENT_ID, --percent_id PERCENT_ID
Minimum percent identity on blast search (default: 0)
-m MAX_TARGET_SEQS, --max_target_seqs MAX_TARGET_SEQS
Number of aligned sequences to keep (default: 500)
-q QUERY_COVERAGE, --query_coverage QUERY_COVERAGE
Minimum query coverage to retain (default: None)
-c CPUS, --cpus CPUS Number of cpus to use (default: -1)
-i, --identify Whether to use basta to assign taxopnomy to the hits
based on LCA. This is a rough estimate and should be
revised carefully (default: False)
-o OUT_FILENAME, --out_filename OUT_FILENAME
name of output (filtered) file (default: hit.hits)
-f OUTFMT, --outfmt OUTFMT
Custom format for BLAST (default: qseqid sseqid pident
evalue qcovs qlen length staxid stitle)
There are only two positional arguments, the query file and the path to
the BLAST database. Most of the optional characters will filter and/or
modify the blast search. The two exceptions are identify
, which will
run basta, and cpus, that can be tailored to your machine (by default it
uses all cores in your machine). NOTE: if you are in Compute Canada
you HAVE to pass this value matching the number of cores you requested.
Notes on the BASTA run
justblast performs a rough assignment of taxonomy based on BASTA. Here I use the following parameters:
- -m 10: A minimum of 10 hits have to agree to assign the given taxonomy
- -n 50: Uses the top 50 hits to make the assignment, regardless of you MAX_TARGET_SEQS
- The rest of parameters are either default, or use the same as for the blast.
For basta to run your outfmt must contain AT LEAST:
- qseqid
- sseqid
- length
- evalue
- pident
Dummy Example
Let's say that you have a fasta file called seqs.fasta
, and you want to
run a blast against the nucleotide database (nt) located on you home folder
(/home/user). You want to restrict your blast to an evalue of 1E-10, a percent id
of 95%, and retrieve only 50 target sequences that have a query coverage of over 90%.
You also want to explore roughly the taxonomic landscape using BASTA. Then you can
call the program by:
justblast.py seqs.fasta /home/user/nt -e 1E-10 -p 95 -m 50 -q 90 -i -o results.hits
This will generate a hits file named results.hits
and will contain the following
columns (note that the outfmt was left default):
- qseqid
- sseqid
- pident
- evalue
- qcovs
- qlen
- length
- staxid
- stitle
Also a file called results_annotated.hits
that besides the columns above, will also
contain the column lineage.
It will also contain a PDF with the histograms of all the taxonomic levels identified called `results_taxadist.pdf'
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
File details
Details for the file justblast-2020.0.4.tar.gz
.
File metadata
- Download URL: justblast-2020.0.4.tar.gz
- Upload date:
- Size: 11.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 24c81eae423610beca378b4a34852b263a9e069b2592008b6f84ac2203be0484 |
|
MD5 | 68f8b3825ac2b880a7aaddee17a038e6 |
|
BLAKE2b-256 | 27a709716699e681043a21d57b0a377578d0cacf353210e4ba93dccff6d46614 |
File details
Details for the file justblast-2020.0.4-py3.6.egg
.
File metadata
- Download URL: justblast-2020.0.4-py3.6.egg
- Upload date:
- Size: 22.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | b2138b5c26ff584e078bdc53ae258d8d6d08ea2c2abb27ffddc8fbb4eee5aef9 |
|
MD5 | 783663b799a5a135b03eed0ade0fe84d |
|
BLAKE2b-256 | ccb182b881af890abd306bc30b5e4f69a241d7559d0172cdc17aaccfef073446 |
File details
Details for the file justblast-2020.0.4-py3-none-any.whl
.
File metadata
- Download URL: justblast-2020.0.4-py3-none-any.whl
- Upload date:
- Size: 12.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.49.0 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | bbf4407e8022ec596ce32481f195ca94e8bcd95589e169e6d4d92174cdc83d6f |
|
MD5 | d6e36720fe320e81fead1b7b2aecd98a |
|
BLAKE2b-256 | e59c9e3cc38737c32525d9cd335898563225013f15177ddb9c27efa2b38be3bf |