Commands for turning blast queries into pandas dataframes.
Project description
blasttools
Commands for turning blast queries into pandas dataframes.
Blast against any built blast databases
blasttools blast --out=my.pkl query.fasta my_blastdbs_dir/*.pot
Install
Install with
python -m pip install -U blasttools
# *OR*
python -m pip install -U 'git+https://github.com/arabidopsis/blasttools.git'
Once installed you can update with blasttools update
Common Usages:
Build some blast databases from Ensembl Plants.
blasttools plants --release=40 build triticum_aestivum zea_mays
Find out what species are available:
blasttools plants --release=40 species
Blast against my.fasta
and save dataframe as a pickle file (the default is to
save as a csv file named my.fasta.csv
).
blasttools plants blast --out=dataframe.pkl my.fasta triticum_aestivum zea_mays
Get your blast data!
import pandas as pd
df = pd.read_pickle('dataframe.pkl')
Parallelization
When blasting, you can specify --num-threads
which is passed directly to the
underlying blast command. If you want to parallelize over species, databases or fasta files,
I suggest you use GNU Parallel [Tutorial].
parallel
has a much better set of options for controlling how the parallelization works
and is also quite simple for simple things.
e.g. build blast databases from a set of fasta files concurrently:
parallel blasttools build ::: *.fa.gz
Or blast everything!
species=$(blasttools plants species)
parallel blasttools plants build ::: $species
# must have different output files here...
parallel blasttools plants blast --out=my{}.pkl my.fasta ::: $species
# or in batches of 4 species at a time
parallel -N4 blasttools plants blast --out='my{#}.pkl' my.fasta ::: $species
Then gather them all together...
blasttools concat --out=alldone.xlsx my*.pkl && rm my*.pkl
or programmatically:
from glob import glob
import pandas as pd
df = pd.concat([pd.read_pickle(f) for f in glob('my*.pkl')], ignore_index=True)
Remember: if you parallelize your blasts and use --num-threads > 1
then you are probably going to be fighting for cpu time
amongst yourselves!
Best matches
Usually if you want the top/best --best=3
will select the lowest evalue's for
each query sequence. However if you want say the best to, say, be the longest query match
then you can add --expr='qstart - qend'
. (Remember we are looking for the lowest values).
XML
Blast offers an xml (--xml
) output format that adds query
, match
, sbjct
strings. The other
fields are equivalent to adding --columns='+score gaps nident positive qlen slen'
.
It also offers a way to display the blast match as a pairwise alignment.
from blasttools.blastxml import hsp_match
df = pd.read_csv('results.csv')
df['alignment'] = df.apply(hsp_match, axis=1)
print(df.iloc[0].alignment)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for blasttools-0.1.15-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3575a20409435198f7607069c3f93fabb5059036d50b0c6b3ba230a1ebf84e56 |
|
MD5 | cd3a0db730f10160378c1d0b50b4123b |
|
BLAKE2b-256 | bcdc2f27e6e8bf5d56f7cc2ac84592d95997b0fd9e26a77a478320162e1306b1 |