Skip to main content

Commands for turning blast queries into pandas dataframes.

Project description

blasttools

Commands for turning blast queries into pandas dataframes.

Blast against any built blast databases

blasttools blast --out=my.pkl query.fasta my_blastdbs_dir/*.pot

Install

Install with

python -m pip install -U blasttools
# *OR*
python -m pip install -U 'git+https://github.com/arabidopsis/blasttools.git'

Once installed you can update with blasttools update

Common Usages:

Build some blast databases from Ensembl Plants.

blasttools plants --release=40 build triticum_aestivum zea_mays

Find out what species are available:

blasttools plants --release=40 species

Blast against my.fasta and save dataframe as a pickle file (the default is to save as a csv file named my.fasta.csv).

blasttools plants blast --out=dataframe.pkl my.fasta triticum_aestivum zea_mays

Get your blast data!

import pandas as pd
df = pd.read_pickle('dataframe.pkl')

Parallelization

When blasting, you can specify --num-threads which is passed directly to the underlying blast command. If you want to parallelize over species, databases or fasta files, I suggest you use GNU Parallel [Tutorial].

parallel has a much better set of options for controlling how the parallelization works and is also quite simple for simple things.

e.g. build blast databases from a set of fasta files concurrently:

parallel blasttools build ::: *.fa.gz

Or blast everything!

species=$(blasttools plants species)
parallel blasttools plants build ::: $species
# must have different output files here...
parallel blasttools plants blast --out=my{}.pkl my.fasta ::: $species
# or in batches of 4 species at a time
parallel -N4 blasttools plants blast --out='my{#}.pkl' my.fasta ::: $species

Then gather them all together...

blasttools concat --out=alldone.xlsx my*.pkl && rm my*.pkl

or programmatically:

from glob import glob
import pandas as pd
df = pd.concat([pd.read_pickle(f) for f in glob('my*.pkl')], ignore_index=True)

Remember: if you parallelize your blasts and use --num-threads > 1 then you are probably going to be fighting for cpu time amongst yourselves!

Best matches

Usually if you want the top/best --best=3 will select the lowest evalue's for each query sequence. However if you want say the best to, say, be the longest query match then you can add --expr='qstart - qend'. (Remember we are looking for the lowest values).

XML

Blast offers an xml (--xml) output format that adds query, match, sbjct strings. The other fields are equivalent to adding --columns='+score gaps nident positive qlen slen'.

It also offers a way to display the blast match as a pairwise alignment.

from blasttools.blastxml import hsp_match
df = pd.read_csv('results.csv')
df['alignment'] = df.apply(hsp_match, axis=1)
print(df.iloc[0].alignment)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blasttools-0.1.11.tar.gz (19.7 kB view hashes)

Uploaded Source

Built Distribution

blasttools-0.1.11-py3-none-any.whl (23.6 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page