Skip to main content

Commands for turning blast queries into pandas dataframes.

Project description

blasttools

Commands for turning blast queries into pandas dataframes.

Blast against any built blast databases

blasttools blast --out=my.pkl query.fasta my_blastdbs_dir/*.pot

Install

Install with

python -m pip install -U blasttools
# *OR*
python -m pip install -U 'git+https://github.com/arabidopsis/blasttools.git'

Once installed you can update with blasttools update

Common Usages:

Build some blast databases from Ensembl Plants.

blasttools plants --release=40 build triticum_aestivum zea_mays

Find out what species are available:

blasttools plants --release=40 species

Blast against my.fasta and save dataframe as a pickle file (the default is to save as a csv file named my.fasta.csv).

blasttools plants blast --out=dataframe.pkl my.fasta triticum_aestivum zea_mays

Get your blast data!

import pandas as pd
df = pd.read_pickle('dataframe.pkl')

Parallelization

When blasting, you can specify --num-threads which is passed directly to the underlying blast command. If you want to parallelize over species, databases or fasta files, I suggest you use GNU Parallel [Tutorial].

parallel has a much better set of options for controlling how the parallelization works and is also quite simple for simple things.

e.g. build blast databases from a set of fasta files concurrently:

parallel blasttools build ::: *.fa.gz

Or blast everything!

species=$(blasttools plants species)
parallel blasttools plants build ::: $species
# must have different output files here...
parallel blasttools plants blast --out=my{}.pkl my.fasta ::: $species
# or in batches of 4 species at a time
parallel -N4 blasttools plants blast --out='my{#}.pkl' my.fasta ::: $species

Then gather them all together...

blasttools concat --out=alldone.xlsx my*.pkl && rm my*.pkl

or programmatically:

from glob import glob
import pandas as pd
df = pd.concat([pd.read_pickle(f) for f in glob('my*.pkl')], ignore_index=True)

Remember: if you parallelize your blasts and use --num-threads > 1 then you are probably going to be fighting for cpu time amongst yourselves!

Best matches

Usually if you want the top/best --best=3 will select the lowest evalue's for each query sequence. However if you want say the best to, say, be the longest query match then you can add --expr='qstart - qend'. (Remember we are looking for the lowest values).

XML

Blast offers an xml (--xml) output format that adds query, match, sbjct strings. The other fields are equivalent to adding --columns='+score gaps nident positive qlen slen'.

It also offers a way to display the blast match as a pairwise alignment.

from blasttools.blastxml import hsp_match
df = pd.read_csv('results.csv')
df['alignment'] = df.apply(hsp_match, axis=1)
print(df.iloc[0].alignment)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

blasttools-0.1.16.tar.gz (20.8 kB view details)

Uploaded Source

Built Distribution

blasttools-0.1.16-py3-none-any.whl (24.6 kB view details)

Uploaded Python 3

File details

Details for the file blasttools-0.1.16.tar.gz.

File metadata

  • Download URL: blasttools-0.1.16.tar.gz
  • Upload date:
  • Size: 20.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.0 Linux/5.15.0-106-generic

File hashes

Hashes for blasttools-0.1.16.tar.gz
Algorithm Hash digest
SHA256 6db50477ee48c37b2e72201add54bc7f7187961230593b6cf132643a3ff72e5f
MD5 cf51260b03b80e51baf744a4e7040e92
BLAKE2b-256 5bbf048394b39a2dfdf6de45a406071083c10e6d829c8a22768340b7cf2beb36

See more details on using hashes here.

File details

Details for the file blasttools-0.1.16-py3-none-any.whl.

File metadata

  • Download URL: blasttools-0.1.16-py3-none-any.whl
  • Upload date:
  • Size: 24.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.8.2 CPython/3.11.0 Linux/5.15.0-106-generic

File hashes

Hashes for blasttools-0.1.16-py3-none-any.whl
Algorithm Hash digest
SHA256 7baddd3b3ff7db8f8781514ab17a24b180c429e518a30bfa611b3c087069e8ac
MD5 109f8e10746858b84582ca7ef4aeabbd
BLAKE2b-256 554ad3f88eea959293e873036cda60c72ef2ed9a783e9fccaba7404031bf1974

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page