Commands for turning blast queries into pandas dataframes.
Project description
blasttools
Commands for turning blast queries into pandas dataframes.
Blast against any built blast databases
blasttools blast --out=my.pkl query.fasta my_blastdbs_dir/*.pot
Install
Install with
python -m pip install -U blasttools
# *OR*
python -m pip install -U 'git+https://github.com/arabidopsis/blasttools.git'
Once installed you can update with blasttools update
Common Usages:
Build some blast databases from Ensembl Plants.
blasttools plants --release=40 build triticum_aestivum zea_mays
Find out what species are available:
blasttools plants --release=40 species
Blast against my.fasta
and save dataframe as a pickle file (the default is to
save as a csv file named my.fasta.csv
).
blasttools plants blast --out=dataframe.pkl my.fasta triticum_aestivum zea_mays
Get your blast data!
import pandas as pd
df = pd.read_pickle('dataframe.pkl')
Parallelization
When blasting, you can specify --num-threads
which is passed directly to the
underlying blast command. If you want to parallelize over species, databases or fasta files,
I suggest you use GNU Parallel [Tutorial].
parallel
has a much better set of options for controlling how the parallelization works
and is also quite simple for simple things.
e.g. build blast databases from a set of fasta files concurrently:
parallel blasttools build ::: *.fa.gz
Or blast everything!
species=$(blasttools plants species)
parallel blasttools plants build ::: $species
# must have different output files here...
parallel blasttools plants blast --out=my{}.pkl my.fasta ::: $species
# or in batches of 4 species at a time
parallel -N4 blasttools plants blast --out='my{#}.pkl' my.fasta ::: $species
Then gather them all together...
blasttools concat --out=alldone.xlsx my*.pkl && rm my*.pkl
or programmatically:
from glob import glob
import pandas as pd
df = pd.concat([pd.read_pickle(f) for f in glob('my*.pkl')], ignore_index=True)
Remember: if you parallelize your blasts and use --num-threads > 1
then you are probably going to be fighting for cpu time
amongst yourselves!
Best matches
Usually if you want the top/best --best=3
will select the lowest evalue's for
each query sequence. However if you want say the best to, say, be the longest query match
then you can add --expr='qstart - qend'
. (Remember we are looking for the lowest values).
XML
Blast offers an xml (--xml
) output format that adds query
, match
, sbjct
strings. The other
fields are equivalent to adding --columns='+score gaps nident positive qlen slen'
.
It also offers a way to display the blast match as a pairwise alignment.
from blasttools.blastxml import hsp_match
df = pd.read_csv('results.csv')
df['alignment'] = df.apply(hsp_match, axis=1)
print(df.iloc[0].alignment)
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file blasttools-0.1.16.tar.gz
.
File metadata
- Download URL: blasttools-0.1.16.tar.gz
- Upload date:
- Size: 20.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.0 Linux/5.15.0-106-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6db50477ee48c37b2e72201add54bc7f7187961230593b6cf132643a3ff72e5f |
|
MD5 | cf51260b03b80e51baf744a4e7040e92 |
|
BLAKE2b-256 | 5bbf048394b39a2dfdf6de45a406071083c10e6d829c8a22768340b7cf2beb36 |
File details
Details for the file blasttools-0.1.16-py3-none-any.whl
.
File metadata
- Download URL: blasttools-0.1.16-py3-none-any.whl
- Upload date:
- Size: 24.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.8.2 CPython/3.11.0 Linux/5.15.0-106-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 7baddd3b3ff7db8f8781514ab17a24b180c429e518a30bfa611b3c087069e8ac |
|
MD5 | 109f8e10746858b84582ca7ef4aeabbd |
|
BLAKE2b-256 | 554ad3f88eea959293e873036cda60c72ef2ed9a783e9fccaba7404031bf1974 |