A simple library for executing BLAST searches with ncbi-blast+
Project description
simple_blast
This is a library that provides a (decreasingly) basic wrapper around
ncbi-blast+. Currently, the library supports searches with blastn only, but I
may expand the library to include wrappers for other BLAST executables if I need
them.
Dependencies
- Pandas (>= 1.5.6)
- ncbi-blast+ (>= 2.12.0+)
Optional
- Biopython (>= 1.84, for SAM parsing and support for searching with SeqRecord)
- pyblast4_archive (>= 0.0.4, for decoding query/subject IDs in SAM format)
Basic usage
You can define a blastn search to be carried out using the BlastnSearch
class. BlastnSearchobjects are constructed with two required
arguments—the subject sequence and the query sequence files, in that
order. For example, to set up a blastn search for sequences in seqs1.fasta
against those in seqs2.fasta using output format 12 (Seqalign), you could
construct a BlastnSearch object like this:
from simple_blast import BlastnSearch
search = BlastnSearch(21, "seqs1.fasta", "seqs2.fasta")
The BLAST search is not carried out until you ask for the results by running the
get_output() function.
results = search.get_output()
blastn can output binary data, so the get_output() function appropriately
returns bytes.
Often, it's convenient to use output format 6, a tabular representation of the
HSPs. For that purpose, you can use TabularBlastnSearch.
from simple_blast import TabularBlastnSearch
search = TabularBlastnSearch("seqs1.fasta", "seqs2.fasta")
The hits property of the search returns a Pandas dataframe containing the HSPs
identified in the BLAST search.
results = search.hits
The columns in the output may be configured by passing either the out_columns
or additional_columns arguments when constructing the
TabularBlastnSearch. The former argument overrides the set of output columns;
the latter argument is added to the list of default output columns.
Sequences from memory
simple_blast can handle BLAST searches with sequences stored in memory (i.e.,
not in a file). It works with sequences stored as strings or in
BioPython SeqRecord objects.
from Bio.SeqRecord import SeqRecord, Seq
from simple_blast import TabularBlastnSearch
# Define some data.
subjects = [
SeqRecord(
Seq(
"AAGGCGTACGGGCCTTTCGCTTCCGAAAACTTCCTCTTAGGTCGCTGTTACTGGATGTCGAGTCAGCACA"
"TGGGAAACTCCACGCATCGGCGGGATTTCACAACGCCTAGAACACCGGTAATGCGAGTATCCGTATCGGT"
"AACAAATATCTTTGGGATACTACAGGAATATCCGTAGGAGTTCGCCGCGATTAGGTGCCTCGATGATATG"
"CAGCTGTCACTGGAGATAACACACTATGCAGCAGTAATGGATGTTATTGCTACTAAGGTTCCCTGTCACC"
),
id="My Sequence 1"
),
SeqRecord(
Seq(
"TTCATTGGTGGGCTTTCTGGTTCACGCCCATCTCAATGTACATTTTCCGTGACGTGATGATAATCATAAC"
"TCGTTGGTAGTAATAGGGTAAGGGAATTTGGCAGGTAGTCGGGGCAAGACTGCCGTTACAAGCTAATCAT"
"CTGCCAACTAACTTTAGCCGTAATTGGCACTAACAGTTAACCTTCGCGCGTTTCTCAGTGTAGAGTGAGA"
"CTATGTGATTACTTTCAGCGCCCAGCGGTGGTAGGTAGTAAAAAGTGGCCACCGAACCGAATGCT"
),
id="My Sequence 2"
)
]
queries = [
SeqRecord(
Seq("TGGGAAACTCCACGCATCGGCGGGATTTCACAACGCCTAGAACACCGGTAATGCGAGTATCCGT"),
id="Query 1"
)
]
with TabularBlastnSearch.from_sequences(queries, subjects) as search:
results = search.hits
or, using a list of strings:
from simple_blast import TabularBlastnSearch
# Define some data.
subjects = [
(
"AAGGCGTACGGGCCTTTCGCTTCCGAAAACTTCCTCTTAGGTCGCTGTTACTGGATGTCGAGTCAGCACA"
"TGGGAAACTCCACGCATCGGCGGGATTTCACAACGCCTAGAACACCGGTAATGCGAGTATCCGTATCGGT"
"AACAAATATCTTTGGGATACTACAGGAATATCCGTAGGAGTTCGCCGCGATTAGGTGCCTCGATGATATG"
"CAGCTGTCACTGGAGATAACACACTATGCAGCAGTAATGGATGTTATTGCTACTAAGGTTCCCTGTCACC"
),
(
"TTCATTGGTGGGCTTTCTGGTTCACGCCCATCTCAATGTACATTTTCCGTGACGTGATGATAATCATAAC"
"TCGTTGGTAGTAATAGGGTAAGGGAATTTGGCAGGTAGTCGGGGCAAGACTGCCGTTACAAGCTAATCAT"
"CTGCCAACTAACTTTAGCCGTAATTGGCACTAACAGTTAACCTTCGCGCGTTTCTCAGTGTAGAGTGAGA"
"CTATGTGATTACTTTCAGCGCCCAGCGGTGGTAGGTAGTAAAAAGTGGCCACCGAACCGAATGCT"
)
]
queries = ["TGGGAAACTCCACGCATCGGCGGGATTTCACAACGCCTAGAACACCGGTAATGCGAGTATCCGT"]
with TabularBlastnSearch.from_sequences(queries, subjects as search:
results = search.hits
When using a list of strings, sequences are automatically named seq_i, where
i is the position of the sequence in the list.
You can use SeqRecords together with lists of strings, and you can also use
in-memory sequences together with files by providing the subject or query
keyword arguments to from_sequences.
TabularBlastnSearch.from_sequences(
subject_seqs=["CATGAACTA"],
query="seqs1.fasta"
)
Since using a context manager is slightly cumbersome, you can also use the
blastn_from_sequences convenience function to get the hits for a search.
from simple_blast import blastn_from_sequences
# Define some data.
subjects = [
(
"AAGGCGTACGGGCCTTTCGCTTCCGAAAACTTCCTCTTAGGTCGCTGTTACTGGATGTCGAGTCAGCACA"
"TGGGAAACTCCACGCATCGGCGGGATTTCACAACGCCTAGAACACCGGTAATGCGAGTATCCGTATCGGT"
"AACAAATATCTTTGGGATACTACAGGAATATCCGTAGGAGTTCGCCGCGATTAGGTGCCTCGATGATATG"
"CAGCTGTCACTGGAGATAACACACTATGCAGCAGTAATGGATGTTATTGCTACTAAGGTTCCCTGTCACC"
),
(
"TTCATTGGTGGGCTTTCTGGTTCACGCCCATCTCAATGTACATTTTCCGTGACGTGATGATAATCATAAC"
"TCGTTGGTAGTAATAGGGTAAGGGAATTTGGCAGGTAGTCGGGGCAAGACTGCCGTTACAAGCTAATCAT"
"CTGCCAACTAACTTTAGCCGTAATTGGCACTAACAGTTAACCTTCGCGCGTTTCTCAGTGTAGAGTGAGA"
"CTATGTGATTACTTTCAGCGCCCAGCGGTGGTAGGTAGTAAAAAGTGGCCACCGAACCGAATGCT"
)
]
queries = ["TGGGAAACTCCACGCATCGGCGGGATTTCACAACGCCTAGAACACCGGTAATGCGAGTATCCGT"]
results = blastn_from_sequences(queries, subjects)
Note: Searching from in-memory sequences is implemented using Unix FIFOs, so this feature currently will not work on Windows.
DB caches
When the same sequence file is used as a subject in multiple searches, it can be
efficient to build a BLAST database up front. The BlastDBCache class can be
used to handle this mostly automatically. To make a BlastDBCache, you need
to specify the location of the on the file system.
from simple_blast import BlastDBCache
cache = BlastDBCache("cache_dir")
To add a file to the cache, use the makedb method.
cache.makedb("seqs2.fasta")
When constructing a BlastnSearch object, give it the BlastDBCache as the
db_cache parameter to make the BlastnSearch object use the cache for
searches.
search = BlastnSearch(12, "seqs1.fasta", "seqs2.fasta", db_cache=cache)
Now search will use the database we created for seqs2.fasta.
Explicit database searches
Rather than searching against a FASTA file or a database created implicitly with
BlastDBCache, you can also explicitly specify a database to query with the db
keyword argument.
search = BlastnSearch(12, "seqs1.fasta", db="mydb")
Remote searches
You can query the NCBI databases remotely using the remote parameter.
search = BlastnSearch(12, "seqs1.fasta", db="nr", remote=True)
Format conversions
It's sometimes useful to convert between different BLAST output
formats. ncbi-blast+ comes with a utility, blast_formatter, that can convert
output in the "Blast4 Archive" format (ASN.1, output format 11) to any other
BLAST format.
Using blast_formatter with simple_blast.convert
You can use blast_formatter directly with the simple_blast.convert
module. For example,
from simple_blast.convert import blast_format_file
# Convert to output format 11.
blast_format_file(12, "my_blast_results.asn1", "my_blast_results.json")
If you don't specify the output file, you can get the output as bytes.
seqalign_bytes = blast_format_file(12, "my_blast_results.asn1")
You can also use the similar blast_format_bytes to provide bytes as input.
Using MultiformatBlastnSearch
You can create a search with output format 11 using the
MultiformatBlastnSearch class.
from simple_blast.multiformat import MultiformatBlastnSearch
search = MultiformatBlastnSearch("seqs1.fasta", "seqs2.fasta")
You can convert the output to another format using the to method.
seqalign_bytes = search.to(12)
For output formats with an associated subclass of BlastnSearch, you can also
convert directly to that subclass with to_search..
tabular_search = search.to_search(6)
results = tabular_search.hits # A Pandas DataFrame
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simple_blast-0.7.0.tar.gz.
File metadata
- Download URL: simple_blast-0.7.0.tar.gz
- Upload date:
- Size: 87.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
08244f242022a687663e2d1bfeb5c1801412a1e9f19627c97f193b1ce6db7e94
|
|
| MD5 |
d752d57850bfa14d21905d4b1e5037a0
|
|
| BLAKE2b-256 |
54f967764fecd1cb206a458afa2711e5bf57949642531f0b2f93f4ebe1028d5a
|
Provenance
The following attestation bundles were made for simple_blast-0.7.0.tar.gz:
Publisher:
package.yml on actapia/simple_blast
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simple_blast-0.7.0.tar.gz -
Subject digest:
08244f242022a687663e2d1bfeb5c1801412a1e9f19627c97f193b1ce6db7e94 - Sigstore transparency entry: 226913302
- Sigstore integration time:
-
Permalink:
actapia/simple_blast@c16273b5f1e18ab9346e35be07124b422b275f51 -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/actapia
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
package.yml@c16273b5f1e18ab9346e35be07124b422b275f51 -
Trigger Event:
release
-
Statement type:
File details
Details for the file simple_blast-0.7.0-py3-none-any.whl.
File metadata
- Download URL: simple_blast-0.7.0-py3-none-any.whl
- Upload date:
- Size: 19.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a1ef412dcdfb21795e50f8a12aec3c1e00efb31881b946b2205b44b28d34db7
|
|
| MD5 |
0b29a43cfdafcee6953ba46bce74f4fd
|
|
| BLAKE2b-256 |
ae1862508d82d9a88ba8ba0288e5ac1278bdedf22842aec46298fe93d312eed6
|
Provenance
The following attestation bundles were made for simple_blast-0.7.0-py3-none-any.whl:
Publisher:
package.yml on actapia/simple_blast
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
simple_blast-0.7.0-py3-none-any.whl -
Subject digest:
9a1ef412dcdfb21795e50f8a12aec3c1e00efb31881b946b2205b44b28d34db7 - Sigstore transparency entry: 226913303
- Sigstore integration time:
-
Permalink:
actapia/simple_blast@c16273b5f1e18ab9346e35be07124b422b275f51 -
Branch / Tag:
refs/tags/v0.7.0 - Owner: https://github.com/actapia
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
package.yml@c16273b5f1e18ab9346e35be07124b422b275f51 -
Trigger Event:
release
-
Statement type: