Skip to main content

A simple library for executing BLAST searches with ncbi-blast+

Project description

simple_blast

This is a library that provides a (decreasingly) basic wrapper around ncbi-blast+. Currently, the library supports searches with blastn only, but I may expand the library to include wrappers for other BLAST executables if I need them.

Dependencies

  • Pandas (>= 1.5.6)
  • ncbi-blast+ (>= 2.12.0+)

Optional

  • Biopython (>= 1.84, for SAM parsing and support for searching with SeqRecord)
  • pyblast4_archive (>= 0.0.7, for conversion to SAM format)
  • pysam (>= 0.23.1, for conversion to SAM format)

Basic usage

You can define a blastn search to be carried out using the BlastnSearch class. BlastnSearchobjects are constructed with two required arguments—the subject sequence and the query sequence files, in that order. For example, to set up a blastn search for sequences in seqs1.fasta against those in seqs2.fasta using output format 12 (Seqalign), you could construct a BlastnSearch object like this:

from simple_blast import BlastnSearch

search = BlastnSearch(21, "seqs1.fasta", "seqs2.fasta")

The BLAST search is not carried out until you ask for the results by running the get_output() function.

results = search.get_output()

blastn can output binary data, so the get_output() function appropriately returns bytes.

Often, it's convenient to use output format 6, a tabular representation of the HSPs. For that purpose, you can use TabularBlastnSearch.

from simple_blast import TabularBlastnSearch

search = TabularBlastnSearch("seqs1.fasta", "seqs2.fasta")

The hits property of the search returns a Pandas dataframe containing the HSPs identified in the BLAST search.

results = search.hits

The columns in the output may be configured by passing either the out_columns or additional_columns arguments when constructing the TabularBlastnSearch. The former argument overrides the set of output columns; the latter argument is added to the list of default output columns.

Sequences from memory

simple_blast can handle BLAST searches with sequences stored in memory (i.e., not in a file). It works with sequences stored as strings or in BioPython SeqRecord objects.

from Bio.SeqRecord import SeqRecord, Seq
from simple_blast import TabularBlastnSearch

# Define some data.
subjects = [
    SeqRecord(
	    Seq(
		"AAGGCGTACGGGCCTTTCGCTTCCGAAAACTTCCTCTTAGGTCGCTGTTACTGGATGTCGAGTCAGCACA"
		"TGGGAAACTCCACGCATCGGCGGGATTTCACAACGCCTAGAACACCGGTAATGCGAGTATCCGTATCGGT"
		"AACAAATATCTTTGGGATACTACAGGAATATCCGTAGGAGTTCGCCGCGATTAGGTGCCTCGATGATATG"
		"CAGCTGTCACTGGAGATAACACACTATGCAGCAGTAATGGATGTTATTGCTACTAAGGTTCCCTGTCACC"
	    ),
		id="My Sequence 1"
	),
    SeqRecord(
	    Seq(
		"TTCATTGGTGGGCTTTCTGGTTCACGCCCATCTCAATGTACATTTTCCGTGACGTGATGATAATCATAAC"
		"TCGTTGGTAGTAATAGGGTAAGGGAATTTGGCAGGTAGTCGGGGCAAGACTGCCGTTACAAGCTAATCAT"
		"CTGCCAACTAACTTTAGCCGTAATTGGCACTAACAGTTAACCTTCGCGCGTTTCTCAGTGTAGAGTGAGA"
		"CTATGTGATTACTTTCAGCGCCCAGCGGTGGTAGGTAGTAAAAAGTGGCCACCGAACCGAATGCT"
	    ),
		id="My Sequence 2"
	)
]
queries = [
    SeqRecord(
        Seq("TGGGAAACTCCACGCATCGGCGGGATTTCACAACGCCTAGAACACCGGTAATGCGAGTATCCGT"),
        id="Query 1"
    )
]

with TabularBlastnSearch.from_sequences(queries, subjects) as search:
    results = search.hits

or, using a list of strings:

from simple_blast import TabularBlastnSearch

# Define some data.
subjects = [
    (
        "AAGGCGTACGGGCCTTTCGCTTCCGAAAACTTCCTCTTAGGTCGCTGTTACTGGATGTCGAGTCAGCACA"
        "TGGGAAACTCCACGCATCGGCGGGATTTCACAACGCCTAGAACACCGGTAATGCGAGTATCCGTATCGGT"
        "AACAAATATCTTTGGGATACTACAGGAATATCCGTAGGAGTTCGCCGCGATTAGGTGCCTCGATGATATG"
        "CAGCTGTCACTGGAGATAACACACTATGCAGCAGTAATGGATGTTATTGCTACTAAGGTTCCCTGTCACC"
    ),
    (
        "TTCATTGGTGGGCTTTCTGGTTCACGCCCATCTCAATGTACATTTTCCGTGACGTGATGATAATCATAAC"
        "TCGTTGGTAGTAATAGGGTAAGGGAATTTGGCAGGTAGTCGGGGCAAGACTGCCGTTACAAGCTAATCAT"
        "CTGCCAACTAACTTTAGCCGTAATTGGCACTAACAGTTAACCTTCGCGCGTTTCTCAGTGTAGAGTGAGA"
        "CTATGTGATTACTTTCAGCGCCCAGCGGTGGTAGGTAGTAAAAAGTGGCCACCGAACCGAATGCT"
    )
]
queries = ["TGGGAAACTCCACGCATCGGCGGGATTTCACAACGCCTAGAACACCGGTAATGCGAGTATCCGT"]

with TabularBlastnSearch.from_sequences(queries, subjects as search:
    results = search.hits

When using a list of strings, sequences are automatically named seq_i, where i is the position of the sequence in the list.

You can use SeqRecords together with lists of strings, and you can also use in-memory sequences together with files by providing the subject or query keyword arguments to from_sequences.

TabularBlastnSearch.from_sequences(
    subject_seqs=["CATGAACTA"],
	query="seqs1.fasta"
)

Since using a context manager is slightly cumbersome, you can also use the blastn_from_sequences convenience function to get the hits for a search.

from simple_blast import blastn_from_sequences

# Define some data.
subjects = [
    (
        "AAGGCGTACGGGCCTTTCGCTTCCGAAAACTTCCTCTTAGGTCGCTGTTACTGGATGTCGAGTCAGCACA"
        "TGGGAAACTCCACGCATCGGCGGGATTTCACAACGCCTAGAACACCGGTAATGCGAGTATCCGTATCGGT"
        "AACAAATATCTTTGGGATACTACAGGAATATCCGTAGGAGTTCGCCGCGATTAGGTGCCTCGATGATATG"
        "CAGCTGTCACTGGAGATAACACACTATGCAGCAGTAATGGATGTTATTGCTACTAAGGTTCCCTGTCACC"
    ),
    (
        "TTCATTGGTGGGCTTTCTGGTTCACGCCCATCTCAATGTACATTTTCCGTGACGTGATGATAATCATAAC"
        "TCGTTGGTAGTAATAGGGTAAGGGAATTTGGCAGGTAGTCGGGGCAAGACTGCCGTTACAAGCTAATCAT"
        "CTGCCAACTAACTTTAGCCGTAATTGGCACTAACAGTTAACCTTCGCGCGTTTCTCAGTGTAGAGTGAGA"
        "CTATGTGATTACTTTCAGCGCCCAGCGGTGGTAGGTAGTAAAAAGTGGCCACCGAACCGAATGCT"
    )
]
queries = ["TGGGAAACTCCACGCATCGGCGGGATTTCACAACGCCTAGAACACCGGTAATGCGAGTATCCGT"]

results = blastn_from_sequences(queries, subjects)

Note: Searching from in-memory sequences is implemented using Unix FIFOs, so this feature currently will not work on Windows.

DB caches

When the same sequence file is used as a subject in multiple searches, it can be efficient to build a BLAST database up front. The BlastDBCache class can be used to handle this mostly automatically. To make a BlastDBCache, you need to specify the location of the on the file system.

from simple_blast import BlastDBCache

cache = BlastDBCache("cache_dir")

To add a file to the cache, use the makedb method.

cache.makedb("seqs2.fasta")

When constructing a BlastnSearch object, give it the BlastDBCache as the db_cache parameter to make the BlastnSearch object use the cache for searches.

search = BlastnSearch(12, "seqs1.fasta", "seqs2.fasta", db_cache=cache)

Now search will use the database we created for seqs2.fasta.

Explicit database searches

Rather than searching against a FASTA file or a database created implicitly with BlastDBCache, you can also explicitly specify a database to query with the db keyword argument.

search = BlastnSearch(12, "seqs1.fasta", db="mydb")

Remote searches

You can query the NCBI databases remotely using the remote parameter.

search = BlastnSearch(12, "seqs1.fasta", db="nr", remote=True)

Format conversions

It's sometimes useful to convert between different BLAST output formats. ncbi-blast+ comes with a utility, blast_formatter, that can convert output in the "Blast4 Archive" format (ASN.1, output format 11) to any other BLAST format.

Using blast_formatter with simple_blast.convert

You can use blast_formatter directly with the simple_blast.convert module. For example,

from simple_blast.convert import blast_format_file

# Convert to output format 11.
blast_format_file(12, "my_blast_results.asn1", "my_blast_results.json")

If you don't specify the output file, you can get the output as bytes.

seqalign_bytes = blast_format_file(12, "my_blast_results.asn1")

You can also use the similar blast_format_bytes to provide bytes as input.

Using MultiformatBlastnSearch

You can create a search with output format 11 using the MultiformatBlastnSearch class.

from simple_blast.multiformat import MultiformatBlastnSearch

search = MultiformatBlastnSearch("seqs1.fasta", "seqs2.fasta")

You can convert the output to another format using the to method.

seqalign_bytes = search.to(12)

For output formats with an associated subclass of BlastnSearch, you can also convert directly to that subclass with to_search..

tabular_search = search.to_search(6)
results = tabular_search.hits # A Pandas DataFrame

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simple_blast-0.7.5.tar.gz (100.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

simple_blast-0.7.5-py3-none-any.whl (23.9 kB view details)

Uploaded Python 3

File details

Details for the file simple_blast-0.7.5.tar.gz.

File metadata

  • Download URL: simple_blast-0.7.5.tar.gz
  • Upload date:
  • Size: 100.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for simple_blast-0.7.5.tar.gz
Algorithm Hash digest
SHA256 c0cb81bf15bce3b64084269a00b257a49ddadf749fb9b2b1544c09b3269f6dd0
MD5 e113f17501a9ba109fb9f693e7639ec4
BLAKE2b-256 97c26cac4fa74d22e057699358cf4b2389941a2f1d5768420a6096da33a08969

See more details on using hashes here.

Provenance

The following attestation bundles were made for simple_blast-0.7.5.tar.gz:

Publisher: package.yml on actapia/simple_blast

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file simple_blast-0.7.5-py3-none-any.whl.

File metadata

  • Download URL: simple_blast-0.7.5-py3-none-any.whl
  • Upload date:
  • Size: 23.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for simple_blast-0.7.5-py3-none-any.whl
Algorithm Hash digest
SHA256 6bfd0e08e7a767c4f05e3beebdd2afda40b0ea584108e5a3d550bf49717e7442
MD5 2ef6e1a38ed8b9a3b37dc28f9fc0cd3c
BLAKE2b-256 146568f17fabb35f1b1c5d8703490659c3a2c9f0e65fb3e296cb410efd0b84a3

See more details on using hashes here.

Provenance

The following attestation bundles were made for simple_blast-0.7.5-py3-none-any.whl:

Publisher: package.yml on actapia/simple_blast

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page