BLAST Utilities
Project description
blastutils: BLAST Utilities
When using BLAST, you often need to manipulate the results, such as reading, sorting, and filtering. Biopython is a good choice, but the newer version of Biopython seems to be phasing out support for tab-delimited text files, opting for XML and XML2 formats instead. This change brings several issues:
- XML text is not very human-readable.
- XML files are significantly larger than tab-delimited text files. I encountered a case where the results file was 1.5TB in XML format, while the tab-delimited version was only 40GB.
- Bio.Blast.parse has a bug that sometimes raises a CorruptedXMLError (this is critical).
- Biopython's support for sorting, filtering, and other operations is somewhat weak.
Installation
pip install blastutils
HSP, Hit, Record
HSP
In BLAST, an HSP (High-Scoring Segment Pair) refers to a segment of alignment between two sequences that exhibits high similarity based on certain evaluation criteria such as score, e-value, or sequence coverage. HSPs are often the most important part of a BLAST result, as they represent the strongest regions of similarity between two sequences. The HSP object has 11 attributes:
qstart– The start position of the alignment on the query sequence.qend– The end position of the alignment on the query sequence.sstart– The start position of the alignment on the subject sequence.send– The end position of the alignment on the subject sequence.mismatch– The number of mismatches in the alignment.gapopen– The number of gaps in the alignment.length– The length of the alignment.pident– The percentage of identical matches.qcovhsp– The coverage of the query sequence by this HSP.bitscore– The bit score of the alignment.evalue– The e-value of the alignment.
Hit
In BLAST, a Hit refers to the result of an alignment between a query sequence and a subject sequence from the database. A Hit typically contains one or more HSPs. The Hit object has 3 attributes:
sseqid– The identifier of the subject sequence.slen– The length of the subject sequence.hsps– The HSPs associated with the hit.
Record
A Record represents all the results of a single query sequence, typically containing one or more Hits. The Record object has 3 attributes:
qseqid– The identifier of the query sequence.qlen– The length of the query sequence.hits– The Hits associated with the record.
Usage
Reading and writing BLAST result file
Note: To parse BLAST alignment results using the BlastOutputFile, the outfmt parameter must be set to
6 qseqid qlen sseqid slen qstart qend sstart send mismatch gapopen length pident qcovhsp bitscore evalue.
from blastutils import BlastOutputFile, Reader, Writer
with BlastOutputFile('example.txt') as file:
reader = Reader(file)
records = []
for record in reader:
records.append(record)
with BlastOutputFile('new-example.txt', 'w') as file:
writer = Writer(file)
for record in records:
writer.write(record)
BlastOutputFile is a wrapper for BLAST result files and has two parameters:
path– The file path.mode– The mode for opening the file; 'r' for reading (default), 'w' for writing.
You can use it with a context manager as shown above, or manage the file manually:
from blastutils import BlastOutputFile, Reader
file = BlastOutputFile('example.txt')
file.open()
row = file.read() # ['seq1', 100, 'ref1', 100, 1, 100, 1, 100, 0, 0, 100, 100.0, 100.0, 185.0, 8.22e-50]
file.close()
Pass the BlastOutputFile object to the Reader to read the results. The Reader is an iterator, and you can iterate over it using a for loop, or use next():
from blastutils import BlastOutputFile, Reader
file = BlastOutputFile('example.txt')
file.open()
reader = Reader(file)
record1 = next(reader)
record2 = next(reader)
file.close()
Filtering
Filtering is a very common task, for example, filtering out Hits with low similarity or low coverage. However, filtering criteria can vary widely, so blastutils does not provide a lot of built-in filters. Instead, it provides an API for users to implement custom filtering rules.
blastutils defines a Filter base class, and all subclasses must implement the __call__ method, which takes a Hit object as input and returns True if the Hit should be kept, or False if it should be filtered out.
For example, to implement a filtering rule that removes Hits with an HSP's identity (pident) lower than 90:
from blastutils import Filter
class MinSimilarity(Filter):
def __init__(self, threshold):
self.threshold = threshold
def __call__(self, hit):
if hit.is_empty(): # Filter out Hits with no HSPs
return False
return hit.hsps[0].pident >= self.threshold
record.filter(MinSimilarity(90)) # Filter out Hits with identity below 90
Sorting
After obtaining BLAST results, a common task is to find the best match for each query sequence. One common approach is to sort all the Hits and select the first one. Like filtering, sorting criteria can vary, and blastutils does not provide a universal sorting method. Instead, it defines a Compare base class, and all subclasses must implement the __call__ method, which takes two Hit objects as input:
- Return
-1(or any value less than0) ifhit1is better thanhit2. - Return
0ifhit1andhit2are equally good. - Return
1(or any value greater that 0) ifhit2is better thanhit1.
For example, to implement a sorting rule that prefers Hits with smaller e-values and larger bit scores:
from blastutils import Compare
class ByEvalueBitscore(Compare):
def __call__(self, hit1, hit2):
if hit2.is_empty(): # If hit2 has no HSP, hit1 is better
return -1
if hit1.is_empty(): # If hit1 has no HSP, hit2 is better
return 1
hsp1 = hit1.hsps[0]
hsp2 = hit2.hsps[0]
if hsp1.evalue < hsp2.evalue:
return -1
if hsp1.evalue > hsp2.evalue:
return 1
return -1 if hsp1.bitscore >= hsp2.bitscore else 1
record.best(ByEvalueBitscore()) # You can get the best Hit without sorting
record.sort(ByEvalueBitscore()) # Sort the Hits
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file blastutils-1.0.0.tar.gz.
File metadata
- Download URL: blastutils-1.0.0.tar.gz
- Upload date:
- Size: 7.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6a547898772e061475116a10ca805cc72a912e3c9da738a4178dc574db14a939
|
|
| MD5 |
7768064ea17933462affa61e1d49e887
|
|
| BLAKE2b-256 |
96385fdc3198c1e812f07c59137ed749714dc8b40c57e0348a2757149f5ca875
|
File details
Details for the file blastutils-1.0.0-py2.py3-none-any.whl.
File metadata
- Download URL: blastutils-1.0.0-py2.py3-none-any.whl
- Upload date:
- Size: 7.1 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb36dc17e11ec7a0a32b3f25fa9b8ab875e5970b9254dfd7f9af58e3bb78e06f
|
|
| MD5 |
e28357982c03ce748513b01b2a947973
|
|
| BLAKE2b-256 |
87bb3b55cbd80cf17becfecc6c8890aa83fd763f19d64db5e0c44778e5153d77
|