('Tools to filter sam o bam files by percent identity or percent of matched sequence',)
Project description
Filtering sam/bam files by percent identity or percent of matched sequence
Tools to filter alignments in SAM/BAM files by percent identity or percent of matched sequence.
Percent identity is computed as:
where Nm is the number of matches and Ni is the number of mismatches.
Percent of matched sequences is computed as:
where L corresponds to query sequence length.
NOTES
BAM/SAM files must contain MD tags to be able to filter by percent identity. Aligners such as BWA add MD tags to each queried sequence in a BAM file. MD tags can also be generated with samtools.
Dependencies
Installation
pip3 install filtersam
Better to install within an environment, such as a conda environment, to avoid path conflicts with the included bash scripts.
TODO
- Make it command line callable
- Perhaps good idea (if possible) to add a specific tag to BAM/SAM containing computed percent identity
- Include several definitions of percent identity and/or let the user define one
Usage
This package contains two main functions: filterSAMbyIdentity
and filterSAMbyPercentMatched
, to filter BAM files by percent identity or percent of matched sequence, respectively.
To exemplify its usage, let's filter a BAM file by percent identity and percent of matched sequence.
from filtersam.filtersam import filterSAMbyIdentity, filterSAMbyPercentMatched
# Filter alignments with percent identity greater or equal to 95%
filterSAMbyIdentity(input_path='ERS491274.bam',
output_path='ERS491274_PI95.bam',
identity_cutoff=95)
# Filter alignments with percent of matched sequence greater or equal to 50%
filterSAMbyPercentMatched(input_path='ERS491274.bam',
output_path='ERS491274_PM50.bam',
matched_cutoff=50)
Parallelizing filtersam
Filtering large BAM files can take a while. However, filtersam
can be parallelized with an additional python package: parallelbam. Effectively, parallelbam
splits a large BAM file into chunks and calls filtersam
in dedicated processes for each one of them.
Let's try this out, we will parallelize the above operation in 8 processes.
from parallelbam.parallelbam import parallelizeBAMoperation, getNumberOfReads
# Filter alignments with percent identity greater or equal to 95% in parallel
parallelizeBAMoperation('ERS491274.bam',
callback=filterSAMbyIdentity,
callback_additional_args=[95],
n_processes=8,
output_path='ERS491274_PI95_parallel.bam')
We can further check if the filtered bam files produced in a single process and in parallel contain the same number of segments with the function getNumberOfReads
of parallelbam.
# Number of segments in the original bam
getNumberOfReads('ERS491274.bam')
1113119
# Number of segments in the single-process PI-filtered bam file
getNumberOfReads('ERS491274_PI95.bam')
11384
# Number of segments in the paralllized PI-filtered bam file
getNumberOfReads('ERS491274_PI95_parallel.bam')
11384
We see that both bam files contain the same number of (filtered) segments (fewer than in the original bam file).
Command-line usage
Filtersam can also be called as a command line program in the following way:
filtersam [-h] [-i] [-m] [-p] [-o] bam
where bam is the path to the bam/sam file.
Call
filtersam --help
to display help text about the arguments.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.