SAMsift - sift your alignments
Project description
SAMsift is a program for advanced filtering and tagging of SAM/BAM alignments using Python expressions.
Getting started
# clone this repo and add it to PATH
git clone http://github.com/karel-brinda/samsift
cd samsift
export PATH=$(pwd)/samsift:$PATH
# filtering: keep only alignments with score >94, save them as filtered.bam
samsift -i tests/test.bam -o filtered.bam -f 'AS>94'
# filtering: keep only unaligned reads
samsift -i tests/test.bam -f 'FLAG & 0x04'
# filtering: keep only aligned reads
samsift -i tests/test.bam -f 'not(FLAG & 0x04)'
# filtering: keep only sequences containing ACCAGAGGAT
samsift -i tests/test.bam -f 'SEQ.find("ACCAGAGGAT")!=-1'
# filtering: keep only sequences containing A and T only (defined using regular expressions)
samsift -i tests/test.bam -f 're.match(r"^[AT]*$", SEQ)'
# filtering: sample alignments with 25% rate
samsift -i tests/test.bam -f 'random.random()<0.25'
# filtering: sample alignments with 25% rate with a fixed RNG seed
samsift -i tests/test.bam -f 'random.random()<0.25' -0 'random.seed(42)'
# filtering: keep only alignments of reads specified in tests/qnames.txt
samsift -i tests/test.bam -0 'q=open("tests/qnames.txt").read().splitlines()' -f 'QNAME in q'
# filtering: keep only first 5000 reads from chr1 and 5000 reads from chr2
samsift -i tests/test.bam -0 'c={"chr1":5000,"chr2":5000}' -f 'c[RNAME]>0' -c 'c[RNAME]-=1' -m nonstop-remove
# tagging: add tags 'ln' with sequence length and 'ab' with average base quality
samsift -i tests/test.bam -c 'ln=len(SEQ);ab=1.0*sum(QUALa)/ln'
# tagging: add a tag 'ii' with the number of the current alignment
samsift -i tests/test.bam -0 'i=0' -c 'i+=1;ii=i'
# updating: removing sequences and base qualities
samsift -i tests/test.bam -c 'a.query_sequence=""'
# updating: switching all reads to unaligned
samsift -i tests/test.bam -c 'a.flag|=0x4;a.reference_start=-1;a.cigarstring="";a.reference_id=-1;a.mapping_quality=0'
Installation
Using Bioconda:
# add all necessary Bioconda channels
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
# install samsift
conda install samsift
Using PIP from PyPI:
pip install --upgrade samsift
Using PIP from Github:
pip install --upgrade git+https://github.com/karel-brinda/samsift
Command-line parameters
Program: samsift (advanced filtering and tagging of SAM/BAM alignments using Python expressions)
Version: 0.2.5
Author: Karel Brinda <kbrinda@hsph.harvard.edu>
Usage: samsift.py [-i FILE] [-o FILE] [-f PY_EXPR] [-c PY_CODE] [-m STR]
[-0 PY_CODE] [-d PY_EXPR] [-t PY_EXPR]
Basic options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-i FILE input SAM/BAM file [-]
-o FILE output SAM/BAM file [-]
-f PY_EXPR filtering expression [True]
-c PY_CODE code to be executed (e.g., assigning new tags) [None]
-m STR mode: strict (stop on first error)
nonstop-keep (keep alignments causing errors)
nonstop-remove (remove alignments causing errors) [strict]
Advanced options:
-0 PY_CODE initialization [None]
-d PY_EXPR debugging expression to print [None]
-t PY_EXPR debugging trigger [True]
Algorithm
exec(INITIALIZATION)
for ALIGNMENT in ALIGNMENTS:
if eval(DEBUG_TRIGER):
print(eval(DEBUG_EXPR))
if eval(FILTER):
exec(CODE)
print(ALIGNMENT)
Python expressions and code. All expressions and code should be valid with respect to Python 3. Expressions are evaluated using the eval function and code is executed using the exec function. Initialization can be used for importing Python modules, setting global variables (e.g., counters) or loading data from disk. Some modules (namely random and re) are loaded without an explicit request.
Example (printing all alignments):
samsift -i tests/test.bam -f 'True'
SAM fields. Expressions and code can access variables mirroring the fields from the alignment section of the SAM specification, i.e., QNAME, FLAG, RNAME, POS (1-based), MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, and QUAL. Several additional variables are defined to simply accessing some useful information: QUALa stores the base qualities as an integer array; SEQs, QUALs, QUALsa skip soft-clipped bases; and RNAMEi and RNEXTi store the reference ids as integers.
Example (keeping only the alignments with leftmost position <= 10000):
samsift -i tests/test.bam -f 'POS<=10000'
SAMsift internally uses the PySam library and the representation of the current alignment (an instance of the class pysam.AlignedSegment) is available as a variable a. Therefore, the previous example is equivalent to
samsift -i tests/test.bam -f 'a.reference_start+1<=10000'
The a variable can also be used for modifying the current alignment record.
Example (removing the sequence and the bases from every record):
samsift -i tests/test.bam -c 'a.query_sequence=""'
SAM tags. Every SAM tag is translated to a variable with the same name.
Example (removing alignments with a score smaller or equal to the sequence length):
samsift -i tests/test.bam -f 'AS>len(SEQ)'
If CODE is provided, all two-letter variables except re (the Python regex module) are back-translated to tags after the code execution.
Example (adding a tag ab carrying the average base quality):
samsift -i tests/test.bam -c 'ab=1.0*sum(QUALa)/len(QUALa)'
Errors. If an error occurs during an evalution of an expression or an execution of a code (e.g., due to accessing an undefined tag), then SAMsift behavior depends on the specified mode (-m). With the strict mode (-m strict, default), SAMsift will immediately interrupt the computation and report an error. With the -m nonstop-keep option, SAMsift will continue processing the alignments while keeping the error-causing alignments in the output. With the -m nonstop-remove option, all error-causing alignments are skipped and ommited from the output.
Similar programs
samtools view can filter alignments based on FLAGS, read group tags, and CIGAR strings.
sambamba view supports, in addition to SAMtools, a filtration using simple Perl-like expressions. However, it is not possible to use floats or compare different tags.
BamQL provides a simple query language for filtering SAM/BAM files.
bamPals adds tags XB, XE, XP and XL.
SamJavascript can filter alignments using JavaScript expressions.
Picard FilterSamReads can also filter alignments using JavaScript expressions.
Issues
Please use Github issues.
Changelog
See Releases.
Licence
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.