Skip to main content

Advanced filtering and tagging of SAM/BAM alignments using Python expressions

Project description

https://github.com/karel-brinda/samsift/actions/workflows/ci.yml/badge.svg https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat-square https://badge.fury.io/py/samsift.svg https://zenodo.org/badge/DOI/10.5281/zenodo.1048211.svg

SAMsift is a program for advanced filtering and tagging of SAM/BAM alignments using Python expressions.

Getting started

# clone this repo and add it to PATH
git clone http://github.com/karel-brinda/samsift
cd samsift
export PATH=$(pwd)/samsift:$PATH

# filtering: keep only alignments with score >94, save them as filtered.bam
samsift -i tests/test.bam -o filtered.bam -f 'AS>94'
# filtering: keep only unaligned reads
samsift -i tests/test.bam -f 'FLAG & 0x04'
# filtering: keep only aligned reads
samsift -i tests/test.bam -f 'not(FLAG & 0x04)'
# filtering: keep only sequences containing ACCAGAGGAT
samsift -i tests/test.bam -f 'SEQ.find("ACCAGAGGAT")!=-1'
# filtering: keep only sequences containing A and T only (defined using regular expressions)
samsift -i tests/test.bam -f 're.match(r"^[AT]*$", SEQ)'
# filtering: sample alignments with 25% rate
samsift -i tests/test.bam -f 'random.random()<0.25'
# filtering: sample alignments with 25% rate with a fixed RNG seed
samsift -i tests/test.bam -f 'random.random()<0.25' -0 'random.seed(42)'
# filtering: keep only alignments of reads specified in tests/qnames.txt
samsift -i tests/test.bam -0 'q=open("tests/qnames.txt").read().splitlines()' -f 'QNAME in q'
# filtering: keep only first 5000 reads from chr1 and 5000 reads from chr2
samsift -i tests/test.bam -0 'c={"chr1":5000,"chr2":5000}' -f 'c[RNAME]>0' -c 'c[RNAME]-=1' -m nonstop-remove
# tagging: add tags 'ln' with sequence length and 'ab' with average base quality
samsift -i tests/test.bam -c 'ln=len(SEQ);ab=1.0*sum(QUALa)/ln'
# tagging: add a tag 'ii' with the number of the current alignment
samsift -i tests/test.bam -0 'i=0' -c 'i+=1;ii=i'
# updating: removing sequences and base qualities
samsift -i tests/test.bam -c 'a.query_sequence=""'
# updating: switching all reads to unaligned
samsift -i tests/test.bam -c 'a.flag|=0x4;a.reference_start=-1;a.cigarstring="";a.reference_id=-1;a.mapping_quality=0'

Installation

Using Bioconda:

# add all necessary Bioconda channels
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

# install samsift
conda install samsift

Using PIP from PyPI:

pip install --upgrade samsift

Using PIP from Github:

pip install --upgrade git+https://github.com/karel-brinda/samsift

Command-line parameters

Program: samsift (advanced filtering and tagging of SAM/BAM alignments using Python expressions)
Version: 0.3.1
Author:  Karel Brinda <karel.brinda@inria.fr>

Usage:   samsift.py [-i FILE] [-o FILE] [-f [PY_EXPR ...]] [-c [PY_CODE ...]] [-m STR]
                    [-0 [PY_CODE ...]] [-d [PY_EXPR ...]] [-t [PY_EXPR ...]]

Basic options:
  -h, --help        show this help message and exit
  -v, --version     show program's version number and exit
  -i FILE           input SAM/BAM file [-]
  -o FILE           output SAM/BAM file [-]
  -f [PY_EXPR ...]  filtering expression [True]
  -c [PY_CODE ...]  code to be executed (e.g., assigning new tags) [None]
  -m STR            mode: strict (stop on first error)
                          nonstop-keep (keep alignments causing errors)
                          nonstop-remove (remove alignments causing errors) [strict]

Advanced options:
  -0 [PY_CODE ...]  initialization [None]
  -d [PY_EXPR ...]  debugging expression to print [None]
  -t [PY_EXPR ...]  debugging trigger [True]

Algorithm

exec(INITIALIZATION)
for ALIGNMENT in ALIGNMENTS:
        if eval(DEBUG_TRIGER):
                print(eval(DEBUG_EXPR))
        if eval(FILTER):
                exec(CODE)
                print(ALIGNMENT)

Python expressions and code. All expressions and code should be valid with respect to Python 3. Expressions are evaluated using the eval function and code is executed using the exec function. Initialization can be used for importing Python modules, setting global variables (e.g., counters) or loading data from disk. Some modules (namely datetime, math, random, and re) are loaded without an explicit request, and the internal RNG seed is set to 42.

Example (printing all alignments):

samsift -i tests/test.bam -f 'True'

SAM fields. Expressions and code can access variables mirroring the fields from the alignment section of the SAM specification, i.e., QNAME, FLAG, RNAME, POS (1-based), MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, and QUAL. Several additional variables are defined to simply accessing some useful information: QUALa stores the base qualities as an integer array; SEQs, QUALs, QUALsa skip soft-clipped bases; and RNAMEi and RNEXTi store the reference ids as integers.

Example (keeping only the alignments with leftmost position <= 10000):

samsift -i tests/test.bam -f 'POS<=10000'

SAMsift internally uses the PySam library and the representation of the current alignment (an instance of the class pysam.AlignedSegment) is available as a variable a. Therefore, the previous example is equivalent to

samsift -i tests/test.bam -f 'a.reference_start+1<=10000'

The a variable can also be used for modifying the current alignment record.

Example (removing the sequence and the bases from every record):

samsift -i tests/test.bam -c 'a.query_sequence=""'

SAM tags. Every SAM tag is translated to a variable with the same name.

Example (removing alignments with a score smaller or equal to the sequence length):

samsift -i tests/test.bam -f 'AS>len(SEQ)'

If CODE is provided, all two-letter variables except re (the Python regex module) are back-translated to tags after the code execution.

Example (adding a tag ab carrying the average base quality):

samsift -i tests/test.bam -c 'ab=1.0*sum(QUALa)/len(QUALa)'

Errors. If an error occurs during an evalution of an expression or an execution of a code (e.g., due to accessing an undefined tag), then SAMsift behavior depends on the specified mode (-m). With the strict mode (-m strict, default), SAMsift will immediately interrupt the computation and report an error. With the -m nonstop-keep option, SAMsift will continue processing the alignments while keeping the error-causing alignments in the output. With the -m nonstop-remove option, all error-causing alignments are skipped and ommited from the output.

Similar programs

  • samtools view can filter alignments based on FLAGS, read group tags, and CIGAR strings.

  • sambamba view supports, in addition to SAMtools, a filtration using simple Perl-like expressions. However, it is not possible to use floats or compare different tags.

  • BamQL provides a simple query language for filtering SAM/BAM files.

  • bamPals adds tags XB, XE, XP and XL.

  • SamJavascript can filter alignments using JavaScript expressions.

  • Picard FilterSamReads can also filter alignments using JavaScript expressions.

Issues

Please use Github issues.

Changelog

See Releases.

Licence

MIT

Author

Karel Brinda <karel.brinda@inria.fr>

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

samsift-0.3.1.tar.gz (13.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

samsift-0.3.1-py3-none-any.whl (11.8 kB view details)

Uploaded Python 3

File details

Details for the file samsift-0.3.1.tar.gz.

File metadata

  • Download URL: samsift-0.3.1.tar.gz
  • Upload date:
  • Size: 13.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for samsift-0.3.1.tar.gz
Algorithm Hash digest
SHA256 3f795f8ed9644dbe55ad1e330a71307ef011e3964f3058f4c16d5f508114356e
MD5 2b979e1d5ed5134e24a98d3bf5f2878b
BLAKE2b-256 839c89789565c0c6ecbfc3b9d5dae09583528172bd90fadfb56227ebc316547f

See more details on using hashes here.

Provenance

The following attestation bundles were made for samsift-0.3.1.tar.gz:

Publisher: publish.yml on karel-brinda/samsift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file samsift-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: samsift-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 11.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for samsift-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9a030c5ff031bc8f057b3f3fd8b75a709c46d8b5ea201d122a9cf5525f629922
MD5 fca573fa7eba5b5288714db8117f0e2b
BLAKE2b-256 391ff212fbea7121f147727a587b33c7c3d8722e4f075564b38ec1bc331950f8

See more details on using hashes here.

Provenance

The following attestation bundles were made for samsift-0.3.1-py3-none-any.whl:

Publisher: publish.yml on karel-brinda/samsift

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page