Skip to main content

Clips overlapping regions in read mates of SAM/BAM files.

Project description

clip

Clips overlapping regions in read mates of SAM/BAM files.

To install: pip install clipoverlap Or:

git clone git@github.com:innovate-invent/clip.git
cd clip
python3 setup.py install

To run:

$ clip -h
Clip Overlap v1.0
Clip overlapping reads from SAM/BAM/CRAM file
Use: clip [-tmabcosv] [input file path | < infile > outfile] [output file path]
If no paths are given stdin and stdout are used.
-t # Threads to use for processing (Default=1)
-m # Maximum template length guaranteeing no read overlap (Default=1000)
-a Alternate strand being clipped to avoid strand bias (RAM intensive)
-b Trim tails of reads that extend past end of mate. Used to trim barcode remnants.
-c Clip only, do not merge clipped region into mate.
-o [sbuc] Output format: s=SAM (Default), b=BAM compressed, bu=BAM uncompressed, c=CRAM
-s Maintain input order (High depth regions may fill RAM), if not set will output in arbitrary order (Minimal RAM)
-v Verbose status output

You may notice if you just run clip with no parameters it will just sit there doing nothing. That is because the default is to listen to stdin for input.

Notes:

clip uses a minimum of two subprocesses regardless of the -t option.

-a will alternate between clipping the tail of the left most strand and clipping the head of the right most strand. This is to avoid possible strand bias later in a processing pipeline.

If you are processing reads that had barcodes ligated and removed the 5’ barcode in a previous step (See ProDuSe:trim) then use the -b option to remove any possible 3’ barcode sequence that would be appended if sequencing ran to the end of the molecule.

Using -s and -a together will force clip to try and sort by start reference coordinate. If unsorted data is the input then this could potentially run out of RAM.

Merge Algorithm

The mate read cigars are assumed to align 1-1 with an offset determined by the difference in the reference start positions.

  • If -c is unset then clip will retain the highest quality base at a given position in the overlapping region of the mate pairs.

  • If the base qualities are equal then it will keep the base that does not match the reference.

  • If base qualities are equal and both bases are different variants, then the quality score is set to 3 (3 = Phred 50% probability of either base).

  • If the operations between the aligned cigars do not match then the operations from the mate with the lowest alignment cost are retained.

The alignment cost is calculated for the overlapping region only. The cost is summed using these values:

Operation

Value

M, X, =, N

-1

I

6 to start, +1 to lengthen

D

3 to start, +1 to lengthen

TODO:

Significant speed and memory optimisations are planned. Need to eliminate the pysam dependency first.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

clipoverlap-1.0.9.tar.gz (11.3 kB view details)

Uploaded Source

Built Distribution

clipoverlap-1.0.9-py3-none-any.whl (13.5 kB view details)

Uploaded Python 3

File details

Details for the file clipoverlap-1.0.9.tar.gz.

File metadata

  • Download URL: clipoverlap-1.0.9.tar.gz
  • Upload date:
  • Size: 11.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for clipoverlap-1.0.9.tar.gz
Algorithm Hash digest
SHA256 20064aa36ddfb2c4faaae7f147f1a93853ab67b9690bc65701ec754db5786681
MD5 71cbd039d648d3c74d78581f1f6ba241
BLAKE2b-256 a3e3bfa31d87a95f2e6f1e12656db695d5461378468ec9fdad789b114e84f805

See more details on using hashes here.

File details

Details for the file clipoverlap-1.0.9-py3-none-any.whl.

File metadata

File hashes

Hashes for clipoverlap-1.0.9-py3-none-any.whl
Algorithm Hash digest
SHA256 9b8e9bead4d61cdc1f13d3711cc1f848e600715d903cedd204f48d17296bf7fc
MD5 9846081a7f4bbe6a3da18712844fdfd3
BLAKE2b-256 87f7d5c5c922e42afe8581b2b28255cb1922bb6a21150c7a901a85f23daa8bfa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page