Skip to main content

Samfile long-read filtering script.

This project has been archived.

The maintainers of this project have marked this project as archived. No new releases are expected.

Project description

GERENUQ

DOI

A simple commandline tool and python functions for filtering long reads from bam, sam and paf red alignment files according to various user-defined parameters.

Installation

Using Conda

  $ conda install -c conda-forge -c bioconda -c abahcheli gerenuq

Using Pip

  $ pip install gerenuq

Using Docker

  $ docker pull abahcheli/gerenuq

Manual

  $ git clone https://github.com/abahcheli/gerenuq
  $ cd gerenuq
  $ python setup.py install

Usage

gerenuq

Required inputs:
-i / --input <input raw samfile>
-o / --output <output filtered samfile>

Optional inputs:
-l / --length <minimum read length for cutoff (default 1000)>
-m / --matchlength <sequence identity, also known as minimum ratio of matches to read length (default 0.5)>
-s / --score <minimum score for the whole alignment (default 1)>
-q / --lengthscore <minimum ratio of length to score, may be considered as the fraction of bases that have a positive score (default 2)>
-t / --threads <number of processes to run (default 1)>
gerenuq_filter_file(input_file, output_file, min_score = 1, min_len_to_score = 2, min_length = 1000, min_match_to_length = 0.5)
'''
Filters minimap2-mapped reads by mapping score, length, match-to-length and length-to-score ratios. Paf format files only filter by query cutoff.

Requires input_file in bam, sam or paf format and output_file (output in the same format as input).
'''

gerenuq_filter_read_list(read_list, format='sam', min_score = 1, min_len_to_score = 2, min_length = 1000, min_match_to_length = 0.5)
'''
Filters minimap2-mapped reads by mapping score, length, match-to-length and length-to-score ratios. Paf format files only filter by query cutoff.

Requires read_list as list of mapped read lines from sam or paf file (in tsv format). Returns a list of reads in sam or paf (tsv) format that passed filtering parameters. Headers will be ignored and not returned.
'''

Getting Started

Background and Theory

Gerenuq is based off of a series of commands used to filter reads, originating from the filtering process used in the cmags paper. Instead of requiring a number of inputs and outputs, this script is a single line requiring a samfile input and returning a filtered samfile list.

The script filters reads mapped against a reference from a minimap2 results samfile. Required input parameters is a samfile (-i or --samfile) (see Getting Started) and an output file (-o or --output).

The script will parse the samfile, filtering reads that are primary alignments, at least 1,000 bases long, meet a minimum ratio of 0.5 for the number of matches to the read length (sequence identity), and be less than a maximum ratio of 2 for the length divided by the score (inverse of the average score per base).

Optional parameters can change the filters in a number of ways (refer to the help command when running the script). It is highly recommended that you multi-thread to speed up the filtering process.

Quick Start

For appropriate inputs, type python3 gerenuq.py --help.

The samfile should be the output from a minimap2 alignment that may be filtered by samtools.

Required input parameters is a samfile, bamfile or paf file (-i or --input) and output file (-o or --output). Output will a samfile, bamfile or paf file filtered according to input or default parameters. For example, a simple input would be:

gerenuq -i raw_samfile.sam -o filtered_samfile.sam

Processing time increases exponentially for each additionally mapped read. Multi-processing is recommended, which the number of processes to run can be describe with the -t flag (or --threads).

gerenuq -i raw_samfile.sam -o filtered_samfile.sam --length 50000 --threads 20

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gerenuq-0.2.7.tar.gz (9.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gerenuq-0.2.7-py3-none-any.whl (15.6 kB view details)

Uploaded Python 3

File details

Details for the file gerenuq-0.2.7.tar.gz.

File metadata

  • Download URL: gerenuq-0.2.7.tar.gz
  • Upload date:
  • Size: 9.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.8.5

File hashes

Hashes for gerenuq-0.2.7.tar.gz
Algorithm Hash digest
SHA256 3412d0b88e139588026116cd6845e2225290829962398bb347d4fa4e816503f0
MD5 1b998a2da3ef24a864616599f59bbaab
BLAKE2b-256 27c31529e7761b0ba83b1722b560be559127a5dad97ca24d5e0080f648913739

See more details on using hashes here.

File details

Details for the file gerenuq-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: gerenuq-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 15.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.6.1 requests/2.25.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.54.0 CPython/3.8.5

File hashes

Hashes for gerenuq-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 0df477ef18fd4cd18d726daf73518ce96d0af2b9fe6eee38e34ad24ab8cf959f
MD5 879e7405c3a0204262012851acdb4a89
BLAKE2b-256 f828d54d29d5c74ee8a73067cb2648d41a64611de5bd4b9d16c8169cdb48380f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page