Commandline tool for parsing NGS reads by multiple fuzzy regex operations
Project description
itermae
See the concept here and tutorial here.
itermae
is a command-line utility to recognize patterns in input sequences
and generate outputs from groups recognized. Basically, it uses fuzzy regular
expression operations to (primarily) DNA sequence for purposes of DNA
barcode/tag/UMI parsing, sequence and quality -based filtering,
and general output re-arrangment.
itermae
reads and makes FASTQ, FASTA, text-file, and SAM (tab-delimited)
files using Biopython
sequence records
to represent slice, and read/output formats.
Pattern matching uses the regex
library,
and the tool is designed to function in command-line pipes from tools like
GNU parallel
to permit light-weight parallelization.
It's usage might look something like this:
zcat seq_data.fastqz | itermae --config my_config.yml -v > output.sam
or
zcat seq_data.fastqz \
| parallel --quote --pipe -l 4 --keep-order -N 10000 \
itermae --config my_config.yml -v > output.sam
with a my_config.yml
file that may look something like this:
matches:
- use: input
pattern: NNNNNGTCCTCGAGGTCTCTNNNNNNNNNNNNNNNNNNNNCGTACGCTGCAGGTC
marking: aaaaaBBBBBBBBBBBBBBBccccccccccccccccccccDDDDDDDDDDDDDDD
marked_groups:
a:
name: sampleIndex
repeat: 5
B:
allowed_errors: 2
c:
name: barcode
repeat_min: 18
repeat_max: 22
D:
allowed_insertions: 1
allowed_deletions: 2
allowed_substititions: 2
output_list:
- name: 'barcode'
description: 'description+" sample="+sampleIndex'
seq: 'barcode'
filter: 'statistics.median(barcode.quality) >= 35'
Availability, installation, 'installation'
Options:
-
Use pip to install
itermae
, sopython3 -m pip install itermae
-
You can clone this repo, and install it locally. Dependencies are in
requirements.txt
, sopython3 -m pip install -r requirements.txt
will install those. -
You can use Singularity to pull and run a Singularity image of itermae.py, where everything is already installed. This is the recommended usage.
This image is built with a few other tools, like g/mawk, perl, and parallel, to make command line munging easier.
Usage
itermae
is envisioned to be used in a pipe-line where you just got your
DNA sequencing FASTQ reads back, and you want to parse them.
The recommended interface is the YAML config file, as demonstrated
in the tutorial
and detailed again in the
configuration details.
You can also use a command-line argument interface as detailed more
in the examples.
I recommend you test this on small batches of data,
then stick it behind GNU parallel
and feed the whole FASTQ file via
zcat
in on standard input.
This parallelizes with a small memory footprint, then
you write it out to disk (or stream into another tool).
Thanks
Again, the tool is built upon on the excellent work of
Development, helping
Any issues or advice are welcome as an issue on the gitlab repo. Complaints are especially welcome.
For development, see the documentation as rendered from docstrings.
A set of tests is written up with pytest
module, and can be run from inside
the cloned repo with make test
.
See make help
for more options, such as building, installing, and uploading.
There's also a bash script with some longer runs in
profiling_tests
, these generate longer runs for profiling purposes
with cProfile
and snakeviz
.
But is out of date. Todo is to re-configure and retest that for speed.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file itermae-0.6.0.1.tar.gz
.
File metadata
- Download URL: itermae-0.6.0.1.tar.gz
- Upload date:
- Size: 20.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1b16aa3624ed911279cbb1601d239872c533ad326f953d482c99d843e6065d59 |
|
MD5 | 5451221409f3011b6431c11305b52cce |
|
BLAKE2b-256 | b81db01f55374ec6dd07e84786f4f832e1196bfd63a42d9fbc389e3ea5bee46b |
File details
Details for the file itermae-0.6.0.1-py3-none-any.whl
.
File metadata
- Download URL: itermae-0.6.0.1-py3-none-any.whl
- Upload date:
- Size: 20.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0e9b0d129d1644d51928be6c18b126221c855d5ea9f5f03fb7929adaff482385 |
|
MD5 | 690844f6a11a94e0115e5f2e9c412fec |
|
BLAKE2b-256 | 5df4671bfe83d6c14d3f6a57f386d1d1a570a8e2b44399ee13a5e917ce7c9fb1 |