Commandline tool for parsing NGS reads by multiple fuzzy regex operations
Project description
itermae
Command-line utility to recognize patterns in input sequences and generate outputs from groups recognized. Basically, a utility for applying fuzzy regular expression operations to (primarily) DNA sequence for purposes of DNA barcode/tag/UMI parsing, sequence and quality -based filtering, and general output re-arrangment.
Reads and makes FASTQ, FASTA, text-file, and SAM (tab-delimited).
Designed to function with sequence piped in from tools like GNU parallel
to permit light-weight parallelization.
Matching is handled as strings in
regex
,
and Biopython
is used to represent,
slice, and read/output formats.
Designed for use in command-line shells on a *nix machine.
Availability, installation, 'installation'
Options:
-
Use pip to install
itermae
, sopython3 -m pip install itermae
-
You can clone this repo, and install it locally. Dependencies are in
requirements.txt
, sopython3 -m pip install -r requirements.txt
will install those. But if you're not using pip anyways, then you... do you. -
You can use Singularity to pull and run a Singularity image of itermae.py, where everything is already installed. This is the recommended usage. This image is built with a few other tools, like gawk, perl, and parallel, to make command line munging easier.
Usage
itermae
is envisioned to be used in a pipe-line where you just got your
DNA sequencing FASTQ reads back, and you want to parse them.
You feed small chunks of the file into the tool with match-level verbosity and record-level reports to develop good patterns. These patterns, filtering, and outputs are used to pull out and assemble the output you want.
Then you wrap it it up behind
parallel
and feed the whole FASTQ file via zcat
in on standard input.
This parallelizes with a small memory footprint (will measure later), then
you write it out to disk (or stream into another tool).
Tutorial / demo - there's a jupyter notebook in this root directory
(demos_and_tutorial_itermae.ipynb
) and the rendered output HTML.
That should have some examples and ideas for how to use it.
There's also some longer runs that are launched by a bash script in
profiling_tests
, these generate longer runs for profiling purposes
with cProfile
and snakeviz
.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for itermae-0.5.9.9-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6a2e25b0ff43ada0706d36ab75faa069e4724e9896ab5dc2d10b560aba164b6a |
|
MD5 | c3f4f49609e376854d5245f365cfe4dd |
|
BLAKE2b-256 | aae5f37995b7810bb483ab159485753063b3d3519779b8150ab219c2a339dec2 |