Skip to main content

Commandline tool for parsing NGS reads by multiple fuzzy regex operations

Project description

itermae

See the concept here and tutorial here.

itermae is a command-line utility to recognize patterns in input sequences and generate outputs from groups recognized. Basically, it uses fuzzy regular expression operations to (primarily) DNA sequence for purposes of DNA barcode/tag/UMI parsing, sequence and quality -based filtering, and general output re-arrangment.

itermae diagram

itermae reads and makes FASTQ, FASTA, text-file, and SAM (tab-delimited) files using Biopython sequence records to represent slice, and read/output formats. Pattern matching uses the regex library, and the tool is designed to function in command-line pipes from tools like GNU parallel to permit light-weight parallelization.

It's usage might look something like this:

zcat seq_data.fastqz | itermae --config my_config.yml -v > output.sam

or

zcat seq_data.fastqz \
    | parallel --quote --pipe -l 4 --keep-order -N 10000 \
        itermae --config my_config.yml -v > output.sam

with a my_config.yml file that may look something like this:

matches:
    - use: input
      pattern: NNNNNGTCCTCGAGGTCTCTNNNNNNNNNNNNNNNNNNNNCGTACGCTGCAGGTC
      marking: aaaaaBBBBBBBBBBBBBBBccccccccccccccccccccDDDDDDDDDDDDDDD
      marked_groups:
          a:
              name: sampleIndex
              repeat: 5
          B:
              allowed_errors: 2
          c:
              name: barcode
              repeat_min: 18
              repeat_max: 22
          D:
              allowed_insertions: 1
              allowed_deletions: 2
              allowed_substititions: 2
output_list:
    -   name: 'barcode'
        description: 'description+" sample="+sampleIndex'
        seq: 'barcode'
        filter: 'statistics.median(barcode.quality) >= 35'

Availability, installation, 'installation'

Options:

  1. Use pip to install itermae, so

    python3 -m pip install itermae

  2. You can clone this repo, and install it locally. Dependencies are in requirements.txt, so python3 -m pip install -r requirements.txt will install those.

  3. You can use Singularity to pull and run a Singularity image of itermae.py, where everything is already installed. This is the recommended usage.

    This image is built with a few other tools, like g/mawk, perl, and parallel, to make command line munging easier.

Usage

itermae is envisioned to be used in a pipe-line where you just got your DNA sequencing FASTQ reads back, and you want to parse them. The recommended interface is the YAML config file, as demonstrated in the tutorial and detailed again in the configuration details. You can also use a command-line argument interface as detailed more in the examples.

I recommend you test this on small batches of data, then stick it behind GNU parallel and feed the whole FASTQ file via zcat in on standard input. This parallelizes with a small memory footprint, then you write it out to disk (or stream into another tool).

Thanks

Again, the tool is built upon on the excellent work of

Development, helping

Any issues or advice are welcome as an issue on the gitlab repo. Complaints are especially welcome.

For development, see the documentation as rendered from docstrings.

A set of tests is written up with pytest module, and can be run from inside the cloned repo with make test. See make help for more options, such as building, installing, and uploading.

There's also a bash script with some longer runs in profiling_tests, these generate longer runs for profiling purposes with cProfile and snakeviz. But is out of date. Todo is to re-configure and retest that for speed.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

itermae-0.6.0.1.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

itermae-0.6.0.1-py3-none-any.whl (20.8 kB view details)

Uploaded Python 3

File details

Details for the file itermae-0.6.0.1.tar.gz.

File metadata

  • Download URL: itermae-0.6.0.1.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.8.5

File hashes

Hashes for itermae-0.6.0.1.tar.gz
Algorithm Hash digest
SHA256 1b16aa3624ed911279cbb1601d239872c533ad326f953d482c99d843e6065d59
MD5 5451221409f3011b6431c11305b52cce
BLAKE2b-256 b81db01f55374ec6dd07e84786f4f832e1196bfd63a42d9fbc389e3ea5bee46b

See more details on using hashes here.

File details

Details for the file itermae-0.6.0.1-py3-none-any.whl.

File metadata

  • Download URL: itermae-0.6.0.1-py3-none-any.whl
  • Upload date:
  • Size: 20.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.3.0 pkginfo/1.7.0 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.9.1 tqdm/4.57.0 CPython/3.8.5

File hashes

Hashes for itermae-0.6.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0e9b0d129d1644d51928be6c18b126221c855d5ea9f5f03fb7929adaff482385
MD5 690844f6a11a94e0115e5f2e9c412fec
BLAKE2b-256 5df4671bfe83d6c14d3f6a57f386d1d1a570a8e2b44399ee13a5e917ce7c9fb1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page