Skip to main content

Analyze deep sequencing of complex libraries

Project description

ngs-analysis

Convenient analysis of sequencing reads that span multiple DNA or protein parts. For instance, given a library of protein variants linked to DNA barcodes, this tool can answer questions like:

  • How accurate are the variant sequences, at the DNA or protein level?
  • How frequently is the same barcode linked to two different variants?
  • Which reads contain parts required for function (e.g., a kozak start sequence, or a fused protein tag)?

This kind of analysis often involves parsing raw sequencing reads for DNA and/or protein sub-sequences (parts), then mapping the parts to a reference of anticipated part combinations. This package offers a simple workflow:

  1. Define how to parse reads into parts using plain text expressions (no code)
  2. Run the parser on your anticipated DNA sequences to generate a reference
  3. Parse a batch of sequencing samples
  4. Map the parts found in each read to the reference

It’s been tested with Illumina paired-end reads and Oxford Nanopore long reads. Under the hood it uses NGmerge to merge paired reads and MMseqs2 for sequencing mapping. It is moderately performant: 1 million paired-end reads can be mapped to a reference of 100,000 variant-barcode pairs in ~1 minute.

Workflow

A cartoon example with two reference sequences, each consisting of a variant linked to a barcode:

sequences

Here's the analysis workflow and outputs:

analysis workflow

Note that in the last two columns, the parsed variant is mapped to a reference variant defined by the barcode present in the same read, rather than all possible reference variants. Check out the example notebook for paired end reads for details.

TL;DR

Run ngs-analysis --help to see available commands.

  1. Make an empty directory, add config.yaml and samples.csv based on the example.
  2. Add reference_dna.csv with anticipated DNA sequences (including adapters).
  3. Run ngs-analysis setup. Add --clean to start the analysis from scratch.
  4. Check that designs.csv is accurate; if not, fix config.yaml.
    • If you have paired-end data, put it in 0_paired_reads/ and run ngs-analysis merge_read_pairs <sample>.
    • If you have single-end data (e.g., nanopore), put it in 1_reads/.
  5. Run ngs-analysis parse_reads <sample>. Check that 2_parsed/<sample>.parsed.pq looks alright (with pandas, use pd.read_parquet)
  6. Run ngs-analysis map_parsed_reads <sample>. Results are in 3_mapped/<sample>.mapped.csv

Install

pip install ngs-analysis

Make sure that the mmseqs and NGmerge executables are available (NGmerge is only needed for paired reads).

On Linux and Intel-based MacOS, you can use conda install -c bioconda -c conda-forge mmseqs2 ngmerge. On Apple Silicon mmseqs can be installed via Homebrew with brew install mmseqs2, and NGmerge can be installed from source, or via brew install brewsci/bio/ngmerge.

Tested on Linux and MacOS (Apple Silicon).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ngs-analysis-0.0.4.tar.gz (21.9 kB view details)

Uploaded Source

File details

Details for the file ngs-analysis-0.0.4.tar.gz.

File metadata

  • Download URL: ngs-analysis-0.0.4.tar.gz
  • Upload date:
  • Size: 21.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.15

File hashes

Hashes for ngs-analysis-0.0.4.tar.gz
Algorithm Hash digest
SHA256 78734c6dfb1df6cc8727fa53946296d7bd4eea5fb0baef79c655a0fc13e56d4e
MD5 fe117305e9362b4cbb70076efa55b4fa
BLAKE2b-256 6c736b13ca783011efe456c99f08bd4011a9504e0ec74a3e18c0cadbf865d747

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page