Analyze deep sequencing of complex libraries
Project description
ngs-analysis
Convenient analysis of sequencing reads that span multiple DNA or protein parts. For instance, given a library of protein variants linked to DNA barcodes, this tool can answer questions like:
- How accurate are the variant sequences, at the DNA or protein level?
- How frequently is the same barcode linked to two different variants?
- Which reads contain parts required for function (e.g., a kozak start sequence, or a fused protein tag)?
This kind of analysis often involves parsing raw sequencing reads for DNA and/or protein sub-sequences (parts), then mapping the parts to a reference of anticipated part combinations. This package offers a simple workflow:
- Define how to parse reads into parts using plain text expressions (no code)
- Run the parser on your anticipated DNA sequences to generate a reference
- Parse a batch of sequencing samples
- Map the parts found in each read to the reference
It’s been tested with Illumina paired-end reads and Oxford Nanopore long reads. Under the hood it uses NGmerge to merge paired reads and MMseqs2 for sequencing mapping. It is moderately performant: 1 million paired-end reads can be mapped to a reference of 100,000 variant-barcode pairs in ~1 minute.
Workflow
A cartoon example with two reference sequences, each consisting of a variant linked to a barcode:
Here's the analysis workflow and outputs:
Note that in the last two columns, the parsed variant is mapped to a reference variant defined by the barcode present in the same read, rather than all possible reference variants. Check out the example notebook for paired end reads for details.
TL;DR
Run ngs-analysis --help
to see available commands.
- Make an empty directory, add
config.yaml
andsamples.csv
based on the example. - Add
reference_dna.csv
with anticipated DNA sequences (including adapters). - Run
ngs-analysis setup
. Add--clean
to start the analysis from scratch. - Check that
designs.csv
is accurate; if not, fixconfig.yaml
. -
- If you have paired-end data, put it in
0_paired_reads/
and runngs-analysis merge_read_pairs <sample>
. - If you have single-end data (e.g., nanopore), put it in
1_reads/
.
- If you have paired-end data, put it in
- Run
ngs-analysis parse_reads <sample>
. Check that2_parsed/<sample>.parsed.pq
looks alright (with pandas, usepd.read_parquet
) - Run
ngs-analysis map_parsed_reads <sample>
. Results are in3_mapped/<sample>.mapped.csv
Install
pip install ngs-analysis
Make sure that the mmseqs
and NGmerge
executables are available (NGmerge is only needed for paired reads).
On Linux and Intel-based MacOS, you can use conda install -c bioconda -c conda-forge mmseqs2 ngmerge
. On Apple Silicon mmseqs
can be installed via Homebrew with brew install mmseqs2
, and NGmerge can be installed from source, or via brew install brewsci/bio/ngmerge
.
Tested on Linux and MacOS (Apple Silicon).
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.