Skip to main content

One-step genotyping tools for targeted long-read sequencing

Project description

License Test Python PyPI Bioconda DOI

日本語はこちら

DAJIN2 is a genotyping tool for genome-edited samples, utilizing nanopore sequencer target sequencing.

The name DAJIN is derived from the phrase 一網打尽 (Ichimou DAJIN or Yīwǎng Dǎjìn), symbolizing the concept of capturing everything in one sweep.

🌟 Features

  • Comprehensive Mutation Detection: Equipped with the capability to detect genome editing events over a wide range, it can identify a broad spectrum of mutations, from small changes to large structural variations.
    • DAJIN2 is also possible to detect complex mutations characteristic of genome editing, such as "insertions occurring in regions where deletions have occurred."
  • Intuitive Visualization: The outcomes of genome editing are visualized intuitively, allowing for the rapid and easy identification and analysis of mutations.
  • Multi-Sample Compatibility: Enabling parallel processing of multiple samples. This facilitates efficient progression of large-scale experiments and comparative studies.

🛠 Installation

Prerequisites

  • Python >= 3.8
  • Unix-like environment (Linux, macOS, WSL2, etc.)

From Bioconda (Recommended)

conda create -n env-dajin2 -c conda-forge -c bioconda python=3.10 DAJIN2 -y
conda activate env-dajin2

From PyPI

pip install DAJIN2

[!CAUTION] If you encounter any issues during the installation, please refer to the Troubleshooting Guide

💻 Usage

Required Files

FASTQ/FASTA/BAM Files for Sample and Control

In DAJIN2, a control that has not undergone genome editing is necessary to detect genome-editing-specific mutations. Specify a directory containing the FASTQ/FASTA (both gzip compressed and uncompressed) or BAM files of the genome editing sample and control.

Basecalling with Guppy

After basecalling with Guppy, the following file structure will be output:

fastq_pass
├── barcode01
│   ├── fastq_runid_b347657c88dced2d15bf90ee6a1112a3ae91c1af_0_0.fastq.gz
│   ├── fastq_runid_b347657c88dced2d15bf90ee6a1112a3ae91c1af_10_0.fastq.gz
│   └── fastq_runid_b347657c88dced2d15bf90ee6a1112a3ae91c1af_11_0.fastq.gz
└── barcode02
    ├── fastq_runid_b347657c88dced2d15bf90ee6a1112a3ae91c1af_0_0.fastq.gz
    ├── fastq_runid_b347657c88dced2d15bf90ee6a1112a3ae91c1af_10_0.fastq.gz
    └── fastq_runid_b347657c88dced2d15bf90ee6a1112a3ae91c1af_11_0.fastq.gz

Assuming barcode01 is the control and barcode02 is the sample, the respective directories are specified as follows:

  • Control: fastq_pass/barcode01
  • Sample: fastq_pass/barcode02
Basecalling with Dorado

For basecalling with Dorado (dorado demux), the following file structure will be output:

dorado_demultiplex
├── EXP-PBC096_barcode01.bam
└── EXP-PBC096_barcode02.bam

[!IMPORTANT] Store each BAM file in a separate directory. The directory names can be set arbitrarily.

dorado_demultiplex
├── barcode01
│   └── EXP-PBC096_barcode01.bam
└── barcode02
    └── EXP-PBC096_barcode02.bam

Similarly, store the FASTA files outputted after sequence error correction with dorado correct in separate directories.

dorado_correct
├── barcode01
│   └── EXP-PBC096_barcode01.fasta
└── barcode02
    └── EXP-PBC096_barcode02.fasta

Assuming barcode01 is the control and barcode02 is the sample, the respective directories are specified as follows:

  • Control: dorado_demultiplex/barcode01 / dorado_correct/barcode01
  • Sample: dorado_demultiplex/barcode02 / dorado_correct/barcode02

FASTA File Including Anticipated Allele Sequences

The FASTA file should contain descriptions of the alleles anticipated as a result of genome editing.

[!IMPORTANT] A header name >control and its sequence are mandatory.

If there are anticipated alleles (e.g., knock-ins or knock-outs), include their sequences in the FASTA file too. These anticipated alleles can be named arbitrarily.

Below is an example of a FASTA file:

>control
ACGTACGTACGTACGT
>knock-in
ACGTACGTCCCCACGTACGT
>knock-out
ACGTACGT

Here, >control represents the sequence of the control allele, while >knock-in and >knock-out represent the sequences of the anticipated knock-in and knock-out alleles, respectively.

Single Sample Analysis

DAJIN2 allows for the analysis of single samples (one sample vs one control).

DAJIN2 <-s|--sample> <-c|--control> <-a|--allele> <-n|--name> \
  [-g|--genome] [-t|--threads] [-h|--help] [-v|--version]

Options:
-s, --sample              Specify the path to the directory containing sample FASTQ/FASTA/BAM files.
-c, --control             Specify the path to the directory containing control FASTQ/FASTA/BAM files.
-a, --allele              Specify the path to the FASTA file.
-n, --name (Optional)     Set the output directory name. Default: 'Results'.
-g, --genome (Optional)   Specify the reference UCSC genome ID (e.g., hg38, mm39). Default: '' (empty string).
-t, --threads (Optional)  Set the number of threads. Default: 1.
-h, --help                Display this help message and exit.
-v, --version             Display the version number and exit.

Example

# Download example dataset
curl -LJO https://github.com/akikuno/DAJIN2/raw/main/examples/example_single.tar.gz
tar -xf example_single.tar.gz

# Run DAJIN2
DAJIN2 \
    --control example_single/control \
    --sample example_single/sample \
    --allele example_single/stx2_deletion.fa \
    --name stx2_deletion \
    --genome mm39 \
    --threads 4

Batch Processing

By using the batch subcommand, you can process multiple files simultaneously.
For this purpose, a CSV or Excel file consolidating the sample information is required.

[!NOTE] For guidance on how to compile sample information, please refer to this document.

DAJIN2 batch <-f|--file> [-t|--threads] [-h]

options:
  -f, --file                Specify the path to the CSV or Excel file.
  -t, --threads (Optional)  Set the number of threads. Default: 1.
  -h, --help                Display this help message and exit.

Example

# Donwload the example dataset
curl -LJO https://github.com/akikuno/DAJIN2/raw/main/examples/example_batch.tar.gz
tar -xf example_batch.tar.gz

# Run DAJIN2
DAJIN2 batch --file example_batch/batch.csv --threads 4

📈 Report Contents

Upon completion of DAJIN2 processing, a directory named DAJIN_Results is generated.
Inside the DAJIN_Results directory, the following files can be found:

DAJIN_Results/tyr-substitution
├── BAM
│   ├── tyr_c230gt_01
│   ├── tyr_c230gt_10
│   ├── tyr_c230gt_50
│   └── tyr_control
├── FASTA
│   ├── tyr_c230gt_01
│   ├── tyr_c230gt_10
│   └── tyr_c230gt_50
├── HTML
│   ├── tyr_c230gt_01
│   ├── tyr_c230gt_10
│   └── tyr_c230gt_50
├── MUTATION_INFO
│   ├── tyr_c230gt_01.csv
│   ├── tyr_c230gt_10.csv
│   └── tyr_c230gt_50.csv
├── read_plot.html
├── read_plot.pdf
└── read_summary.xlsx

1. BAM

The BAM directory contains the BAM files of reads classified per allele.

[!NOTE] Specifying a reference genome using the genome option will align the reads to that genome.
Without genome options, the reads will align to the control allele within the input FASTA file.

2. FASTA and HTML

The FASTA directory stores the FASTA files of each allele.
The HTML directory contains HTML files for each allele, where mutation sites are color-highlighted.
For example, Tyr point mutation is highlighted in green.

3. MUTATION_INFO

The MUTATION_INFO directory saves tables depicting mutation sites for each allele.
An example of a Tyr point mutation is described by its position on the chromosome and the type of mutation.

4. resd_summary.xlsx, read_plot.html and read_plot.pdf

read_summary.xlsx describes the number of reads and presence proportion for each allele.
Both read_plot.html and read_plot.pdf illustrate the proportions of each allele.
The chart's Allele type indicates the type of allele, and Percent of reads shows the proportion of reads for each allele.

The Allele type includes:

  • Intact: Alleles that perfectly match the input FASTA allele.
  • Indels: Substitutions, deletions, insertions, or inversions within 50 bases.
  • SV: Substitutions, deletions, insertions, or inversions beyond 50 bases.

[!WARNING] In PCR amplicon sequencing, the % of reads might not match the actual allele proportions due to amplification bias.
Especially when large deletions are present, the deletion alleles might be significantly amplified, potentially not reflecting the actual allele proportions.

📣 Feedback and Support

[!NOTE] For frequently asked questions, please refer to this page.

For more questions, bug reports, or other forms of feedback, we'd love to hear from you!
Please use GitHub Issues for all reporting purposes.

Please refer to CONTRIBUTING for how to contribute and how to verify your contributions.

🤝 Code of Conduct

Please note that this project is released with a Contributor Code of Conduct.
By participating in this project you agree to abide by its terms.

📄 References

For more information, please refer to the following publication:

Kuno A, et al. (2022) DAJIN enables multiplex genotyping to simultaneously validate intended and unintended target genome editing outcomes. PLoS Biology 20(1): e3001507.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dajin2-0.5.5.1.tar.gz (67.7 kB view details)

Uploaded Source

Built Distribution

dajin2-0.5.5.1-py3-none-any.whl (82.6 kB view details)

Uploaded Python 3

File details

Details for the file dajin2-0.5.5.1.tar.gz.

File metadata

  • Download URL: dajin2-0.5.5.1.tar.gz
  • Upload date:
  • Size: 67.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for dajin2-0.5.5.1.tar.gz
Algorithm Hash digest
SHA256 ee5dd13234cb15fd03c32d31a6397431fe28376a0142ae2cb9e7142c1bba4712
MD5 53ba0793dc084fb40aa5c92a44df6263
BLAKE2b-256 e333b767060c23d77e9a071afb96d07e07f6de1429b9098577996bb40f6330d5

See more details on using hashes here.

File details

Details for the file dajin2-0.5.5.1-py3-none-any.whl.

File metadata

  • Download URL: dajin2-0.5.5.1-py3-none-any.whl
  • Upload date:
  • Size: 82.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.7

File hashes

Hashes for dajin2-0.5.5.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d96359d9fccaf11b98c15883d18e61dd46bf18739a6728f13d405838dfdc2dfb
MD5 c77f7d7fc6c585932da4c41d192f6012
BLAKE2b-256 e2bea4efae7822167b54bf6013869ec29b62a38737fd0dbe63a7a41141a0ff13

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page