A small tool help assess and visualize the accuracy of a sequencing dataset, specifically for Oxford Nanopore Technologies (ONT) long-read sequencing.
Project description
Giraffe_View
Giraffe_View is designed to help assess and visualize the accuracy of a sequencing dataset, specifically for Oxford Nanopore Technologies (ONT) long-read sequencing including DNA and RNA data. There are four main functions to validate the read quality.
observe
calculates the observed read accuracy, mismatches porportion, and homopolymer identification.estimate
calculates the estimated read accuracy, which is equal to Quality Score.gc_bias
compares the relationship between GC content and read coverage.modi_bin
perform statistics on the distribution of modification based on the bed file.
Install
To use this software, you need to install additional dependencies including samtools, minimap2, seqkit, and bedtools for read processing. The following commands can help you to install the package and dependencies.
pip install Giraffe-View
conda install -c bioconda -c conda-forge samtools minimap2 seqkit bedtools -y
General Usage
Giraffe View is run simply with fllowing commands:
giraffe -h
usage: giraffe [-h] {observe,estimate,gc_bias,modi_bin} ...
A tool to help you assess quality of ONT data.
positional arguments:
{observe,estimate,gc_bias,modi_bin}
observe Observed quality in accuracy, mismatch, and homopolymer identification
estimate Estimated read accuracy
gc_bias Relationship between GC content and depth
modi_bin Average modification proportion of bins
optional arguments:
-h, --help show this help message and exit
The available sub-commands are:
observe
giraffe observe -h
usage: giraffe observe [-h] --input <fastq> --ref <reference> [--cpu <number>] [--plot]
optional arguments:
-h, --help show this help message and exit
--input <fastq> input reads
--ref <reference> input reference
--cpu <number> number of cpu (default:10)
--plot Results visualization
fastq
- the raw fastq data, some filter steps will be conducted including short read ( < 200 bp) and low quality read ( < 7 ) removal.reference
- the reference file in fasta format.cpu
- the number of CPUs will be used during processing.
estimate
giraffe estimate -h
usage: giraffe estimate [-h] --input <fastq> [--cpu <number>] [--plot]
optional arguments:
-h, --help show this help message and exit
--input <fastq> input reads
--cpu <number> number of cpu (default:10)
--plot Results visualization
gc_bias
giraffe gc_bias -h
usage: giraffe gc_bias [-h] --ref <reference> --input <sam/bam> [--binsize] [--plot]
optional arguments:
-h, --help show this help message and exit
--ref <reference> input reference file
--input <sam/bam> input bam/sam file
--binsize input bin size (default:1000)
--plot Results visualization
reference
- the reference file in fasta format.sam
/bam
- the result of mapping in sam/bam file. If you have used the observe function to process your data, the resultingtmp.sort.bam
file can be used as the input.binsize
- the length of bin. A bin is the smallest unit to count the read coverage and GC content.
modi_bin
giraffe modi_bin -h
usage: giraffe modi_bin [-h] --input <bed> --ref <reference> [--cpu <number>] [--plot]
optional arguments:
-h, --help show this help message and exit
--input <bed> input modificated bed file, please use the .bed as the file suffix
--ref <reference> input position file with CSV format, please use the .csv as the file suffix
--cpu <number> number of cpu (default:10)
--plot Results visualization
-
bed
- a BED file with four columns (three columns for position, one for methylation proportion). Please use the tab ("\t") to gap the column instead of the space (" ").#chrom start end value chr1 81 83 0.8 chr1 21314 21315 0.3 chr1 32421 32422 0.85
-
reference
- a CSV file with target regions.chr1,0,100000,1_0_100000 chr1,100000,200000,1_100000_200000
Workflow
graph TD
A(raw signal) -.-> |Basecall| B(FASTA)
A(raw signal) -.-> |Basecall| C(modificated BED)
C(modificated BED) --> |modi_bin| D(modification distribution)
B(FASTA) --> |estimate|e(estimated accuracy)
B(FASTA) --> |observe| f(clean reads)
f(clean reads) --> |observe| g(aligned BAM)
g(aligned BAM) --> |observe|h(homopolymer identification)
g(aligned BAM) --> |observe|i(observed accuracy)
g(aligned BAM) --> |gc_bias|j(GC bias)
Developing
- run the homopolymer identification with multi-processed
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file Giraffe_View-0.0.9.4.tar.gz
.
File metadata
- Download URL: Giraffe_View-0.0.9.4.tar.gz
- Upload date:
- Size: 13.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | db220665a560e69c0919b69d964b29c75f6ad0bfcc2d5cae747d13c44cbbaf2c |
|
MD5 | ec1cbb5c40740603885f18ffdeac70d6 |
|
BLAKE2b-256 | f132748947e89aa3637ce5819dfb13d978682fcbd79699890360db43219256f6 |
File details
Details for the file Giraffe_View-0.0.9.4-py3-none-any.whl
.
File metadata
- Download URL: Giraffe_View-0.0.9.4-py3-none-any.whl
- Upload date:
- Size: 15.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 3a70ffbafb14567ed5b8b4ecd632f3567bd5edfefd36ca5ee81e828a74051194 |
|
MD5 | 2c506c07ee9f64779ae9c02d611d1707 |
|
BLAKE2b-256 | 78408cfd8268049bd68c500badc6641922d577cabc29d40fd034c4062ac7edd7 |