Skip to main content

A small tool help assess and visualize the accuracy of a sequencing dataset, specifically for Oxford Nanopore Technologies (ONT) long-read sequencing.

Project description

Giraffe_View

Giraffe_View is designed to help assess and visualize the accuracy of a sequencing dataset, specifically for Oxford Nanopore Technologies (ONT) long-read sequencing including DNA and RNA data. There are four main functions to validate the read quality.

  • observe calculates the observed read accuracy, mismatches porportion, and homopolymer identification.
  • estimate calculates the estimated read accuracy, which is equal to Quality Score.
  • gcbias compares the relationship between GC content and read coverage.
  • modibin perform statistics on the distribution of modification based on the bed file.

Install

To use this software, you need to install additional dependencies including samtools, minimap2, seqkit, and bedtools for read processing. The following commands can help you to install the package and dependencies.

pip install Giraffe-View
conda install -c bioconda -c conda-forge samtools minimap2 seqkit bedtools -y

General Usage

Giraffe View is run simply with fllowing commands:

usage: giraffe [-h] {observe,estimate,gcbias,modibin} ...

A tool to help you assess quality of ONT data.

positional arguments:
  {observe,estimate,gcbias,modibin}
    observe             Observed quality in accuracy, mismatch, and homopolymer identification
    estimate            Estimated read accuracy
    gcbias              Relationship between GC content and depth
    modibin             Average modification proportion of bins

optional arguments:
  -h, --help            show this help message and exit

The available sub-commands are:

observe

giraffe observe -h
usage: giraffe observe [-h] --input <fastq> --ref <reference> [--cpu <number>] [--plot]

optional arguments:
  -h, --help         show this help message and exit
  --input <fastq>    input reads
  --ref <reference>  input reference
  --cpu <number>     number of cpu (default:10)
  --plot             Results visualization
  • fastq - the raw fastq data, some filter steps will be conducted including short read ( < 200 bp) and low quality read ( < 7 ) removal.
  • reference - the reference file in fasta format.
  • cpu - the number of CPUs will be used during processing.

estimate

giraffe estimate -h
usage: giraffe estimate [-h] --input <fastq> [--cpu <number>] [--plot]

optional arguments:
  -h, --help       show this help message and exit
  --input <fastq>  input reads
  --cpu <number>   number of cpu (default:10)
  --plot           Results visualization

gcbias

giraffe gcbias -h
usage: giraffe gcbias [-h] --ref <reference> --input <sam/bam> [--binsize] [--plot]

optional arguments:
  -h, --help         show this help message and exit
  --ref <reference>  input reference file
  --input <sam/bam>  input bam/sam file
  --binsize          input bin size (default:1000)
  --plot             Results visualization
  • reference - the reference file in fasta format.
  • sam / bam - the result of mapping in sam/bam file. If you have used the observe function to process your data, the resulting tmp.sort.bam file can be used as the input.
  • binsize - the length of bin. A bin is the smallest unit to count the read coverage and GC content.

modibin

giraffe modibin -h
usage: giraffe modibin [-h] --input <bed> --ref <reference> [--cpu <number>] [--plot]

optional arguments:
  -h, --help         show this help message and exit
  --input <bed>      input modificated bed file, please use the .bed as the file suffix
  --ref <reference>  input position file with CSV format, please use the .csv as the file suffix
  --cpu <number>     number of cpu (default:10)
  --plot             Results visualization
  • bed - a BED file with four columns (three columns for position, one for methylation proportion). Please use the tab ("\t") to gap the column instead of the space (" ").

    #chrom	start	end	value
    chr1	81	83	0.8
    chr1	21314	21315	0.3
    chr1	32421	32422	0.85
    
  • reference - a CSV file with target regions.

    chr1,0,100000,1_0_100000
    chr1,100000,200000,1_100000_200000
    

Workflow

graph TD
	A(raw signal) -.-> |Basecall| B(FASTA)
	A(raw signal) -.-> |Basecall| C(modificated BED)
	C(modificated BED) --> |modibin| D(modification distribution)
	B(FASTA) --> |estimate|e(estimated accuracy)
	B(FASTA) --> |observe| f(clean reads)
	f(clean reads) --> |observe| g(aligned BAM)
	
	g(aligned BAM) --> |observe|h(homopolymer identification)
 	g(aligned BAM) --> |observe|i(observed accuracy)
	g(aligned BAM) --> |gcbias|j(GC bias) 

Developing

  • run the homopolymer identification with multi processes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Giraffe_View-0.0.9.5.tar.gz (13.2 kB view details)

Uploaded Source

Built Distribution

Giraffe_View-0.0.9.5-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file Giraffe_View-0.0.9.5.tar.gz.

File metadata

  • Download URL: Giraffe_View-0.0.9.5.tar.gz
  • Upload date:
  • Size: 13.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for Giraffe_View-0.0.9.5.tar.gz
Algorithm Hash digest
SHA256 9710ea131bce5652ecbb0fff06e01b0008a244b7518f65108a249a14490d17ab
MD5 8650271a8ae54aa413fde653b7975860
BLAKE2b-256 2f0218ac149fc963e050bad03d628b2233ba564873640d194d0fb91edfa2804b

See more details on using hashes here.

File details

Details for the file Giraffe_View-0.0.9.5-py3-none-any.whl.

File metadata

  • Download URL: Giraffe_View-0.0.9.5-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for Giraffe_View-0.0.9.5-py3-none-any.whl
Algorithm Hash digest
SHA256 865f2ab0a69b3af409bb4fef2d2142e7731e07c4243b5c520573a582399ba0ed
MD5 6c29bd9663fa70a8e83409dfb00be6ac
BLAKE2b-256 b8e3fc5350320085439a252cd2e8366b1fc2572b2ccc1fa2bed586e98e42a47e

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page