Skip to main content

A small tool help assess and visualize the accuracy of a sequencing dataset, specifically for Oxford Nanopore Technologies (ONT) long-read sequencing.

Project description

Giraffe_View

Giraffe_View is designed to help assess and visualize the accuracy of a sequencing dataset, specifically for Oxford Nanopore Technologies (ONT) long-read sequencing including DNA and RNA data. There are four main functions to validate the read quality.

  • observe calculates the observed read accuracy, mismatches porportion, and homopolymer identification.
  • estimate calculates the estimated read accuracy, which is equal to Quality Score.
  • GC_bias compares the relationship between GC content and read coverage.
  • modi perform statistics on the distribution of modification based on the bed file.

Install

To use this software, you will need to install additional dependencies including samtools, minimap2, seqkit, pysam, numpy, and pandas. You can install these dependencies using the following command.

# for data processing
pip install rpy2==3.0 pysam numpy pandas
conda install -c bioconda -c conda-forge samtools minimap2 seqkit bedtools -y

# for figure plotting
conda install -c R ggplot2 patchwork -y

General Usage

Giraffe View is run simply with fllowing commands:

python Giraffe_View.py --help
usage: Giraffe_view [-h] {observe,modi,GC_bias,estimate} ...

A tool to help you assess quality of your ONT data.

positional arguments:
  {observe,modi,GC_bias,estimate}
    observe             Observed quality in accuracy, mismatch, and homopolymer
    modi                Average modification proportion of regions
    GC_bias             Relationship between GC content and depth
    estimate            Estimated read accuracy

optional arguments:
  -h, --help            show this help message and exit

The available sub-commands are:

observe

python Giraffe_View.py observe --help
usage: Giraffe_view observe [-h] --input <fastq> --ref <reference> [--cpu <number>]

optional arguments:
  -h, --help         show this help message and exit
  --input <fastq>    input reads
  --ref <reference>  input reference
  --cpu <number>     number of cpu (default:10)
  • fastq - the raw fastq data, some filter steps will be conducted including short read ( < 200 bp) and low quality read ( < 7 ) removal.
  • reference - the reference file in fasta format.
  • cpu - the number of CPUs will be used during processing.

estimate

python Giraffe_View.py estimate --help
usage: Giraffe_view estimate [-h] --input <fastq> [--cpu <number>]

optional arguments:
  -h, --help       show this help message and exit
  --input <fastq>  input reads
  --cpu <number>   number of cpu (default:10)

GC_bias

python Giraffe_View.py GC_bias --help
usage: Giraffe_view GC_bias [-h] --ref <reference> --input <sam/bam> [--binsize]

optional arguments:
  -h, --help         show this help message and exit
  --ref <reference>  input reference file
  --input <sam/bam>  input bam/sam file
  --binsize          input bin size (default:1000)
  • reference - the reference file in fasta format.
  • sam / bam - the result of mapping in sam/bam file. If you have used the observe function to process your data, the resulting tmp.sort.bam file can be used as the input.
  • binsize - the length of bin. A bin is the smallest unit to count the read coverage and GC content.

modi

python Giraffe_View.py modi --help
usage: Giraffe_view modi [-h] --input <bed> --ref <reference> [--cpu <number>]

optional arguments:
  -h, --help         show this help message and exit
  --input <bed>      input bed file
  --ref <reference>  input reference
  --cpu <number>     number of cpu (default:10)
  • bed - a bed file with four columns (three columns for position, one for methylation proportion). Please use the tab ("\t") to gap the column instead of the space (" ").

    #chrom	start	end	value
    chr1	81	83	0.8
    chr1	21314	21315	0.3
    chr1	32421	32422	0.85
    
  • reference - a csv file with target regions.

    chr1,0,100000,1_0_100000
    chr1,100000,200000,1_100000_200000
    

Workflow

graph TD
raw_data --> |Quality control| clean_data
raw_data --> |Basecall| modification_file
modification_file --> modification_distribution
clean_data --> Estimated_accuracy
clean_data --> |Reference| aligned_file
aligned_file --> Homopolymer_analysis
aligned_file --> GC_bias 
aligned_file --> Observed_accuracy

Developing

  • A example to show how to run
  • polish the result figures
  • run the homopolymer identification with multi-processes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Giraffe_View-0.0.7.tar.gz (12.3 kB view details)

Uploaded Source

Built Distribution

Giraffe_View-0.0.7-py3-none-any.whl (14.3 kB view details)

Uploaded Python 3

File details

Details for the file Giraffe_View-0.0.7.tar.gz.

File metadata

  • Download URL: Giraffe_View-0.0.7.tar.gz
  • Upload date:
  • Size: 12.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for Giraffe_View-0.0.7.tar.gz
Algorithm Hash digest
SHA256 a8d39156ed77197bc4f8e6691fb7749e39ea4eb0f7b0b159509b32a39a9dce0e
MD5 b1fabdbbaba889f03560ab36c1590b80
BLAKE2b-256 20271a98f88f8f4c12d5623e3671595e5e6eb32b1e3c1d718c344c659067e523

See more details on using hashes here.

File details

Details for the file Giraffe_View-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: Giraffe_View-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 14.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for Giraffe_View-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 40379a28aaaab2d1c0d2420ce18cf0ed873065ede58adce7d789c8b0ac5bdfb6
MD5 d5ba4739f0ae381db52571cad378c4c4
BLAKE2b-256 1e9eb5b050d9dd239624df9eab0b45beb7869f80812ce7036dc949d5b7219118

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page