Skip to main content

A small tool help assess and visualize the accuracy of a sequencing dataset, specifically for Oxford Nanopore Technologies (ONT) long-read sequencing.

Project description

# Giraffe_View 

**Giraffe_View** is designed to help assess and visualize the accuracy of a sequencing dataset, specifically for Oxford Nanopore Technologies (ONT) long-read sequencing including DNA and RNA data. There are four main functions to validate the read quality.

- `observe` calculates the observed read accuracy, mismatches porportion, and homopolymer identification.
- `estimate` calculates the estimated read accuracy, which is equal to Quality Score.
- `GC_bias` compares the relationship between GC content and read coverage.
- `modi` perform statistics on the distribution of modification based on the bed file.



## Install

To use this software, you will need to install additional dependencies including samtools, minimap2, seqkit, pysam, numpy, and pandas. You can install these dependencies using the following command.

```shell
# for data processing
pip install rpy2==3.0 pysam numpy pandas
conda install -c bioconda -c conda-forge samtools minimap2 seqkit bedtools -y

# for figure plotting
conda install -c R ggplot2 patchwork -y
```



## General Usage

Giraffe View is run simply with fllowing commands:

```shell
python Giraffe_View.py --help
```

```shell
usage: Giraffe_view [-h] {observe,modi,GC_bias,estimate} ...

A tool to help you assess quality of your ONT data.

positional arguments:
{observe,modi,GC_bias,estimate}
observe Observed quality in accuracy, mismatch, and homopolymer
modi Average modification proportion of regions
GC_bias Relationship between GC content and depth
estimate Estimated read accuracy

optional arguments:
-h, --help show this help message and exit
```



The available sub-commands are:

### observe

```shell
python Giraffe_View.py observe --help
```

```xshell
usage: Giraffe_view observe [-h] --input <fastq> --ref <reference> [--cpu <number>]

optional arguments:
-h, --help show this help message and exit
--input <fastq> input reads
--ref <reference> input reference
--cpu <number> number of cpu (default:10)
```

- `fastq` - the raw fastq data, some filter steps will be conducted including short read ( < 200 bp) and low quality read ( < 7 ) removal.
- `reference` - the reference file in fasta format.
- `cpu` - the number of CPUs will be used during processing.



### estimate

```shell
python Giraffe_View.py estimate --help
```

```shell
usage: Giraffe_view estimate [-h] --input <fastq> [--cpu <number>]

optional arguments:
-h, --help show this help message and exit
--input <fastq> input reads
--cpu <number> number of cpu (default:10)
```



### GC_bias

```shell
python Giraffe_View.py GC_bias --help
```

```shell
usage: Giraffe_view GC_bias [-h] --ref <reference> --input <sam/bam> [--binsize]

optional arguments:
-h, --help show this help message and exit
--ref <reference> input reference file
--input <sam/bam> input bam/sam file
--binsize input bin size (default:1000)
```

- `reference` - the reference file in fasta format.
- `sam` / `bam` - the result of mapping in sam/bam file. If you have used the observe function to process your data, the resulting `tmp.sort.bam` file can be used as the input.
- `binsize` - the length of bin. A bin is the smallest unit to count the read coverage and GC content.



### modi

```shell
python Giraffe_View.py modi --help
```

```shell
usage: Giraffe_view modi [-h] --input <bed> --ref <reference> [--cpu <number>]

optional arguments:
-h, --help show this help message and exit
--input <bed> input bed file
--ref <reference> input reference
--cpu <number> number of cpu (default:10)
```

- `bed` - a bed file with four columns (three columns for position, one for methylation proportion). Please use the tab ("\t") to gap the column instead of the space (" ").

```shell
#chrom start end value
chr1 81 83 0.8
chr1 21314 21315 0.3
chr1 32421 32422 0.85
```

- `reference` - a csv file with target regions.

```shell
chr1,0,100000,1_0_100000
chr1,100000,200000,1_100000_200000
```



## Workflow

```mermaid
graph TD
raw_data --> |Quality control| clean_data
raw_data --> |Basecall| modification_file
modification_file --> modification_distribution
clean_data --> Estimated_accuracy
clean_data --> |Reference| aligned_file
aligned_file --> Homopolymer_analysis
aligned_file --> GC_bias
aligned_file --> Observed_accuracy
```



## Developing

- A example to show how to run
- polish the result figures
- run the homopolymer identification with multi-processes

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Giraffe_View-0.0.1.tar.gz (12.2 kB view details)

Uploaded Source

Built Distribution

Giraffe_View-0.0.1-py3-none-any.whl (15.4 kB view details)

Uploaded Python 3

File details

Details for the file Giraffe_View-0.0.1.tar.gz.

File metadata

  • Download URL: Giraffe_View-0.0.1.tar.gz
  • Upload date:
  • Size: 12.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for Giraffe_View-0.0.1.tar.gz
Algorithm Hash digest
SHA256 fbc2bb76da734aaf2864ab38cc00177b766b03f019b1e68e18b2b97114ba1a21
MD5 337691259479d56d97fc60e7f578c23a
BLAKE2b-256 6edeb0f3138c1862859b6206b6eac7c64b3bba71eaf268d2b8f56b8a2a36457f

See more details on using hashes here.

File details

Details for the file Giraffe_View-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: Giraffe_View-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 15.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.4.2 requests/2.22.0 setuptools/45.2.0 requests-toolbelt/0.8.0 tqdm/4.30.0 CPython/3.8.10

File hashes

Hashes for Giraffe_View-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 250f645719358c7e57229385de5516fa2288ea606f24429130649c952a6b0b50
MD5 a7248117ecdd9c581e814f10e4c6a2c9
BLAKE2b-256 6db347739744db4ba8da05c2a3a6c7495be736c3dd4e38746ca54c78aefdb8a0

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page