Giraffe_View is specially designed to provide a comprehensive assessment of the accuracy of long-read sequencing datasets obtained from both the PacBio and Nanopore platforms.

These details have not been verified by PyPI

Project links

Homepage

Project description

Giraffe View

Giraffe_View is specially designed to provide a comprehensive assessment of the accuracy of long-read sequencing datasets obtained from both the PacBio and Nanopore platforms.

estimate Calculation of estimated read accuracy (Q score), length, and GC content.
observe Calculation of observed read accuracy, mismatch proportion, and homopolymer identification (e.g. AAAA).
gcbias Calculation of the relationship between GC content and sequencing depth.
modbin Calculation of the distribution of modification (e.g. 5mC or 6mA methylation) at the regional level.

Installation

Before using this tool, you need to install additional dependencies for read processing, including the samtools，minimap2, and bedtools. The following commands can help you install both the software package and its dependencies.

conda install -c bioconda -c conda-forge samtools minimap2 bedtools -y
pip install Giraffe-View

If you are unfamiliar with the process of installing conda, you can refer to the official conda documentation for detailed instructions. Please follow this link for guidance on installing conda.

General Usage

The giraffe can be run using the following commands.

estimate

giraffe estimate --input {read_list.txt} --cpu 4 --plot

read_list.txt - a table with your sample ID, sequencing platforms (ONT/Pacbio), and path of your sequencing reads (FASTQ format).

# A demo of read_list.txt
# Note: please use the SPACE(" ") to gap them.
R1 ONT /home/user/test/reads/S1.fastq
R2 Pacbio /home/user/test/reads/S2.fastq
R3 ONT /home/user/test/reads/S3.fastq

observe

giraffe observe --input {read_list.txt} --ref {genome.fa} --cpu 4 --plot

read_list.txt - a table the same as the above one.

gcbias

giraffe gcbias --input {bam_list.txt} --ref {genome.fa} --plot

bam_list.txt - a table with your sample ID, sequencing platforms, and path of your alignment files (sam/bam format).

# A demo of bam_list.txt
# Note: please use the SPACE(" ") to gap them.
# If you have used the observe function to process your data, the resulting bam files can be used as the input.
R1 ONT /home/user/test/Giraffe_Results/2_Observed_quality/S1.bam
R2 Pacbio /home/user/test/Giraffe_Results/2_Observed_quality/S2.bam
R3 ONT /home/user/test/Giraffe_Results/2_Observed_quality/S3.bam

modbin

giraffe modbin --input {methylation_list.txt} --pos {promoter.csv} --cpu 4 --plot

bam_list.txt - a table with your sample ID, sequencing platforms, and path of your methylation profiling files (bed format).

# A demo of methylation_list.txt
# Note: please use the SPACE(" ") to gap them.
R1 ONT test/reads/5mC_S1.txt
R2 Pacbio test/reads/5mC_S2.txt
R3 ONT test/reads/5mC_S3.txt

# A demo of your methylation file (e.g. 5mC_S1.txt).
# Please use the tab ("\t") to gap the column.
# chromosome start end methylation_proportion
chr1	81	83	0.8
chr1	21314	21315	0.3
chr1	32421	32422	0.85

# A demo of promoter.csv
#chromosome, start, end, geneID
chr1,12027,17027,ENSDARG00000099104
chr1,6822,11822,ENSDARG00000102407

# Note: there is no Header for all tables.

Example

Here, we provide demo datasets for testing the giraffe. The following commands can help to download them.

# The input file list
wget https://figshare.com/ndownloader/files/44967445 -O fastq.list
wget https://figshare.com/ndownloader/files/44967442 -O bed.list
wget https://figshare.com/ndownloader/files/44967499 -O bam.list

# The reference and ONT reads (R10.4.1 and R9.4.1) of E.coli
wget https://figshare.com/ndownloader/files/44967436 -O Read.tar.gz

# The 5mC methylation files of zebrafish blood and kidney samples.
# The position file is the gene promoter region in chromosome 1. 
wget https://figshare.com/ndownloader/files/44967427 -O Methylation.tar.gz

tar -xzvf Read.tar.gz
tar -xzvf Methylation.tar.gz
rm Read.tar.gz Methylation.tar.gz

Please run the following commands to start data analysis!

giraffe estimate --input fastq.list --plot --cpu 4
giraffe observe --input fastq.list --plot --cpu 4 --ref Read/ecoli_chrom.fa
giraffe gcbias --input bam.list --plot --ref Read/ecoli_chrom.fa
giraffe modbin --input bed.list --cpu 4 --plot --bed Methylation/zf_promoter.db

Results

if you run the demo data in the example, you will obtain a fold named Giraffe_Results with the following structure.

Giraffe_Results/
├── 1_Estimated_quality
│   ├── 1_Read_accuracy.pdf
│   ├── 2_Read_length.pdf
│   ├── 3_Read_GC_content.pdf
│   └── Estimated_information.txt
├── 2_Observed_quality
│   ├── 1_Observed_read_accuracy.pdf
│   ├── 2_Observed_mismatch_proportion.pdf
│   ├── 3_Homoploymer_summary.pdf
│   ├── Homoploymer_summary.txt
│   ├── Observed_information.txt
│   ├── R1041.bam
│   ├── R1041.bam.bai
│   ├── R1041_homopolymer_detail.txt
│   ├── R1041_homopolymer_in_reference.txt
│   ├── R941.bam
│   ├── R941.bam.bai
│   ├── R941_homopolymer_detail.txt
│   └── R941_homopolymer_in_reference.txt
├── 3_GC_bias
│   ├── 1_Bin_distribution.pdf
│   ├── 2_Relationship_normalization.pdf
│   ├── Bin_distribution.txt
│   ├── R1041_relationship_raw.txt
│   ├── R941_relationship_raw.txt
│   └── Relationship_normalization.txt
└── 4_Regional_modification
    ├── 1_Regional_modification.pdf
    ├── Blood.bed
    └── Kidney.bed

1_Estimated_quality

Estimated_information.txt - File with read ID, estimated read accuracy, estimate read error, Q Score, GC content, read length and sample ID.

ReadID Accuracy Error Q_value Length GC_content Group

@9154e0a0 0.935 0.065 11.857 316 0.503 R1041

@fa8f2a80 0.948 0.052 12.877 9621 0.498 R1041
1_Read_accuracy.pdf - Distribution of estimated read accuracy (Fig A).
2_Read_length.pdf - Distribution of read length (Fig B).
3_Read_GC_content.pdf - Distribution of read GC content (Fig C).

ReadID	Accuracy	Error	Q_value	Length	GC_content	Group
@9154e0a0	0.935	0.065	11.857	316	0.503	R1041
@fa8f2a80	0.948	0.052	12.877	9621	0.498	R1041

alt text

2_Observed_quality

Homoploymer_summary.txt - Accuracy of identification for each homopolymer type (only the length over 3 base pair was calculated, e.g. AAAA and TTTTT).

Base Accuracy Group

T 0.909 R1041

G 0.857 R1041

A 0.907 R1041

C 0.859 R1041
Observed_information.txt - Summary of observed accuracy includes the read ID, insertion length, deletion length, substitution length, matched length, observed identification rate, observed accuracy, and sample ID for each read.

ID Ins Del Sub Mat Iden Acc Group

70fbffe6 3 1 1 354 0.9972 0.9861 R1041

96a5c10b 3 11 2 342 0.9942 0.9553 R1041

Base	Accuracy	Group
T	0.909	R1041
G	0.857	R1041
A	0.907	R1041
C	0.859	R1041

ID	Ins	Del	Sub	Mat	Iden	Acc	Group
70fbffe6	3	1	1	354	0.9972	0.9861	R1041
96a5c10b	3	11	2	342	0.9942	0.9553	R1041

XXX_homopolymer_detail.txt - Detailed information for homopolymer identification includes the chromosome, start position, end position, homopolymer length, homopolymer type , matched base number, deleted base number, inserted base number, substituted base number, read ID, and sample ID (Read level).

Chrom	Start	End	length	type	Matched base	Deleted base	Inserted base	Substituted base	ReadID	SampleID
ecoli_chrom	3083	3086	4	T	4	0	0	0	c322bcea	R941
ecoli_chrom	3382	3386	5	A	5	0	0	0	c322bcea	R941

XXX_homopolymer_in_reference.txt - Summarized information includes the position of homopolymer in reference, the number of perfectly matched read, the total number of mapped read, the homopolymer feature, and sample ID (Reference level).

pos num_of_mat depth type Group

ecoli_chrom_3083_3086 1 1 4T R941

ecoli_chrom_3382_3386 1 1 5A R941
XXX.bam - BAM file generated by aligning the data against the reference genome.
XXX.bam.bai - Index for BAM file.
1_Observed_read_accuracy.pdf - Distribution of observed read accuracy (Fig A).
2_Observed_mismatch_proportion.pdf - Distribution of mismatch proportion (Fig B).
3_Homoploymer_summary.pdf - Accuracy of homopolymer identification (Fig C).

pos	num_of_mat	depth	type	Group
ecoli_chrom_3083_3086	1	1	4T	R941
ecoli_chrom_3382_3386	1	1	5A	R941

alt text

3_GC_bias

Bin_distribution.txt - BINs number within each GC content. (GC content, and Number of BINs)
XXXX_relationship_raw.txt - Read coverage for total GC content (GC content, average depth among the BINs, number of BINs, and sample ID).
Relationship_normalization - Normalized read coverage for selected GC content (GC content, average depth, Number of BINs, sample ID, and normalized depth).

GC_content Depth Number Group Normalized_depth

40 7.832 55 R1041 1.066

41 7.655 59 R1041 1.067
1_Bin_distribution.pdf - Visualization of BINs number within each GC content (Fig A).
2_Relationship_normalization.pdf - Relationship between normalized depth and GC content (Fig B).

GC_content	Depth	Number	Group	Normalized_depth
40	7.832	55	R1041	1.066
41	7.655	59	R1041	1.067

alt text

4_Regional_modification

XXX.bed - Average modification proportion for each BIN (BIN name, average value, and sample ID).

BIN name 5mC proportion Group

ENSDARG00000102097 0.6 Blood

ENSDARG00000099319 0.830 Blood
1_Regional_modification.pdf

BIN name	5mC proportion	Group
ENSDARG00000102097	0.6	Blood
ENSDARG00000099319	0.830	Blood

alt text

Workflow

graph TD
	A(raw signal) -.-> |Basecall| B(FASTA)
	A(raw signal) -.-> |Basecall| C(modificated file)
	C(modificated files) --> |modbin| D(Modification distribution)
	B(sequence reads) --> |estimate|e(Estimated table)
	e(Estimated table) --> f(Estimated accuracy)
	e(Estimated table) --> l(Read length)
	e(Estimated table) --> x(Read GC content)
	
	B(sequence reads) --> |observe|g(Aligned files)
	
	g(Aligned files) --> |observe|h(Homopolymer identification)
 	g(Aligned files) --> |observe|i(Observed accuracy)
 	g(Aligned files) --> |observe|c(Mismatch proportion)
	g(Aligned files) --> |gcbias|j(GC bias comparison)

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

0.2.3

Aug 8, 2024

0.2.2

Aug 5, 2024

0.2.1

Aug 5, 2024

0.2.0

Jul 29, 2024

0.1.0.17

Jul 29, 2024

0.1.0.16

Jul 25, 2024

0.1.0.15

Jul 22, 2024

0.1.0.14

Apr 26, 2024

0.1.0.13

Apr 26, 2024

0.1.0.12

Apr 23, 2024

0.1.0.11

Apr 18, 2024

0.1.0.10

Apr 18, 2024

0.1.0.9

Apr 4, 2024

0.1.0.8

Apr 4, 2024

0.1.0.7

Apr 4, 2024

0.1.0.6

Apr 4, 2024

0.1.0.5

Mar 28, 2024

0.1.0.4

Mar 27, 2024

0.1.0.3

Mar 27, 2024

0.1.0.2

Mar 15, 2024

This version

0.1.0.1

Mar 13, 2024

0.1.0.0

Mar 12, 2024

0.0.9.5

Jul 22, 2023

0.0.9.4

Jul 22, 2023

0.0.9.3

Jul 21, 2023

0.0.9.2

Jul 21, 2023

0.0.9.1

Jul 21, 2023

0.0.9

Jul 21, 2023

0.0.8

Jul 18, 2023

0.0.7

Jul 18, 2023

0.0.6

Jul 18, 2023

0.0.5

Jul 18, 2023

0.0.4

Jul 18, 2023

0.0.3

Jul 18, 2023

0.0.2

Jul 18, 2023

0.0.1.1

Jul 18, 2023

0.0.1

Jul 18, 2023

0.0.0.1.2

Aug 8, 2023

0.0.0.1rc0 pre-release

Aug 3, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Giraffe_View-0.1.0.1.tar.gz (16.9 kB view details)

Uploaded Mar 13, 2024 Source

Built Distribution

Giraffe_View-0.1.0.1-py3-none-any.whl (19.0 kB view details)

Uploaded Mar 13, 2024 Python 3

File details

Details for the file Giraffe_View-0.1.0.1.tar.gz.

File metadata

Download URL: Giraffe_View-0.1.0.1.tar.gz
Upload date: Mar 13, 2024
Size: 16.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for Giraffe_View-0.1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`40a3712ca4a59e97353c161ced7d18e03d54aac4aed1fdd2f29e34e0f569ceb2`
MD5	`16a909ce2526d19c1da407a1bafb1626`
BLAKE2b-256	`4bb14b186aeecd49fc5c7d73edc60c326f344ac34084da83597dbf55e984453f`

See more details on using hashes here.

File details

Details for the file Giraffe_View-0.1.0.1-py3-none-any.whl.

File metadata

Download URL: Giraffe_View-0.1.0.1-py3-none-any.whl
Upload date: Mar 13, 2024
Size: 19.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.9.13

File hashes

Hashes for Giraffe_View-0.1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e2b7f99b711c33fad89f7060e4394b608a752db81c3cad0d2507edf132a6ef19`
MD5	`9cac66e53a1351ea87f668ca7e660045`
BLAKE2b-256	`d8a7790d1e986c1c8db33c7af5e300b6bcb01dcc166b23f90e5fd5f71b86c5e6`