Skip to main content

GenCube enables researchers to search for, download, and unify genome assemblies and diverse types of annotations, and retrieve metadata for sequencing-based experimental data suitable for specific requirements.

Project description

Gencube

github version pypi version python versions pypi downloads license

Efficient retrieval, download, and unification of genomic data from leading biodiversity databases

Keun Hong Son1,2,3, and Je-Yoel Cho1,2,3

1 Department of Biochemistry, College of Veterinary Medicine, Seoul National University, Seoul, Korea
2 Comparative Medicine and Disease Research Center (CDRC), Science Research Center (SRC), Seoul National University, Seoul, Korea
3 BK21 PLUS Program for Creative Veterinary Science Research and Research Institute for Veterinary Science, Seoul National University, Seoul, Korea

Manuscript

bioRxiv (uploaded: 2024.07.01)


gencube enables researchers to search for, download, and unify genome assemblies and diverse types of annotations, and retrieve metadata for sequencing-based experimental data suitable for specific requirements.

gencube_overview

Databases accessed from gencube

  • GenBank: NCBI GenBank Nucleotide Sequence Database
  • RefSeq: NCBI Reference Sequence Database
  • GenArk: UCSC Genome Archive
  • Ensembl Rapid Release: Ensembl genome browser that provides frequent updates for newly sequenced species
  • Zoonomia TOGA: Tool to infer Orthologs from Genome Alignments
  • INSDC: International Nucleotide Sequence Database Collaboration
  • SRA: NCBI Sequence Read Archive
  • ENA: EMBL-EBI European Nucleotide Archive
  • DDBJ: DNA Data Bank of Japan

Detailed information of each database

Installation

The latest release can be installed with

pip install gencube

Alternative

conda install -c bioconda gencube

Tutorials

gencube consists of six subcommands

$ gencube
usage: gencube [-h] {genome,geneset,sequence,annotation,crossgenome,seqmeta} ...

gencube v1.0.0

positional arguments:
  {genome,geneset,sequence,annotation,crossgenome,seqmeta}
    genome              Search, download, and modify chromosome labels for genome assemblies
    geneset             Search, download, and modify chromosome labels for genesets (gene annotations)
    sequence            Search and download sequence data of genesets
    annotation          Search, download, and modify chromosome labels for various genome annotations, such as gaps and repeats
    crossgenome         Search and download comparative genomics data, such as homology, and codon or protein alignment
    seqmeta             Search, fetch, and integrate metadata of experimental sequencing data

options:
  -h, --help            show this help message and exit

The positional arguments and options shared among the genome, geneset, sequence, annotation, and crossgenome subcommand

When using the above five subcommands, it's important to find genome assemblies required for personal research. Below are the positional arguments and options shared by the these subcommands to browse and search for specific genome assemblies.

positional arguments:
  keywords              Taxonomic names to search for genomes. You can provide various forms 
                        such as species names or accession numbers.  
                        Examples: homo_sapiens, human, GCF_000001405.40, GCA_000001405.29, GRCh38, hg38 
                        
                        Multiple names can be combined and will be merged in the search results.
                        To specify multiple names, separate them with spaces.

options:
  -h, --help            show this help message and exit
  -v level, --level level
                        Specify the genome assembly level (default: complete,chromosome)
                        complete   : Fully assembled genomes
                        chromosome : Assembled at the chromosome level
                        scaffold   : Assembled into scaffolds, but not to the chromosome level
                        contig     : Contiguous sequences without gaps
                        
  -r, --refseq          Show genomes that have RefSeq accession (GCF_* format)
  -u, --ucsc            Show genomes that have UCSC name
  -l, --latest          Show genomes corresponding to the latest version
  -m, --metadata        Save metadata for the searched genomes

Examples

# Search using scientific or common name
$ gencube genome homo_sapiens canis_lupus_familiaris
$ gencube genome human dog

# Search using assembly name
$ gencube genome T2T-CHM13v2.0 GRCh38

# Search using UCSC name
$ gencube genome hg38 hg19

# Search using GenBank (GCF_*) or RefSeq (GCA_*) accession
$ gencube genome GCF_000001405.40 GCA_021950905.1

# Show searched genomes corresponding to all genome assembly levels
$ gencube genome homo_sapiens --level complete,chromosome,scaffold,contig

# Only show genomes that have RefSeq accession and UCSC name, and correspond to the latest version
$ gencube genome homo_sapiens --refseq --ucsc --latest

# Download the full information metadata of searched genomes
$ gencube genome homo_sapiens --metadata

Example output displayed in the terminal

$ gencube genome GCF_000001405.40 GCA_021950905.1

# Search assemblies in NCBI database
  Keyword: ['GCF_000001405.40', 'GCA_021950905.1']

  Total 3 genomes are searched.

# Convert JSON to dataframe format.
  Filter options
  Level:   ['Complete', 'Chromosome']
  RefSeq:  False
  UCSC:    False
  Latest:  False

# Check accessibility to GenArk, Ensembl Rapid Release
  UCSC GenArk  : 4167 genomes across 2813 species
  Ensembl Rapid: 2272 genomes across 1522 species

+----+------------------------+---------+------------+------------------+--------+----------+-----------+
|    | Assembly name          |   Taxid | Release    | NCBI             | UCSC   | GenArk   | Ensembl   |
+====+========================+=========+============+==================+========+==========+===========+
|  0 | HG002.mat.cur.20211005 |    9606 | 2022/02/04 | GCA_021951015.1  |        | v        | v         |
+----+------------------------+---------+------------+------------------+--------+----------+-----------+
|  1 | HG002.pat.cur.20211005 |    9606 | 2022/02/04 | GCA_021950905.1  |        | v        | v         |
+----+------------------------+---------+------------+------------------+--------+----------+-----------+
|  2 | GRCh38.p14             |    9606 | 2022/02/03 | GCF_000001405.40 | hg38   | v        |           |
+----+------------------------+---------+------------+------------------+--------+----------+-----------+

genome: Search, download, and modify chromosome labels for genome assemblies

You can download genome data in FASTA format from four different databases (GenBank, RefSeq, GenArk, Ensembl Rapid Release). Each database uses a different soft-masking method, and you can selectively download the data as needed. You can also download unmasked and hard-masked genomes from the Ensembl Rapid Release database.

options:
  -d, --download        Download "fasta" formatted genome file.
  -f types, --fasta types
                        Type of "fasta" formatted genome file (default: refseq).
                        Default is from the RefSeq database.
                        If not available, download from the GenBank database.
                        genbank    : soft-masked genome by NCBI GenBank
                        refseq     : soft-masked genome by NCBI RefSeq
                        genark     : soft-masked genome by UCSC GenArk
                        ensembl    : soft-masked genome by Ensembl Rapid Release
                        ensembl_hm : hard-masked genome by Ensembl Rapid Release
                        ensembl_um : unmasked genome by Ensembl Rapid Release
  -c type, --chr_style type
                        Chromosome label style used in the download file (default: ensembl)
                        ensembl : 1, 2, X, MT. Unknowns use GenBank IDs.
                        gencode : chr1, chr2, chrX, chrM. Unknowns use GenBank IDs.
                        ucsc    : chr1, chr2, chrX, chrM. Uses UCSC-specific IDs for unknowns.
                                  (!! Limited use if UCSC IDs are not issued.)
                        raw     : Uses raw file labels without modification. Format depends on the database:
                                 - NCBI GenBank: CM_* or other-form IDs
                                 - NCBI RefSeq : NC_*, NW_* or other-form IDs
                                 - GenArk      : GenBank or RefSeq IDs
                                 - Ensembl     : Ensembl IDs
  -p 1-9, --compresslevel 1-9
                        Compression level for output data (default: 6).
                        Lower numbers are faster but have lower compression.
  --recursive           Download file regardless of their presence only if integrity check is not possible.

Examples

# Download genome files under the default conditions (RefSeq or GenBank)
$ gencube genome GCF_011100685.1 --download
# Download multiple genomes from various databases
$ gencube genome GCF_011100685.1 --download --fasta refseq,genark,ensembl
# Change the chromosome labels to the GENCODE style and set the compression level of the file to 9.
$ gencube genome GCF_011100685.1 --download --chr_style gencode --compresslevel 9

geneset: Search, download, and modify chromosome labels for genesets (gene annotations)

options:
  -d types, --download types
                        Type of gene set
                        refseq_gtf    : RefSeq gene set (GTF format)
                        refseq_gff    : RefSeq gene set (GFF)
                        gnomon        : RefSeq Gnomon gene prediction (GFF)
                        cross         : RefSeq Cross-species alignments (GFF)
                        same          : RefSeq Same-species alignments (GFF)
                        agustus       : GenArk Augustus gene prediction (GFF)
                        xenoref       : GenArk XenoRefGene (GFF)
                        genark_ref    : GenArk RefSeq gene models (GFF)
                        ensembl_gff   : Ensembl Rapid Release gene set (GFF)
                        toga_gtf      : Zoonomia TOGA gene set (GTF)
                        toga_bed      : Zoonomia TOGA gene set (BED)
                        toga_pseudo   : Zoonomia TOGA processed pseudogenes (BED)

Examples

# search usable and accessible data
gencube geneset GCF_011100685.1

# Download multiple genesets from various databases
$ gencube geneset GCF_011100685.1 --download refseq_gtf,agustus,toga_gtf

sequence: Search and download sequence data of genesets

options:
  -d types, --download types
                        Download "fasta" formatted sequence file. 
                        1. Nucleotide sequences:
                           refseq_rna         : Accessioned RNA sequences annotated on the genome assembly.
                           refseq_rna_genomic : RNA features based on the genome sequence.
                           refseq_cds_genomic : CDS features based on the genome sequence.
                           refseq_pseudo      : Pseudogene and other gene regions without transcribed RNA or translated protein products.
                           ensembl_cdna       : Ensembl Rapid Release cDNA sequences of transcripts.
                           ensembl_cds        : Ensembl Rapid Release coding sequences (CDS).
                           ensembl_repeat     : Ensembl repeat modeler sequences.
                        
                        2. Protein sequences:
                           refseq_pep         : Accessioned protein sequences annotated on the genome assembly.
                           refseq_pep_cds     : CDS features translated into protein sequences.
                           ensembl_pep        : Ensembl Rapid Release protein sequences.

Examples

# search usable and accessible data
gencube sequence GCF_011100685.1

# Download multiple genesets from various databases
$ gencube sequence GCF_011100685.1 --download refseq_rna,ensembl_cdna,refseq_pep,ensembl_pep

annotation: Search, download, and modify chromosome labels for various genome annotations, such as gaps and repeats

options:
  -d types, --download types
                        Download annotation file.
                        gap : Genomic gaps - AGP defined (bigBed format)
                        sr   : Simple tandem repeats by TRF (bigBed) 
                        td   : Tandem duplications (bigBed) 
                        wm   : Genomic intervals masked by WindowMasker + SDust (bigBed) 
                        rmsk : Repeated elements annotated by RepeatMasker (bigBed) 
                        cpg  : CpG Islands - Islands < 300 bases are light green (bigBed) 
                        gc   : GC percent in 5-Base window (bigWig)

Examples

# search usable and accessible data
gencube annotation GCF_011100685.1

# Download multiple annotations
gencube annotation GCF_011100685.1 --download sr,td,rmsk,gc

crossgenome: Search and download comparative genomics data, such as homology, and codon or protein alignment

options:
  -d types, --download types
                        ensembl_homology   : Homology data from Ensembl Rapid Release, detailing gene orthology relationships across species.
                        toga_homology      : Homology data from TOGA, providing predictions of orthologous genes based on genome alignments.
                        toga_align_codon   : Codon alignment data from TOGA, showing aligned codon sequences between reference and query species.
                        toga_align_protein : Protein alignment data from TOGA, detailing aligned protein sequences between reference and query species.
                        toga_inact_mut     : List of inactivating mutations from TOGA, identifying mutations that disrupt gene function.

Examples

# search usable and accessible data
gencube crossgenome GCF_011100685.1

# Download multiple crossgenome data
$ gencube sequence GCF_011100685.1 --download toga_homology,toga_align_codon

seqmeta: Search, fetch, and integrate metadata of experimental sequencing data

seqmeta_scheme

$ gencube seqmeta
usage: gencube seqmeta [-h] [--info] [-o string] [-st string] [-sr string] [-l string] [-ex keywords] [-m] [keywords ...]

Search, fetch, and integrate metadata of experimental sequencing data

positional arguments:
  keywords              Keywords to search for sequencing-based experimental data. You can provide various forms 
                        Examples: tissue name, cell line, disease name, etc 
                                  liver, k562, cancer, breast_cancer
                        Multiple keywords can be combined and will be merged in the search results.
                        To specify multiple names, separate them with spaces.

options:
  -h, --help            show this help message and exit
  --info                Show full information about organism, strategy, source and layout 
                         
  -o string, --organism string
                        Scientific name or common name 
                        Example: homo_sapiens or human 
                        
                        Available common names:
                        human, mouse, dog, dingo, wolf, cat, pig, pig_domestic, cow, dairy_cow, chicken, horse, rice, wheat
                        elephant, whale, naked_mole_rat, blind_mole_rat, gorilla, rhesus_monkey, cynomolgus_monkey, baboon
                        chimpanzee, marmoset, macaque, capuchin_monkey, squirrel_monkey, bonobo, yeast, fruit_fly, nematode
                        zebrafish, african clawed frog, rat, guinea pig, rabbit, opossum
                         
  -st string, --strategy string
                        Available strategies 
                        wgs, wga, wxs, targeted, synthetic_long_read, gbs, rad, tn, clone_end, amplicon, clone, rna, mrna
                        ncrna, ribo, rip, mirna, ssrna, est, fl_cdna, atac, dnase, faire, chip, mre, bisulfite, mbd, medip
                        hic, chiapet, tethered
                         
  -sr string, --source string
                        Available sources 
                        genomic, genomic_single_cell, transcriptomic, transcriptomic_single_cell, metagenomic
                        metatranscriptomic, synthetic, viral, other
                         
  -l string, --layout string
                        Available layout: paired, single (default: paired,single) 
                         
  -ex keywords, --exclude keywords
                        Exclude the results for the keywords used in this option  
                         
  -m, --metadata        Save integrated metadata

Examples

$ gencube seqmeta --organism dog --strategy chip,chip_seq

$ gencube seqmeta --organism dog --strategy chip,chip_seq liver,lung cancer,tumor

$ gencube seqmeta --organism dog --strategy chip,chip_seq liver,lung cancer,tumor --exclude crispr,

((((("human"[Organism]) AND ("rna seq"[Strategy])) AND ("illumina"[Platform])) AND ("public"[Access])) AND (("liver" OR "lung") AND ("cancer" OR "tumor"))) NOT "crispr"

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gencube-0.9.0.tar.gz (75.8 kB view hashes)

Uploaded Source

Built Distribution

gencube-0.9.0-py3-none-any.whl (56.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page