Query Ensembl for genes using free form search words, look up genes/transcripts by Ensembl ID or fetch the latest FTPs by species.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Environment
- Console
Framework
- Jupyter
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering :: Bio-Informatics
- Utilities

Project description

gget (gene-get)

PyPI - Downloads

gget has four main commands:

gget ref
Fetch links to GTF and FASTA files from the Ensembl FTP site.
gget search
Query Ensembl for genes using free form search words.
gget info
Look up genes or transcripts by their Ensembl ID.
gget seq
Fetch DNA sequences of Ensembl IDs.

Installation

pip install gget

For use in Jupyter Lab / Google Colab:

from gget import ref, search, info, seq

Getting started

# Fetch Homo sapiens  GTF, DNA, and cDNA FTPs from Ensembl release 104
$ gget ref -s homo_sapiens -r 104

# Search zebra finch genes with "mito" in their description and limit to the top 10 genes
$ gget search -sw mito -s taeniopygia_guttata -l 10

# Look up Ensembl ID ENSSCUG00000017183 and also return its homology information
$ gget info -id ENSSCUG00000017183 -H

# Fetch the sequences of Ensembl ID ENSG00000034713 and all its transcript isoforms
$ gget seq -id ENSMUSG00000025040 -i

Manual

gget ref

Function to fetch GTF and FASTA (cDNA and DNA) URLs from the Ensembl FTP site. Returns a dictionary/json containing the requested URLs with their respective Ensembl version and release date and time.

Options

-l --list
List all available species.

-s --species
Species for which the FTPs will be fetched in the format genus_species, e.g. homo_sapiens.

-w --which
Defines which results to return. Possible entries are: 'all' - Returns GTF, cDNA, and DNA links and associated info (default). Or one or a combination of the following: 'gtf' - Returns the GTF FTP link and associated info. 'cdna' - Returns the cDNA FTP link and associated info. 'dna' - Returns the DNA FTP link and associated info.

-r --release
Ensemble release the FTPs will be fetched from, e.g. 104 (default: None → uses latest Ensembl release).

-ftp --ftp
If True: returns only a list containing the requested FTP links (default: False).

-d --download
Download the requested FTPs to the current directory.

-o --out
Path to the file the results will be saved in, e.g. path/to/directory/results.json (default: None → just prints results).
For Jupyter Lab / Google Colab: save=True will save the output to the current working directory.

Examples

Show all available species

# Jupyter Lab / Google Colab:
!gget ref --list

# Terminal:
$ gget ref --list

→ Returns a list with all available species from the latest Ensembl release.

Fetch GTF, DNA, and cDNA FTP links for a specific species

# Jupyter Lab / Google Colab:
ref("homo_sapiens")

# Terminal:
$ gget ref -s homo_sapiens

→ Returns a json with the latest links to human GTF and FASTA files, their respective release dates and time, and the Ensembl release from which the links were fetched, in the format:

{
            species: {
                "transcriptome_cdna": {
                    "ftp": cDNA FTP download URL,
                    "ensembl_release": Ensembl release,
                    "release_date": Day-Month-Year,
                    "release_time": HH:MM,
                    "bytes": cDNA FTP file size in bytes
                },
                "genome_dna": {
                    "ftp": DNA FTP download URL,
                    "ensembl_release": Ensembl release,
                    "release_date": Day-Month-Year,
                    "release_time": HH:MM,
                    "bytes": DNA FTP file size in bytes
                },
                "annotation_gtf": {
                    "ftp": GTF FTP download URL,
                    "ensembl_release": Ensembl release,
                    "release_date": Day-Month-Year,
                    "release_time": HH:MM,
                    "bytes": GTF FTP file size in bytes
                }
            }
        }

Fetch GTF, DNA, and cDNA FTP links for a specific species from a specific Ensembl release

For example, for Ensembl release 104:

# Jupyter Lab / Google Colab:
ref("homo_sapiens", release=104)

# Terminal
$ gget ref -s homo_sapiens -r 104

→ Returns a json with the human reference genome GTF, DNA, and cDNA links, and their respective release dates and time, from Ensembl release 104.

Save the results

# Jupyter Lab / Google Colab:
ref("homo_sapiens", save=True)

# Terminal 
$ gget ref -s homo_sapiens -o path/to/directory/ref_results.json

→ Saves the results in path/to/directory/ref_results.json.
For Jupyter Lab / Google Colab: Saves the results in a json file named ref_results.json in the current working directory.

Note: To download the files linked to by the FTPs into the current directory, add flag -d.

Fetch only certain types of links for a specific species

# Jupyter Lab / Google Colab:
ref("homo_sapiens", which=["gtf", "dna"])

# Terminal 
$ gget ref -s homo_sapiens -w gtf,dna

→ Returns a dictionary/json containing the latest human reference GTF and DNA files, in this order, and their respective release dates and time.

Fetch only certain types of links for a specific species and return only the links

# Jupyter Lab / Google Colab:
ref("homo_sapiens", which=["gtf", "dna"], ftp=True)

# Terminal 
$ gget ref -s homo_sapiens -w gtf,dna -ftp

→ Returns only the links (wihtout additional information) to the latest human reference GTF and DNA files, in this order, in a space-separated list (terminal), or comma-separated list (Jupyter Lab / Google Colab).
For Jupyter Lab / Google Colab: Combining this command with save=True, will save the results in a text file named ref_results.txt in the current working directory.

gget search

Query Ensembl for genes or transcripts from a defined species using free form search words.

:warning: gget search currently only supports genes listed in the Ensembl core API, which includes limited external references. Manually searching the Ensembl website might yield more results.

Options

-sw --searchwords
One or more free form searchwords for the query, e.g. gaba, nmda. Searchwords are not case-sensitive.

-s --species
Species or database to be searched.
Species can be passed in the format 'genus_species', e.g. 'homo_sapiens'. To pass a specific CORE database (e.g. a specific mouse strain), enter the name of the CORE database, e.g. 'mus_musculus_dba2j_core_105_1'. All availabale species databases can be found here: http://ftp.ensembl.org/pub/release-105/mysql/

-t --d_type
Possible entries: 'gene' (default), 'transcript' Returns either genes or transcripts, respectively, which match the searchwords.

-ao --andor
Possible entries: 'or', 'and' 'or': ID descriptions must include at least one of the searchwords (default). 'and': Only return IDs whose descriptions include all searchwords.

-l --limit
Limits the number of search results to the top [limit] genes found (default: None).

-o --out
Path to the file the results will be saved in, e.g. path/to/directory/results.csv (default: None → just prints results).
For Jupyter Lab / Google Colab: save=True will save the output to the current working directory.

Examples

Query Ensembl for genes from a specific species using multiple searchwords

# Jupyter Lab / Google Colab:
search(["gaba", "gamma-aminobutyric"], "homo_sapiens")

# Terminal 
$ gget search -sw gaba,gamma-aminobutyric -s homo_sapiens

→ Returns all genes that contain at least one of the searchwords in their Ensembl or external reference description, in the format:

Ensembl_ID	Ensembl_description	Ext_ref_description	Biotype	Gene_name	URL
ENSG00000034713	GABA type A receptor associated protein like 2 [Source:HGNC Symbol;Acc:HGNC:13291]	GABA type A receptor associated protein like 2	protein_coding	GABARAPL2	https://uswest.ensembl.org/homo_sapiens/Gene/Summary?g=ENSG00000034713
. . .	. . .	. . .	. . .	. . .	. . .

Query Ensembl for transcripts from a specific species which include ALL searchwords

# Jupyter Lab / Google Colab:
search(["gaba", "gamma-aminobutyric"], "nothobranchius_furzeri", d_type="transcript", andor="and")

# Terminal 
$ gget search -sw gaba,gamma-aminobutyric -s nothobranchius_furzeri -t transcript -ao and

→ Returns all killifish transcripts that contain all of the searchwords in their Ensembl or external reference description.

Query Ensembl for genes from a specific species using a single searchword and while limiting the number of returned search results

# Jupyter Lab / Google Colab:
search("gaba", "homo_sapiens", limit=10)

# Terminal 
$ gget search -sw gaba -s homo_sapiens -l 10

→ Returns the first 10 genes that contain the searchword in their Ensembl or external reference description. If more than one searchword is passed, limit will limit the number of genes per searchword.

Query Ensembl for genes from any of the 236 species databases found here, e.g. a specific mouse strain.

# Jupyter Lab / Google Colab:
search("brain", "mus_musculus_cbaj_core_105_1")

# Terminal 
$ gget search -sw brain -s mus_musculus_cbaj_core_105_1

→ Returns genes from the CBA/J mouse strain that contain the searchword in their Ensembl or external reference description.

gget info

Look up gene or transcript Ensembl IDs. Returns their common name, description, homologs, synonyms, corresponding transcript/gene, transcript isoforms and more from the Ensembl database as well as external references.

Options

-id --ens_ids
One or more Ensembl IDs.

-e --expand
Expand returned information (default: False). For genes: add isoform information. For transcripts: add translation and exon information.

-H --homology
Returns homology information of ID (default: False).

-x --xref
Returns information from external references (default: False).

Examples

Look up a list of gene Ensembl IDs including information on all isoforms

# Jupyter Lab / Google Colab:
info(["ENSG00000034713", "ENSG00000104853", "ENSG00000170296"], expand=True)

# Terminal 
$ gget info -id ENSG00000034713,ENSG00000104853,ENSG00000170296 -e

→ Returns a json containing information about each ID, amongst others the common name, description, and corresponding transcript/gene, in the format:

{
            "Ensembl ID": {
                        "species": genus_species,
                        "object_type": e.g. Gene,
                        "biotype": Gene biotype, e.g. protein_coding,
                        "display_name": Common gene name,
                        "description": Ensemble description,
                        "assembly_name": Name of species assmebly,
                        "seq_region_name": Sequence region,
                        "start": Sequence start position,
                        "end": Sequence end position,
                        "strand": Strand
                        "canonical_transcript": Transcript ID,
                        # All transcript isoforms:
                        "Transcript": [{'display_name': Transcript name,
					'biotype': Transcript biotype,
					'id': Transcript ID}, ...]
                        },
}

Note: When looking up Ensembl IDs of transcripts instead of genes, the "Transcript" entry above will be replaced by "Translation" and "Exon" information.

Look up a transcript Ensembl ID and include external reference descriptions

# Jupyter Lab / Google Colab:
info("ENSDART00000135343", xref=True)

# Terminal 
$ gget info -id ENSDART00000135343 -x

→ Returns a json containing the homology information, and external reference description of each ID in addition to the standard information mentioned above.

gget seq

Fetch DNA sequences from gene or transcript Ensembl IDs.

Options

-id --ens_ids
One or more Ensembl IDs.

-i --isoforms
If a gene Ensembl ID is passed, this returns sequences of all known transcript isoforms.

-o --out
Path to the file the results will be saved in, e.g. path/to/directory/results.fa (default: None → just prints results).
For Jupyter Lab / Google Colab: save=True will save the output FASTA to the current working directory.

Examples

Fetch the sequences of several transcript Ensembl IDs

# Jupyter Lab / Google Colab:
seq(["ENST00000441207","ENST00000587537"])

# Terminal 
$ gget seq -id ENST00000441207,ENST00000587537

→ Returns a FASTA containing the sequence of each ID, in the format:

>Ensembl_ID chromosome:assembly:seq_region_name:seq_region_start:seq_region_end:strand
GGGAATGGAAATCTGTCCCTCGTGCTGGAAGCCAACCAGTGGTGATGACTCTGTGTGCCACTCCGCCTCCTACAGCGCGGATCCTCTG  
CGTGTGTCCTCGCAAGACAAGCTCGATGAAATGGCCGAGTCCAGTCAAGCAAACTTTGAGGGAA...

Fetch the sequences of a gene Ensembl ID and all its transcript isoforms

# Jupyter Lab / Google Colab:
seq("ENSMUSG00000025040", isoforms=True)

# Terminal 
$ gget seq -id ENSMUSG00000025040 -i

→ Returns a FASTA containing the sequence of the gene ID and the sequences of all of each transcripts.

Author: Laura Luebbert

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Environment
- Console
Framework
- Jupyter
Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering :: Bio-Informatics
- Utilities

Release history Release notifications | RSS feed

0.28.4

Feb 1, 2024

0.28.3

Jan 22, 2024

0.28.2

Nov 16, 2023

0.28.0

Nov 12, 2023

0.27.9

Aug 7, 2023

0.27.8

Jul 12, 2023

0.27.7

May 16, 2023

0.27.6 yanked

May 2, 2023

Reason this release was yanked:

Requirement clashes

0.27.5

Apr 6, 2023

0.27.4

Mar 19, 2023

0.27.3

Mar 11, 2023

0.27.2

Jan 1, 2023

0.27.1

Dec 30, 2022

0.27.0

Dec 10, 2022

0.3.13

Nov 11, 2022

0.3.12

Nov 10, 2022

0.3.11

Sep 7, 2022

0.3.10

Sep 2, 2022

0.3.9

Aug 25, 2022

0.3.8

Aug 12, 2022

0.3.7

Aug 9, 2022

0.3.5

Aug 6, 2022

0.3.4 yanked

Aug 6, 2022

Reason this release was yanked:

Bug in gget alphafold reading .fa files

0.3.3 yanked

Aug 5, 2022

Reason this release was yanked:

Bug in gget alphafold reading .fa files

0.3.1 yanked

Aug 5, 2022

Reason this release was yanked:

Bug in gget alphafold relax flag

0.3.0 yanked

Aug 4, 2022

Reason this release was yanked:

Bug in gget alphafold relax flag

0.2.7

Jul 29, 2022

0.2.6

Jul 8, 2022

0.2.5

Jun 30, 2022

0.2.4

Jun 29, 2022

0.2.3

Jun 27, 2022

0.2.2

Jun 24, 2022

0.2.1

Jun 9, 2022

0.2.0

Jun 8, 2022

0.1.2

Jun 3, 2022

0.1.1

May 28, 2022

0.1.0

May 25, 2022

0.0.24

May 17, 2022

0.0.23 yanked

May 17, 2022

Reason this release was yanked:

Bug in terminal functionality

0.0.22

May 10, 2022

0.0.17

Mar 2, 2022

This version

0.0.16

Mar 2, 2022

0.0.6

Feb 26, 2022

0.0.5

Feb 25, 2022

0.0.4

Feb 22, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gget-0.0.16.tar.gz (19.9 kB view hashes)

Uploaded Mar 2, 2022 Source

Built Distribution

gget-0.0.16-py3-none-any.whl (16.7 kB view hashes)

Uploaded Mar 2, 2022 Python 3

Hashes for gget-0.0.16.tar.gz

Hashes for gget-0.0.16.tar.gz
Algorithm	Hash digest
SHA256	`112f877de1a7b60c3837eac61c304b0b28a3451053917278490710bd2730b929`
MD5	`7735d9fe1f5ee7f872b77e88d21cb963`
BLAKE2b-256	`5a774eee5a214dfea90be9578a14fb48f75ab4b30de3d18e4d08173bdc957ccb`

Hashes for gget-0.0.16-py3-none-any.whl

Hashes for gget-0.0.16-py3-none-any.whl
Algorithm	Hash digest
SHA256	`797558fcccfc97a3b4f758c30b15efe70420f46bf18f19ce1c73560a3886da6f`
MD5	`43de943f62792bcee05fd02234022eb0`
BLAKE2b-256	`1b79401e96384daeda1192230af52bc438acffe7dd9ce19d5615d5a3408a1159`

gget 0.0.16

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

gget (gene-get)

Installation

Getting started

Manual

gget ref

Options

Examples

Show all available species

Fetch GTF, DNA, and cDNA FTP links for a specific species

Fetch GTF, DNA, and cDNA FTP links for a specific species from a specific Ensembl release

Save the results

Fetch only certain types of links for a specific species

Fetch only certain types of links for a specific species and return only the links

gget search

Options

Examples

Query Ensembl for genes from a specific species using multiple searchwords

Query Ensembl for transcripts from a specific species which include ALL searchwords

Query Ensembl for genes from a specific species using a single searchword and while limiting the number of returned search results

Query Ensembl for genes from any of the 236 species databases found here, e.g. a specific mouse strain.

gget info

Options

Examples

Look up a list of gene Ensembl IDs including information on all isoforms

Look up a transcript Ensembl ID and include external reference descriptions

gget seq

Options

Examples

Fetch the sequences of several transcript Ensembl IDs

Fetch the sequences of a gene Ensembl ID and all its transcript isoforms

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution