Python interface to Ensembl reference genome metadata
Project description
PyEnsembl
PyEnsembl is a Python interface to Ensembl reference genome metadata such as exons and transcripts. PyEnsembl downloads GTF and FASTA files from the Ensembl FTP server and loads them into a local database. PyEnsembl can also work with custom reference data specified using user-supplied GTF and FASTA files.
Example Usage
from pyensembl import EnsemblRelease
# release 77 uses human reference genome GRCh38
data = EnsemblRelease(77)
# will return ['HLA-A']
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
# get all exons associated with HLA-A
exon_ids = data.exon_ids_of_gene_name('HLA-A')
Installation
You can install PyEnsembl using pip:
pip install pyensembl
This should also install any required packages such as datacache.
Before using PyEnsembl, run the following command to download and install Ensembl data:
pyensembl install --release <list of Ensembl release numbers> --species <species-name>
For example, pyensembl install --release 75 76 --species human
will download and install all
human reference data from Ensembl releases 75 and 76.
Alternatively, you can create the EnsemblRelease
object from inside a Python
process and call ensembl_object.download()
followed by ensembl_object.index()
.
Cache Location
By default, PyEnsembl uses the platform-specific Cache
folder
and caches the files into the pyensembl
sub-directory.
You can override this default by setting the environment key PYENSEMBL_CACHE_DIR
as your preferred location for caching:
export PYENSEMBL_CACHE_DIR=/custom/cache/dir
or
import os
os.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'
# ... PyEnsembl API usage
Usage tips
List installed genomes
To see the genomes for which PyEnsembl has already downloaded and indexed metadata you can run:
pyensembl list
Or equivalently do this in Python:
from pyensembl.shell import collect_all_installed_ensembl_releases
collect_all_installed_ensembl_releases()
Load genome in Python
Here's an example Python snippet that loads fly genome data from Ensembl release v100:
from pyensembl import EnsemblRelease
data = EnsemblRelease(release=100, species='drosophila_melanogaster')
Data structures
Gene
gene = genome.gene_by_id(gene_id='FBgn0011747')
Transcript
transcript = gene.transcripts[0]
Protein information
transcript.protein_id
transcript.protein_sequence
Non-Ensembl Data
PyEnsembl also allows arbitrary genomes via the specification of local file paths or remote URLs to both Ensembl and non-Ensembl GTF and FASTA files. (Warning: GTF formats can vary, and handling of non-Ensembl data is still very much in development.)
For example:
from pyensembl import Genome
data = Genome(
reference_name='GRCh38',
annotation_name='my_genome_features',
# annotation_version=None,
gtf_path_or_url='/My/local/gtf/path_to_my_genome_features.gtf', # Path or URL of GTF file
# transcript_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing transcript sequences
# protein_fasta_paths_or_urls=None, # List of paths or URLs of FASTA files containing protein sequences
# cache_directory_path=None, # Where to place downloaded and cached files for this genome
)
# parse GTF and construct database of genomic features
data.index()
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
API
The EnsemblRelease
object has methods to let you access all possible
combinations of the annotation features gene_name, gene_id,
transcript_name, transcript_id, exon_id as well as the location of
these genomic elements (contig, start position, end position, strand).
Genes
- genes(contig=None, strand=None)
- Returns a list of Gene objects, optionally restricted to a particular contig or strand.
- genes_at_locus(contig, position, end=None, strand=None)
- Returns a list of Gene objects overlapping a particular position on a contig, optionally extend into a range with the end parameter and restrict to forward or backward strand by passing strand='+' or strand='-'.
- gene_by_id(gene_id)
- Return a Gene object for given Ensembl gene ID (e.g. "ENSG00000068793").
- gene_names(contig=None, strand=None)
- Returns all gene names in the annotation database, optionally restricted to a particular contig or strand.
- genes_by_name(gene_name)
- Get all the unqiue genes with the given name (there might be multiple due to copies in the genome), return a list containing a Gene object for each distinct ID.
- gene_by_protein_id(protein_id)
- Find Gene associated with the given Ensembl protein ID (e.g. "ENSP00000350283")
- gene_names_at_locus(contig, position, end=None, strand=None)
- Names of genes overlapping with the given locus, optionally restricted by strand. (returns a list to account for overlapping genes)
- gene_name_of_gene_id(gene_id)
- Returns name of gene with given genen ID.
- gene_name_of_transcript_id(transcript_id)
- Returns name of gene associated with given transcript ID.
- gene_name_of_transcript_name(transcript_name)
- Returns name of gene associated with given transcript name.
- gene_name_of_exon_id(exon_id)
- Returns name of gene associated with given exon ID.
- gene_ids(contig=None, strand=None)
- Return all gene IDs in the annotation database, optionally restricted by chromosome name or strand.
- gene_ids_of_gene_name(gene_name)
- Returns all Ensembl gene IDs with the given name.
Transcripts
- transcripts(contig=None, strand=None)
- Returns a list of Transcript objects for all transcript entries in the Ensembl database, optionally restricted to a particular contig or strand.
- transcript_by_id(transcript_id)
- Construct a Transcript object for given Ensembl transcript ID (e.g. "ENST00000369985")
- transcripts_by_name(transcript_name)
- Returns a list of Transcript objects for every transcript matching the given name.
- transcript_names(contig=None, strand=None)
- Returns all transcript names in the annotation database.
- transcript_ids(contig=None, strand=None)
- Returns all transcript IDs in the annotation database.
- transcript_ids_of_gene_id(gene_id)
- Return IDs of all transcripts associated with given gene ID.
- transcript_ids_of_gene_name(gene_name)
- Return IDs of all transcripts associated with given gene name.
- transcript_ids_of_transcript_name(transcript_name)
- Find all Ensembl transcript IDs with the given name.
- transcript_ids_of_exon_id(exon_id)
- Return IDs of all transcripts associatd with given exon ID.
Exons
- exon_ids(contig=None, strand=None)
- Returns a list of exons IDs in the annotation database, optionally restricted by the given chromosome and strand.
- exon_by_id(exon_id)
- Construct an Exon object for given Ensembl exon ID (e.g. "ENSE00001209410")
- exon_ids_of_gene_id(gene_id)
- Returns a list of exon IDs associated with a given gene ID.
- exon_ids_of_gene_name(gene_name)
- Returns a list of exon IDs associated with a given gene name.
- exon_ids_of_transcript_id(transcript_id)
- Returns a list of exon IDs associated with a given transcript ID.
- exon_ids_of_transcript_name(transcript_name)
- Returns a list of exon IDs associated with a given transcript name.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pyensembl-2.3.9.tar.gz
.
File metadata
- Download URL: pyensembl-2.3.9.tar.gz
- Upload date:
- Size: 60.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 444c4f489818fd639c34a631a9527cd6bb6586bf20b58daffe1b79d1d1ef86f4 |
|
MD5 | 6276ce07fa54f16e8e79d7459ae9c105 |
|
BLAKE2b-256 | b1c41fc3ec5ef4bdf05a5309b598feeea53255c238b49f27f5ada4bb45641038 |
File details
Details for the file pyensembl-2.3.9-py3-none-any.whl
.
File metadata
- Download URL: pyensembl-2.3.9-py3-none-any.whl
- Upload date:
- Size: 55.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.10
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d19484ab445993165aff567aec7f6d09666b99a4d0635240f3ee586b717fb921 |
|
MD5 | 1fea5330de5f303beee3c82a7261966d |
|
BLAKE2b-256 | 2c36f8377189c72817c25fc8a81bcc4da88cce8cd704548fca8a70259d1d342c |