Python interface to ensembl reference genome metadata
Project description
PyEnsembl
Python interface to Ensembl reference genome metadata (exons, transcripts, &c)
Example Usage
from pyensembl import EnsemblRelease
# release 77 uses human reference genome GRCh38
data = EnsemblRelease(77)
# will return ['HLA-A']
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
# get all exons associated with HLA-A
exon_ids = data.exon_ids_of_gene_name('HLA-A')
Installation
You can install PyEnsembl using pip:
pip install pyensembl
This should also install any required packages, such as datacache and BioPython.
Before using PyEnsembl, run the following command to download and install Ensembl data:
pyensembl install --release <list of Ensembl release numbers>
For example, pyensembl install --release 75 76 will download and install all data for Ensembl releases 75 and 76.
Alternatively, you can create the EnsemblRelease object with auto_download=True. PyEnsembl will then download your data as you need it, and there will be a delay of several minutes after your first command.
Non-Ensembl Data
PyEnsembl also allows arbitrary genomes via the specification of local file paths or remote URLs to both Ensembl and non-Ensembl GTF and FASTA files. (Warning: GTF formats can vary, and handling of non-Ensembl data is still very much in development.)
For example:
data = Genome(reference_name='GRCh38', gtf_path_or_url='/My/local/gtf/path.gtf')) gene_names = data.gene_names_at_locus(contig=6, position=29945884)
API
The EnsemblRelease object has methods to let you access all possible combinations of the annotation features gene_name, gene_id, transcript_name, transcript_id, exon_id as well as the location of these genomic elements (contig, start position, end position, strand).
Genes
genes(contig=None, strand=None) : returns list of Gene objects, optionally restricted to a particular contig or strand.
genes_at_locus(contig, position, end=None, strand=None) : returns list of Gene objects overlapping a particular position on a contig, optionally extend into a range with the end parameter and restrict to forward or backward strand by passing strand='+' or strand='-'.
gene_by_id(gene_id) : return Gene object for given Ensembl gene ID (e.g. “ENSG00000068793”)
gene_names(contig=None, strand=None) : returns all gene names in the annotation database, optionally restricted to a particular contig or strand.
genes_by_name(gene_name) : get all the unqiue genes with the given name (there might be multiple due to copies in the genome), return a list containing a Gene object for each distinct ID.
gene_by_protein_id(protein_id) : find Gene associated with the given Ensembl protein ID (e.g. “ENSP00000350283”)
gene_names_at_locus(contig, position, end=None, strand=None) : names of genes overlapping with the given locus (returns a list to account for overlapping genes)
gene_name_of_gene_id(gene_id) : name of gene with given ID
gene_name_of_transcript_id(transcript_id) : name of gene associated with given transcript ID
gene_name_of_transcript_name(transcript_name) : name of gene associated with given transcript name
gene_name_of_exon_id(exon_id) : name of gene associated with given exon ID
gene_ids(contig=None, strand=None) : all gene IDs in the annotation database
gene_ids_of_gene_name(gene_name) : all Ensembl gene IDs with the given name
Transcripts
transcripts(contig=None, strand=None) : returns list of Transcript objects for all transcript entries in the Ensembl database, optionally restricted to a particular contig or strand.
transcript_by_id(transcript_id) : construct Transcript object for given Ensembl transcript ID (e.g. “ENST00000369985”)
transcripts_by_name(transcript_name) : returns list of Transcript objects for every transcript matching the given name.
transcript_names(contig=None, strand=None) : all transcript names in the annotation database
transcript_ids(contig=None, strand=None) : returns all transcript IDs in the annotation database
transcript_ids_of_gene_id(gene_id) : return IDs of all transcripts associated with given gene ID
transcript_ids_of_gene_name(gene_name) : return IDs of all transcripts associated with given gene name
transcript_ids_of_transcript_name(transcript_name) : find all Ensembl transcript IDs with the given name
transcript_ids_of_exon_id(exon_id) : return IDs of all transcripts associatd with given exon ID
Exons
exon_ids(contig=None, strand=None) : returns all transcript IDs in the annotation database
exon_ids_of_gene_id(gene_id)
exon_ids_of_gene_name(gene_name)
exon_ids_of_transcript_name(transcript_name)
exon_ids_of_transcript_id(transcript_id)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pyensembl-0.7.0.tar.gz
.
File metadata
- Download URL: pyensembl-0.7.0.tar.gz
- Upload date:
- Size: 57.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cb485e08fad8ea3c4a8b9c0e6d8dba0f8e95ecd0e143017257025ef7ba07a3e4 |
|
MD5 | 4c97bb2028c0b90aa0beefc1cd0e7611 |
|
BLAKE2b-256 | 028458770e3bbac0a3f5101ce29df06a1af7bbdc24d1bec7fcbfb0f3cc887335 |