Skip to main content

Python interface to ensembl reference genome metadata

Project description

PyEnsembl
=========

PyEnsembl is a Python interface to `Ensembl <http://www.ensembl.org>`__
reference genome metadata such as exons and transcripts. PyEnsembl
downloads `GTF <https://en.wikipedia.org/wiki/Gene_transfer_format>`__
and `FASTA <https://en.wikipedia.org/wiki/FASTA_format>`__ files from
the `Ensembl FTP server <ftp://ftp.ensembl.org>`__ and loads them into a
local database. PyEnsembl can also work with custom reference data
specified using user-supplied GTF and FASTA files.

Example Usage
=============

.. code:: python

from pyensembl import EnsemblRelease

# release 77 uses human reference genome GRCh38
data = EnsemblRelease(77)

# will return ['HLA-A']
gene_names = data.gene_names_at_locus(contig=6, position=29945884)

# get all exons associated with HLA-A
exon_ids = data.exon_ids_of_gene_name('HLA-A')

Installation
============

You can install PyEnsembl using
`pip <https://pip.pypa.io/en/latest/quickstart.html>`__:

.. code:: sh

pip install pyensembl

This should also install any required packages, such as
`datacache <https://github.com/openvax/datacache>`__ and
`BioPython <http://biopython.org/>`__.

Before using PyEnsembl, run the following command to download and
install Ensembl data:

::

pyensembl install --release <list of Ensembl release numbers> --species <species-name>

For example, ``pyensembl install --release 75 76 --species human`` will
download and install all human reference data from Ensembl releases 75
and 76.

Alternatively, you can create the ``EnsemblRelease`` object from inside
a Python process and call ``ensembl_object.download()`` followed by
``ensembl_object.index()``.

Cache Location
--------------

By default, PyEnsembl uses the platform-specific ``Cache`` folder and
caches the files into the ``pyensembl`` sub-directory. You can override
this default by setting the environment key ``PYENSEMBL_CACHE_DIR`` as
your preferred location for caching:

.. code:: sh

export PYENSEMBL_CACHE_DIR=/custom/cache/dir

or

.. code:: python

import os

os.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'
# ... PyEnsembl API usage

Non-Ensembl Data
================

PyEnsembl also allows arbitrary genomes via the specification of local
file paths or remote URLs to both Ensembl and non-Ensembl GTF and FASTA
files. (Warning: GTF formats can vary, and handling of non-Ensembl data
is still very much in development.)

For example:

.. code:: python

data = Genome(
reference_name='GRCh38',
annotation_name='my_genome_features',
gtf_path_or_url='/My/local/gtf/path_to_my_genome_features.gtf')
# parse GTF and construct database of genomic features
data.index()
gene_names = data.gene_names_at_locus(contig=6, position=29945884)

API
===

The ``EnsemblRelease`` object has methods to let you access all possible
combinations of the annotation features *gene_name*, *gene_id*,
*transcript_name*, *transcript_id*, *exon_id* as well as the location of
these genomic elements (contig, start position, end position, strand).

Genes
-----

.. raw:: html

<dl>

.. raw:: html

<dt>

genes(contig=None, strand=None)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns a list of Gene objects, optionally restricted to a particular
contig or strand.

.. raw:: html

</dd>

.. raw:: html

<dt>

genes_at_locus(contig, position, end=None, strand=None)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns a list of Gene objects overlapping a particular position on a
contig, optionally extend into a range with the end parameter and
restrict to forward or backward strand by passing strand=‘+’ or
strand=‘-’.

.. raw:: html

</dd>

.. raw:: html

<dt>

gene_by_id(gene_id)

.. raw:: html

</dt>

.. raw:: html

<dd>

Return a Gene object for given Ensembl gene ID (e.g. “ENSG00000068793”).

.. raw:: html

</dd>

.. raw:: html

<dt>

gene_names(contig=None, strand=None)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns all gene names in the annotation database, optionally restricted
to a particular contig or strand.

.. raw:: html

</dd>

.. raw:: html

<dt>

genes_by_name(gene_name)

.. raw:: html

</dt>

.. raw:: html

<dd>

Get all the unqiue genes with the given name (there might be multiple
due to copies in the genome), return a list containing a Gene object for
each distinct ID.

.. raw:: html

</dd>

.. raw:: html

<dt>

gene_by_protein_id(protein_id)

.. raw:: html

</dt>

.. raw:: html

<dd>

Find Gene associated with the given Ensembl protein ID (e.g.
“ENSP00000350283”)

.. raw:: html

</dd>

.. raw:: html

<dt>

gene_names_at_locus(contig, position, end=None, strand=None)

.. raw:: html

</dt>

.. raw:: html

<dd>

Names of genes overlapping with the given locus, optionally restricted
by strand. (returns a list to account for overlapping genes)

.. raw:: html

</dd>

.. raw:: html

<dt>

gene_name_of_gene_id(gene_id)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns name of gene with given genen ID.

.. raw:: html

</dd>

.. raw:: html

<dt>

gene_name_of_transcript_id(transcript_id)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns name of gene associated with given transcript ID.

.. raw:: html

</dd>

.. raw:: html

<dt>

gene_name_of_transcript_name(transcript_name)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns name of gene associated with given transcript name.

.. raw:: html

</dd>

.. raw:: html

<dt>

gene_name_of_exon_id(exon_id)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns name of gene associated with given exon ID.

.. raw:: html

</dd>

.. raw:: html

<dt>

gene_ids(contig=None, strand=None)

.. raw:: html

</dt>

.. raw:: html

<dd>

Return all gene IDs in the annotation database, optionally restricted by
chromosome name or strand.

.. raw:: html

</dd>

.. raw:: html

<dt>

gene_ids_of_gene_name(gene_name)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns all Ensembl gene IDs with the given name.

.. raw:: html

</dd>

.. raw:: html

</dl>

Transcripts
-----------

.. raw:: html

<dl>

.. raw:: html

<dt>

transcripts(contig=None, strand=None)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns a list of Transcript objects for all transcript entries in the
Ensembl database, optionally restricted to a particular contig or
strand.

.. raw:: html

</dd>

.. raw:: html

<dt>

transcript_by_id(transcript_id)

.. raw:: html

</dt>

.. raw:: html

<dd>

Construct a Transcript object for given Ensembl transcript ID (e.g.
“ENST00000369985”)

.. raw:: html

</dd>

.. raw:: html

<dt>

transcripts_by_name(transcript_name)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns a list of Transcript objects for every transcript matching the
given name.

.. raw:: html

</dd>

.. raw:: html

<dt>

transcript_names(contig=None, strand=None)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns all transcript names in the annotation database.

.. raw:: html

</dd>

.. raw:: html

<dt>

transcript_ids(contig=None, strand=None)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns all transcript IDs in the annotation database.

.. raw:: html

</dd>

.. raw:: html

<dt>

transcript_ids_of_gene_id(gene_id)

.. raw:: html

</dt>

.. raw:: html

<dd>

Return IDs of all transcripts associated with given gene ID.

.. raw:: html

</dd>

.. raw:: html

<dt>

transcript_ids_of_gene_name(gene_name)

.. raw:: html

</dt>

.. raw:: html

<dd>

Return IDs of all transcripts associated with given gene name.

.. raw:: html

</dd>

.. raw:: html

<dt>

transcript_ids_of_transcript_name(transcript_name)

.. raw:: html

</dt>

.. raw:: html

<dd>

Find all Ensembl transcript IDs with the given name.

.. raw:: html

</dd>

.. raw:: html

<dt>

transcript_ids_of_exon_id(exon_id)

.. raw:: html

</dt>

.. raw:: html

<dd>

Return IDs of all transcripts associatd with given exon ID.

.. raw:: html

</dd>

.. raw:: html

</dl>

Exons
-----

.. raw:: html

<dl>

.. raw:: html

<dt>

exon_ids(contig=None, strand=None)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns a list of exons IDs in the annotation database, optionally
restricted by the given chromosome and strand.

.. raw:: html

</dd>

.. raw:: html

<dt>

exon_ids_of_gene_id(gene_id)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns a list of exon IDs associated with a given gene ID.

.. raw:: html

</dd>

.. raw:: html

<dt>

exon_ids_of_gene_name(gene_name)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns a list of exon IDs associated with a given gene name.

.. raw:: html

</dd>

.. raw:: html

<dt>

exon_ids_of_transcript_id(transcript_id)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns a list of exon IDs associated with a given transcript ID.

.. raw:: html

</dd>

.. raw:: html

<dt>

exon_ids_of_transcript_name(transcript_name)

.. raw:: html

</dt>

.. raw:: html

<dd>

Returns a list of exon IDs associated with a given transcript name.

.. raw:: html

</dd>

.. raw:: html

</dl>

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyensembl-1.7.0.tar.gz (58.5 kB view details)

Uploaded Source

File details

Details for the file pyensembl-1.7.0.tar.gz.

File metadata

  • Download URL: pyensembl-1.7.0.tar.gz
  • Upload date:
  • Size: 58.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.1 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/2.7.15

File hashes

Hashes for pyensembl-1.7.0.tar.gz
Algorithm Hash digest
SHA256 683083cb350fc31350a51bfa5bd12a14f36f968d97462a8752d019bc569dc0e6
MD5 2c884e541a578e26e9ffa2e6fa182c33
BLAKE2b-256 a6c772a75f7ed70c6e1d7101373e0606c71c84e86df1ace9cf11d3fcf36def09

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page