Python interface to ensembl reference genome metadata
Project description
PyEnsembl
=========
PyEnsembl is a Python interface to `Ensembl <http://www.ensembl.org>`__
reference genome metadata such as exons and transcripts. PyEnsembl
downloads `GTF <https://en.wikipedia.org/wiki/Gene_transfer_format>`__
and `FASTA <https://en.wikipedia.org/wiki/FASTA_format>`__ files from
the `Ensembl FTP server <ftp://ftp.ensembl.org>`__ and loads them into a
local database. PyEnsembl can also work with custom reference data
specified using user-supplied GTF and FASTA files.
Example Usage
=============
.. code:: python
from pyensembl import EnsemblRelease
# release 77 uses human reference genome GRCh38
data = EnsemblRelease(77)
# will return ['HLA-A']
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
# get all exons associated with HLA-A
exon_ids = data.exon_ids_of_gene_name('HLA-A')
Installation
============
You can install PyEnsembl using
`pip <https://pip.pypa.io/en/latest/quickstart.html>`__:
.. code:: sh
pip install pyensembl
This should also install any required packages, such as
`datacache <https://github.com/openvax/datacache>`__ and
`BioPython <http://biopython.org/>`__.
Before using PyEnsembl, run the following command to download and
install Ensembl data:
::
pyensembl install --release <list of Ensembl release numbers> --species <species-name>
For example, ``pyensembl install --release 75 76 --species human`` will
download and install all human reference data from Ensembl releases 75
and 76.
Alternatively, you can create the ``EnsemblRelease`` object from inside
a Python process and call ``ensembl_object.download()`` followed by
``ensembl_object.index()``.
Cache Location
--------------
By default, PyEnsembl uses the platform-specific ``Cache`` folder and
caches the files into the ``pyensembl`` sub-directory. You can override
this default by setting the environment key ``PYENSEMBL_CACHE_DIR`` as
your preferred location for caching:
.. code:: sh
export PYENSEMBL_CACHE_DIR=/custom/cache/dir
or
.. code:: python
import os
os.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'
# ... PyEnsembl API usage
Non-Ensembl Data
================
PyEnsembl also allows arbitrary genomes via the specification of local
file paths or remote URLs to both Ensembl and non-Ensembl GTF and FASTA
files. (Warning: GTF formats can vary, and handling of non-Ensembl data
is still very much in development.)
For example:
.. code:: python
data = Genome(
reference_name='GRCh38',
annotation_name='my_genome_features',
gtf_path_or_url='/My/local/gtf/path_to_my_genome_features.gtf')
# parse GTF and construct database of genomic features
data.index()
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
API
===
The ``EnsemblRelease`` object has methods to let you access all possible
combinations of the annotation features *gene_name*, *gene_id*,
*transcript_name*, *transcript_id*, *exon_id* as well as the location of
these genomic elements (contig, start position, end position, strand).
Genes
-----
.. raw:: html
<dl>
.. raw:: html
<dt>
genes(contig=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of Gene objects, optionally restricted to a particular
contig or strand.
.. raw:: html
</dd>
.. raw:: html
<dt>
genes_at_locus(contig, position, end=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of Gene objects overlapping a particular position on a
contig, optionally extend into a range with the end parameter and
restrict to forward or backward strand by passing strand=‘+’ or
strand=‘-’.
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_by_id(gene_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Return a Gene object for given Ensembl gene ID (e.g. “ENSG00000068793”).
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_names(contig=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns all gene names in the annotation database, optionally restricted
to a particular contig or strand.
.. raw:: html
</dd>
.. raw:: html
<dt>
genes_by_name(gene_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Get all the unqiue genes with the given name (there might be multiple
due to copies in the genome), return a list containing a Gene object for
each distinct ID.
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_by_protein_id(protein_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Find Gene associated with the given Ensembl protein ID (e.g.
“ENSP00000350283”)
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_names_at_locus(contig, position, end=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Names of genes overlapping with the given locus, optionally restricted
by strand. (returns a list to account for overlapping genes)
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_name_of_gene_id(gene_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns name of gene with given genen ID.
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_name_of_transcript_id(transcript_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns name of gene associated with given transcript ID.
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_name_of_transcript_name(transcript_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns name of gene associated with given transcript name.
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_name_of_exon_id(exon_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns name of gene associated with given exon ID.
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_ids(contig=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Return all gene IDs in the annotation database, optionally restricted by
chromosome name or strand.
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_ids_of_gene_name(gene_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns all Ensembl gene IDs with the given name.
.. raw:: html
</dd>
.. raw:: html
</dl>
Transcripts
-----------
.. raw:: html
<dl>
.. raw:: html
<dt>
transcripts(contig=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of Transcript objects for all transcript entries in the
Ensembl database, optionally restricted to a particular contig or
strand.
.. raw:: html
</dd>
.. raw:: html
<dt>
transcript_by_id(transcript_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Construct a Transcript object for given Ensembl transcript ID (e.g.
“ENST00000369985”)
.. raw:: html
</dd>
.. raw:: html
<dt>
transcripts_by_name(transcript_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of Transcript objects for every transcript matching the
given name.
.. raw:: html
</dd>
.. raw:: html
<dt>
transcript_names(contig=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns all transcript names in the annotation database.
.. raw:: html
</dd>
.. raw:: html
<dt>
transcript_ids(contig=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns all transcript IDs in the annotation database.
.. raw:: html
</dd>
.. raw:: html
<dt>
transcript_ids_of_gene_id(gene_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Return IDs of all transcripts associated with given gene ID.
.. raw:: html
</dd>
.. raw:: html
<dt>
transcript_ids_of_gene_name(gene_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Return IDs of all transcripts associated with given gene name.
.. raw:: html
</dd>
.. raw:: html
<dt>
transcript_ids_of_transcript_name(transcript_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Find all Ensembl transcript IDs with the given name.
.. raw:: html
</dd>
.. raw:: html
<dt>
transcript_ids_of_exon_id(exon_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Return IDs of all transcripts associatd with given exon ID.
.. raw:: html
</dd>
.. raw:: html
</dl>
Exons
-----
.. raw:: html
<dl>
.. raw:: html
<dt>
exon_ids(contig=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of exons IDs in the annotation database, optionally
restricted by the given chromosome and strand.
.. raw:: html
</dd>
.. raw:: html
<dt>
exon_ids_of_gene_id(gene_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of exon IDs associated with a given gene ID.
.. raw:: html
</dd>
.. raw:: html
<dt>
exon_ids_of_gene_name(gene_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of exon IDs associated with a given gene name.
.. raw:: html
</dd>
.. raw:: html
<dt>
exon_ids_of_transcript_id(transcript_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of exon IDs associated with a given transcript ID.
.. raw:: html
</dd>
.. raw:: html
<dt>
exon_ids_of_transcript_name(transcript_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of exon IDs associated with a given transcript name.
.. raw:: html
</dd>
.. raw:: html
</dl>
=========
PyEnsembl is a Python interface to `Ensembl <http://www.ensembl.org>`__
reference genome metadata such as exons and transcripts. PyEnsembl
downloads `GTF <https://en.wikipedia.org/wiki/Gene_transfer_format>`__
and `FASTA <https://en.wikipedia.org/wiki/FASTA_format>`__ files from
the `Ensembl FTP server <ftp://ftp.ensembl.org>`__ and loads them into a
local database. PyEnsembl can also work with custom reference data
specified using user-supplied GTF and FASTA files.
Example Usage
=============
.. code:: python
from pyensembl import EnsemblRelease
# release 77 uses human reference genome GRCh38
data = EnsemblRelease(77)
# will return ['HLA-A']
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
# get all exons associated with HLA-A
exon_ids = data.exon_ids_of_gene_name('HLA-A')
Installation
============
You can install PyEnsembl using
`pip <https://pip.pypa.io/en/latest/quickstart.html>`__:
.. code:: sh
pip install pyensembl
This should also install any required packages, such as
`datacache <https://github.com/openvax/datacache>`__ and
`BioPython <http://biopython.org/>`__.
Before using PyEnsembl, run the following command to download and
install Ensembl data:
::
pyensembl install --release <list of Ensembl release numbers> --species <species-name>
For example, ``pyensembl install --release 75 76 --species human`` will
download and install all human reference data from Ensembl releases 75
and 76.
Alternatively, you can create the ``EnsemblRelease`` object from inside
a Python process and call ``ensembl_object.download()`` followed by
``ensembl_object.index()``.
Cache Location
--------------
By default, PyEnsembl uses the platform-specific ``Cache`` folder and
caches the files into the ``pyensembl`` sub-directory. You can override
this default by setting the environment key ``PYENSEMBL_CACHE_DIR`` as
your preferred location for caching:
.. code:: sh
export PYENSEMBL_CACHE_DIR=/custom/cache/dir
or
.. code:: python
import os
os.environ['PYENSEMBL_CACHE_DIR'] = '/custom/cache/dir'
# ... PyEnsembl API usage
Non-Ensembl Data
================
PyEnsembl also allows arbitrary genomes via the specification of local
file paths or remote URLs to both Ensembl and non-Ensembl GTF and FASTA
files. (Warning: GTF formats can vary, and handling of non-Ensembl data
is still very much in development.)
For example:
.. code:: python
data = Genome(
reference_name='GRCh38',
annotation_name='my_genome_features',
gtf_path_or_url='/My/local/gtf/path_to_my_genome_features.gtf')
# parse GTF and construct database of genomic features
data.index()
gene_names = data.gene_names_at_locus(contig=6, position=29945884)
API
===
The ``EnsemblRelease`` object has methods to let you access all possible
combinations of the annotation features *gene_name*, *gene_id*,
*transcript_name*, *transcript_id*, *exon_id* as well as the location of
these genomic elements (contig, start position, end position, strand).
Genes
-----
.. raw:: html
<dl>
.. raw:: html
<dt>
genes(contig=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of Gene objects, optionally restricted to a particular
contig or strand.
.. raw:: html
</dd>
.. raw:: html
<dt>
genes_at_locus(contig, position, end=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of Gene objects overlapping a particular position on a
contig, optionally extend into a range with the end parameter and
restrict to forward or backward strand by passing strand=‘+’ or
strand=‘-’.
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_by_id(gene_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Return a Gene object for given Ensembl gene ID (e.g. “ENSG00000068793”).
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_names(contig=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns all gene names in the annotation database, optionally restricted
to a particular contig or strand.
.. raw:: html
</dd>
.. raw:: html
<dt>
genes_by_name(gene_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Get all the unqiue genes with the given name (there might be multiple
due to copies in the genome), return a list containing a Gene object for
each distinct ID.
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_by_protein_id(protein_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Find Gene associated with the given Ensembl protein ID (e.g.
“ENSP00000350283”)
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_names_at_locus(contig, position, end=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Names of genes overlapping with the given locus, optionally restricted
by strand. (returns a list to account for overlapping genes)
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_name_of_gene_id(gene_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns name of gene with given genen ID.
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_name_of_transcript_id(transcript_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns name of gene associated with given transcript ID.
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_name_of_transcript_name(transcript_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns name of gene associated with given transcript name.
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_name_of_exon_id(exon_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns name of gene associated with given exon ID.
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_ids(contig=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Return all gene IDs in the annotation database, optionally restricted by
chromosome name or strand.
.. raw:: html
</dd>
.. raw:: html
<dt>
gene_ids_of_gene_name(gene_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns all Ensembl gene IDs with the given name.
.. raw:: html
</dd>
.. raw:: html
</dl>
Transcripts
-----------
.. raw:: html
<dl>
.. raw:: html
<dt>
transcripts(contig=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of Transcript objects for all transcript entries in the
Ensembl database, optionally restricted to a particular contig or
strand.
.. raw:: html
</dd>
.. raw:: html
<dt>
transcript_by_id(transcript_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Construct a Transcript object for given Ensembl transcript ID (e.g.
“ENST00000369985”)
.. raw:: html
</dd>
.. raw:: html
<dt>
transcripts_by_name(transcript_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of Transcript objects for every transcript matching the
given name.
.. raw:: html
</dd>
.. raw:: html
<dt>
transcript_names(contig=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns all transcript names in the annotation database.
.. raw:: html
</dd>
.. raw:: html
<dt>
transcript_ids(contig=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns all transcript IDs in the annotation database.
.. raw:: html
</dd>
.. raw:: html
<dt>
transcript_ids_of_gene_id(gene_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Return IDs of all transcripts associated with given gene ID.
.. raw:: html
</dd>
.. raw:: html
<dt>
transcript_ids_of_gene_name(gene_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Return IDs of all transcripts associated with given gene name.
.. raw:: html
</dd>
.. raw:: html
<dt>
transcript_ids_of_transcript_name(transcript_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Find all Ensembl transcript IDs with the given name.
.. raw:: html
</dd>
.. raw:: html
<dt>
transcript_ids_of_exon_id(exon_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Return IDs of all transcripts associatd with given exon ID.
.. raw:: html
</dd>
.. raw:: html
</dl>
Exons
-----
.. raw:: html
<dl>
.. raw:: html
<dt>
exon_ids(contig=None, strand=None)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of exons IDs in the annotation database, optionally
restricted by the given chromosome and strand.
.. raw:: html
</dd>
.. raw:: html
<dt>
exon_ids_of_gene_id(gene_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of exon IDs associated with a given gene ID.
.. raw:: html
</dd>
.. raw:: html
<dt>
exon_ids_of_gene_name(gene_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of exon IDs associated with a given gene name.
.. raw:: html
</dd>
.. raw:: html
<dt>
exon_ids_of_transcript_id(transcript_id)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of exon IDs associated with a given transcript ID.
.. raw:: html
</dd>
.. raw:: html
<dt>
exon_ids_of_transcript_name(transcript_name)
.. raw:: html
</dt>
.. raw:: html
<dd>
Returns a list of exon IDs associated with a given transcript name.
.. raw:: html
</dd>
.. raw:: html
</dl>
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pyensembl-1.7.0.tar.gz
(58.5 kB
view details)
File details
Details for the file pyensembl-1.7.0.tar.gz
.
File metadata
- Download URL: pyensembl-1.7.0.tar.gz
- Upload date:
- Size: 58.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.19.1 setuptools/40.4.1 requests-toolbelt/0.8.0 tqdm/4.26.0 CPython/2.7.15
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 683083cb350fc31350a51bfa5bd12a14f36f968d97462a8752d019bc569dc0e6 |
|
MD5 | 2c884e541a578e26e9ffa2e6fa182c33 |
|
BLAKE2b-256 | a6c772a75f7ed70c6e1d7101373e0606c71c84e86df1ace9cf11d3fcf36def09 |