Skip to main content

pybio genomics

Project description

pybio: basic genomics toolset

pybio is a Python framework for handling genomics operations and a direct interface to Ensembl genomes.

Downloading an Ensembl genome+annotation takes one line:

$ pybio genome homo_sapiens

Features include genome+annotation download from Ensembl and processing with STAR and salmon, support of Fasta, Fastq, bedGraph and other file formats, motif sequence searches, and specific APA feautes like alternative polyadenylation site-pair classification (same-exon, skipped-exon, composite-exon) and more.

Contents

Installation

The easiest way to install pybio is running:

$ pip install pybio

Note that on some systems, pip is installing the executable scripts under ~/.local/bin. However this folder is not in the PATH which will result in command not found if you try to run $ pybio on the command line. To fix this, please execute export PATH="$PATH:~/.local/bin" (and add this to your .profile). Another suggestion is to install inside a virtual environment (using virtualenv).

If you would like instead to install the latest developmental version from this repository:

# clone pybio repository locally
git clone https://github.com/grexor/pybio.git

# build and install
./build.sh

Quick Start

pybio is strongly integrated with Ensembl and provides genomic loci search for diverse annotated features (genes -> transcripts -> exons + 5UTR + 3UTR).

Let's say we are interested in the human genome. First download and prepare the genome with a single command:

pybio genome homo_sapiens

Searching a genomic position for features is easy in python, for example:

import pybio
genes, transcripts, exons, UTR5, UTR3 = annotate("homo_sapiens", "1", "+", 11012344)

This will return a list of feature objects (genes, transctipts, exons, 3'-UTR and 5'-UTR) (check pybio/core/genomes.py classes to see details of these objects).

If you would like to know all genes that span the provided position, you could then write:

for gene in genes:
   print(gene.gene_id, gene.gene_name, gene.start, gene.stop)

And to list all transcripts of each gene, you could extend the code like this:

for gene in genes:
   print(gene.gene_id, gene.gene_name, gene.start, gene.stop)
   for transcript in gene.transcripts:
      print(transcript.transcript_id)

However you could also start directly with transcripts, and print to which genes are transcripts assigned to:

for transcript in transcripts:
  print(transcript.gene.gene_id, transcript.transcript_id)

And an intuitive graph representation of relationships between feature objects:

gene <-> transcript_1 <-> exon_1
                      <-> exon_2
                      ...
                      <-> utr5
                      <-> utr3
     <-> transcript_2 <-> exon_1
                      <-> exon_2
                      ...
                      <-> utr5
                      <-> utr3

Representation of relationships between feature objects:

                gene = Gene instance object
    gene.transcripts = list of all transcript objects of the gene
          transcript = Transcript instance object
     transcript.gene = points to the gene of the transcript
    transcript.exons = list of all exon objects of the transcript
transcript.utr5/utr3 = points to the UTR5 / UTR3 of the transcript
                exon = Exon instance object
     exon.transcript = points to the transcript of the exon
           utr5/utr3 = Utr5 / Utr3 instance object
utr5/utr3.transcript = points to the transcript of the UTR5/UTR3

Documentation

Here we provide basic pybio usage examples.

Downloading Ensembl genomes

To download Ensembl genomes simply run a few commands on the command line. For example:

$ pybio genome homo_sapiens      # downloads the latest version of Ensembl homo_sapiens assembly and annotation
$ pybio genome homo_sapiens 109  # downloads a specific version (in this case, v109) of Ensembl homo_sapiens assembly and annotation
$ pybio genome elephant          # will list all available elephant genomes and make you choose which one to download

The above will download the FASTA sequence and GTF annotation. If you have STAR and salmon installed on your system, pybio will also build an index of the genome for both.

Data will be stored in the folder specified in the file pybio.config. The genomes folder structure is as follows:

homo_sapiens.assembly.ensembl109             # FASTA files of the genome, each chromosome in a separate file
homo_sapiens.annotation.ensembl109           # Annotation in GTF and TAB format
homo_sapiens.assembly.ensembl109.star        # STAR index, GTF annotation aware
homo_sapiens.transcripts.ensembl109          # transcriptome, this is the Ensembl "cDNA" file in FASTA format
homo_sapiens.transcripts.ensembl109.salmon   # Salmon index of the transcriptome

pybio also supports the download of Ensembl Genomes (Ensembl Fungi, Ensembl Plants, Ensembl Protists, Ensembl Metazoa). You simply provide the name of the species on the command line to automagically download the genome, the assembly and prepare STAR and salmon indices.

For example, to download the latest version of the Dictyostelium discoideum genome, you would write:

$ pybio genome dicty                      # would search for genomes with "dicty" in the name of the species or description
$ pybio genome dictyostelium_discoideum   # also directly providing the exact genome species works

Another example is download the latest Arabidopsis thaliana genome:

$ pybio genome arabidopsis_thaliana

To see all available species, simply run pybio species. Moreover, to see all available arabidopsis genomes, you could run:

$ pybio species arabidopsis
arabidopsis_halleri	Ahal2.2	ensemblgenomes	plants	ensemblgenomes56
arabidopsis_lyrata	v.1.0	ensemblgenomes	plants	ensemblgenomes56
arabidopsis_thaliana	TAIR10	ensemblgenomes	plants	ensemblgenomes56

Voila.

Retrieving genomic sequences

To retrieve stretches of genomic sequence, we use the seq(genome, chr, strand, position, upstream, downstream) method:

import pybio
seq = pybio.core.genomes.seq("homo_sapiens", "1", "+", 450000, -20, 20)

The above command fetches the chr 1 sequence from 450000-20..450000+20, the resulting sequence is of length 41, TACCCTGATTCTGAAACGAAAAAGCTTTACAAAATCCAAGA.

Annotating genomic positions

Given a genomic position, we can quickly retrieve the gene, transcript, exon and utr5/3 information at the given position. If there are several features (genes, transcripts, exons, UTR regions) at the specified position, they are all reported by pybio.

# annotate position
genes, transcripts, exons, utr5, utr3 = pybio.genomes.annotate("hg38", "1", "+", 11012344)

# print all genes that cover the position
for gene in genes:
   print(gene.gene_id, gene.gene_name, gene.start, gene.stop)

The above command would return:

[pybio] loading genome annotation for homo_sapiens with Ensembl version 109
ENSG00000120948, TARDBP, 11012343, 11030527

We can also easily access all transcripts of each gene with gene.transcripts and all exons of each transcript with transcript.exons.

Dependencies

Basic dependencies include pysam, numpy and samtools and should be installed automatically by pip when you install pybio over pip install pybio.

Optional dependencies include STAR and salmon if you would like to build genome/transcriptome indices and align reads.

File formats

Supported file formats. Work in progress.

Genomic Coordinates

All genomic coordinates we operate with inside pybio are 0-based left+right inclusive. This means, when we say for example 100-103, this would include coordinates 100, 101, 102 and 103. The first coordinate is 0.

Important

Refseq and Ensembl GTF files are 1-indexed. When we read files from refseq/ensembl, we substract 1 on all coordinates to keep this in line with other coordinate structures inside pybio (which are all 0-indexed).

Authors

pybio is developed and supported by Gregor Rot.

Issues and Suggestions

Use the issues page to report issues and leave suggestions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pybio-0.3.9.tar.gz (83.4 kB view hashes)

Uploaded Source

Built Distribution

pybio-0.3.9-py3-none-any.whl (91.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page