bioinfo_tools

Python library that parses GFF, Fasta files into python classes

These details have not been verified by PyPI

Project links

Homepage

Environment
- Web Environment
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language
- Python
Topic
- Scientific/Engineering :: Bio-Informatics

Project description

# bioinfo_tools 0.3.0

## Installation

```bash
pip install bioinfo_tools
```

## Parsers

*HEADS UP!* These parsers are still under development and usage is not consistent from one parser to another.

### Fasta parser

```python
from bioinfo_tools.parsers.fasta import FastaParser

fasta_parser = FastaParser()

# by default, sequence IDs are separated by the firstly found '|' or ':'
for seqid, sequence in fasta_parser.read("/path/to/file.fasta"):
print(seqid, sequence)

# you may specify a specific separator for your sequence ID (e.g white space):
for seqid, sequence in fasta_parser.read("/path/to/file.fasta", id_separator=" "):
print(seqid, sequence)
```

### GFF parser

```python
from bioinfo_tools.parsers.gff import Gff3

gff_parser = Gff3()
with open("/path/to/file.gff", "r") as fh:
for gene in gff_parser.read(fh):
print(gene)

import gzip
with gzip.open("/path/to/file.gz", "rb") as fh:
for gene in gff_parser.read(fh):
print(gene)
```

### OBO parser

```python
from bioinfo_tools.parsers.obo import OboParser

obo_parser = OboParser()
with open("/path/to/file.obo") as fh:
go_terms = obo_parser.read(fh)

for go_term in go_terms.values():
print(go_term)

# you may also get the GO term parents via the parser
parents = obo_parser.get_parents(go_term)
```

## Usage Examples

### Extract all introns sequences by parsing GFF and fasta files

In this example, we focus on a genome assembly. We will first load a GFF file containing gene annotations for this
assembly, then load a fastA file containing the nucleic sequences of each chromosome in the genome.
We will then collect all transcript introns and extract their nucleic sequences.

**__DISCLAIMER__**: for this example to work, your GFF file must expose at least the following feature types in column #3:
- `gene`
- one of `transcript|mRNA|RNA` (or lowercased version)

```python
from bioinfo_tools.genomic_features.chromosome import Chromosome
from bioinfo_tools.parsers.gff import Gff3
from bioinfo_tools.parsers.fasta import FastaParser

chromosomes = dict() # {<chromosome_id>: <bioinfo_tools.genomic_features.Chromosome>}

# start with parsing a GFF file
gff_parser = Gff3()
with open("/path/to/gene_models.gff", "r") as fh:
for gene in gff_parser.read(fh):
chromosome = gene['seqid']

if chromosome not in chromosomes:
chromosomes[chromosome] = Chromosome(chromosome) # init a new Chromosome object

chromosomes[chromosome].add_gene(gene) # add the current gene to our Chromosome object

# load our chromosome sequences in memory
fasta_parser = FastaParser()
for chromosome, nucleic_sequence in fasta_parser.read("/path/to/genome_chromosomes.fasta"):
if chromosome not in chromosomes:
chromosomes[chromosome] = Chromosome(chromosome)
# attach parsed chromosome sequence to our Chromosome object
chromosomes[chromosome].attach_nucleic_sequence(nucleic_sequence)

# now, collect introns and extact their nucleic sequence
introns_sequences = dict() # {<intron_id>: <intron_sequence>}
for chromosome in chromosomes.values():
for gene in chromosome.genes:
for transcript in gene.transcripts:
for idx, intron in enumerate(transcript.introns):
intron_id = "%s_intron_%s" % (transcript.transcript_id, idx)
intron_seq = intron.extract(chromosome.nucleic_sequence) # that we attached above
introns_sequences[intron_id] = intron_seq

# from here, you can do what you want with the intron sequences (eg. write them to a fasta file, etc)
# ...
```

__Note:__ when at the transcript level, you can grab its feature types as described in your GFF file by doing so:
```python
for feature in transcript._get_features("exon"):
print(feature) # I'm an exon
```
For convenience and clarity, following properties are available on transcript objects:
```python
print(transcript.introns) # will call transcript._get_features('intron') behind the scenes
print(transcript.exons) # will call transcript._get_features('exon') behind the scenes
print(transcript.cds) # will call transcript._get_features('cds') behind the scenes
print(transcript.polypeptide) # will call transcript._get_features('polypeptide') behind the scenes
print(transcript.five_prime_utr) # will call transcript._get_features('five_prime_utr') behind the scenes
print(transcript.three_prime_utr) # will call transcript._get_features('three_prime_utr') behind the scenes
```

Project details

These details have not been verified by PyPI

Project links

Homepage

Environment
- Web Environment
Intended Audience
- Developers
License
- OSI Approved :: BSD License
Operating System
- OS Independent
Programming Language
- Python
Topic
- Scientific/Engineering :: Bio-Informatics

Release history Release notifications | RSS feed

0.3.1

Apr 3, 2018

This version

0.3.0

Feb 22, 2018

0.2.7

Feb 2, 2018

0.2.6.1

Dec 7, 2017

0.2.6

Dec 7, 2017

0.2.5

Dec 7, 2017

0.2.4

Nov 22, 2017

0.2.3

Nov 21, 2017

0.2.2

Nov 20, 2017

0.2.1

Nov 15, 2017

0.2

Nov 14, 2017

0.1.13

Oct 26, 2017

0.1.12

Oct 26, 2017

0.1.11

Oct 14, 2017

0.1.10

Oct 14, 2017

0.1.9.1

Oct 14, 2017

0.1.9

Oct 14, 2017

0.1.8

Oct 14, 2017

0.1.7

Sep 21, 2017

0.1.6

Sep 21, 2017

0.1.5

Sep 20, 2017

0.1.4

Sep 18, 2017

0.1.3

Sep 18, 2017

0.1.2

Sep 4, 2017

0.1.1

Sep 4, 2017

0.1.0

Aug 8, 2017

0.0.9

Jul 10, 2017

0.0.8

May 23, 2017

0.0.4

Apr 17, 2017

0.0.3

Apr 7, 2017

0.0.2

Mar 30, 2017

0.0.1

Mar 30, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioinfo_tools-0.3.0.tar.gz (12.3 kB view details)

Uploaded Feb 22, 2018 Source

File details

Details for the file bioinfo_tools-0.3.0.tar.gz.

File metadata

Download URL: bioinfo_tools-0.3.0.tar.gz
Upload date: Feb 22, 2018
Size: 12.3 kB
Tags: Source
Uploaded using Trusted Publishing? No

File hashes

Hashes for bioinfo_tools-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`f01bcd2b18258cf603c26990cc8e775f2e82b7e2b8c223e1cd4e97633d6cf464`
MD5	`30eadc7b125ec020fa839fbae2ea8f87`
BLAKE2b-256	`37a998e108946c7ce861f5adb7c65aa8fa9343f4dc372a77422bd8d69d7c0960`