Skip to main content

Python library that parses GFF, Fasta files into python classes

Project description

# bioinfo_tools 0.3.1

## Installation

pip install bioinfo_tools

## Parsers

*HEADS UP!* These parsers are still under development and usage is not consistent from one parser to another.

### Fasta parser

from bioinfo_tools.parsers.fasta import FastaParser

fasta_parser = FastaParser()

# by default, sequence IDs are separated by the firstly found '|' or ':'
for seqid, sequence in"/path/to/file.fasta"):
print(seqid, sequence)

# you may specify a specific separator for your sequence ID (e.g white space):
for seqid, sequence in"/path/to/file.fasta", id_separator=" "):
print(seqid, sequence)

### GFF parser

from bioinfo_tools.parsers.gff import Gff3

gff_parser = Gff3()
with open("/path/to/file.gff", "r") as fh:
for gene in

import gzip
with"/path/to/file.gz", "rb") as fh:
for gene in

### OBO parser

from bioinfo_tools.parsers.obo import OboParser

obo_parser = OboParser()
with open("/path/to/file.obo") as fh:
go_terms =

for go_term in go_terms.values():

# you may also get the GO term parents via the parser
parents = obo_parser.get_parents(go_term)

## Usage Examples

### Extract all introns sequences by parsing GFF and fasta files

In this example, we focus on a genome assembly. We will first load a GFF file containing gene annotations for this
assembly, then load a fastA file containing the nucleic sequences of each chromosome in the genome.
We will then collect all transcript introns and extract their nucleic sequences.

**__DISCLAIMER__**: for this example to work, your GFF file must expose at least the following feature types in column #3:
- `gene`
- one of `transcript|mRNA|RNA` (or lowercased version)

from bioinfo_tools.genomic_features.chromosome import Chromosome
from bioinfo_tools.parsers.gff import Gff3
from bioinfo_tools.parsers.fasta import FastaParser

chromosomes = dict() # {<chromosome_id>: <bioinfo_tools.genomic_features.Chromosome>}

# start with parsing a GFF file
gff_parser = Gff3()
with open("/path/to/gene_models.gff", "r") as fh:
for gene in
chromosome = gene['seqid']

if chromosome not in chromosomes:
chromosomes[chromosome] = Chromosome(chromosome) # init a new Chromosome object

chromosomes[chromosome].add_gene(gene) # add the current gene to our Chromosome object

# load our chromosome sequences in memory
fasta_parser = FastaParser()
for chromosome, nucleic_sequence in"/path/to/genome_chromosomes.fasta"):
if chromosome not in chromosomes:
chromosomes[chromosome] = Chromosome(chromosome)
# attach parsed chromosome sequence to our Chromosome object

# now, collect introns and extact their nucleic sequence
introns_sequences = dict() # {<intron_id>: <intron_sequence>}
for chromosome in chromosomes.values():
for gene in chromosome.genes:
for transcript in gene.transcripts:
for idx, intron in enumerate(transcript.introns):
intron_id = "%s_intron_%s" % (transcript.transcript_id, idx)
intron_seq = intron.extract(chromosome.nucleic_sequence) # that we attached above
introns_sequences[intron_id] = intron_seq

# from here, you can do what you want with the intron sequences (eg. write them to a fasta file, etc)
# ...

__Note:__ when at the transcript level, you can grab its feature types as described in your GFF file by doing so:
for feature in transcript._get_features("exon"):
print(feature) # I'm an exon
For convenience and clarity, following properties are available on transcript objects:
print(transcript.introns) # will call transcript._get_features('intron') behind the scenes
print(transcript.exons) # will call transcript._get_features('exon') behind the scenes
print(transcript.cds) # will call transcript._get_features('cds') behind the scenes
print(transcript.polypeptide) # will call transcript._get_features('polypeptide') behind the scenes
print(transcript.five_prime_utr) # will call transcript._get_features('five_prime_utr') behind the scenes
print(transcript.three_prime_utr) # will call transcript._get_features('three_prime_utr') behind the scenes

Project details

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bioinfo_tools-0.3.1.tar.gz (12.4 kB view hashes)

Uploaded source

Supported by

AWS AWS Cloud computing Datadog Datadog Monitoring Fastly Fastly CDN Google Google Object Storage and Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page