Python library that parses GFF, Fasta files into python classes
Project description
# bioinfo_tools 0.3.0
## Installation
```bash
pip install bioinfo_tools
```
## Parsers
*HEADS UP!* These parsers are still under development and usage is not consistent from one parser to another.
### Fasta parser
```python
from bioinfo_tools.parsers.fasta import FastaParser
fasta_parser = FastaParser()
# by default, sequence IDs are separated by the firstly found '|' or ':'
for seqid, sequence in fasta_parser.read("/path/to/file.fasta"):
print(seqid, sequence)
# you may specify a specific separator for your sequence ID (e.g white space):
for seqid, sequence in fasta_parser.read("/path/to/file.fasta", id_separator=" "):
print(seqid, sequence)
```
### GFF parser
```python
from bioinfo_tools.parsers.gff import Gff3
gff_parser = Gff3()
with open("/path/to/file.gff", "r") as fh:
for gene in gff_parser.read(fh):
print(gene)
import gzip
with gzip.open("/path/to/file.gz", "rb") as fh:
for gene in gff_parser.read(fh):
print(gene)
```
### OBO parser
```python
from bioinfo_tools.parsers.obo import OboParser
obo_parser = OboParser()
with open("/path/to/file.obo") as fh:
go_terms = obo_parser.read(fh)
for go_term in go_terms.values():
print(go_term)
# you may also get the GO term parents via the parser
parents = obo_parser.get_parents(go_term)
```
## Usage Examples
### Extract all introns sequences by parsing GFF and fasta files
In this example, we focus on a genome assembly. We will first load a GFF file containing gene annotations for this
assembly, then load a fastA file containing the nucleic sequences of each chromosome in the genome.
We will then collect all transcript introns and extract their nucleic sequences.
**__DISCLAIMER__**: for this example to work, your GFF file must expose at least the following feature types in column #3:
- `gene`
- one of `transcript|mRNA|RNA` (or lowercased version)
```python
from bioinfo_tools.genomic_features.chromosome import Chromosome
from bioinfo_tools.parsers.gff import Gff3
from bioinfo_tools.parsers.fasta import FastaParser
chromosomes = dict() # {<chromosome_id>: <bioinfo_tools.genomic_features.Chromosome>}
# start with parsing a GFF file
gff_parser = Gff3()
with open("/path/to/gene_models.gff", "r") as fh:
for gene in gff_parser.read(fh):
chromosome = gene['seqid']
if chromosome not in chromosomes:
chromosomes[chromosome] = Chromosome(chromosome) # init a new Chromosome object
chromosomes[chromosome].add_gene(gene) # add the current gene to our Chromosome object
# load our chromosome sequences in memory
fasta_parser = FastaParser()
for chromosome, nucleic_sequence in fasta_parser.read("/path/to/genome_chromosomes.fasta"):
if chromosome not in chromosomes:
chromosomes[chromosome] = Chromosome(chromosome)
# attach parsed chromosome sequence to our Chromosome object
chromosomes[chromosome].attach_nucleic_sequence(nucleic_sequence)
# now, collect introns and extact their nucleic sequence
introns_sequences = dict() # {<intron_id>: <intron_sequence>}
for chromosome in chromosomes.values():
for gene in chromosome.genes:
for transcript in gene.transcripts:
for idx, intron in enumerate(transcript.introns):
intron_id = "%s_intron_%s" % (transcript.transcript_id, idx)
intron_seq = intron.extract(chromosome.nucleic_sequence) # that we attached above
introns_sequences[intron_id] = intron_seq
# from here, you can do what you want with the intron sequences (eg. write them to a fasta file, etc)
# ...
```
__Note:__ when at the transcript level, you can grab its feature types as described in your GFF file by doing so:
```python
for feature in transcript._get_features("exon"):
print(feature) # I'm an exon
```
For convenience and clarity, following properties are available on transcript objects:
```python
print(transcript.introns) # will call transcript._get_features('intron') behind the scenes
print(transcript.exons) # will call transcript._get_features('exon') behind the scenes
print(transcript.cds) # will call transcript._get_features('cds') behind the scenes
print(transcript.polypeptide) # will call transcript._get_features('polypeptide') behind the scenes
print(transcript.five_prime_utr) # will call transcript._get_features('five_prime_utr') behind the scenes
print(transcript.three_prime_utr) # will call transcript._get_features('three_prime_utr') behind the scenes
```
## Installation
```bash
pip install bioinfo_tools
```
## Parsers
*HEADS UP!* These parsers are still under development and usage is not consistent from one parser to another.
### Fasta parser
```python
from bioinfo_tools.parsers.fasta import FastaParser
fasta_parser = FastaParser()
# by default, sequence IDs are separated by the firstly found '|' or ':'
for seqid, sequence in fasta_parser.read("/path/to/file.fasta"):
print(seqid, sequence)
# you may specify a specific separator for your sequence ID (e.g white space):
for seqid, sequence in fasta_parser.read("/path/to/file.fasta", id_separator=" "):
print(seqid, sequence)
```
### GFF parser
```python
from bioinfo_tools.parsers.gff import Gff3
gff_parser = Gff3()
with open("/path/to/file.gff", "r") as fh:
for gene in gff_parser.read(fh):
print(gene)
import gzip
with gzip.open("/path/to/file.gz", "rb") as fh:
for gene in gff_parser.read(fh):
print(gene)
```
### OBO parser
```python
from bioinfo_tools.parsers.obo import OboParser
obo_parser = OboParser()
with open("/path/to/file.obo") as fh:
go_terms = obo_parser.read(fh)
for go_term in go_terms.values():
print(go_term)
# you may also get the GO term parents via the parser
parents = obo_parser.get_parents(go_term)
```
## Usage Examples
### Extract all introns sequences by parsing GFF and fasta files
In this example, we focus on a genome assembly. We will first load a GFF file containing gene annotations for this
assembly, then load a fastA file containing the nucleic sequences of each chromosome in the genome.
We will then collect all transcript introns and extract their nucleic sequences.
**__DISCLAIMER__**: for this example to work, your GFF file must expose at least the following feature types in column #3:
- `gene`
- one of `transcript|mRNA|RNA` (or lowercased version)
```python
from bioinfo_tools.genomic_features.chromosome import Chromosome
from bioinfo_tools.parsers.gff import Gff3
from bioinfo_tools.parsers.fasta import FastaParser
chromosomes = dict() # {<chromosome_id>: <bioinfo_tools.genomic_features.Chromosome>}
# start with parsing a GFF file
gff_parser = Gff3()
with open("/path/to/gene_models.gff", "r") as fh:
for gene in gff_parser.read(fh):
chromosome = gene['seqid']
if chromosome not in chromosomes:
chromosomes[chromosome] = Chromosome(chromosome) # init a new Chromosome object
chromosomes[chromosome].add_gene(gene) # add the current gene to our Chromosome object
# load our chromosome sequences in memory
fasta_parser = FastaParser()
for chromosome, nucleic_sequence in fasta_parser.read("/path/to/genome_chromosomes.fasta"):
if chromosome not in chromosomes:
chromosomes[chromosome] = Chromosome(chromosome)
# attach parsed chromosome sequence to our Chromosome object
chromosomes[chromosome].attach_nucleic_sequence(nucleic_sequence)
# now, collect introns and extact their nucleic sequence
introns_sequences = dict() # {<intron_id>: <intron_sequence>}
for chromosome in chromosomes.values():
for gene in chromosome.genes:
for transcript in gene.transcripts:
for idx, intron in enumerate(transcript.introns):
intron_id = "%s_intron_%s" % (transcript.transcript_id, idx)
intron_seq = intron.extract(chromosome.nucleic_sequence) # that we attached above
introns_sequences[intron_id] = intron_seq
# from here, you can do what you want with the intron sequences (eg. write them to a fasta file, etc)
# ...
```
__Note:__ when at the transcript level, you can grab its feature types as described in your GFF file by doing so:
```python
for feature in transcript._get_features("exon"):
print(feature) # I'm an exon
```
For convenience and clarity, following properties are available on transcript objects:
```python
print(transcript.introns) # will call transcript._get_features('intron') behind the scenes
print(transcript.exons) # will call transcript._get_features('exon') behind the scenes
print(transcript.cds) # will call transcript._get_features('cds') behind the scenes
print(transcript.polypeptide) # will call transcript._get_features('polypeptide') behind the scenes
print(transcript.five_prime_utr) # will call transcript._get_features('five_prime_utr') behind the scenes
print(transcript.three_prime_utr) # will call transcript._get_features('three_prime_utr') behind the scenes
```
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
bioinfo_tools-0.3.0.tar.gz
(12.3 kB
view details)
File details
Details for the file bioinfo_tools-0.3.0.tar.gz
.
File metadata
- Download URL: bioinfo_tools-0.3.0.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | f01bcd2b18258cf603c26990cc8e775f2e82b7e2b8c223e1cd4e97633d6cf464 |
|
MD5 | 30eadc7b125ec020fa839fbae2ea8f87 |
|
BLAKE2b-256 | 37a998e108946c7ce861f5adb7c65aa8fa9343f4dc372a77422bd8d69d7c0960 |