Skip to main content

Manipulate genomic features and validate the syntax and reference sequence of your GFF3 files.

Project description

https://badge.fury.io/py/gff3.png https://travis-ci.org/hotdogee/gff3-py.png?branch=master https://pypip.in/d/gff3/badge.png

Manipulate genomic features and validate the syntax and reference sequence of your GFF3 files.

Features

  • Simple data structures: Parses a GFF3 file into a structure composed of simple python dict and list.

  • Validation: Validates the GFF3 syntax on parse, and saves the error messages in the parsed structure.

  • Best effort parsing: Despite any detected errors, continue to parse the whole file and make as much sense to it as possible.

  • Uses the python logging library to log error messages with support for custom loggers.

  • Parses embeded or external FASTA sequences to check bounds and number of N s.

  • Check and correct the phase for CDS features.

  • Tree traversal methods ancestors and descendants returns a simple list in Breadth-first search order.

  • Transfer children and parents using the adopt and adopted methods.

  • Test for overlapping features using the overlap method.

  • Remove a feature and its associated features using the remove method.

  • Write the modified structure to a GFF3 file using the write mthod.

Quick Start

An example that just parses a GFF3 file named annotations.gff and validates it using an external FASTA file named annotations.fa looks like:

# validate.py
# ============
from gff3 import Gff3

# initialize a Gff3 object
gff = Gff3()
# parse GFF3 file and do syntax checking, this populates gff.lines and gff.features
# if an embedded ##FASTA directive is found, parse the sequences into gff.fasta_embedded
gff.parse('annotations.gff')
# parse the external FASTA file into gff.fasta_external
gff.parse_fasta_external('annotations.fa')
# Check seqid, bounds and the number of Ns in each feature using one or more reference sources
gff.check_reference(allowed_num_of_n=0, feature_types=['CDS'])
# Checks whether child features are within the coordinate boundaries of parent features
gff.check_parent_boundary()
# Calculates the correct phase and checks if it matches the given phase for CDS features
gff.check_phase()

A more feature complete GFF3 validator with a command line interface which also generates validation report in MarkDown is available under examples/gff_valid.py

The following example demonstrates how to filter, tranverse, and modify the parsed gff3 lines list.

  1. Change features with type exon to pseudogenic_exon and type transcript to pseudogenic_transcript if the feature has an ancestor of type pseudogene

  2. If a pseudogene feature overlaps with a gene feature, move all of the children from the pseudogene feature to the gene feature, and remove the pseudogene feature.

# fix_pseudogene.py
# =================
from gff3 import Gff3
gff = Gff3('annotations.gff')
type_map = {'exon': 'pseudogenic_exon', 'transcript': 'pseudogenic_transcript'}
pseudogenes = [line for line in gff.lines if line['line_type'] == 'feature' and line['type'] == 'pseudogene']
for pseudogene in pseudogenes:
    # convert types
    for line in gff.descendants(pseudogene):
        if line['type'] in type_map:
            line['type'] = type_map[line['type']]
    # find overlapping gene
    overlapping_genes = [line for line in gff.lines if line['line_type'] == 'feature' and line['type'] == 'gene' and gff.overlap(line, pseudogene)]
    if overlapping_genes:
        # move pseudogene children to overlapping gene
        gff.adopt(pseudogene, overlapping_genes[0])
        # remove pseudogene
        gff.remove(pseudogene)
gff.write('annotations_fixed.gff')

History

1.0.0 (2018-12-01)

  • Fix Python3 issues

  • Added sequence functions: complement(seq) and translate(seq)

  • Added fasta write function: fasta_dict_to_file(fasta_dict, fasta_file, line_char_limit=None)

  • Added Gff method to return the sequence of line_data: sequence(self, line_data, child_type=None, reference=None)

  • Gff.write no longer prints redundent ‘###’ when the whole gene is marked as removed

0.3.0 (2015-03-10)

  • Fixed phase checking.

0.2.0 (2015-01-28)

  • Supports python 2.6, 2.7, 3.3, 3.4, pypy.

  • Don’t report empty attributes as errors.

  • Improved documentation.

0.1.0 (2014-12-11)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gff3-1.0.1.tar.gz (30.0 kB view details)

Uploaded Source

Built Distribution

gff3-1.0.1-py2.py3-none-any.whl (20.5 kB view details)

Uploaded Python 2 Python 3

File details

Details for the file gff3-1.0.1.tar.gz.

File metadata

  • Download URL: gff3-1.0.1.tar.gz
  • Upload date:
  • Size: 30.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.3.0 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for gff3-1.0.1.tar.gz
Algorithm Hash digest
SHA256 df01cbfbbb25970ded3a1f3f9e4eeb1126013ffd8aab64d5cedc0a48fa880813
MD5 122e3a23505c0d3214591e66fe3c515e
BLAKE2b-256 dab3afb184cdccdba5f43c089b9026f932d46113cee2a2b58acf8c0cf475abf7

See more details on using hashes here.

File details

Details for the file gff3-1.0.1-py2.py3-none-any.whl.

File metadata

  • Download URL: gff3-1.0.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 20.5 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.6.0 importlib_metadata/4.3.0 pkginfo/1.7.1 requests/2.25.1 requests-toolbelt/0.9.1 tqdm/4.60.0 CPython/3.9.4

File hashes

Hashes for gff3-1.0.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 75eb609dca7195370307997e271c19a6107b4cdde16181e876907c7d2ebf2e7b
MD5 e5e4924dd43a0612fc67fc85cb1a862e
BLAKE2b-256 69d282a41500b1e58561245a3da24a66aa90dd4cf60ac1534dfe80b669038fcf

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page