Skip to main content

A fast gtf/gff parser.

Project description

gxf is a fast gtf/gff parser based pandas.

GFF/GTF file format

The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines. The following documentation is based on the Version 2 specifications.

The GTF (General Transfer Format) is identical to GFF version 2.

Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.'

  • chr_id - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
  • source - name of the program that generated this feature, or the data source (database or project name)
  • type - feature type name, e.g. Gene, Variation, Similarity
  • start - Start position* of the feature, with sequence numbering starting at 1.
  • end - End position* of the feature, with sequence numbering starting at 1.
  • score - A floating point value.
  • strand - defined as + (forward) or - (reverse).
  • phase - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
  • attributes - A semicolon-separated list of tag-value pairs, providing additional information about each feature. *- Both, the start and end position are included. For example, setting start-end to 1-2 describes two bases, the first and second base in the sequence.

Note that where the attributes contain identifiers that link the features together into a larger structure, these will be used by Ensembl to display the features as joined blocks.

GFF file format

Usage

query all lines that type is 'gene'

from gxf import GXF
filename = 'test.gff'

gff = GXF(filename)


gff.filter(type='gene')

Multi-condition query

from gxf import GXF
filename = 'test.gff'

gff = GXF(filename)


gff.filter(type='gene' strand=1)

You can query not only equality, but also inequality.

The query name is field_name + __ + oper, and oper is one of the geleeqnegtlt.

query start >= 200

from gxf import GXF
filename = 'test.gff'

gff = GXF(filename)

gff.filter(start__ge=200)

query end < 100

from gxf import GXF
filename = 'test.gff'

gff = GXF(filename)


gff.filter(end__lt=100)

preprocessing data

You can use Inherits GXF to rewrite some method to preprocess or Post-process.

the method format is before/after + _handle_ + field_name, eg. after_handle_attributes, and the method need one arg.

from gxf import GXF
filename = 'test.gff'

class MyGXF(GXF):

    def before_handle_type(self, x):
        return x.lower()

    def after_handle_type(self, x):
        return x.upper()

gff = MyGXF(filename)

gff.filter(type='gene')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gxf-0.0.5.tar.gz (7.8 kB view hashes)

Uploaded Source

Built Distribution

gxf-0.0.5-py3-none-any.whl (6.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page