Skip to main content

A fast gtf/gff parser.

Project description

gxf is a fast gtf/gff parser based pandas.

GFF/GTF file format

The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines. The following documentation is based on the Version 2 specifications.

The GTF (General Transfer Format) is identical to GFF version 2.

Fields must be tab-separated. Also, all but the final field in each feature line must contain a value; "empty" columns should be denoted with a '.'

  • chr_id - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
  • source - name of the program that generated this feature, or the data source (database or project name)
  • type - feature type name, e.g. Gene, Variation, Similarity
  • start - Start position* of the feature, with sequence numbering starting at 1.
  • end - End position* of the feature, with sequence numbering starting at 1.
  • score - A floating point value.
  • strand - defined as + (forward) or - (reverse).
  • phase - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
  • attributes - A semicolon-separated list of tag-value pairs, providing additional information about each feature. *- Both, the start and end position are included. For example, setting start-end to 1-2 describes two bases, the first and second base in the sequence.

Note that where the attributes contain identifiers that link the features together into a larger structure, these will be used by Ensembl to display the features as joined blocks.

GFF file format

Usage

query all lines that type is 'gene'

from gxf import GXF
filename = 'test.gff'

gff = GXF(filename)


gff.filter(type='gene')

Multi-condition query

from gxf import GXF
filename = 'test.gff'

gff = GXF(filename)


gff.filter(type='gene' strand=1)

You can query not only equality, but also inequality.

The query name is field_name + __ + oper, and oper is one of the geleeqnegtlt.

query start >= 200

from gxf import GXF
filename = 'test.gff'

gff = GXF(filename)

gff.filter(start__ge=200)

query end < 100

from gxf import GXF
filename = 'test.gff'

gff = GXF(filename)


gff.filter(end__lt=100)

preprocessing data

You can use Inherits GXF to rewrite some method to preprocess or Post-process.

the method format is before/after + _handle_ + field_name, eg. after_handle_attributes, and the method need one arg.

from gxf import GXF
filename = 'test.gff'

class MyGXF(GXF):

    def before_handle_type(self, x):
        return x.lower()

    def after_handle_type(self, x):
        return x.upper()

gff = MyGXF(filename)

gff.filter(type='gene')

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gxf-0.0.5.tar.gz (7.8 kB view details)

Uploaded Source

Built Distribution

gxf-0.0.5-py3-none-any.whl (6.7 kB view details)

Uploaded Python 3

File details

Details for the file gxf-0.0.5.tar.gz.

File metadata

  • Download URL: gxf-0.0.5.tar.gz
  • Upload date:
  • Size: 7.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.3

File hashes

Hashes for gxf-0.0.5.tar.gz
Algorithm Hash digest
SHA256 a42bb60c054c8442911fd198f730a76b66d7b250bc9dc7bd4b3e239bd8a742d2
MD5 06b19ff65850c3d416f3b8e411ba9f73
BLAKE2b-256 06e13d2ca3307599c95e2474546ddbfe1ff1e2aead1effd6b275250d907cc10c

See more details on using hashes here.

File details

Details for the file gxf-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: gxf-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 6.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.0 CPython/3.8.3

File hashes

Hashes for gxf-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 4a988c44a54a24d499d40f347b1d869c5eca5e0343cd469cfaf1b936a578e5cf
MD5 2b69483ae6d190c7763ccdf7078bf7f1
BLAKE2b-256 3e47058c5b840b59b9df8c593a23a78c290ca5faae53f05a62f3da94846f9734

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page