A package for parsing gff3 (general feature format) files into pandas dataframes

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

gff3_parser

This is a simple python package to parse gff3 ( Generic Feature Format) files into pandas dataframes. This file format is used for genetic annotation files and I couldn't find a parser that worked with python so I wrote this. This is still a work in progress and I'll hopefully be adding features soon.

Background

What if gff3 file format?

I store this from nice explanation from NGS Analysis:

General Feature Format (GFF) is a tab-delimited text file that holds information any and every feature that can be applied to a nucleic acid or protein sequence. Everything from CDS, microRNAs, binding domains, ORFs, and more can be handled by this format. Unfortunately there have been many variations of the original GFF format and many have since become incompatible with each other. The latest accepted format (GFF3) has attempted to address many of the issues that were missing from previous versions.
GFF3 has 9 required fields, though not all are utilized (either blank or a default value of ‘.’).

Sequence ID

Source - Describes the algorithm or the procedure that generated this feature. Typically Genescane or Genebank, respectively.

Feature Type - Describes what the feature is (mRNA, domain, exon, etc.). These terms are constrained to the Sequence Ontology terms.

Feature Start

Feature End

Score - Typically E-values for sequence similarity and P-values for predictions.

Strand (+ or -)

Phase - Indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon.

Atributes A semicolon-separated list of tag-value pairs, providing additional information about each feature. Some of these tags are predefined, e.g. ID, Name, Alias, Parent . You can see the full list here.

Example File

##gff-version 3
#description: evidence-based annotation of the human genome (GRCh38), version 38 (Ensembl 104)
#provider: GENCODE
#contact: gencode-help@ebi.ac.uk
#format: gff3
#date: 2021-03-12
##sequence-region chr1 1 248956422
chr1	HAVANA	gene	11869	14409	.	+	.	ID=ENSG00000223972.5;gene_id=ENSG00000223972.5;gene_type=transcribed_unprocessed_pseudogene;gene_name=DDX11L1;level=2;hgnc_id=HGNC:37102;havana_gene=OTTHUMG00000000961.2
chr1	HAVANA	transcript	11869	14409	.	+	.	ID=ENST00000456328.2;Parent=ENSG00000223972.5;gene_id=ENSG00000223972.5;transcript_id=ENST00000456328.2;gene_type=transcribed_unprocessed_pseudogene;gene_name=DDX11L1;transcript_type=processed_transcript;transcript_name=DDX11L1-202;level=2;transcript_support_level=1;hgnc_id=HGNC:37102;tag=basic;havana_gene=OTTHUMG00000000961.2;havana_transcript=OTTHUMT00000362751.1
chr1	HAVANA	exon	11869	12227	.	+	.	ID=exon:ENST00000456328.2:1;Parent=ENST00000456328.2;gene_id=ENSG00000223972.5;transcript_id=ENST00000456328.2;gene_type=transcribed_unprocessed_pseudogene;gene_name=DDX11L1;transcript_type=processed_transcript;transcript_name=DDX11L1-202;exon_number=1;exon_id=ENSE00002234944.1;level=2;transcript_support_level=1;hgnc_id=HGNC:37102;tag=basic;havana_gene=OTTHUMG00000000961.2;havana_transcript=OTTHUMT00000362751.1
chr1	HAVANA	exon	12613	12721	.	+	.	ID=exon:ENST00000456328.2:2;Parent=ENST00000456328.2;gene_id=ENSG00000223972.5;transcript_id=ENST00000456328.2;gene_type=transcribed_unprocessed_pseudogene;gene_name=DDX11L1;transcript_type=processed_transcript;transcript_name=DDX11L1-202;exon_number=2;exon_id=ENSE00003582793.1;level=2;transcript_support_level=1;hgnc_id=HGNC:37102;tag=basic;havana_gene=OTTHUMG00000000961.2;havana_transcript=OTTHUMT00000362751.1

Why this is super annoying to parse

Basically the first 8 columns are nicely structured tapular data but that last column has an arbitrary number of new values. This is kind of similar to a SQL table and a paired noSQL db but the way these files are distributed you can't use those tools.

How to parse

I just found every unique key in the last column and made it it's own column and then reorganized data accordingly. It can be reasonably sparse and it does take a good amount of time and space (the files are often pretty large) but the end result is a normal structured pandas dataframe.

Installation

pip install pip install gff3-parser

I'd recommend updating this often as I find and fix issues

Example Usage

>>> import gff3_parser
>>> filepath = "gencode.v38.annotation.gff3"
>>>  just_tabular = gff3_parser.parse_gff3(filepath, verbose = True, parse_attributes = False)
description: evidence-based annotation of the human genome (GRCh38), version 38 (Ensembl 104)

provider: GENCODE

contact: gencode-help@ebi.ac.uk

format: gff3

date: 2021-03-12

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3148167/3148167 [00:07<00:00, 421099.43it/s]

>>> just_tabular.head()
  Seqid  Source        Type  Start    End  Score Strand Phase
0  chr1  HAVANA        gene  11869  14409    NaN      +   NaN
1  chr1  HAVANA  transcript  11869  14409    NaN      +   NaN
2  chr1  HAVANA        exon  11869  12227    NaN      +   NaN
3  chr1  HAVANA        exon  12613  12721    NaN      +   NaN
4  chr1  HAVANA        exon  13221  14409    NaN      +   NaN

>>> full_data = gff3_parser.parse_gff3('gencode.v38.annotation.gff3',verbose = False,  parse_attributes=True)

>>> full_data.head()
  Seqid  Source        Type  Start    End  Score Strand Phase  seqid  ...                        ID transcript_support_level             Parent  ont       transcript_type    tag       havana_transcript transcript_name ccdsid
0  chr1  HAVANA        gene  11869  14409    NaN      +   NaN    NaN  ...         ENSG00000223972.5                      NaN                NaN  NaN                   NaN    NaN                     NaN             NaN    NaN
1  chr1  HAVANA  transcript  11869  14409    NaN      +   NaN    NaN  ...         ENST00000456328.2                        1  ENSG00000223972.5  NaN  processed_transcript  basic  OTTHUMT00000362751.1\n     DDX11L1-202    NaN
2  chr1  HAVANA        exon  11869  12227    NaN      +   NaN    NaN  ...  exon:ENST00000456328.2:1                        1  ENST00000456328.2  NaN  processed_transcript  basic  OTTHUMT00000362751.1\n     DDX11L1-202    NaN
3  chr1  HAVANA        exon  12613  12721    NaN      +   NaN    NaN  ...  exon:ENST00000456328.2:2                        1  ENST00000456328.2  NaN  processed_transcript  basic  OTTHUMT00000362751.1\n     DDX11L1-202    NaN
4  chr1  HAVANA        exon  13221  14409    NaN      +   NaN    NaN  ...  exon:ENST00000456328.2:3                        1  ENST00000456328.2  NaN  processed_transcript  basic  OTTHUMT00000362751.1\n     DDX11L1-202    NaN

>>> full_data.columns
Index(['Seqid', 'Source', 'Type', 'Start', 'End', 'Score', 'Strand', 'Phase',
       'seqid', 'transcript_id', 'havana_gene', 'gene_type', 'gene_name',
       'gene_id', 'exon_id', 'level', 'protein_id', 'hgnc_id', 'exon_number',
       'ID', 'transcript_support_level', 'Parent', 'ont', 'transcript_type',
       'tag', 'havana_transcript', 'transcript_name', 'ccdsid'],
      dtype='object')

Full Documentation

This whole project has literally one public function so far so I'm just going to document it here until I feel like it needs more.

Ill get around to it eventually. Theres only two uses and both are in the example.

Project details

These details have not been verified by PyPI

Project links

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.0.5

Aug 19, 2021

0.0.4

Aug 19, 2021

0.0.3

Aug 19, 2021

0.0.2

Aug 19, 2021

0.0.1

Aug 19, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gff3_parser-0.0.5.tar.gz (6.6 kB view details)

Uploaded Aug 19, 2021 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gff3_parser-0.0.5-py3-none-any.whl (6.7 kB view details)

Uploaded Aug 19, 2021 Python 3

File details

Details for the file gff3_parser-0.0.5.tar.gz.

File metadata

Download URL: gff3_parser-0.0.5.tar.gz
Upload date: Aug 19, 2021
Size: 6.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.2

File hashes

Hashes for gff3_parser-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`459269031331d5f9e6d6c91b1cc2ee4b9f0791f749ff93c8e2bcd0443cd5e2c8`
MD5	`5048b46daf5182249f61b0251dae6751`
BLAKE2b-256	`452c97ced34849d3fbdb03db9822547bc36419e82316a5af64a9bc56e441c900`

See more details on using hashes here.

File details

Details for the file gff3_parser-0.0.5-py3-none-any.whl.

File metadata

Download URL: gff3_parser-0.0.5-py3-none-any.whl
Upload date: Aug 19, 2021
Size: 6.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.4.2 importlib_metadata/4.6.4 pkginfo/1.7.1 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.1 CPython/3.8.2

File hashes

Hashes for gff3_parser-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7e6b9ffb7b5a1153d01ed7ce3bdc2daafacf5c5736ed11138fcf0b040c564017`
MD5	`8949c0d92fa1d13f4207178b019b6a60`
BLAKE2b-256	`da0d888b116170ca5c3a7d66667ba06892465f0981638b28b7f7934cddc357bd`

See more details on using hashes here.

gff3-parser 0.0.5

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

gff3_parser

Background

What if gff3 file format?

Example File

Why this is super annoying to parse

How to parse

Installation

Example Usage

Full Documentation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes