Fast GTF parser
Project description
***************************************
Overview
***************************************
We want an extremely fast, lightweight way to access gene data stored in GTF format.
The parsed data is held in an intuitive
Gene
-> transcript
-> transcript
with exons being stored as intervals
Our aim is to
* cache data in binary format, which can be
* re-read in < 10s for even the largest genomes
Currently initial parsing Ensembl Homo sapiens release 56 takes around 4.5 minutes.
The binary data can be reloaded in < 10s.
This contains *all* of the data structure in the original GTF file
Note that we sacrifice memory usage for speed. This is seldom a problem for modern computers
and genome sizes (There are around ~400,000 exons but there are stored as intervals / int pairs)
***************************************
A Simple example
***************************************
::
gene_structures = t_parse_gtf("Mus musculus")
#
# used cached data for speed
#
ignore_cache = False
#
# get all protein coding genes only
#
genes_by_type = gene_structures.get_genes(gtf_file, logger, ["protein_coding"], ignore_cache = ignore_cache)
#
# print out gene counts
#
t_parse_gtf.log_gene_types (logger, genes_by_type)
return genes_by_type
Overview
***************************************
We want an extremely fast, lightweight way to access gene data stored in GTF format.
The parsed data is held in an intuitive
Gene
-> transcript
-> transcript
with exons being stored as intervals
Our aim is to
* cache data in binary format, which can be
* re-read in < 10s for even the largest genomes
Currently initial parsing Ensembl Homo sapiens release 56 takes around 4.5 minutes.
The binary data can be reloaded in < 10s.
This contains *all* of the data structure in the original GTF file
Note that we sacrifice memory usage for speed. This is seldom a problem for modern computers
and genome sizes (There are around ~400,000 exons but there are stored as intervals / int pairs)
***************************************
A Simple example
***************************************
::
gene_structures = t_parse_gtf("Mus musculus")
#
# used cached data for speed
#
ignore_cache = False
#
# get all protein coding genes only
#
genes_by_type = gene_structures.get_genes(gtf_file, logger, ["protein_coding"], ignore_cache = ignore_cache)
#
# print out gene counts
#
t_parse_gtf.log_gene_types (logger, genes_by_type)
return genes_by_type
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
gtf_to_genes-1.0beta2.tar.gz
(18.5 kB
view details)
File details
Details for the file gtf_to_genes-1.0beta2.tar.gz.
File metadata
- Download URL: gtf_to_genes-1.0beta2.tar.gz
- Upload date:
- Size: 18.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ad86d85d3c555e8605a32ca6d9ad34c537e5c2b7c79094b895bc28b45b30718f
|
|
| MD5 |
6de7b3da40b7147fc92ceb6b048cde9c
|
|
| BLAKE2b-256 |
d89bdbb884d7ddba138899608665954b0da8a749127f9d800fb1b67c606acf67
|