Skip to main content

Parse GTF files with Polars

Project description

gtf-polars

Parse GTF files with Polars

Implements a memory-efficient GTF parser that stays fully lazy until .collect() is called. For more information on Polars Lazy API, see this link

Scripts

scripts/subset_gtf_feature.py

Filters a GTF file to keep only rows matching one or more feature types (e.g. gene, transcript, exon) and writes the result as a tab-separated GTF file.

python scripts/subset_gtf_feature.py gencode.v39.annotation.gtf \
    --feature gene transcript \
    --output subset.gtf
Argument Description
gtf_file Path to the input GTF or gzipped GTF file
--feature One or more feature types to keep
--output Output path (default: subset.gtf)

scripts/transcript_to_gene.py

Builds a transcript-to-gene mapping CSV from a GTF file by extracting transcript_id, gene_id, and gene_name from transcript rows. Useful for downstream tools (e.g. alevin, tximeta) that require a tx2gene table.

python scripts/transcript_to_gene.py isoseq.gtf --output transcript_to_gene.csv
Argument Description
gtf_file Path to the input GTF file
--output Output CSV path (default: transcript_to_gene.csv)

Library usage

from gtf_polars import parse_gtf
import polars as pl

lf = parse_gtf("gencode.v39.annotation.sorted.gtf", attributes_to_extract=["gene_id", "gene_name"])

df = (lf.filter(pl.col("feature") == 'transcript').select(['seqname', 'start','end','gene_id', 'gene_name']).collect())

df.head()
shape: (5, 5)
┌─────────┬───────┬───────┬───────────────────┬─────────────┐
│ seqname ┆ start ┆ end   ┆ gene_id           ┆ gene_name   │
│ ---     ┆ ---   ┆ ---   ┆ ---               ┆ ---         │
│ str     ┆ i64   ┆ i64   ┆ str               ┆ str         │
╞═════════╪═══════╪═══════╪═══════════════════╪═════════════╡
│ chr1    ┆ 11869 ┆ 14409 ┆ ENSG00000223972.5 ┆ DDX11L1     │
│ chr1    ┆ 12010 ┆ 13670 ┆ ENSG00000223972.5 ┆ DDX11L1     │
│ chr1    ┆ 14404 ┆ 29570 ┆ ENSG00000227232.5 ┆ WASH7P      │
│ chr1    ┆ 17369 ┆ 17436 ┆ ENSG00000278267.1 ┆ MIR6859-1   │
│ chr1    ┆ 29554 ┆ 31097 ┆ ENSG00000243485.5 ┆ MIR1302-2HG │
└─────────┴───────┴───────┴───────────────────┴─────────────┘

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gtf_polars-0.1.0.tar.gz (5.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gtf_polars-0.1.0-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file gtf_polars-0.1.0.tar.gz.

File metadata

  • Download URL: gtf_polars-0.1.0.tar.gz
  • Upload date:
  • Size: 5.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for gtf_polars-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7981cc75f1c52a7ab73e50663a2e274f5fa22f838a9d458a9ff9b15a814e6662
MD5 6b11c7363f3ec20ebba65e236e91fc2f
BLAKE2b-256 53f8cca6d43631048c8e28c833b9018334c7ddd3573ed8512d69786662b759f6

See more details on using hashes here.

File details

Details for the file gtf_polars-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gtf_polars-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.12

File hashes

Hashes for gtf_polars-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1ee4051a3b3fa3c43a7c63a6314f6ee7d59e734ffd138309c438e5150ce1e655
MD5 d22f3846af1e12b2a2a90c6953489b9d
BLAKE2b-256 8aa8ea675db0738a26abc2f853b6358b28412237e865211f7e027b3f28a755a4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page