Skip to main content

GFFBase — Rust-accelerated GFF3/GTF parser with a DuckDB-backed storage engine and a drop-in gffutils-compatible Python API.

Project description

GFFBase

PyPI Python License Tests Coverage Validated


What is GFFBase?

GFFBase is a high-performance genomic-annotation engine combining a SIMD Rust parser, a DuckDB columnar backend, and a zero-copy PyArrow interface — purpose-built for whole-genome-scale ingest and bulk machine-learning feature extraction, while remaining a drop-in successor to gffutils.

A SIMD Rust+PyO3 parser feeds DuckDB's columnar storage through record-batch Arrow handoffs. A smart query router auto-picks an R-tree or B-tree spatial index per query, and a closure-cache / recursive-CTE relational dispatcher selects the right strategy based on the corpus's actual hierarchy depth. The full FeatureDB / Feature / create_db / DataIterator / GFFWriter / merge_criteria legacy API is preserved verbatim — most users migrate by changing one import line.

Three reasons it matters

  1. 🚀 ≥ 32× faster GENCODE GTF ingest (v49, 6.07 M lines) — and mathematically more efficient: legacy needs a Python loop + ~5 million correlated SQLite subqueries to invent the missing gene/transcript rows, while gffbase does the same work in two set-based DuckDB GROUP BY aggregations + one recursive CTE. (Proven by a same-release GTF/GFF3 head-to-head)
  2. ⚡ 36.68× faster bulk ML extractionchildren_batched(format='arrow') returns 50 000 transcripts → 1.6 M exons as a zero-copy PyArrow table in 1.16 s. No Python Feature objects, ever. (How?)
  3. 🛡️ Validated NCBI compliance — all four canonical human-genome annotations (GENCODE / RefSeq / MANE / CHESS 3) ingest cleanly with zero strict-mode warnings. RefSeq's split-CDS duplicate-ID convention is handled automatically.

⚡ Comprehensive Human Genome Annotations — validated across every canonical corpus

Validated head-to-head against legacy gffutils on the four canonical human-genome annotation sources, including the GENCODE v49 GTF and GFF3 versions of the same release — a same-biology, same-features, different-format pairing that exposes the GTF Synthesis Advantage in its purest form:

Corpus Format Lines gffbase ingest legacy ingest speedup spatial qps batched (5 k anchors)
GENCODE v49 (basic) GTF 6,068,892 4 min 37 s ≥ 2 hr 30 min[^1] 🚀 ≥ 32× 1,204 172 ms / 596 k desc
GENCODE v49 (basic) GFF3 6,066,054 6 min 7 s 11 min 23 s 1.86× 1,292 422 ms / 1.93 M desc
RefSeq GRCh38.p14 GFF3 4,932,571 4 min 12 s[^2] 6 min 5 s 1.45× 1,011 263 ms / 999 k desc
MANE v1.5 (Ensembl) GFF3 524,834 21.6 s 45.1 s 2.09× 1,766 78 ms / 156 k desc
CHESS 3.1.3 GFF3 2,761,061 53.6 s 2 min 13.1 s 2.48× 1,175 91 ms / 161 k desc

[^1]: Legacy gffutils.create_db() on GENCODE v49 GTF (6.07 M lines) hits the bench's safety-valve cap (75 min). The reported wall is a conservative 2× extrapolation — the canonical GENCODE v45 GTF (2.0 M lines, 3× smaller) ran uncapped at 3,582 s (59 min 42 s) on the same hardware, so the v49 wall is well past 2 hours. See Performance Comparison §"GTF Synthesis Advantage" for the formal cost model. [^2]: Result of the v0.1.0 ingest-pipeline optimization — the same RefSeq corpus used to take 7 min 49 s before the GFF3 path was re-architected to stamp seqid_y and bbox inline during the Arrow batch INSERT.

The same biological release, ingested in two different formats, by two different engines — that's the load-bearing comparison. Legacy GFF3 ingest finishes in 11 min because every parent edge is explicit; legacy GTF ingest takes hours because the parent rows have to be invented from the data (one Python ↔ SQLite round-trip per missing row). gffbase replaces those millions of round-trips with two set-based DuckDB GROUP BY aggregations + one recursive CTE — the same code path runs for GTF and GFF3, which is why the gffbase column barely shifts (4 min 37 s → 6 min 7 s) between the two rows while the legacy column balloons by 13×–20×.

Robustness: every corpus ingests cleanly with zero strict-mode warnings from the NCBI-spec-hardened Rust parser (9 enforced rules, line-numbered GFFFormatError, opt-in non-strict mode). RefSeq's notorious duplicate-ID=cds-NP_xxx convention (split CDS segments) is handled transparently — gffbase mirrors gffutils.merge_strategy="create_unique" automatically and records the remap in the duplicates table. No config knobs to flip.

📊 Full reproducible numbers + per-corpus root-cause analysis: PERFORMANCE_COMPARISON.md. Re-run via python benchmarks/06_mega.py --legacy-timeout 900.


🚀 The Killer Feature — zero-copy PyArrow for ML pipelines

Modern ML genomics pipelines have one shape: pull every exon for 50 000 transcripts, push the column-oriented table into a tensor, train. Legacy gffutils forces a per-feature Python loop — constructing 1.6 M throwaway Feature objects per pull, which crushes both wall time and memory. gffbase bypasses Python entirely with a single batched call that returns DuckDB's internal Arrow buffers directly:

# 50 000 transcript IDs → every exon, in one query.
# Returns a zero-copy pyarrow.Table — no Python `Feature` object
# is constructed at any layer.
exons = db.children_batched(
    transcript_ids,
    featuretype="exon",
    format="arrow",        # or "df" / "polars"
)

# Hand off directly to PyTorch / Hugging Face datasets / JAX / Lance.
import torch
starts = torch.from_numpy(exons.column("start").to_numpy())
ends   = torch.from_numpy(exons.column("end").to_numpy())
# The "anchor" column carries the input id for each row, so you can
# reconstruct per-transcript groups without re-issuing N queries.

Numbers for that one call (50 000 transcripts, GENCODE basic annotation, returning 1.6 M exon rows):

Path Wall vs legacy
gffbase children_batched(format='arrow') 1.16 s 36.68× faster
legacy gffutils row-by-row loop 42.55 s 1.0× (baseline)
gffbase row-by-row loop ≥ 642 s 0.07× (slower!)

This is the reason GFFBase exists. Iterating for x in ids: db.children(x) with DuckDB pays vectorization startup per call and is slower than legacy's SQLite row-by-row path — but the batched API obliterates both row-by-row paths because it issues one set-based SQL query and avoids constructing any Python Feature objects whatsoever.

region_batched(...) and parents_batched(...) have the same zero-copy contract for spatial and parent workloads.


📦 Installation

pip install gffbase

Universal abi3-py39 wheels — single binary per arch covers CPython 3.9 → 3.13. No Rust toolchain required at install time.

For source/dev installs (Rust ≥ 1.69 + maturin):

pip install -e .[dev]
maturin develop --release

🏃 Quick start — row-by-row (drop-in for gffutils)

from gffbase import create_db

# 1. Ingest a GTF/GFF3 in seconds (auto-detects format, gzipped OK).
db = create_db("gencode.v49.chr_patch_hapl_scaff.basic.annotation.gtf.gz",
               "gencode.duckdb", force=True)

# 2. Walk a single gene's hierarchy.
for tx in db.children("ENSG00000139618", level=1, featuretype="transcript"):
    print(tx.id, tx.start, tx.end)

# 3. Spatial overlap query — uses the per-seqid R-tree under the hood.
for f in db.region("chr17:43044295-43125483", featuretype="exon"):
    print(f)

If you're migrating from gffutils, change one line:

import gffbase as gffutils    # one-line alias migration
db = gffutils.create_db(...)  # everything else identical

(But please read the Migration Guide first — it has one important note about ML loops.)


🤖 Quick start — vectorized for ML

from gffbase import FeatureDB

db = FeatureDB("gencode.duckdb")

# Pull every exon for 50 000 transcripts — one set-based SQL query.
exons = db.children_batched(
    transcript_ids,                # iterable of 50 000 IDs
    featuretype="exon",
    format="arrow",                # "df" / "polars" also supported
)
# exons is a pyarrow.Table sharing memory with DuckDB. No copies.

# Spatial: "for each ATAC-seq peak, find every overlapping CDS."
peaks = [("chr1", 100_000, 110_000), ("chr1", 200_000, 210_000), ...]
overlaps = db.region_batched(peaks, featuretype="CDS", format="arrow")

See the Machine Learning Workflows Cookbook for end-to-end pipelines with PyTorch and Hugging Face datasets.


✨ What's inside

  • Rust + PyO3 parser — SIMD line/tab splitting, lazy URL-decoding, GTF semicolon-in-quotes safe, gzipped input transparent. Hardened against the NCBI GFF3 spec (line-numbered GFFFormatError, strict / non-strict modes, 9 enforced rules).
  • DuckDB columnar storage — 7-table schema, set-based GTF gene/transcript synthesis, recursive-CTE transitive closure, per-seqid-banded R-tree spatial index built inline during ingest.
  • Smart routingregion() auto-picks R-tree vs B-tree; children() auto-picks closure cache vs dynamic CTE based on measured corpus depth.
  • Vectorized batched APIchildren_batched, parents_batched, region_batched return pyarrow.Table / pandas.DataFrame / polars.DataFrame directly out of DuckDB's buffer pool.
  • Drop-in legacy APIFeatureDB, Feature, create_db, DataIterator, GFFWriter, merge_criteria, interfeatures, bed12, execute() SQL escape hatch, export_sqlite().
  • abi3 wheels — single binary per arch covers CPython 3.9–3.13.

📚 Documentation

Full site (rendered with MkDocs Material) — build it locally:

pip install -e .[docs]
mkdocs serve            # http://localhost:8000
Page What's there
Usage Gallery Copy-pasteable snippets for every public API method
Performance comparison Head-to-head numbers across every canonical human-genome annotation + per-corpus root-cause analysis
Migration guide for gffutils users Drop-in compat checklist + the one OLAP/OLTP gotcha you must understand
Cookbooks GENCODE/Ensembl, RefSeq, MANE, ML workflows
API reference Every public method, full signatures + docstrings

🧪 Testing

pip install -e .[test]
pytest                  # 523 passed, 7 skipped, 99.19% coverage

CI runs the full matrix on Linux + macOS + Windows, both R-tree and B-tree fallback paths, on Python 3.9 / 3.11 / 3.13.


🤝 Contributing

GFFBase welcomes pull requests, bug reports, and feature suggestions. Start with CONTRIBUTING.md for the full guide:

  • Rust + Python development setup (maturin develop --release)
  • Running the test suite + the 99 % coverage gate
  • Branch naming, Conventional Commits, the PR checklist

The repo ships standard issue templates and a PR template so new contributions land with the context maintainers need to triage them quickly.


🪪 License

Apache License 2.0. See LICENSE.


Citation: if GFFBase helps your research, please cite the project at the Releases page.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gffbase-0.1.0.tar.gz (120.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

gffbase-0.1.0-cp39-abi3-win_amd64.whl (278.6 kB view details)

Uploaded CPython 3.9+Windows x86-64

gffbase-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (359.9 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

gffbase-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (355.9 kB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

gffbase-0.1.0-cp39-abi3-macosx_11_0_arm64.whl (339.2 kB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

gffbase-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl (348.3 kB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file gffbase-0.1.0.tar.gz.

File metadata

  • Download URL: gffbase-0.1.0.tar.gz
  • Upload date:
  • Size: 120.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gffbase-0.1.0.tar.gz
Algorithm Hash digest
SHA256 ceb3fb77eb40a559f3543281946b86177b7f3298c022dffacec3f860140b78bf
MD5 230ba0c2d66033339f0c47912c18b15d
BLAKE2b-256 3339d0cf400df9804d25aec6019c55ef7d5fecf016d010c61b14c4d393bb0ce5

See more details on using hashes here.

Provenance

The following attestation bundles were made for gffbase-0.1.0.tar.gz:

Publisher: release.yml on Kuanhao-Chao/gffbase

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gffbase-0.1.0-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: gffbase-0.1.0-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 278.6 kB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gffbase-0.1.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 ebdd3d571878fbbbb9cd8e9ae442b727b5ecaac734a928c752e15accd7300d51
MD5 20ce006d6ee1d794f6d281a434c5b876
BLAKE2b-256 f4f73bf9148c041fa0abaec14b0ac46c3e8a56997919188d37da7ef592ad3441

See more details on using hashes here.

Provenance

The following attestation bundles were made for gffbase-0.1.0-cp39-abi3-win_amd64.whl:

Publisher: release.yml on Kuanhao-Chao/gffbase

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gffbase-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for gffbase-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2ccdcc091daff87ea66547f10c1873033c0f299262a2b3161dff58f7551748f2
MD5 17b16d0f2c6121179c0d4ca6b6deff2f
BLAKE2b-256 55a16a6dd2562054cc85cb169c44d459c65dd640253c7af93f6c011b6fe354ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for gffbase-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on Kuanhao-Chao/gffbase

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gffbase-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for gffbase-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 c49275b4b110e3bfb60dd3310d691a53935a336024f25049b83abf30f096279a
MD5 f83cee176ad0cecf508eb611073d4212
BLAKE2b-256 7bc7b2187919c118ab419eb4b8d8916642216b701d04df2f8b76a0464796772c

See more details on using hashes here.

Provenance

The following attestation bundles were made for gffbase-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on Kuanhao-Chao/gffbase

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gffbase-0.1.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for gffbase-0.1.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d67a92b0bd99755750c89a002f810750f649a6b440277c3130583ef32a9b754f
MD5 7294a13bc41a780f1b45c15e22d17756
BLAKE2b-256 ce61c3f4127921776efee552cc63dce3ff551c0dfd07a20863d3adaa7ca51835

See more details on using hashes here.

Provenance

The following attestation bundles were made for gffbase-0.1.0-cp39-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on Kuanhao-Chao/gffbase

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gffbase-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for gffbase-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 c96fbb8da06a7a06fb30b4f678fa1f31ec520354dd39388972309f724b8dd8cc
MD5 171354af3a20114f6b9a64667124c9be
BLAKE2b-256 465f99868e605a83c80608890e8deb617073cc034a256cb9380559d75084b597

See more details on using hashes here.

Provenance

The following attestation bundles were made for gffbase-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on Kuanhao-Chao/gffbase

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page