GFFBase — Rust-accelerated GFF3/GTF parser with a DuckDB-backed storage engine and a drop-in gffutils-compatible Python API.
Project description
GFFBase
What is GFFBase?
GFFBase is a high-performance genomic-annotation engine combining a
SIMD Rust parser, a DuckDB columnar backend, and a zero-copy PyArrow
interface — purpose-built for whole-genome-scale ingest and bulk
machine-learning feature extraction, while remaining a drop-in
successor to gffutils.
A SIMD Rust+PyO3 parser feeds DuckDB's columnar storage through
record-batch Arrow handoffs. A smart query router auto-picks an
R-tree or B-tree spatial index per query, and a closure-cache /
recursive-CTE relational dispatcher selects the right strategy based
on the corpus's actual hierarchy depth. The full FeatureDB /
Feature / create_db / DataIterator / GFFWriter /
merge_criteria legacy API is preserved verbatim — most users
migrate by changing one import line.
Three reasons it matters
- 🚀 ≥ 32× faster GENCODE GTF ingest (v49, 6.07 M lines) — and
mathematically more efficient: legacy needs a Python loop +
~5 million correlated SQLite subqueries to invent the missing
gene/transcript rows, while gffbase does the same work in two
set-based DuckDB
GROUP BYaggregations + one recursive CTE. (Proven by a same-release GTF/GFF3 head-to-head) - ⚡ 36.68× faster bulk ML extraction —
children_batched(format='arrow')returns 50 000 transcripts → 1.6 M exons as a zero-copy PyArrow table in 1.16 s. No PythonFeatureobjects, ever. (How?) - 🛡️ Validated NCBI compliance — all four canonical human-genome annotations (GENCODE / RefSeq / MANE / CHESS 3) ingest cleanly with zero strict-mode warnings. RefSeq's split-CDS duplicate-ID convention is handled automatically.
⚡ Comprehensive Human Genome Annotations — validated across every canonical corpus
Validated head-to-head against legacy gffutils on the four canonical
human-genome annotation sources, including the GENCODE v49 GTF and
GFF3 versions of the same release — a same-biology, same-features,
different-format pairing that exposes the GTF Synthesis Advantage in
its purest form:
| Corpus | Format | Lines | gffbase ingest | legacy ingest | speedup | spatial qps | batched (5 k anchors) |
|---|---|---|---|---|---|---|---|
| GENCODE v49 (basic) | GTF | 6,068,892 | 4 min 37 s | ≥ 2 hr 30 min[^1] | 🚀 ≥ 32× | 1,204 | 172 ms / 596 k desc |
| GENCODE v49 (basic) | GFF3 | 6,066,054 | 6 min 7 s | 11 min 23 s | 1.86× | 1,292 | 422 ms / 1.93 M desc |
| RefSeq GRCh38.p14 | GFF3 | 4,932,571 | 4 min 12 s[^2] | 6 min 5 s | 1.45× | 1,011 | 263 ms / 999 k desc |
| MANE v1.5 (Ensembl) | GFF3 | 524,834 | 21.6 s | 45.1 s | 2.09× | 1,766 | 78 ms / 156 k desc |
| CHESS 3.1.3 | GFF3 | 2,761,061 | 53.6 s | 2 min 13.1 s | 2.48× | 1,175 | 91 ms / 161 k desc |
[^1]: Legacy gffutils.create_db() on GENCODE v49 GTF (6.07 M lines) hits the bench's safety-valve cap (75 min). The reported wall is a conservative 2× extrapolation — the canonical GENCODE v45 GTF (2.0 M lines, 3× smaller) ran uncapped at 3,582 s (59 min 42 s) on the same hardware, so the v49 wall is well past 2 hours. See Performance Comparison §"GTF Synthesis Advantage" for the formal cost model.
[^2]: Result of the v0.1.0 ingest-pipeline optimization — the same RefSeq corpus used to take 7 min 49 s before the GFF3 path was re-architected to stamp seqid_y and bbox inline during the Arrow batch INSERT.
The same biological release, ingested in two different formats, by
two different engines — that's the load-bearing comparison. Legacy
GFF3 ingest finishes in 11 min because every parent edge is explicit;
legacy GTF ingest takes hours because the parent rows have to be
invented from the data (one Python ↔ SQLite round-trip per missing
row). gffbase replaces those millions of round-trips with two
set-based DuckDB GROUP BY aggregations + one recursive CTE — the
same code path runs for GTF and GFF3, which is why the gffbase
column barely shifts (4 min 37 s → 6 min 7 s) between the two rows
while the legacy column balloons by 13×–20×.
Robustness: every corpus ingests cleanly with zero strict-mode
warnings from the NCBI-spec-hardened Rust parser (9 enforced rules,
line-numbered GFFFormatError, opt-in non-strict mode). RefSeq's
notorious duplicate-ID=cds-NP_xxx convention (split CDS segments) is
handled transparently — gffbase mirrors
gffutils.merge_strategy="create_unique" automatically and records the
remap in the duplicates table. No config knobs to flip.
📊 Full reproducible numbers + per-corpus root-cause analysis:
PERFORMANCE_COMPARISON.md. Re-run via
python benchmarks/06_mega.py --legacy-timeout 900.
🚀 The Killer Feature — zero-copy PyArrow for ML pipelines
Modern ML genomics pipelines have one shape: pull every exon for
50 000 transcripts, push the column-oriented table into a tensor,
train. Legacy gffutils forces a per-feature Python loop —
constructing 1.6 M throwaway Feature objects per pull, which crushes
both wall time and memory. gffbase bypasses Python entirely with a
single batched call that returns DuckDB's internal Arrow buffers
directly:
# 50 000 transcript IDs → every exon, in one query.
# Returns a zero-copy pyarrow.Table — no Python `Feature` object
# is constructed at any layer.
exons = db.children_batched(
transcript_ids,
featuretype="exon",
format="arrow", # or "df" / "polars"
)
# Hand off directly to PyTorch / Hugging Face datasets / JAX / Lance.
import torch
starts = torch.from_numpy(exons.column("start").to_numpy())
ends = torch.from_numpy(exons.column("end").to_numpy())
# The "anchor" column carries the input id for each row, so you can
# reconstruct per-transcript groups without re-issuing N queries.
Numbers for that one call (50 000 transcripts, GENCODE basic annotation, returning 1.6 M exon rows):
| Path | Wall | vs legacy |
|---|---|---|
gffbase children_batched(format='arrow') |
1.16 s | 36.68× faster |
legacy gffutils row-by-row loop |
42.55 s | 1.0× (baseline) |
| gffbase row-by-row loop | ≥ 642 s | 0.07× (slower!) |
This is the reason GFFBase exists. Iterating
for x in ids: db.children(x) with DuckDB pays vectorization startup
per call and is slower than legacy's SQLite row-by-row path — but
the batched API obliterates both row-by-row paths because it issues
one set-based SQL query and avoids constructing any Python Feature
objects whatsoever.
region_batched(...) and parents_batched(...) have the same
zero-copy contract for spatial and parent workloads.
📦 Installation
pip install gffbase
Universal abi3-py39 wheels — single binary per arch covers CPython
3.9 → 3.13. No Rust toolchain required at install time.
For source/dev installs (Rust ≥ 1.69 + maturin):
pip install -e .[dev]
maturin develop --release
🏃 Quick start — row-by-row (drop-in for gffutils)
from gffbase import create_db
# 1. Ingest a GTF/GFF3 in seconds (auto-detects format, gzipped OK).
db = create_db("gencode.v49.chr_patch_hapl_scaff.basic.annotation.gtf.gz",
"gencode.duckdb", force=True)
# 2. Walk a single gene's hierarchy.
for tx in db.children("ENSG00000139618", level=1, featuretype="transcript"):
print(tx.id, tx.start, tx.end)
# 3. Spatial overlap query — uses the per-seqid R-tree under the hood.
for f in db.region("chr17:43044295-43125483", featuretype="exon"):
print(f)
If you're migrating from gffutils, change one line:
import gffbase as gffutils # one-line alias migration
db = gffutils.create_db(...) # everything else identical
(But please read the Migration Guide first — it has one important note about ML loops.)
🤖 Quick start — vectorized for ML
from gffbase import FeatureDB
db = FeatureDB("gencode.duckdb")
# Pull every exon for 50 000 transcripts — one set-based SQL query.
exons = db.children_batched(
transcript_ids, # iterable of 50 000 IDs
featuretype="exon",
format="arrow", # "df" / "polars" also supported
)
# exons is a pyarrow.Table sharing memory with DuckDB. No copies.
# Spatial: "for each ATAC-seq peak, find every overlapping CDS."
peaks = [("chr1", 100_000, 110_000), ("chr1", 200_000, 210_000), ...]
overlaps = db.region_batched(peaks, featuretype="CDS", format="arrow")
See the Machine Learning Workflows
Cookbook for end-to-end
pipelines with PyTorch and Hugging Face datasets.
✨ What's inside
- Rust + PyO3 parser — SIMD line/tab splitting, lazy URL-decoding,
GTF semicolon-in-quotes safe, gzipped input transparent. Hardened
against the NCBI GFF3 spec (line-numbered
GFFFormatError, strict / non-strict modes, 9 enforced rules). - DuckDB columnar storage — 7-table schema, set-based GTF gene/transcript synthesis, recursive-CTE transitive closure, per-seqid-banded R-tree spatial index built inline during ingest.
- Smart routing —
region()auto-picks R-tree vs B-tree;children()auto-picks closure cache vs dynamic CTE based on measured corpus depth. - Vectorized batched API —
children_batched,parents_batched,region_batchedreturnpyarrow.Table/pandas.DataFrame/polars.DataFramedirectly out of DuckDB's buffer pool. - Drop-in legacy API —
FeatureDB,Feature,create_db,DataIterator,GFFWriter,merge_criteria,interfeatures,bed12,execute()SQL escape hatch,export_sqlite(). - abi3 wheels — single binary per arch covers CPython 3.9–3.13.
📚 Documentation
Full site (rendered with MkDocs Material) — build it locally:
pip install -e .[docs]
mkdocs serve # http://localhost:8000
| Page | What's there |
|---|---|
| Usage Gallery | Copy-pasteable snippets for every public API method |
| Performance comparison | Head-to-head numbers across every canonical human-genome annotation + per-corpus root-cause analysis |
Migration guide for gffutils users |
Drop-in compat checklist + the one OLAP/OLTP gotcha you must understand |
| Cookbooks | GENCODE/Ensembl, RefSeq, MANE, ML workflows |
| API reference | Every public method, full signatures + docstrings |
🧪 Testing
pip install -e .[test]
pytest # 523 passed, 7 skipped, 99.19% coverage
CI runs the full matrix on Linux + macOS + Windows, both R-tree and B-tree fallback paths, on Python 3.9 / 3.11 / 3.13.
🤝 Contributing
GFFBase welcomes pull requests, bug reports, and feature suggestions.
Start with CONTRIBUTING.md for the full guide:
- Rust + Python development setup (
maturin develop --release) - Running the test suite + the 99 % coverage gate
- Branch naming, Conventional Commits, the PR checklist
The repo ships standard issue templates and a PR template so new contributions land with the context maintainers need to triage them quickly.
🪪 License
Apache License 2.0. See LICENSE.
Citation: if GFFBase helps your research, please cite the project at the Releases page.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gffbase-0.1.0.tar.gz.
File metadata
- Download URL: gffbase-0.1.0.tar.gz
- Upload date:
- Size: 120.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ceb3fb77eb40a559f3543281946b86177b7f3298c022dffacec3f860140b78bf
|
|
| MD5 |
230ba0c2d66033339f0c47912c18b15d
|
|
| BLAKE2b-256 |
3339d0cf400df9804d25aec6019c55ef7d5fecf016d010c61b14c4d393bb0ce5
|
Provenance
The following attestation bundles were made for gffbase-0.1.0.tar.gz:
Publisher:
release.yml on Kuanhao-Chao/gffbase
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gffbase-0.1.0.tar.gz -
Subject digest:
ceb3fb77eb40a559f3543281946b86177b7f3298c022dffacec3f860140b78bf - Sigstore transparency entry: 1448674175
- Sigstore integration time:
-
Permalink:
Kuanhao-Chao/gffbase@011747cd95b00f42a8415d54b95ef3b4451f0b4c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Kuanhao-Chao
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@011747cd95b00f42a8415d54b95ef3b4451f0b4c -
Trigger Event:
push
-
Statement type:
File details
Details for the file gffbase-0.1.0-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: gffbase-0.1.0-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 278.6 kB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ebdd3d571878fbbbb9cd8e9ae442b727b5ecaac734a928c752e15accd7300d51
|
|
| MD5 |
20ce006d6ee1d794f6d281a434c5b876
|
|
| BLAKE2b-256 |
f4f73bf9148c041fa0abaec14b0ac46c3e8a56997919188d37da7ef592ad3441
|
Provenance
The following attestation bundles were made for gffbase-0.1.0-cp39-abi3-win_amd64.whl:
Publisher:
release.yml on Kuanhao-Chao/gffbase
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gffbase-0.1.0-cp39-abi3-win_amd64.whl -
Subject digest:
ebdd3d571878fbbbb9cd8e9ae442b727b5ecaac734a928c752e15accd7300d51 - Sigstore transparency entry: 1448674272
- Sigstore integration time:
-
Permalink:
Kuanhao-Chao/gffbase@011747cd95b00f42a8415d54b95ef3b4451f0b4c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Kuanhao-Chao
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@011747cd95b00f42a8415d54b95ef3b4451f0b4c -
Trigger Event:
push
-
Statement type:
File details
Details for the file gffbase-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: gffbase-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 359.9 kB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ccdcc091daff87ea66547f10c1873033c0f299262a2b3161dff58f7551748f2
|
|
| MD5 |
17b16d0f2c6121179c0d4ca6b6deff2f
|
|
| BLAKE2b-256 |
55a16a6dd2562054cc85cb169c44d459c65dd640253c7af93f6c011b6fe354ee
|
Provenance
The following attestation bundles were made for gffbase-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
release.yml on Kuanhao-Chao/gffbase
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gffbase-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
2ccdcc091daff87ea66547f10c1873033c0f299262a2b3161dff58f7551748f2 - Sigstore transparency entry: 1448674628
- Sigstore integration time:
-
Permalink:
Kuanhao-Chao/gffbase@011747cd95b00f42a8415d54b95ef3b4451f0b4c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Kuanhao-Chao
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@011747cd95b00f42a8415d54b95ef3b4451f0b4c -
Trigger Event:
push
-
Statement type:
File details
Details for the file gffbase-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: gffbase-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 355.9 kB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c49275b4b110e3bfb60dd3310d691a53935a336024f25049b83abf30f096279a
|
|
| MD5 |
f83cee176ad0cecf508eb611073d4212
|
|
| BLAKE2b-256 |
7bc7b2187919c118ab419eb4b8d8916642216b701d04df2f8b76a0464796772c
|
Provenance
The following attestation bundles were made for gffbase-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
release.yml on Kuanhao-Chao/gffbase
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gffbase-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
c49275b4b110e3bfb60dd3310d691a53935a336024f25049b83abf30f096279a - Sigstore transparency entry: 1448674349
- Sigstore integration time:
-
Permalink:
Kuanhao-Chao/gffbase@011747cd95b00f42a8415d54b95ef3b4451f0b4c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Kuanhao-Chao
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@011747cd95b00f42a8415d54b95ef3b4451f0b4c -
Trigger Event:
push
-
Statement type:
File details
Details for the file gffbase-0.1.0-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: gffbase-0.1.0-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 339.2 kB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d67a92b0bd99755750c89a002f810750f649a6b440277c3130583ef32a9b754f
|
|
| MD5 |
7294a13bc41a780f1b45c15e22d17756
|
|
| BLAKE2b-256 |
ce61c3f4127921776efee552cc63dce3ff551c0dfd07a20863d3adaa7ca51835
|
Provenance
The following attestation bundles were made for gffbase-0.1.0-cp39-abi3-macosx_11_0_arm64.whl:
Publisher:
release.yml on Kuanhao-Chao/gffbase
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gffbase-0.1.0-cp39-abi3-macosx_11_0_arm64.whl -
Subject digest:
d67a92b0bd99755750c89a002f810750f649a6b440277c3130583ef32a9b754f - Sigstore transparency entry: 1448674433
- Sigstore integration time:
-
Permalink:
Kuanhao-Chao/gffbase@011747cd95b00f42a8415d54b95ef3b4451f0b4c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Kuanhao-Chao
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@011747cd95b00f42a8415d54b95ef3b4451f0b4c -
Trigger Event:
push
-
Statement type:
File details
Details for the file gffbase-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: gffbase-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 348.3 kB
- Tags: CPython 3.9+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c96fbb8da06a7a06fb30b4f678fa1f31ec520354dd39388972309f724b8dd8cc
|
|
| MD5 |
171354af3a20114f6b9a64667124c9be
|
|
| BLAKE2b-256 |
465f99868e605a83c80608890e8deb617073cc034a256cb9380559d75084b597
|
Provenance
The following attestation bundles were made for gffbase-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl:
Publisher:
release.yml on Kuanhao-Chao/gffbase
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gffbase-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl -
Subject digest:
c96fbb8da06a7a06fb30b4f678fa1f31ec520354dd39388972309f724b8dd8cc - Sigstore transparency entry: 1448674523
- Sigstore integration time:
-
Permalink:
Kuanhao-Chao/gffbase@011747cd95b00f42a8415d54b95ef3b4451f0b4c -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/Kuanhao-Chao
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@011747cd95b00f42a8415d54b95ef3b4451f0b4c -
Trigger Event:
push
-
Statement type: