Command-line toolkit for GFF4 graph annotation files
Project description
GFF4tools
GFF4tools is a command-line toolkit for validating, indexing, querying, inspecting, extracting, and exporting GFF4 graph annotation files.
GFF4 is the graph-coordinate annotation exchange format. GFF4tools is the user
toolkit around that format. The Python import module remains gff4, while the
primary CLI binary is gff4tools; gff4 is retained as a compatibility alias.
Status: 0.4.0 is the current stable release target. The current v0.3
real-data gate, v0.4 compressed benchmark gate, and TestPyPI release-candidate
install smoke pass.
Current Formats
GFF4 currently supports two output shapes:
--format single: the v0.2 canonical single-file.gff4table, recommended for user-facing demos and handoff. This is the default.--format singlealso supports.gff4.gzpaths for compressed read/write of the same canonical table layout.--format package: the v0.1 directory debug package, retained for internal inspection and existing smoke workflows.
GFF4tools accepts GFA1 graphs with embedded paths plus GFF3 annotations on those paths, then writes graph-projected feature records that can be validated, queried, indexed, inspected, and exported back to path-specific GFF3.
Supported:
- parse GFA1
S,L, andPrecords - build embedded path step indexes
- convert path intervals to graph walks
- import GFF3 features onto graph coordinates
- validate graph walks and feature hierarchy
- query by node, edge, path interval, or gene
- query by feature type or sample ID
- summarize sample-level annotation-footprint presence/absence by graph node, node interval, or graph walk
- calculate footprint coverage using gene-span, exon, or CDS bases and emit long TSV or sample-by-feature matrices
- export path-specific GFF3 for round-trip checks
- inspect
.gfa/.gfa.gzfiles with GFA1.1Wlines - subset a W-line pangenome graph into a small P-line GFA region
- build and use a sidecar SQLite
.gfi.sqlitequery index - reject stale indexes using source size, modification time, and SHA-256
- read, write, validate, query, index, and export compressed
.gff4.gzsingle-file tables
Not currently in scope: de novo gene prediction, graph-aware alignment, snarls, orthogroup clustering, gene function inference, frameshift detection, reference synteny block inference, production database storage, web visualization, or polyploid-specific modeling.
What GFF4tools Does Not Infer
GFF4tools reports graph-coordinate annotation records and
annotation-footprint overlap over sample paths. It does not infer
orthogroups, pangene membership, de novo genes, gene function, expression,
frameshift status, or biological proof of gene presence. The footprint-pav
command and its pav compatibility alias should be interpreted as
annotation-footprint overlap summaries, not orthogroup-level gene PAV.
Later releases will add copy/allele modeling, anchors, snarls, PAV/CNV matrices, graph-SV/GWAS annotation, and production storage.
Quick Start From Source
Use this path from a source checkout to understand and verify the project in
about 10 minutes. The commands below reference files under examples/, which
are shipped in the source distribution and repository but are not installed as
package data in the wheel.
Install from a checkout:
python -m pip install -e .
gff4tools --help
-
Build the v0.2 multi-sample PAV demo as a single
.gff4file:gff4tools import-gff3 \ --gfa examples/pav_multi_sample/pav_multi_sample.gfa \ --gff3 examples/pav_multi_sample/pav_multi_sample.gff3 \ --path-map examples/pav_multi_sample/path_map.tsv \ --out /tmp/pav_multi_sample.gff4
-
Inspect, validate, index, and query the generated file:
gff4tools stats /tmp/pav_multi_sample.gff4 gff4tools view /tmp/pav_multi_sample.gff4 --section features --head 5 gff4tools paths /tmp/pav_multi_sample.gff4 gff4tools validate /tmp/pav_multi_sample.gff4 \ --gfa examples/pav_multi_sample/pav_multi_sample.gfa gff4tools index /tmp/pav_multi_sample.gff4 gff4tools query /tmp/pav_multi_sample.gff4 --feature-type gene --format tsv
-
Summarize sample-level annotation-footprint overlap directly on graph coordinates:
gff4tools footprint-pav /tmp/pav_multi_sample.gff4 --node n2 gff4tools footprint-pav /tmp/pav_multi_sample.gff4 \ --node-interval n2:45-80 gff4tools footprint-pav /tmp/pav_multi_sample.gff4 \ --walk '>n2:85-100>n3:0-25' gff4tools footprint-pav /tmp/pav_multi_sample.gff4 \ --node n2 \ --coverage-basis CDS \ --min-overlap-bp 1 \ --matrix status
-
Use the same workflow with compressed v0.4 exchange files:
gff4tools import-gff3 \ --gfa examples/pav_multi_sample/pav_multi_sample.gfa \ --gff3 examples/pav_multi_sample/pav_multi_sample.gff3 \ --path-map examples/pav_multi_sample/path_map.tsv \ --out /tmp/pav_multi_sample.gff4.gz gff4tools validate /tmp/pav_multi_sample.gff4.gz \ --gfa examples/pav_multi_sample/pav_multi_sample.gfa gff4tools index /tmp/pav_multi_sample.gff4.gz gff4tools query /tmp/pav_multi_sample.gff4.gz \ --feature-type gene \ --format tsv gff4tools export-gff3 /tmp/pav_multi_sample.gff4.gz \ --path SampleA#1#chr1 \ --out /tmp/pav_multi_sample.export.gff3
-
Run the test suite:
python -m pytest
-
Run all demo workflows:
bash scripts/smoke_demos.shThis writes demo outputs to
/tmp/gff4_smoke/and runs:- toy import, validate, and edge query
- real-small import, validate, and reverse-edge query
- Arabidopsis public package and single-file import, validate, query, and GFF3 export
- multi-sample import, validate, node footprint-PAV, node-interval footprint-PAV, and walk footprint-PAV
-
Smoke-check an installed wheel or PyPI package against your own GFF4 file:
gff4tools --version gff4tools --help gff4 --help gff4tools stats path/to/file.gff4 gff4tools query path/to/file.gff4 --feature-type gene --format tsv
To run the bundled demos after a wheel install, use a source checkout or unpack the source distribution so the
examples/directory is available. -
Read the toy tutorial for the full command-by-command walkthrough:
-
For v0.3 real pangenome graph work, inspect and subset a W-line graph:
gff4tools graph-inspect \ --gfa data/real_cqu-pangenome/pangenome.gfa.gz \ --out /tmp/quinoa_graph_stats.json gff4tools graph-subset \ --gfa data/real_cqu-pangenome/pangenome.gfa.gz \ --reference-sample CquZ \ --seqid Cq1A \ --start0 0 \ --end0 250000 \ --out-gfa /tmp/quinoa_region.gfa \ --out-path-map /tmp/quinoa_path_map.tsv \ --out-stats /tmp/quinoa_region.stats.json
Or run the full local quinoa real-data gate:
bash scripts/quinoa_realdata/run_quinoa_cqu_demo.shExpected final signal:
QUINOA_CQU_V03_GATE: PASS
Run the release-facing multi-region gate:
bash scripts/quinoa_realdata/run_quinoa_cqu_demo.sh --multi-region
Expected final signal:
QUINOA_CQU_V03_MULTI_REGION_GATE: PASS
Run the v0.4 compressed single-file benchmark gate:
bash scripts/quinoa_realdata/run_quinoa_cqu_v04_benchmark.shExpected final signal:
QUINOA_CQU_V04_BENCHMARK_GATE: PASS
The v0.4 benchmark table records:
Field Meaning plain_gff4_bytesSize of the uncompressed canonical .gff4file.gzip_gff4_bytesSize of the compressed .gff4.gzfile.compression_ratiogzip_gff4_bytes / plain_gff4_bytes; lower is smaller.wall_secondsElapsed time for the benchmarked command. user_seconds/sys_secondsCPU time used by the benchmarked command. peak_rss_kibPeak resident memory in KiB, normalized by the wrapper. query/export parity The gate asserts indexed vs scan query parity and plain vs gzip export parity. -
Inspect the single-file layout:
##gff4-version 0.2
##format gff4-feature-table
##section manifest
#key value
gff4_version 0.2
##section sources
#source_id source_kind source_role ...
##section features
#feature_uid annotation_set_id source_feature_id ...
##section locations
#location_id feature_uid projection_set_id ...
##section location_spans
#location_id span_rank path_id step_rank node_id ...
##section nodes
#node_id node_length
##section edges
#edge_id from_node_id from_orient to_node_id to_orient ...
##section paths
#path_id sample_id haplotype_id contig_id path_length path_role
##section path_steps
#path_id step_rank node_id orient ...
Demo Data
examples/toy/: hand-checkable graph for learning the v0.1 coordinate model.examples/real_small/: semi-real demo with PanSN path names, GFF3 seqid aliases, UTR records, and a reverse-oriented path step.examples/arabidopsis_public/: public reference-backed Arabidopsis TAIR10/Araport11 AT1G01010 region projected onto a small single-path GFA.examples/pav_multi_sample/: v0.2 landing demo with multiple samples, long nodes containing multiple genes, a cross-node gene, and sample-level PAV.examples/quinoa_cqu_real/: v0.3 real pangenome graph gate for the local quinoa Minigraph-Cactus W-line graph.
Each demo has its own README with import, validate, query, and export commands.
Documentation
- MVP scope: what v0.1 does and does not attempt.
- v0.1 package spec: debug TSV package layout,
coordinate rules, query surface, and
path_map.tsv. - v0.2 single-file spec: canonical
single-file
.gff4graph annotation table layout. - Roadmap: five major milestones from v0.2 format hardening to indexed queries and release-ready ecosystem integration.
- v0.2 release notes: stable MVP status, feature summary, real-data gate, scope, and known limitations.
- v0.2 release checklist: standard and real-data stable-gate checks.
- v0.3 spec: W-line graph input and sidecar index profile.
- v0.3 release notes: real pangenome graph readiness release line.
- v0.3 release checklist: quinoa real-data gate checks.
- v0.4 release checklist: compressed single-file benchmark checks.
- v0.4.0 release notes: stable compressed I/O and benchmark hardening release scope.
- v0.4.0rc1 release notes: compressed I/O, benchmark hardening, semantic guardrails, and release scope.
- v0.5 store/indexed PAV RFC:
proposed internal store boundary and index tables for future indexed
footprint-pav. - GFF4tools CLI guide: productized command-line usage for inspection, query, indexing, export, and validation.
- MVP tutorial: executable toy workflow.
- Release checklist: verification steps before a v0.1 release-facing commit or tag.
- v0.1 release notes: release highlights, verification, demo coverage, and known limits.
Development
python -m pytest
Run the demo smoke workflow before release-facing changes:
bash scripts/smoke_demos.sh
The initial development target is the hand-checkable toy graph under examples/toy/.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gff4tools-0.4.0.tar.gz.
File metadata
- Download URL: gff4tools-0.4.0.tar.gz
- Upload date:
- Size: 140.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
275d7afbeb22d342d270636a27afe58ae14815065763667578fccdba765ca5b0
|
|
| MD5 |
fa1baa9f6ac9e48bb8ad02df11a7f060
|
|
| BLAKE2b-256 |
b8cc574a080c2c70c6f952ae0c329c42b990fc9460fda239ccecd2b724f3b869
|
Provenance
The following attestation bundles were made for gff4tools-0.4.0.tar.gz:
Publisher:
publish-python.yml on Qgzeng-Bio/Granno
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gff4tools-0.4.0.tar.gz -
Subject digest:
275d7afbeb22d342d270636a27afe58ae14815065763667578fccdba765ca5b0 - Sigstore transparency entry: 1566076407
- Sigstore integration time:
-
Permalink:
Qgzeng-Bio/Granno@2c8b32ab4f6ddcf8a6e36970c6e11384f11cada1 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/Qgzeng-Bio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-python.yml@2c8b32ab4f6ddcf8a6e36970c6e11384f11cada1 -
Trigger Event:
release
-
Statement type:
File details
Details for the file gff4tools-0.4.0-py3-none-any.whl.
File metadata
- Download URL: gff4tools-0.4.0-py3-none-any.whl
- Upload date:
- Size: 49.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bc31d9d5ee9186175967bf893c3728cb8d362608b4472a57292b112c373de400
|
|
| MD5 |
2a2193efe1906797238761f5265cb35d
|
|
| BLAKE2b-256 |
9ca6f37d737ecde309741c5c862d8631be83bdb3336fc3083e1e8740a838ec7d
|
Provenance
The following attestation bundles were made for gff4tools-0.4.0-py3-none-any.whl:
Publisher:
publish-python.yml on Qgzeng-Bio/Granno
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gff4tools-0.4.0-py3-none-any.whl -
Subject digest:
bc31d9d5ee9186175967bf893c3728cb8d362608b4472a57292b112c373de400 - Sigstore transparency entry: 1566076417
- Sigstore integration time:
-
Permalink:
Qgzeng-Bio/Granno@2c8b32ab4f6ddcf8a6e36970c6e11384f11cada1 -
Branch / Tag:
refs/tags/v0.4.0 - Owner: https://github.com/Qgzeng-Bio
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-python.yml@2c8b32ab4f6ddcf8a6e36970c6e11384f11cada1 -
Trigger Event:
release
-
Statement type: