Skip to main content

Command-line toolkit for GFF4 graph annotation files

Project description

GFF4tools

GFF4tools is a command-line toolkit for validating, indexing, querying, inspecting, extracting, and exporting GFF4 graph annotation files.

GFF4 is the graph-coordinate annotation exchange format. GFF4tools is the user toolkit around that format. The Python import module remains gff4, while the primary CLI binary is gff4tools; gff4 is retained as a compatibility alias.

Status: 0.4.0 is the current stable release target. The current v0.3 real-data gate, v0.4 compressed benchmark gate, and TestPyPI release-candidate install smoke pass.

Current Formats

GFF4 currently supports two output shapes:

  • --format single: the v0.2 canonical single-file .gff4 table, recommended for user-facing demos and handoff. This is the default.
  • --format single also supports .gff4.gz paths for compressed read/write of the same canonical table layout.
  • --format package: the v0.1 directory debug package, retained for internal inspection and existing smoke workflows.

GFF4tools accepts GFA1 graphs with embedded paths plus GFF3 annotations on those paths, then writes graph-projected feature records that can be validated, queried, indexed, inspected, and exported back to path-specific GFF3.

Supported:

  • parse GFA1 S, L, and P records
  • build embedded path step indexes
  • convert path intervals to graph walks
  • import GFF3 features onto graph coordinates
  • validate graph walks and feature hierarchy
  • query by node, edge, path interval, or gene
  • query by feature type or sample ID
  • summarize sample-level annotation-footprint presence/absence by graph node, node interval, or graph walk
  • calculate footprint coverage using gene-span, exon, or CDS bases and emit long TSV or sample-by-feature matrices
  • export path-specific GFF3 for round-trip checks
  • inspect .gfa/.gfa.gz files with GFA1.1 W lines
  • subset a W-line pangenome graph into a small P-line GFA region
  • build and use a sidecar SQLite .gfi.sqlite query index
  • reject stale indexes using source size, modification time, and SHA-256
  • read, write, validate, query, index, and export compressed .gff4.gz single-file tables

Not currently in scope: de novo gene prediction, graph-aware alignment, snarls, orthogroup clustering, gene function inference, frameshift detection, reference synteny block inference, production database storage, web visualization, or polyploid-specific modeling.

What GFF4tools Does Not Infer

GFF4tools reports graph-coordinate annotation records and annotation-footprint overlap over sample paths. It does not infer orthogroups, pangene membership, de novo genes, gene function, expression, frameshift status, or biological proof of gene presence. The footprint-pav command and its pav compatibility alias should be interpreted as annotation-footprint overlap summaries, not orthogroup-level gene PAV.

Later releases will add copy/allele modeling, anchors, snarls, PAV/CNV matrices, graph-SV/GWAS annotation, and production storage.

Quick Start From Source

Use this path from a source checkout to understand and verify the project in about 10 minutes. The commands below reference files under examples/, which are shipped in the source distribution and repository but are not installed as package data in the wheel.

Install from a checkout:

python -m pip install -e .
gff4tools --help
  1. Build the v0.2 multi-sample PAV demo as a single .gff4 file:

    gff4tools import-gff3 \
      --gfa examples/pav_multi_sample/pav_multi_sample.gfa \
      --gff3 examples/pav_multi_sample/pav_multi_sample.gff3 \
      --path-map examples/pav_multi_sample/path_map.tsv \
      --out /tmp/pav_multi_sample.gff4
    
  2. Inspect, validate, index, and query the generated file:

    gff4tools stats /tmp/pav_multi_sample.gff4
    
    gff4tools view /tmp/pav_multi_sample.gff4 --section features --head 5
    
    gff4tools paths /tmp/pav_multi_sample.gff4
    
    gff4tools validate /tmp/pav_multi_sample.gff4 \
      --gfa examples/pav_multi_sample/pav_multi_sample.gfa
    
    gff4tools index /tmp/pav_multi_sample.gff4
    
    gff4tools query /tmp/pav_multi_sample.gff4 --feature-type gene --format tsv
    
  3. Summarize sample-level annotation-footprint overlap directly on graph coordinates:

    gff4tools footprint-pav /tmp/pav_multi_sample.gff4 --node n2
    
    gff4tools footprint-pav /tmp/pav_multi_sample.gff4 \
      --node-interval n2:45-80
    
    gff4tools footprint-pav /tmp/pav_multi_sample.gff4 \
      --walk '>n2:85-100>n3:0-25'
    
    gff4tools footprint-pav /tmp/pav_multi_sample.gff4 \
      --node n2 \
      --coverage-basis CDS \
      --min-overlap-bp 1 \
      --matrix status
    
  4. Use the same workflow with compressed v0.4 exchange files:

    gff4tools import-gff3 \
      --gfa examples/pav_multi_sample/pav_multi_sample.gfa \
      --gff3 examples/pav_multi_sample/pav_multi_sample.gff3 \
      --path-map examples/pav_multi_sample/path_map.tsv \
      --out /tmp/pav_multi_sample.gff4.gz
    
    gff4tools validate /tmp/pav_multi_sample.gff4.gz \
      --gfa examples/pav_multi_sample/pav_multi_sample.gfa
    
    gff4tools index /tmp/pav_multi_sample.gff4.gz
    
    gff4tools query /tmp/pav_multi_sample.gff4.gz \
      --feature-type gene \
      --format tsv
    
    gff4tools export-gff3 /tmp/pav_multi_sample.gff4.gz \
      --path SampleA#1#chr1 \
      --out /tmp/pav_multi_sample.export.gff3
    
  5. Run the test suite:

    python -m pytest
    
  6. Run all demo workflows:

    bash scripts/smoke_demos.sh
    

    This writes demo outputs to /tmp/gff4_smoke/ and runs:

    • toy import, validate, and edge query
    • real-small import, validate, and reverse-edge query
    • Arabidopsis public package and single-file import, validate, query, and GFF3 export
    • multi-sample import, validate, node footprint-PAV, node-interval footprint-PAV, and walk footprint-PAV
  7. Smoke-check an installed wheel or PyPI package against your own GFF4 file:

    gff4tools --version
    gff4tools --help
    gff4 --help
    gff4tools stats path/to/file.gff4
    gff4tools query path/to/file.gff4 --feature-type gene --format tsv
    

    To run the bundled demos after a wheel install, use a source checkout or unpack the source distribution so the examples/ directory is available.

  8. Read the toy tutorial for the full command-by-command walkthrough:

  9. For v0.3 real pangenome graph work, inspect and subset a W-line graph:

    gff4tools graph-inspect \
      --gfa data/real_cqu-pangenome/pangenome.gfa.gz \
      --out /tmp/quinoa_graph_stats.json
    
    gff4tools graph-subset \
      --gfa data/real_cqu-pangenome/pangenome.gfa.gz \
      --reference-sample CquZ \
      --seqid Cq1A \
      --start0 0 \
      --end0 250000 \
      --out-gfa /tmp/quinoa_region.gfa \
      --out-path-map /tmp/quinoa_path_map.tsv \
      --out-stats /tmp/quinoa_region.stats.json
    

    Or run the full local quinoa real-data gate:

    bash scripts/quinoa_realdata/run_quinoa_cqu_demo.sh
    

    Expected final signal:

    QUINOA_CQU_V03_GATE: PASS
    

    Run the release-facing multi-region gate:

    bash scripts/quinoa_realdata/run_quinoa_cqu_demo.sh --multi-region
    

    Expected final signal:

    QUINOA_CQU_V03_MULTI_REGION_GATE: PASS
    

    Run the v0.4 compressed single-file benchmark gate:

    bash scripts/quinoa_realdata/run_quinoa_cqu_v04_benchmark.sh
    

    Expected final signal:

    QUINOA_CQU_V04_BENCHMARK_GATE: PASS
    

    The v0.4 benchmark table records:

    Field Meaning
    plain_gff4_bytes Size of the uncompressed canonical .gff4 file.
    gzip_gff4_bytes Size of the compressed .gff4.gz file.
    compression_ratio gzip_gff4_bytes / plain_gff4_bytes; lower is smaller.
    wall_seconds Elapsed time for the benchmarked command.
    user_seconds / sys_seconds CPU time used by the benchmarked command.
    peak_rss_kib Peak resident memory in KiB, normalized by the wrapper.
    query/export parity The gate asserts indexed vs scan query parity and plain vs gzip export parity.
  10. Inspect the single-file layout:

##gff4-version 0.2
##format gff4-feature-table
##section manifest
#key	value
gff4_version	0.2
##section sources
#source_id	source_kind	source_role	...
##section features
#feature_uid	annotation_set_id	source_feature_id	...
##section locations
#location_id	feature_uid	projection_set_id	...
##section location_spans
#location_id	span_rank	path_id	step_rank	node_id	...
##section nodes
#node_id	node_length
##section edges
#edge_id	from_node_id	from_orient	to_node_id	to_orient	...
##section paths
#path_id	sample_id	haplotype_id	contig_id	path_length	path_role
##section path_steps
#path_id	step_rank	node_id	orient	...

Demo Data

  • examples/toy/: hand-checkable graph for learning the v0.1 coordinate model.
  • examples/real_small/: semi-real demo with PanSN path names, GFF3 seqid aliases, UTR records, and a reverse-oriented path step.
  • examples/arabidopsis_public/: public reference-backed Arabidopsis TAIR10/Araport11 AT1G01010 region projected onto a small single-path GFA.
  • examples/pav_multi_sample/: v0.2 landing demo with multiple samples, long nodes containing multiple genes, a cross-node gene, and sample-level PAV.
  • examples/quinoa_cqu_real/: v0.3 real pangenome graph gate for the local quinoa Minigraph-Cactus W-line graph.

Each demo has its own README with import, validate, query, and export commands.

Documentation

Development

python -m pytest

Run the demo smoke workflow before release-facing changes:

bash scripts/smoke_demos.sh

The initial development target is the hand-checkable toy graph under examples/toy/.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gff4tools-0.4.0.tar.gz (140.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gff4tools-0.4.0-py3-none-any.whl (49.2 kB view details)

Uploaded Python 3

File details

Details for the file gff4tools-0.4.0.tar.gz.

File metadata

  • Download URL: gff4tools-0.4.0.tar.gz
  • Upload date:
  • Size: 140.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gff4tools-0.4.0.tar.gz
Algorithm Hash digest
SHA256 275d7afbeb22d342d270636a27afe58ae14815065763667578fccdba765ca5b0
MD5 fa1baa9f6ac9e48bb8ad02df11a7f060
BLAKE2b-256 b8cc574a080c2c70c6f952ae0c329c42b990fc9460fda239ccecd2b724f3b869

See more details on using hashes here.

Provenance

The following attestation bundles were made for gff4tools-0.4.0.tar.gz:

Publisher: publish-python.yml on Qgzeng-Bio/Granno

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gff4tools-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: gff4tools-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 49.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gff4tools-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 bc31d9d5ee9186175967bf893c3728cb8d362608b4472a57292b112c373de400
MD5 2a2193efe1906797238761f5265cb35d
BLAKE2b-256 9ca6f37d737ecde309741c5c862d8631be83bdb3336fc3083e1e8740a838ec7d

See more details on using hashes here.

Provenance

The following attestation bundles were made for gff4tools-0.4.0-py3-none-any.whl:

Publisher: publish-python.yml on Qgzeng-Bio/Granno

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page