Skip to main content

Fast, low-RAM GFF3/GTF tools: annotrieve-compatible stats, bulk JSONL, GTF-to-GFF3 conversion; large files sorted by seqid (GNU sort) then streamed per-chromosome

Project description

gffy

CI Publish to PyPI PyPI version PyPI downloads Python versions PyPI format License

Fast, low-RAM GFF3 feature statistics aligned with the annotrieve GFFStats schema (gene_category_stats and transcript_type_stats), plus a GTF → GFF3 converter (gffy-convert).

Use gffy for statistics JSON, gffy-convert for format conversion, or import either workflow in Python pipelines and services.

Installation

From PyPI

pip install gffy

From source (development)

git clone https://github.com/guigolab/gffy.git
cd gffy
pip install -e ".[dev]"

Runtime: Python 3.9+, stdlib only. Low-memory mode additionally requires GNU sort on PATH.

Command-line usage

After installation, the gffy console script is on your PATH. You can also run:

python -m gffy

Options

Argument / flag Description
gff_source Local path or URL to a GFF3 file (plain or .gz); mutually exclusive with --from-file
--from-file Newline-separated list of paths/URLs (- = stdin); writes JSONL (requires -o)
-o, --output Write JSON (single file) or JSONL (bulk) to this path
--pretty Pretty-print JSON (single-source only; not valid with --from-file)
--low-memory Force sort-to-temp + per-seqid mode (applies to every source in bulk)
--workers N Parallel worker processes for --from-file (default 1, max min(cpu_count, 8))
--fail-fast Stop bulk run on first failing source (default: record error and continue)
--failed-urls-out Override path for unfetchable URL list (default: <output>.failed_urls.txt)
--usage Print user/system time and peak RAM to stderr at end of run
--annotrieve-json Write one Annotrieve Favorites import JSON file (single source)
--custom-name Override custom_name for --annotrieve-json (single source only)
--annotrieve-jsonl Write import-ready JSONL for --from-file bulk (one CustomAnnotation per line)

Annotrieve Favorites import (offline)

Compute metadata for Annotrieve Favorites without uploading GFF to the server (avoids the daily GFF upload limit). Output matches the Import from JSON / Import from JSONL drawer: kind, annotation_id, uploaded_md5 (MD5 of sorted uncompressed GFF bytes), features_summary, and features_statistics.

# Single annotation → JSON file
gffy annotation.gff3 --annotrieve-json my-annotation.json

# Optional display name
gffy annotation.gff3 --annotrieve-json out.json --custom-name "My experiment"

# Stats and Annotrieve export together
gffy annotation.gff3 -o stats.json --annotrieve-json favorites.json

# Bulk: diagnostic JSONL (-o) plus import-ready library file
gffy --from-file urls.txt -o run.jsonl --annotrieve-jsonl favorites-import.jsonl

In the UI: Favorites → Add custom → Import from JSON or Import from JSONL. Re-importing the same sorted content updates the entry by MD5. Schema: annotrieve-schema.json (shipped in the PyPI package).

Examples

# JSON to stdout (progress and RAM on stderr)
gffy annotation.gff3

# Save formatted JSON
gffy annotation.gff3.gz --output stats.json --pretty

# Remote URL
gffy https://example.com/annotation.gff3.gz -o stats.json

# Large files: sort by seqid into a temp file, stream per chromosome, then delete temp
gffy large.gff.gz --low-memory

Streams: statistics JSON goes to stdout (or --output). Mode label and progress go to stderr; add --usage for timing and peak RAM at the end.

Bulk stats from a list

Provide a text file with one path or URL per line (plain or .gz). Blank lines and lines starting with # are ignored. Duplicate lines are skipped (first occurrence kept); duplicates trigger warnings on stderr.

gffy --from-file urls.txt -o stats.jsonl
gffy --from-file urls.txt -o stats.jsonl --workers 4 --low-memory
cat urls.txt | gffy --from-file - -o stats.jsonl
gffy --from-file urls.txt -o stats.jsonl --fail-fast

Example urls.txt:

# Ensembl release
https://example.com/annotation1.gff3.gz
/path/to/local/annotation2.gff3

JSONL output — one compact JSON object per line (not a JSON array):

Success:

{"source": "https://example.com/a.gff3.gz", "mode": "single-pass", "elapsed_ms": 1234, "stats": {"gene_category_stats": {}, "transcript_type_stats": {}}}

Failure (when not using --fail-fast):

{"source": "/missing.gff3", "mode": "single-pass", "elapsed_ms": 12, "error": "FileNotFoundError: GFF file not found: /missing.gff3", "error_category": "not_found"}

HTTP / connection failures on URLs also set error_category (http or network) and optional http_status.

Each stats object matches the annotrieve GFFStats schema (schema.json).

Error handling

Errors are classified without extra I/O on the success path (classification runs only when an exception occurs).

error_category Typical cause
not_found Local path does not exist
network Connection refused, timeout, DNS failure
http HTTP 4xx/5xx from server (http_status field set)
sort GNU sort pipeline failed in low-memory mode
unknown Other failures

Bulk (--from-file): failures are non-blocking by default — each source gets one JSONL line (success or error). Duplicate entries in the input list print [gffy] WARNING: duplicate source (skipped): ... on stderr.

Failed URL retry list: when a URL fails with network or http, it is appended to <output>.failed_urls.txt (e.g. stats.jsonlstats.jsonl.failed_urls.txt). Use that file to retry only the unfetchable URLs. Local not_found paths are not included. Override with --failed-urls-out PATH.

Single-source gffy and gffy-convert: failures print [gffy] ERROR (<category>): ... on stderr and exit with code 1 (no partial output).

Memory: bulk mode runs each source in a separate process (--workers). Every worker uses the same single-pass / low-memory rules as a normal gffy run. Approximate peak RAM scales with workers × per-file usage (~500 MB target per file). Default --workers 1 keeps memory predictable; increase for throughput when you have spare RAM.

Python API:

from gffy import compute_bulk_stats, read_source_list, warn_source_list_issues

parsed = read_source_list("urls.txt")
warn_source_list_issues(parsed)
summary = compute_bulk_stats(
    parsed.sources,
    "stats.jsonl",
    workers=2,
    force_low_memory=False,
    continue_on_error=True,
)
print(summary)
# {"total": N, "succeeded": K, "failed": M, "failed_urls_written": "...", "duplicates_skipped": D}

GTF → GFF3 conversion

Use gffy-convert to translate GTF annotations into GFF3 with ID / Parent attributes (separate from the stats tool):

gffy-convert annotation.gtf -o annotation.gff3
gffy-convert annotation.gtf.gz -o annotation.gff3.gz
gffy-convert https://example.com/annotation.gtf.gz -o local.gff3 --low-memory
gffy-convert annotation.gtf -o out.gff3 --no-sort

You can also run:

python -m gffy.convert_cli annotation.gtf -o annotation.gff3

gffy-convert options

Argument / flag Description
gtf_source Local path or URL to a GTF file (plain or .gz)
-o, --output Required output GFF3 path (use .gz for gzip)
--low-memory Force sort-by-seqid into a temp file, then convert
--no-sort Keep input row order (skip seqid sort)
--usage Print user/system time and peak RAM to stderr at end of run

Streams: GFF3 goes to --output. Mode label and feature counts go to stderr; add --usage for timing and peak RAM at the end.

Conversion rules

Input (GTF) Output (GFF3)
gene_id "G1" on a root feature ID=G1
transcript_id "T1" + gene_id "G1" on a transcript row ID=T1;Parent=G1
exon / CDS / UTR rows with transcript_id Parent=T1 (and ID= when exon_id is present)
Feature type transcript with CDS children Type rewritten to mRNA
Feature type transcript without CDS Type stays transcript
Root feature whose transcripts have CDS biotype=protein_coding added if no biotype already set

GTF-only keys (gene_id, transcript_id, exon_id) are not copied verbatim; other attributes (e.g. gene_name, transcript_biotype, ccds_id) are percent-encoded in GFF3 form.

Example:

# GTF
chr1\tEnsembl\tgene\t1000\t5000\t.\t+\t.\tgene_id "G1"; gene_biotype "protein_coding";
chr1\tEnsembl\ttranscript\t1100\t4900\t.\t+\t.\tgene_id "G1"; transcript_id "T1";
chr1\tEnsembl\tCDS\t1100\t1200\t.\t+\t0\tgene_id "G1"; transcript_id "T1";

# GFF3 (abbreviated column 9)
chr1\t...\tgene\t...\tID=G1;biotype=protein_coding;gene_biotype=protein_coding
chr1\t...\tmRNA\t...\tID=T1;Parent=G1
chr1\t...\tCDS\t...\tParent=T1

Output row order

gffy-convert does not reorder features during conversion itself. The GFF3 file is written in the same line order as the input stream used for pass 2 (which matches pass 1). Whether that stream is the original GTF or a sorted copy depends on file size and flags.

When order matches the original GTF

  • Local file, below the size thresholds (gz ≤ 100 MB, plain ≤ 1 GB — same as gffy stats), and without --low-memory: the tool reads your file twice in place and emits GFF3 rows in identical order to the input (aside from the added ##gff-version 3 header).
  • Any input with --no-sort: sorting is disabled even for large files or with --low-memory. Row order follows the staged/downloaded file as-is.

When rows are reordered (seqid sort)

Sorting runs only when both are true:

  1. Sorting is enabled (default; turned off with --no-sort), and
  2. --low-memory is set, or the source is considered big (gz > 100 MB or plain > 1 GB by Content-Length / file size).

In that case the GTF is staged and run through GNU sort with -k1,1 (first column = seqid only). Effects:

  • All rows for a given seqid are grouped together.
  • Across seqids, order becomes lexicographic by seqid (e.g. chr1, chr10, chr2 under LC_ALL=C).
  • Within a seqid, order is preserved relative to the input because sort uses -s (stable) — gene / transcript / exon / CDS lines on the same chromosome keep their previous relative order.

The converter never sorts by start coordinate, feature type, or hierarchy; only seqid grouping is applied.

URLs

Remote sources are always downloaded once into a temporary file (so two passes do not hit the network twice). That temp file is sorted only under the rules above; otherwise output order matches the download order.

Pass 1 vs pass 2

Pass 1 scans the file to find which transcripts have CDS (for mRNA / biotype=protein_coding rules). Pass 2 writes GFF3 lines in the same order as the same input stream. No second shuffle happens between passes.

Quick reference

Scenario Output line order
Small local file, default flags Same as input GTF
Large local file, default flags Sorted by seqid (stable within seqid)
Any size + --low-memory Sorted by seqid (unless --no-sort)
Any size + --no-sort Same as input (or download order for URLs)
Small URL, default flags Same as download order (no sort)
Large URL, default flags Sorted by seqid

Use --no-sort when downstream tools require the exact row order of the source GTF (e.g. diffing against the original). Use the default (or --low-memory on smaller files) when you want seqids grouped for streaming or parity with gffy stats low-memory mode.

Conversion memory modes

Mode When Behavior
single-pass Small files (same size thresholds as stats) Two passes over the source path; input row order preserved
low-memory Large files, --low-memory, or URL staging + sort Download/stage once, sort by seqid into a temporary file, convert, delete temp

Nothing is cached permanently on disk.

Python API (conversion)

from gffy import convert_gtf_to_gff3

summary = convert_gtf_to_gff3(
    "/path/to/annotation.gtf.gz",
    "out.gff3.gz",
    force_low_memory=False,
    sort=True,
)
print(summary["feature_count"], summary["genes_with_cds"])
Symbol Use case
convert_gtf_to_gff3(source, output, ...) Convert GTF → GFF3 file
describe_convert_mode(source, ...) Mode label + SourceInfo for logging

Converted GFF3 can be passed directly to compute_gff_stats for annotrieve-compatible statistics.

Python library usage

Quick start

from gffy import compute_gff_stats

stats = compute_gff_stats("/path/to/annotation.gff3.gz")

coding = stats["gene_category_stats"].get("coding", {})
print(coding.get("total_count", 0))

for ttype, tstats in stats["transcript_type_stats"].items():
    print(ttype, tstats["total_count"])

Primary API

compute_gff_stats(
    source: str,
    *,
    force_low_memory: bool = False,
) -> dict
  • source — local file path or http(s)/ftp URL (plain or gzip).
  • force_low_memory — always use sort-to-temp + per-seqid processing (see Memory modes).
  • Returns a dict with gene_category_stats and transcript_type_stats (annotrieve-compatible).

Additional exports

Symbol Use case
compute_gff_stats_from_lines(lines) Stats from an in-memory iterable of GFF lines
inspect_source(source) Size/metadata for a path or URL (SourceInfo)
is_big(info) Whether auto low-memory mode applies
describe_compute_mode(source, force_low_memory=...) Human-readable mode label + SourceInfo
build_sorted_gff(source, output_path, info=None) Low-level: sort into a gzipped file you manage
convert_gtf_to_gff3(source, output, ...) Convert GTF → GFF3 (see GTF → GFF3 conversion)
describe_convert_mode(source, ...) Mode label for conversion
build_custom_annotation(source, ...) Annotrieve Favorites CustomAnnotation dict (offline import)
derive_custom_name(source, override=None) Display name from local path or URL path only (no host) for custom annotations
compute_bulk_stats(sources, output, ...) Bulk stats → JSONL (see Bulk stats from a list)
read_source_list(path_or_dash) Parse a newline-separated source list (SourceListResult)
warn_source_list_issues(result) Print duplicate-line warnings to stderr
ErrorInfo, classify_exception Structured error classification
__version__ Package version string

Validate output JSON

The output matches schema.json at the repo root. When installed, a copy is shipped inside the package:

from importlib.resources import files
import json

schema = json.loads(
    files("gffy").joinpath("schema.json").read_text(encoding="utf-8")
)

Large files in code

stats = compute_gff_stats(
    "https://example.com/large.gff.gz",
    force_low_memory=True,
)

Memory modes

Mode When Peak RAM
single-pass gz ≤ 100 MB and plain ≤ 1 GB (by size check) Low for small/medium files
low-memory Above thresholds, or force_low_memory=True / --low-memory Bounded per seqid (~500 MB target)

Low-memory mode:

  1. Sorts the GFF by seqid with system sort (LC_ALL=C, -S 200M buffer cap) into a temporary .gff.gz.
  2. Streams one seqid at a time, updates global stats, frees per-seqid scratch.
  3. Deletes the temporary sorted file when done (nothing persisted on disk).

Environment overrides:

  • GFFY_GZ_THRESHOLD_BYTES (default 104857600 — 100 MB)
  • GFFY_PLAIN_THRESHOLD_BYTES (default 1073741824 — 1 GB)

URL auto-detection uses HEAD Content-Length when available; missing length defaults to single-pass.

Output structure

Top-level keys:

  • gene_category_statscoding, non_coding, pseudogene (each present only if count > 0)
  • transcript_type_stats — keyed by transcript type (e.g. mRNA), sorted by descending total_count

Example (abbreviated):

{
  "gene_category_stats": {
    "coding": {
      "total_count": 22178,
      "length_stats": { "min": 10, "max": 2960899, "mean": 48306.64 },
      "biotype_counts": { "protein_coding": 20000 },
      "transcript_type_counts": { "mRNA": 66153 }
    }
  },
  "transcript_type_stats": {
    "mRNA": {
      "total_count": 66153,
      "length_stats": { "min": 100, "max": 50000, "mean": 3500.0 },
      "biotype_counts": { "protein_coding": 60000 },
      "associated_genes": {
        "total_count": 20000,
        "gene_categories": { "coding": 20000 }
      },
      "exon_stats": { "total_count": 500000, "length": { "min": 10, "max": 5000, "mean": 150.0 } },
      "cds_stats": { "..." : "..." }
    }
  }
}

Gene categories:

  • coding — genes with CDS-bearing transcripts or protein_coding biotype
  • non_coding — genes with exons but no CDS
  • pseudogenepseudogene feature type

Development

pip install -e ".[dev]"
pytest -v
python -m build    # smoke-test packaging
flake8
black .

CI runs on push and pull requests to main and master (Python 3.9 and 3.12).

Releasing to PyPI

  1. Bump version in pyproject.toml and src/gffy/__init__.py.
  2. Commit, tag (e.g. v0.1.1), and push the tag.
  3. Create a GitHub Release from that tag (publish event triggers .github/workflows/publish.yml).

One-time PyPI trusted publishing setup:

  1. Register the project on PyPI as gffy (if needed).
  2. PyPI → Publishing → add a trusted publisher for GitHub:
    • Owner: guigolab
    • Repository: gffy
    • Workflow: publish.yml
    • Environment: pypi (matches the workflow environment.name)

No long-lived PYPI_API_TOKEN is required when trusted publishing is configured.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gffy-0.1.0.tar.gz (44.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gffy-0.1.0-py3-none-any.whl (34.0 kB view details)

Uploaded Python 3

File details

Details for the file gffy-0.1.0.tar.gz.

File metadata

  • Download URL: gffy-0.1.0.tar.gz
  • Upload date:
  • Size: 44.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gffy-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a30fe2b38ffebaffd4c2fc5653d634bb392546633875e6208a224db3cb4cce86
MD5 c8eb9dd472acb5d226271c3a2596447f
BLAKE2b-256 07d590873538b79adc99c68cab08523af1e3c49555260d0154f6dd4caf0c90fc

See more details on using hashes here.

Provenance

The following attestation bundles were made for gffy-0.1.0.tar.gz:

Publisher: publish.yml on guigolab/gffy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file gffy-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: gffy-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 34.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for gffy-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9583b33fe3cd5ac1318148b38b9616418ae278f34e6a7c3c43a54667a6851c05
MD5 e9297731628adb1f0c3d50ef12c5c2be
BLAKE2b-256 7856fc72b4199720fef601527ba647ffa21d5c9e92ae89043e182ac7be44f8b8

See more details on using hashes here.

Provenance

The following attestation bundles were made for gffy-0.1.0-py3-none-any.whl:

Publisher: publish.yml on guigolab/gffy

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page