Fast, low-RAM GFF3/GTF tools: annotrieve-compatible stats, bulk JSONL, GTF-to-GFF3 conversion; large files sorted by seqid (GNU sort) then streamed per-chromosome
Project description
gffy
Fast, low-RAM GFF3 feature statistics aligned with the annotrieve GFFStats schema (gene_category_stats and transcript_type_stats), plus a GTF → GFF3 converter (gffy-convert).
Use gffy for statistics JSON, gffy-convert for format conversion, or import either workflow in Python pipelines and services.
Installation
From PyPI
pip install gffy
From source (development)
git clone https://github.com/guigolab/gffy.git
cd gffy
pip install -e ".[dev]"
Runtime: Python 3.9+, stdlib only. Low-memory mode additionally requires GNU sort on PATH.
Command-line usage
After installation, the gffy console script is on your PATH. You can also run:
python -m gffy
Options
| Argument / flag | Description |
|---|---|
gff_source |
Local path or URL to a GFF3 file (plain or .gz); mutually exclusive with --from-file |
--from-file |
Newline-separated list of paths/URLs (- = stdin); writes JSONL (requires -o) |
-o, --output |
Write JSON (single file) or JSONL (bulk) to this path |
--pretty |
Pretty-print JSON (single-source only; not valid with --from-file) |
--low-memory |
Force sort-to-temp + per-seqid mode (applies to every source in bulk) |
--workers N |
Parallel worker processes for --from-file (default 1, max min(cpu_count, 8)) |
--fail-fast |
Stop bulk run on first failing source (default: record error and continue) |
--failed-urls-out |
Override path for unfetchable URL list (default: <output>.failed_urls.txt) |
--usage |
Print user/system time and peak RAM to stderr at end of run |
--annotrieve-json |
Write one Annotrieve Favorites import JSON file (single source) |
--custom-name |
Override custom_name for --annotrieve-json (single source only) |
--annotrieve-jsonl |
Write import-ready JSONL for --from-file bulk (one CustomAnnotation per line) |
Annotrieve Favorites import (offline)
Compute metadata for Annotrieve Favorites without uploading GFF to the server (avoids the daily GFF upload limit). Output matches the Import from JSON / Import from JSONL drawer: kind, annotation_id, uploaded_md5 (MD5 of sorted uncompressed GFF bytes), features_summary, and features_statistics.
# Single annotation → JSON file
gffy annotation.gff3 --annotrieve-json my-annotation.json
# Optional display name
gffy annotation.gff3 --annotrieve-json out.json --custom-name "My experiment"
# Stats and Annotrieve export together
gffy annotation.gff3 -o stats.json --annotrieve-json favorites.json
# Bulk: diagnostic JSONL (-o) plus import-ready library file
gffy --from-file urls.txt -o run.jsonl --annotrieve-jsonl favorites-import.jsonl
In the UI: Favorites → Add custom → Import from JSON or Import from JSONL. Re-importing the same sorted content updates the entry by MD5. Schema: annotrieve-schema.json (shipped in the PyPI package).
Examples
# JSON to stdout (progress and RAM on stderr)
gffy annotation.gff3
# Save formatted JSON
gffy annotation.gff3.gz --output stats.json --pretty
# Remote URL
gffy https://example.com/annotation.gff3.gz -o stats.json
# Large files: sort by seqid into a temp file, stream per chromosome, then delete temp
gffy large.gff.gz --low-memory
Streams: statistics JSON goes to stdout (or --output). Mode label and progress go to stderr; add --usage for timing and peak RAM at the end.
Bulk stats from a list
Provide a text file with one path or URL per line (plain or .gz). Blank lines and lines starting with # are ignored. Duplicate lines are skipped (first occurrence kept); duplicates trigger warnings on stderr.
gffy --from-file urls.txt -o stats.jsonl
gffy --from-file urls.txt -o stats.jsonl --workers 4 --low-memory
cat urls.txt | gffy --from-file - -o stats.jsonl
gffy --from-file urls.txt -o stats.jsonl --fail-fast
Example urls.txt:
# Ensembl release
https://example.com/annotation1.gff3.gz
/path/to/local/annotation2.gff3
JSONL output — one compact JSON object per line (not a JSON array):
Success:
{"source": "https://example.com/a.gff3.gz", "mode": "single-pass", "elapsed_ms": 1234, "stats": {"gene_category_stats": {}, "transcript_type_stats": {}}}
Failure (when not using --fail-fast):
{"source": "/missing.gff3", "mode": "single-pass", "elapsed_ms": 12, "error": "FileNotFoundError: GFF file not found: /missing.gff3", "error_category": "not_found"}
HTTP / connection failures on URLs also set error_category (http or network) and optional http_status.
Each stats object matches the annotrieve GFFStats schema (schema.json).
Error handling
Errors are classified without extra I/O on the success path (classification runs only when an exception occurs).
error_category |
Typical cause |
|---|---|
not_found |
Local path does not exist |
network |
Connection refused, timeout, DNS failure |
http |
HTTP 4xx/5xx from server (http_status field set) |
sort |
GNU sort pipeline failed in low-memory mode |
unknown |
Other failures |
Bulk (--from-file): failures are non-blocking by default — each source gets one JSONL line (success or error). Duplicate entries in the input list print [gffy] WARNING: duplicate source (skipped): ... on stderr.
Failed URL retry list: when a URL fails with network or http, it is appended to <output>.failed_urls.txt (e.g. stats.jsonl → stats.jsonl.failed_urls.txt). Use that file to retry only the unfetchable URLs. Local not_found paths are not included. Override with --failed-urls-out PATH.
Single-source gffy and gffy-convert: failures print [gffy] ERROR (<category>): ... on stderr and exit with code 1 (no partial output).
Memory: bulk mode runs each source in a separate process (--workers). Every worker uses the same single-pass / low-memory rules as a normal gffy run. Approximate peak RAM scales with workers × per-file usage (~500 MB target per file). Default --workers 1 keeps memory predictable; increase for throughput when you have spare RAM.
Python API:
from gffy import compute_bulk_stats, read_source_list, warn_source_list_issues
parsed = read_source_list("urls.txt")
warn_source_list_issues(parsed)
summary = compute_bulk_stats(
parsed.sources,
"stats.jsonl",
workers=2,
force_low_memory=False,
continue_on_error=True,
)
print(summary)
# {"total": N, "succeeded": K, "failed": M, "failed_urls_written": "...", "duplicates_skipped": D}
GTF → GFF3 conversion
Use gffy-convert to translate GTF annotations into GFF3 with ID / Parent attributes (separate from the stats tool):
gffy-convert annotation.gtf -o annotation.gff3
gffy-convert annotation.gtf.gz -o annotation.gff3.gz
gffy-convert https://example.com/annotation.gtf.gz -o local.gff3 --low-memory
gffy-convert annotation.gtf -o out.gff3 --no-sort
You can also run:
python -m gffy.convert_cli annotation.gtf -o annotation.gff3
gffy-convert options
| Argument / flag | Description |
|---|---|
gtf_source |
Local path or URL to a GTF file (plain or .gz) |
-o, --output |
Required output GFF3 path (use .gz for gzip) |
--low-memory |
Force sort-by-seqid into a temp file, then convert |
--no-sort |
Keep input row order (skip seqid sort) |
--usage |
Print user/system time and peak RAM to stderr at end of run |
Streams: GFF3 goes to --output. Mode label and feature counts go to stderr; add --usage for timing and peak RAM at the end.
Conversion rules
| Input (GTF) | Output (GFF3) |
|---|---|
gene_id "G1" on a root feature |
ID=G1 |
transcript_id "T1" + gene_id "G1" on a transcript row |
ID=T1;Parent=G1 |
exon / CDS / UTR rows with transcript_id |
Parent=T1 (and ID= when exon_id is present) |
Feature type transcript with CDS children |
Type rewritten to mRNA |
Feature type transcript without CDS |
Type stays transcript |
| Root feature whose transcripts have CDS | biotype=protein_coding added if no biotype already set |
GTF-only keys (gene_id, transcript_id, exon_id) are not copied verbatim; other attributes (e.g. gene_name, transcript_biotype, ccds_id) are percent-encoded in GFF3 form.
Example:
# GTF
chr1\tEnsembl\tgene\t1000\t5000\t.\t+\t.\tgene_id "G1"; gene_biotype "protein_coding";
chr1\tEnsembl\ttranscript\t1100\t4900\t.\t+\t.\tgene_id "G1"; transcript_id "T1";
chr1\tEnsembl\tCDS\t1100\t1200\t.\t+\t0\tgene_id "G1"; transcript_id "T1";
# GFF3 (abbreviated column 9)
chr1\t...\tgene\t...\tID=G1;biotype=protein_coding;gene_biotype=protein_coding
chr1\t...\tmRNA\t...\tID=T1;Parent=G1
chr1\t...\tCDS\t...\tParent=T1
Output row order
gffy-convert does not reorder features during conversion itself. The GFF3 file is written in the same line order as the input stream used for pass 2 (which matches pass 1). Whether that stream is the original GTF or a sorted copy depends on file size and flags.
When order matches the original GTF
- Local file, below the size thresholds (gz ≤ 100 MB, plain ≤ 1 GB — same as
gffystats), and without--low-memory: the tool reads your file twice in place and emits GFF3 rows in identical order to the input (aside from the added##gff-version 3header). - Any input with
--no-sort: sorting is disabled even for large files or with--low-memory. Row order follows the staged/downloaded file as-is.
When rows are reordered (seqid sort)
Sorting runs only when both are true:
- Sorting is enabled (default; turned off with
--no-sort), and --low-memoryis set, or the source is considered big (gz > 100 MB or plain > 1 GB byContent-Length/ file size).
In that case the GTF is staged and run through GNU sort with -k1,1 (first column = seqid only). Effects:
- All rows for a given seqid are grouped together.
- Across seqids, order becomes lexicographic by seqid (e.g.
chr1,chr10,chr2underLC_ALL=C). - Within a seqid, order is preserved relative to the input because sort uses
-s(stable) — gene / transcript / exon / CDS lines on the same chromosome keep their previous relative order.
The converter never sorts by start coordinate, feature type, or hierarchy; only seqid grouping is applied.
URLs
Remote sources are always downloaded once into a temporary file (so two passes do not hit the network twice). That temp file is sorted only under the rules above; otherwise output order matches the download order.
Pass 1 vs pass 2
Pass 1 scans the file to find which transcripts have CDS (for mRNA / biotype=protein_coding rules). Pass 2 writes GFF3 lines in the same order as the same input stream. No second shuffle happens between passes.
Quick reference
| Scenario | Output line order |
|---|---|
| Small local file, default flags | Same as input GTF |
| Large local file, default flags | Sorted by seqid (stable within seqid) |
Any size + --low-memory |
Sorted by seqid (unless --no-sort) |
Any size + --no-sort |
Same as input (or download order for URLs) |
| Small URL, default flags | Same as download order (no sort) |
| Large URL, default flags | Sorted by seqid |
Use --no-sort when downstream tools require the exact row order of the source GTF (e.g. diffing against the original). Use the default (or --low-memory on smaller files) when you want seqids grouped for streaming or parity with gffy stats low-memory mode.
Conversion memory modes
| Mode | When | Behavior |
|---|---|---|
| single-pass | Small files (same size thresholds as stats) | Two passes over the source path; input row order preserved |
| low-memory | Large files, --low-memory, or URL staging + sort |
Download/stage once, sort by seqid into a temporary file, convert, delete temp |
Nothing is cached permanently on disk.
Python API (conversion)
from gffy import convert_gtf_to_gff3
summary = convert_gtf_to_gff3(
"/path/to/annotation.gtf.gz",
"out.gff3.gz",
force_low_memory=False,
sort=True,
)
print(summary["feature_count"], summary["genes_with_cds"])
| Symbol | Use case |
|---|---|
convert_gtf_to_gff3(source, output, ...) |
Convert GTF → GFF3 file |
describe_convert_mode(source, ...) |
Mode label + SourceInfo for logging |
Converted GFF3 can be passed directly to compute_gff_stats for annotrieve-compatible statistics.
Python library usage
Quick start
from gffy import compute_gff_stats
stats = compute_gff_stats("/path/to/annotation.gff3.gz")
coding = stats["gene_category_stats"].get("coding", {})
print(coding.get("total_count", 0))
for ttype, tstats in stats["transcript_type_stats"].items():
print(ttype, tstats["total_count"])
Primary API
compute_gff_stats(
source: str,
*,
force_low_memory: bool = False,
) -> dict
source— local file path orhttp(s)/ftpURL (plain or gzip).force_low_memory— always use sort-to-temp + per-seqid processing (see Memory modes).- Returns a dict with
gene_category_statsandtranscript_type_stats(annotrieve-compatible).
Additional exports
| Symbol | Use case |
|---|---|
compute_gff_stats_from_lines(lines) |
Stats from an in-memory iterable of GFF lines |
inspect_source(source) |
Size/metadata for a path or URL (SourceInfo) |
is_big(info) |
Whether auto low-memory mode applies |
describe_compute_mode(source, force_low_memory=...) |
Human-readable mode label + SourceInfo |
build_sorted_gff(source, output_path, info=None) |
Low-level: sort into a gzipped file you manage |
convert_gtf_to_gff3(source, output, ...) |
Convert GTF → GFF3 (see GTF → GFF3 conversion) |
describe_convert_mode(source, ...) |
Mode label for conversion |
build_custom_annotation(source, ...) |
Annotrieve Favorites CustomAnnotation dict (offline import) |
derive_custom_name(source, override=None) |
Display name from local path or URL path only (no host) for custom annotations |
compute_bulk_stats(sources, output, ...) |
Bulk stats → JSONL (see Bulk stats from a list) |
read_source_list(path_or_dash) |
Parse a newline-separated source list (SourceListResult) |
warn_source_list_issues(result) |
Print duplicate-line warnings to stderr |
ErrorInfo, classify_exception |
Structured error classification |
__version__ |
Package version string |
Validate output JSON
The output matches schema.json at the repo root. When installed, a copy is shipped inside the package:
from importlib.resources import files
import json
schema = json.loads(
files("gffy").joinpath("schema.json").read_text(encoding="utf-8")
)
Large files in code
stats = compute_gff_stats(
"https://example.com/large.gff.gz",
force_low_memory=True,
)
Memory modes
| Mode | When | Peak RAM |
|---|---|---|
| single-pass | gz ≤ 100 MB and plain ≤ 1 GB (by size check) | Low for small/medium files |
| low-memory | Above thresholds, or force_low_memory=True / --low-memory |
Bounded per seqid (~500 MB target) |
Low-memory mode:
- Sorts the GFF by seqid with system
sort(LC_ALL=C,-S 200Mbuffer cap) into a temporary.gff.gz. - Streams one seqid at a time, updates global stats, frees per-seqid scratch.
- Deletes the temporary sorted file when done (nothing persisted on disk).
Environment overrides:
GFFY_GZ_THRESHOLD_BYTES(default104857600— 100 MB)GFFY_PLAIN_THRESHOLD_BYTES(default1073741824— 1 GB)
URL auto-detection uses HEAD Content-Length when available; missing length defaults to single-pass.
Output structure
Top-level keys:
gene_category_stats—coding,non_coding,pseudogene(each present only if count > 0)transcript_type_stats— keyed by transcript type (e.g.mRNA), sorted by descendingtotal_count
Example (abbreviated):
{
"gene_category_stats": {
"coding": {
"total_count": 22178,
"length_stats": { "min": 10, "max": 2960899, "mean": 48306.64 },
"biotype_counts": { "protein_coding": 20000 },
"transcript_type_counts": { "mRNA": 66153 }
}
},
"transcript_type_stats": {
"mRNA": {
"total_count": 66153,
"length_stats": { "min": 100, "max": 50000, "mean": 3500.0 },
"biotype_counts": { "protein_coding": 60000 },
"associated_genes": {
"total_count": 20000,
"gene_categories": { "coding": 20000 }
},
"exon_stats": { "total_count": 500000, "length": { "min": 10, "max": 5000, "mean": 150.0 } },
"cds_stats": { "..." : "..." }
}
}
}
Gene categories:
- coding — genes with CDS-bearing transcripts or
protein_codingbiotype - non_coding — genes with exons but no CDS
- pseudogene —
pseudogenefeature type
Development
pip install -e ".[dev]"
pytest -v
python -m build # smoke-test packaging
flake8
black .
CI runs on push and pull requests to main and master (Python 3.9 and 3.12).
Releasing to PyPI
- Bump
versioninpyproject.tomlandsrc/gffy/__init__.py. - Commit, tag (e.g.
v0.1.1), and push the tag. - Create a GitHub Release from that tag (publish event triggers
.github/workflows/publish.yml).
One-time PyPI trusted publishing setup:
- Register the project on PyPI as
gffy(if needed). - PyPI → Publishing → add a trusted publisher for GitHub:
- Owner:
guigolab - Repository:
gffy - Workflow:
publish.yml - Environment:
pypi(matches the workflowenvironment.name)
- Owner:
No long-lived PYPI_API_TOKEN is required when trusted publishing is configured.
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gffy-0.1.0.tar.gz.
File metadata
- Download URL: gffy-0.1.0.tar.gz
- Upload date:
- Size: 44.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a30fe2b38ffebaffd4c2fc5653d634bb392546633875e6208a224db3cb4cce86
|
|
| MD5 |
c8eb9dd472acb5d226271c3a2596447f
|
|
| BLAKE2b-256 |
07d590873538b79adc99c68cab08523af1e3c49555260d0154f6dd4caf0c90fc
|
Provenance
The following attestation bundles were made for gffy-0.1.0.tar.gz:
Publisher:
publish.yml on guigolab/gffy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gffy-0.1.0.tar.gz -
Subject digest:
a30fe2b38ffebaffd4c2fc5653d634bb392546633875e6208a224db3cb4cce86 - Sigstore transparency entry: 1600914014
- Sigstore integration time:
-
Permalink:
guigolab/gffy@669c5a9730157fe619bd85e9f9d09876d9705b8b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/guigolab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@669c5a9730157fe619bd85e9f9d09876d9705b8b -
Trigger Event:
release
-
Statement type:
File details
Details for the file gffy-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gffy-0.1.0-py3-none-any.whl
- Upload date:
- Size: 34.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9583b33fe3cd5ac1318148b38b9616418ae278f34e6a7c3c43a54667a6851c05
|
|
| MD5 |
e9297731628adb1f0c3d50ef12c5c2be
|
|
| BLAKE2b-256 |
7856fc72b4199720fef601527ba647ffa21d5c9e92ae89043e182ac7be44f8b8
|
Provenance
The following attestation bundles were made for gffy-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on guigolab/gffy
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
gffy-0.1.0-py3-none-any.whl -
Subject digest:
9583b33fe3cd5ac1318148b38b9616418ae278f34e6a7c3c43a54667a6851c05 - Sigstore transparency entry: 1600914557
- Sigstore integration time:
-
Permalink:
guigolab/gffy@669c5a9730157fe619bd85e9f9d09876d9705b8b -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/guigolab
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@669c5a9730157fe619bd85e9f9d09876d9705b8b -
Trigger Event:
release
-
Statement type: