Skip to main content

Build standardized, indexed reference files (genome, annotation, repeats, RNAcentral, cCRE) for genome browsers.

Project description

refbox

PyPI version Python versions Tests License: MIT

Build standardized, indexed reference files for genome browsers — in one command.

refbox turns a YAML registry of species/assemblies into ready-to-load browser inputs:

  • Genome FASTA → bgzip + samtools faidx (.fa.gz + .fai + .gzi) + chrom.sizes
  • Transcriptome FASTA → bgzip + samtools faidx (auto-derived from genome + GTF via gffread when no upstream URL exists)
  • GTF / GFF3 annotations → sorted + bgzip + tabix
  • Repeats (UCSC RepeatMasker rmsk.txt.gz → BED + GTF; .fa.out.gz report)
  • RNAcentral non-coding RNA annotations (direct download or liftover from another assembly)
  • ENCODE SCREEN cCREs

It ships a registry of 26 species / 42 assemblies (human, mouse, rat, dog, cow, pig, chimp, gorilla, zebrafish, fly, worm, sea urchin, yeast, plants, bacteria, viruses…) covering GENCODE, Ensembl, Ensembl Genomes, UCSC golden path, NCBI, RNAcentral, and ENCODE SCREEN.


Install

pip install refbox

External tools (install once via conda / mamba):

mamba install -c bioconda htslib samtools gffread ucsc-bedtobigbed

Required: bgzip, tabix, samtools, gffread, liftOver (for RNAcentral cross-assembly), bedToBigBed (for refbox build -bed), GNU sort/grep.


CLI overview

refbox download   # only fetch raw files configured in species.yaml
refbox pull       # full pipeline: download (if missing) + build + test
refbox test       # validate build/ outputs
refbox build      # single-file / directory build for arbitrary inputs

refbox pull — the registry-driven pipeline

# Fetch + build + validate Human GRCh38 (uses bundled species.yaml).
refbox pull --species Homo_sapiens --assembly GRCh38

# --species is optional; it is inferred from --assembly via the registry.
refbox pull --assembly GRCm38

# Loop every assembly in the registry, including ones marked enabled: false.
refbox pull --include-disabled --resource genome transcriptome \
            annotation_gtf annotation_gff3 repeats_rmsk rnacentral

# Pull only specific resources.
refbox pull --assembly GRCh38 --resource ccre
Flag Meaning
--species optional filter; inferred from --assembly when omitted
--assembly optional filter; omit to run every assembly that matches --species
--resource subset of genome transcriptome annotation_gtf annotation_gff3 repeats_rmsk repeats_bed repeats_gtf repeats_fa rnacentral ccre
--include-disabled also process assemblies with enabled: false
--out DIR output root (default: $REFBOX_OUT or $PWD)
--force rebuild even when outputs exist
--no-download skip the auto-download phase
--no-test skip the post-build validation
-v verbose / DEBUG logging

refbox build — single-file / directory build

For custom data that does not have a species.yaml entry. Each call detects the input by extension (or by explicit flag), verifies bgzip / sort order, runs the canonical transformation, and emits indexed outputs.

# Genome FASTA → bgzipped + faidx + chrom.sizes
refbox build -fa  GENOME.fa [-o OUT.fa.gz]

# GTF or GFF3 → sorted + bgzip + tabix
refbox build -gtf ANNOT.gtf [-o OUT.gtf.gz]
refbox build -gff ANNOT.gff3 [-o OUT.gff3.gz]

# Genome + annotation → transcriptome FASTA (via gffread) + faidx
refbox build -fa GENOME.fa -gtf ANNOT.gtf -o transcriptome.fa.gz

# BED → sorted + bgzip + tabix + bigBed
#   chrom.sizes resolved from --chrom-sizes FILE or --assembly NAME
#   (the latter delegates to zlbio's per-species lookup).
refbox build -bed FEATURES.bed [--chrom-sizes FILE | --assembly NAME]

# UCSC rmsk.txt[.gz] → repeats.sorted.bed.gz + repeats.sorted.gtf.gz (+ .tbi)
refbox build -rmsk rmsk.txt.gz [-o OUT_DIR]

# Directory of user files (auto-classified by extension) → full layout under {Species}/{Assembly}/
refbox build -i DIR --assembly NAME [--species NAME]

# Auto-detect a single file by extension
refbox build SOMEFILE

Inputs may be plain or .gz. The builder transparently re-bgzips a plain gzip, copies an already-bgzipped file as-is, and re-sorts an unsorted GFF/BED before indexing.

refbox download / refbox test

refbox download --assembly GRCh38                  # raw files only
refbox test     --assembly GRCh38                  # validate existing build/
refbox test     --include-disabled                 # everything in the registry

Output layout

{REFBOX_OUT}/
  {Species}/
    {Assembly}/
      raw/                          # original downloads / copies
        genome.fa
        transcriptome.fa             # may be derived from genome + GTF
        annotation_gtf.gtf
        annotation_gff3.gff3
        repeats_rmsk.tsv
        repeats_fa.fa
        rnacentral.gff3              # may be lifted from another assembly
        ccre.bed
      build/                        # browser-loadable
        genome.fa.gz                 + .fai + .gzi
        chrom.sizes
        transcriptome.fa.gz          + .fai
        transcriptome.derived.fa.gz  + .fai   # gffread-extracted, when GTF available
        annotation.sorted.gtf.gz     + .tbi
        annotation.sorted.gff3.gz    + .tbi
        repeats.sorted.bed.gz        + .tbi
        repeats.sorted.gtf.gz        + .tbi
        rnacentral.sorted.gff3.gz    + .tbi
        ccre.sorted.bed.gz           + .tbi

Programmatic API

from refbox.config   import load_config, iter_targets
from refbox.download import download_targets
from refbox.build    import build_targets
from refbox.test     import test_targets
from refbox          import file_build as fb     # single-file builders
from refbox.ingest   import ingest_directory     # directory ingest
from refbox.report   import build_report         # Markdown status report

Tutorial — adding a new assembly

Edit config/species.yaml. Each entry is three levels deep: species → assembly → resource. All 10 resource keys must appear; use null for ones with no upstream source.

species:
  Homo_sapiens:
    GRCh38:
      enabled: true
      gencode_version: 44
      ucsc_db: hg38
      genome:
        url: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz
      transcriptome:
        url: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.transcripts.fa.gz
      annotation_gtf:
        url: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz
      annotation_gff3:
        url: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gff3.gz
      repeats_rmsk:
        url: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz
      repeats_bed: null
      repeats_gtf: null
      repeats_fa:
        url: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.out.gz
      rnacentral:
        url: https://ftp.ebi.ac.uk/pub/databases/RNAcentral/current_release/genome_coordinates/gff3/homo_sapiens.GRCh38.gff3.gz
      ccre:
        url: https://downloads.wenglab.org/V3/GRCh38-cCREs.bed

Fallbacks

my_resource:
  local_path: /path/on/disk/file.fa.gz     # copy if present
  url:        https://.../file.fa.gz       # else download
# Concatenate multiple upstream files into one raw file (e.g. Ensembl cdna+ncrna):
transcriptome:
  url:        https://ftp.ensembl.org/pub/release-111/fasta/danio_rerio/cdna/Danio_rerio.GRCz11.cdna.all.fa.gz
  extra_urls:
    - https://ftp.ensembl.org/pub/release-111/fasta/danio_rerio/ncrna/Danio_rerio.GRCz11.ncrna.fa.gz
# RNAcentral cross-assembly liftover (no direct URL upstream):
rnacentral:
  liftover_from:
    source_assembly: GRCh38
    url:       https://.../homo_sapiens.GRCh38.gff3.gz
    chain_url: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/liftOver/hg38ToHg19.over.chain.gz
# Transcriptome auto-derivation: leave it null and refbox will build
# transcriptome.fa.gz from genome + GTF/GFF via gffread.
transcriptome: null

Canonical resource names

Name raw/ build/
genome genome.fa genome.fa.gz + .fai + .gzi, chrom.sizes
transcriptome transcriptome.fa transcriptome.fa.gz + .fai
transcriptome.derived.fa.gz + .fai
annotation_gtf annotation_gtf.gtf annotation.sorted.gtf.gz + .tbi
annotation_gff3 annotation_gff3.gff3 annotation.sorted.gff3.gz + .tbi
repeats_rmsk repeats_rmsk.tsv repeats.sorted.bed.gz + repeats.sorted.gtf.gz
repeats_bed repeats_bed.bed repeats.sorted.bed.gz + .tbi
repeats_gtf repeats_gtf.gtf repeats.sorted.gtf.gz + .tbi
repeats_fa repeats_fa.fa (RepeatMasker .fa.out report)
rnacentral rnacentral.gff3 rnacentral.sorted.gff3.gz + .tbi
ccre ccre.bed ccre.sorted.bed.gz + .tbi

Environment

Variable Meaning
REFBOX_OUT default output root for {Species}/{Assembly}/{raw,build}/
REFBOX_CONFIG path to a custom species.yaml (overrides the bundled registry)

Status reporter

python -m refbox.report --out /path/to/reference > report.md

Walks the output tree and emits a Chinese-language Markdown report listing, per assembly, the status (✓ done / ⚠ missing index / ✗ missing) and size of each artifact.


Development

git clone https://github.com/typekey/refbox.git
cd refbox
pip install -e .
pytest -q                 # unit tests for `refbox build` single-file modes
refbox --help

Release

Tags matching v* automatically build and publish to PyPI via GitHub Actions (.github/workflows/workflow.yml) using PyPI trusted publishing.

git tag v0.3.0
git push origin v0.3.0

Changelog

  • v0.3.2 — Robust download backend fallback. _download now tries axel → aria2c → wget → wget --no-check-certificate → requests → requests verify=False in order, so a single broken TLS host (e.g. ftp.ensemblgenomes.org whose cert is not valid for its own hostname) no longer aborts a resource — it silently retries with the next backend. Also silenced axel/aria2c/wget progress noise.
  • v0.3.0 — CLI refactor: buildpull; new refbox build for arbitrary single-file/directory inputs (with auto bgzip / sort / bigBed); transcriptome auto-derivation via gffread; unit-tested + CI; --include-disabled flag; Chinese refbox.report generator.
  • v0.2.0refbox import subcommand; optional --species; refbox build auto-downloads missing raws and chains into refbox test; full canonical resource fields for all 42 assemblies; RNAcentral liftover entries for GRCh37 / mm9 / rn6.
  • v0.1.5 — RNAcentral liftover_from (download chain + source GFF, liftOver + chrom-name normalization) for assemblies without upstream files.
  • v0.1.4 — RNAcentral chrom-name normalization (Ensembl → UCSC).
  • v0.1.3 — atomic bgzip writes (.tmp + rename) to prevent truncated outputs.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refbox-0.4.0.tar.gz (40.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

refbox-0.4.0-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file refbox-0.4.0.tar.gz.

File metadata

  • Download URL: refbox-0.4.0.tar.gz
  • Upload date:
  • Size: 40.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for refbox-0.4.0.tar.gz
Algorithm Hash digest
SHA256 eb3d2e96f96ee1a6a784249740844afbbe7fbab90b8e8c121bc2279fd9a16026
MD5 03105e7a41f60d2dd0985d926c0d309f
BLAKE2b-256 94b2faae0a9d04f3725d0647f8a1bba2bcb1c2b278572bc05493e7305b18fe0c

See more details on using hashes here.

Provenance

The following attestation bundles were made for refbox-0.4.0.tar.gz:

Publisher: workflow.yml on typekey/refbox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file refbox-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: refbox-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 37.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for refbox-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2977ea6a40233d26afe0e40dcc73bbd9a4e977abd06c9c523f6fd3257dc67923
MD5 c7226a17a1b4c049975be459f142c2f2
BLAKE2b-256 3dec205d661a86bd7e5d109758d656afa0de6fd53e8b92175e496853e110b59c

See more details on using hashes here.

Provenance

The following attestation bundles were made for refbox-0.4.0-py3-none-any.whl:

Publisher: workflow.yml on typekey/refbox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page