Skip to main content

Build standardized, indexed reference files (genome, annotation, repeats, RNAcentral, cCRE) for genome browsers.

Project description

refbox

PyPI version Python versions License: MIT

Build standardized, indexed reference files for genome browsers — in one command.

refbox turns a YAML registry of species/assemblies into ready-to-load browser inputs:

  • Genome FASTA → bgzip + samtools faidx (.gz + .fai + .gzi)
  • Transcriptome FASTA → bgzip + samtools faidx
  • GTF / GFF3 annotations → sorted + bgzip + tabix
  • Repeats (UCSC RepeatMasker rmsk.txt.gz + .fa.out.gz)
  • RNAcentral non-coding RNA annotations
  • ENCODE SCREEN cCREs
  • chrom.sizes derived from the genome .fai

It ships a registry of 26 species / 42 assemblies (human, mouse, rat, dog, cow, pig, chimp, gorilla, zebrafish, fly, worm, sea urchin, yeast, plants, bacteria, viruses…) covering GENCODE, Ensembl, Ensembl Genomes, UCSC golden path, NCBI, RNAcentral, and ENCODE SCREEN.


Why

Genome browsers (e.g. WashU Epigenome Browser, rbrowser) expect a strict set of file formats and indices. Manually downloading, sorting, bgzipping, tabixing, and checking each one is tedious and error-prone. refbox makes the process:

  • declarative — one YAML, every URL pinned
  • idempotent — re-running skips finished files; --force to rebuild
  • filterable--species, --assembly, --resource to scope work
  • verifiable — a test subcommand runs real samtools faidx / tabix queries

Install

pip install refbox

refbox shells out to the standard htslib tooling. Install once via conda/mamba (recommended):

mamba install -c bioconda htslib samtools

Required CLI tools: bgzip, tabix, samtools, GNU sort, grep.


Quick start

# Download → build → validate Human GRCh38 (uses bundled species.yaml)
refbox download --species Homo_sapiens --assembly GRCh38
refbox build    --species Homo_sapiens --assembly GRCh38
refbox test     --species Homo_sapiens --assembly GRCh38

Or use the one-shot driver:

git clone https://github.com/typekey/refbox.git
cd refbox
./build.sh Homo_sapiens GRCh38

CLI reference

refbox download [--species ...] [--assembly ...] [--resource ...] [--out DIR] [--force]
refbox build    [--species ...] [--assembly ...] [--resource ...] [--out DIR] [--force]
refbox test     [--species ...] [--assembly ...] [--out DIR]
Flag Default Meaning
--species all enabled filter to one or more species (e.g. Homo_sapiens)
--assembly all enabled filter to one or more assemblies (e.g. GRCh38)
--resource all 10 subset: genome transcriptome annotation_gtf annotation_gff3 repeats_rmsk repeats_bed repeats_gtf repeats_fa rnacentral ccre
--out $REFBOX_OUT or $PWD output root
--force off rebuild even when outputs exist
-v / --verbose enable DEBUG logging

Environment variables:

Name Meaning
REFBOX_OUT default output root for {Species}/{Assembly}/{raw,build}/
REFBOX_CONFIG path to a custom species.yaml (overrides the bundled registry)

Output layout

{REFBOX_OUT}/
  {Species}/
    {Assembly}/
      raw/                          # original downloads / copies
        genome.fa
        transcriptome.fa
        annotation_gtf.gtf
        annotation_gff3.gff3
        repeats_rmsk.tsv
        repeats_fa.fa
        rnacentral.gff3
        ccre.bed
      build/                        # browser-loadable
        genome.fa.gz                + .fai + .gzi
        chrom.sizes
        transcripts.fa.gz           + .fai
        annotation.sorted.gtf.gz    + .tbi
        annotation.sorted.gff3.gz   + .tbi
        repeats.sorted.bed.gz       + .tbi
        repeats.sorted.gtf.gz       + .tbi
        rnacentral.sorted.gff3.gz   + .tbi
        ccre.sorted.bed.gz          + .tbi

Examples

1. Build only cCREs for Human GRCh38

refbox download --species Homo_sapiens --assembly GRCh38 --resource ccre
refbox build    --species Homo_sapiens --assembly GRCh38 --resource ccre
refbox test     --species Homo_sapiens --assembly GRCh38

2. Build a full reference into a specific directory

refbox download --species Mus_musculus --assembly GRCm38 --out /data/refs
refbox build    --species Mus_musculus --assembly GRCm38 --out /data/refs
refbox test     --species Mus_musculus --assembly GRCm38 --out /data/refs

3. Use a private YAML registry

export REFBOX_CONFIG=/path/to/my_species.yaml
refbox build

4. Drive everything from build.sh

./build.sh                              # all enabled assemblies
./build.sh Homo_sapiens                 # one species, all enabled assemblies
./build.sh Homo_sapiens GRCh38          # one species + assembly
./build.sh Homo_sapiens GRCh38 -- --resource genome ccre
FORCE=1 ./build.sh Mus_musculus GRCm38  # rebuild even when outputs exist
STEPS="test" ./build.sh                 # only run validation

Tutorial — adding a new assembly

The registry lives in config/species.yaml. Each entry is three levels deep: species → assembly → resource.

Step 1. Add an assembly block

species:
  Homo_sapiens:
    GRCh38:
      enabled: true                  # set false to keep it idle
      gencode_version: 44
      ucsc_db: hg38

      genome:
        url: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz

      annotation_gtf:
        url: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz

      repeats_rmsk:
        url: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz

      ccre:
        url: https://downloads.wenglab.org/V3/GRCh38-cCREs.bed

Step 2. Each resource entry follows a fallback rule

my_resource:
  local_path: /path/on/disk/file.fa.gz   # used if it exists
  url:        https://.../file.fa.gz     # else downloaded
# omit / null = skipped silently
  • local_path exists → copy into raw/ (auto-gunzips .gz)
  • otherwise urldownload into raw/
  • otherwise → skip (no error)

Step 3. Canonical resource names

Name Output (raw) Output (build)
genome genome.fa genome.fa.gz + .fai + .gzi, chrom.sizes
transcriptome transcriptome.fa transcripts.fa.gz + .fai
annotation_gtf annotation_gtf.gtf annotation.sorted.gtf.gz + .tbi
annotation_gff3 annotation_gff3.gff3 annotation.sorted.gff3.gz + .tbi
repeats_rmsk repeats_rmsk.tsv (raw input for repeats_bed/gtf — derivation TODO)
repeats_bed repeats_bed.bed repeats.sorted.bed.gz + .tbi
repeats_gtf repeats_gtf.gtf repeats.sorted.gtf.gz + .tbi
repeats_fa repeats_fa.fa (RepeatMasker .fa.out report)
rnacentral rnacentral.gff3 rnacentral.sorted.gff3.gz + .tbi
ccre ccre.bed ccre.sorted.bed.gz + .tbi

Step 4. Run

refbox download --species Homo_sapiens --assembly GRCh38
refbox build    --species Homo_sapiens --assembly GRCh38
refbox test     --species Homo_sapiens --assembly GRCh38

Programmatic API

from refbox.config import load_config, iter_targets
from refbox.download import download_targets
from refbox.build import build_targets
from refbox.test import test_targets

cfg = load_config()
for t in iter_targets(cfg, species=["Homo_sapiens"]):
    print(t.species, t.assembly, list(t.resources))

download_targets(species=["Homo_sapiens"], assembly=["GRCh38"], out="/data/refs")
build_targets(   species=["Homo_sapiens"], assembly=["GRCh38"], out="/data/refs")
test_targets(    species=["Homo_sapiens"], assembly=["GRCh38"], out="/data/refs")

Development

git clone https://github.com/typekey/refbox.git
cd refbox
pip install -e .
refbox --help

Release

Tags matching v* automatically build and publish to PyPI via GitHub Actions (.github/workflows/workflow.yml) using PyPI trusted publishing (no API token required).

git tag v0.1.0
git push origin v0.1.0

Manual build (no upload):

./release.sh build       # writes dist/

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

refbox-0.1.4.tar.gz (22.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

refbox-0.1.4-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file refbox-0.1.4.tar.gz.

File metadata

  • Download URL: refbox-0.1.4.tar.gz
  • Upload date:
  • Size: 22.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for refbox-0.1.4.tar.gz
Algorithm Hash digest
SHA256 f85c6ea2b94a8ed68f14a333eae8ec685ef6887e5eaef5095e004156b426e804
MD5 89e7bb01ed08411963d314290a4135c3
BLAKE2b-256 e1fc176ecc9fe8f1f67eb54b4a0253e679f15345555d3179e7797b6e0b5b750f

See more details on using hashes here.

Provenance

The following attestation bundles were made for refbox-0.1.4.tar.gz:

Publisher: workflow.yml on typekey/refbox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file refbox-0.1.4-py3-none-any.whl.

File metadata

  • Download URL: refbox-0.1.4-py3-none-any.whl
  • Upload date:
  • Size: 21.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for refbox-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 90e4ecea95dfda32429115476c5b0e50f1d3d2983f9384229f7be8598e3b7b77
MD5 275f568f48d04a99865de74c05b59a24
BLAKE2b-256 4c851607f5d754789c6fe38c4a50b7e163270f1898526b4cf133ab98fcdc590a

See more details on using hashes here.

Provenance

The following attestation bundles were made for refbox-0.1.4-py3-none-any.whl:

Publisher: workflow.yml on typekey/refbox

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page