Build standardized, indexed reference files (genome, annotation, repeats, RNAcentral, cCRE) for genome browsers.
Project description
refbox
Build standardized, indexed reference files for genome browsers — in one command.
refbox turns a YAML registry of species/assemblies into ready-to-load browser
inputs:
- Genome FASTA →
bgzip+samtools faidx(.gz+.fai+.gzi) - Transcriptome FASTA →
bgzip+samtools faidx - GTF / GFF3 annotations → sorted +
bgzip+tabix - Repeats (UCSC RepeatMasker
rmsk.txt.gz+.fa.out.gz) - RNAcentral non-coding RNA annotations
- ENCODE SCREEN cCREs
chrom.sizesderived from the genome.fai
It ships a registry of 26 species / 42 assemblies (human, mouse, rat, dog, cow, pig, chimp, gorilla, zebrafish, fly, worm, sea urchin, yeast, plants, bacteria, viruses…) covering GENCODE, Ensembl, Ensembl Genomes, UCSC golden path, NCBI, RNAcentral, and ENCODE SCREEN.
Why
Genome browsers (e.g. WashU Epigenome Browser,
rbrowser) expect a strict set of file
formats and indices. Manually downloading, sorting, bgzipping, tabixing, and
checking each one is tedious and error-prone. refbox makes the process:
- declarative — one YAML, every URL pinned
- idempotent — re-running skips finished files;
--forceto rebuild - filterable —
--species,--assembly,--resourceto scope work - verifiable — a
testsubcommand runs realsamtools faidx/tabixqueries
Install
pip install refbox
refbox shells out to the standard htslib tooling. Install once via
conda/mamba (recommended):
mamba install -c bioconda htslib samtools
Required CLI tools: bgzip, tabix, samtools, GNU sort, grep.
Quick start
# Download → build → validate Human GRCh38 (uses bundled species.yaml)
refbox download --species Homo_sapiens --assembly GRCh38
refbox build --species Homo_sapiens --assembly GRCh38
refbox test --species Homo_sapiens --assembly GRCh38
Or use the one-shot driver:
git clone https://github.com/typekey/refbox.git
cd refbox
./build.sh Homo_sapiens GRCh38
CLI reference
refbox download [--species ...] [--assembly ...] [--resource ...] [--out DIR] [--force]
refbox build [--species ...] [--assembly ...] [--resource ...] [--out DIR] [--force]
refbox test [--species ...] [--assembly ...] [--out DIR]
| Flag | Default | Meaning |
|---|---|---|
--species |
all enabled | filter to one or more species (e.g. Homo_sapiens) |
--assembly |
all enabled | filter to one or more assemblies (e.g. GRCh38) |
--resource |
all 10 | subset: genome transcriptome annotation_gtf annotation_gff3 repeats_rmsk repeats_bed repeats_gtf repeats_fa rnacentral ccre |
--out |
$REFBOX_OUT or $PWD |
output root |
--force |
off | rebuild even when outputs exist |
-v / --verbose |
enable DEBUG logging |
Environment variables:
| Name | Meaning |
|---|---|
REFBOX_OUT |
default output root for {Species}/{Assembly}/{raw,build}/ |
REFBOX_CONFIG |
path to a custom species.yaml (overrides the bundled registry) |
Output layout
{REFBOX_OUT}/
{Species}/
{Assembly}/
raw/ # original downloads / copies
genome.fa
transcriptome.fa
annotation_gtf.gtf
annotation_gff3.gff3
repeats_rmsk.tsv
repeats_fa.fa
rnacentral.gff3
ccre.bed
build/ # browser-loadable
genome.fa.gz + .fai + .gzi
chrom.sizes
transcripts.fa.gz + .fai
annotation.sorted.gtf.gz + .tbi
annotation.sorted.gff3.gz + .tbi
repeats.sorted.bed.gz + .tbi
repeats.sorted.gtf.gz + .tbi
rnacentral.sorted.gff3.gz + .tbi
ccre.sorted.bed.gz + .tbi
Examples
1. Build only cCREs for Human GRCh38
refbox download --species Homo_sapiens --assembly GRCh38 --resource ccre
refbox build --species Homo_sapiens --assembly GRCh38 --resource ccre
refbox test --species Homo_sapiens --assembly GRCh38
2. Build a full reference into a specific directory
refbox download --species Mus_musculus --assembly GRCm38 --out /data/refs
refbox build --species Mus_musculus --assembly GRCm38 --out /data/refs
refbox test --species Mus_musculus --assembly GRCm38 --out /data/refs
3. Use a private YAML registry
export REFBOX_CONFIG=/path/to/my_species.yaml
refbox build
4. Drive everything from build.sh
./build.sh # all enabled assemblies
./build.sh Homo_sapiens # one species, all enabled assemblies
./build.sh Homo_sapiens GRCh38 # one species + assembly
./build.sh Homo_sapiens GRCh38 -- --resource genome ccre
FORCE=1 ./build.sh Mus_musculus GRCm38 # rebuild even when outputs exist
STEPS="test" ./build.sh # only run validation
Tutorial — adding a new assembly
The registry lives in config/species.yaml. Each entry
is three levels deep: species → assembly → resource.
Step 1. Add an assembly block
species:
Homo_sapiens:
GRCh38:
enabled: true # set false to keep it idle
gencode_version: 44
ucsc_db: hg38
genome:
url: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz
annotation_gtf:
url: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz
repeats_rmsk:
url: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz
ccre:
url: https://downloads.wenglab.org/V3/GRCh38-cCREs.bed
Step 2. Each resource entry follows a fallback rule
my_resource:
local_path: /path/on/disk/file.fa.gz # used if it exists
url: https://.../file.fa.gz # else downloaded
# omit / null = skipped silently
local_pathexists → copy intoraw/(auto-gunzips.gz)- otherwise
url→ download intoraw/ - otherwise → skip (no error)
Step 3. Canonical resource names
| Name | Output (raw) | Output (build) |
|---|---|---|
genome |
genome.fa |
genome.fa.gz + .fai + .gzi, chrom.sizes |
transcriptome |
transcriptome.fa |
transcripts.fa.gz + .fai |
annotation_gtf |
annotation_gtf.gtf |
annotation.sorted.gtf.gz + .tbi |
annotation_gff3 |
annotation_gff3.gff3 |
annotation.sorted.gff3.gz + .tbi |
repeats_rmsk |
repeats_rmsk.tsv |
(raw input for repeats_bed/gtf — derivation TODO) |
repeats_bed |
repeats_bed.bed |
repeats.sorted.bed.gz + .tbi |
repeats_gtf |
repeats_gtf.gtf |
repeats.sorted.gtf.gz + .tbi |
repeats_fa |
repeats_fa.fa |
(RepeatMasker .fa.out report) |
rnacentral |
rnacentral.gff3 |
rnacentral.sorted.gff3.gz + .tbi |
ccre |
ccre.bed |
ccre.sorted.bed.gz + .tbi |
Step 4. Run
refbox download --species Homo_sapiens --assembly GRCh38
refbox build --species Homo_sapiens --assembly GRCh38
refbox test --species Homo_sapiens --assembly GRCh38
Programmatic API
from refbox.config import load_config, iter_targets
from refbox.download import download_targets
from refbox.build import build_targets
from refbox.test import test_targets
cfg = load_config()
for t in iter_targets(cfg, species=["Homo_sapiens"]):
print(t.species, t.assembly, list(t.resources))
download_targets(species=["Homo_sapiens"], assembly=["GRCh38"], out="/data/refs")
build_targets( species=["Homo_sapiens"], assembly=["GRCh38"], out="/data/refs")
test_targets( species=["Homo_sapiens"], assembly=["GRCh38"], out="/data/refs")
Development
git clone https://github.com/typekey/refbox.git
cd refbox
pip install -e .
refbox --help
Release
Tags matching v* automatically build and publish to PyPI via GitHub Actions
(.github/workflows/workflow.yml) using
PyPI trusted publishing (no API
token required).
git tag v0.1.0
git push origin v0.1.0
Manual build (no upload):
./release.sh build # writes dist/
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file refbox-0.1.1.tar.gz.
File metadata
- Download URL: refbox-0.1.1.tar.gz
- Upload date:
- Size: 20.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3432446a7c079ebb415c436ccd7c5520548c97df7df65b4e977b3c5b11757fb6
|
|
| MD5 |
f506b45d09463f5340aada86bd8ab6be
|
|
| BLAKE2b-256 |
eaf77a0a459d8029aaa25e904af0f7d13a9a5267d2b6ddb198e1147acd3c2acd
|
Provenance
The following attestation bundles were made for refbox-0.1.1.tar.gz:
Publisher:
workflow.yml on typekey/refbox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
refbox-0.1.1.tar.gz -
Subject digest:
3432446a7c079ebb415c436ccd7c5520548c97df7df65b4e977b3c5b11757fb6 - Sigstore transparency entry: 1661494388
- Sigstore integration time:
-
Permalink:
typekey/refbox@b8817a72cee7ccdf4d4a6d0529cedefb45511df7 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/typekey
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@b8817a72cee7ccdf4d4a6d0529cedefb45511df7 -
Trigger Event:
push
-
Statement type:
File details
Details for the file refbox-0.1.1-py3-none-any.whl.
File metadata
- Download URL: refbox-0.1.1-py3-none-any.whl
- Upload date:
- Size: 19.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
02f32a11f9a672adfa3b9c5728e7f2cfe7b0e570f39339049b7383b3ddba91e4
|
|
| MD5 |
05894c53a01d213893df2862ea1333a5
|
|
| BLAKE2b-256 |
4d58b7f4e2c74b55fc2c5f1039aeff495a124b849ffdc5003504ead94458dbc6
|
Provenance
The following attestation bundles were made for refbox-0.1.1-py3-none-any.whl:
Publisher:
workflow.yml on typekey/refbox
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
refbox-0.1.1-py3-none-any.whl -
Subject digest:
02f32a11f9a672adfa3b9c5728e7f2cfe7b0e570f39339049b7383b3ddba91e4 - Sigstore transparency entry: 1661494474
- Sigstore integration time:
-
Permalink:
typekey/refbox@b8817a72cee7ccdf4d4a6d0529cedefb45511df7 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/typekey
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow.yml@b8817a72cee7ccdf4d4a6d0529cedefb45511df7 -
Trigger Event:
push
-
Statement type: