Docker-first VCF to RDF conversion wrapper with compression/decompression modes
Project description
VCF-RDFizer
VCF-RDFizer is a Docker-first CLI wrapper for:
- VCF -> RDF (N-Triples) with RMLStreamer
- Optional RDF compression/decompression
Requirements
- Python 3.10+
- Docker (installed and running)
Install options:
pip install vcf-rdfizer
or
pipx install vcf-rdfizer
or
conda install -c conda-forge vcf-rdfizer
or pull the prebuilt Docker image directly:
docker pull ecrum19/vcf-rdfizer:latest
Important CLI Rule
--out is required for all modes.
This is the run output root directory. VCF-RDFizer places:
- final RDF/compression outputs
- run metrics/logs
- hidden intermediates
inside this directory.
Modes
full: VCF -> TSV -> RDF -> compressiontsv: VCF -> TSV only (benchmarking)compress: compress an existing.ntdecompress: decompress.nt.gz,.nt.br, or.hdt
In full mode with multiple VCF inputs, failures are isolated per input:
- the run continues with remaining files
- failed inputs are summarized in
run_metrics/<RUN_ID>/failed_inputs.csv
Main Flags (Most Used)
-m, --mode {full,compress,decompress,tsv}-o, --outrequired output root directory-c, --compressionmethods:gzip,brotli,hdt,hdt_gzip,hdt_brotli,none-I, --imageDocker image repo (defaultecrum19/vcf-rdfizer)-v, --image-versionDocker tag/version-b, --buildforce Docker build-B, --no-buildfail if image not found-h, --helpshow full usage
Full Mode Flags
-i, --inputrequired VCF file or directory-r, --rulesmapping rules file (.ttl)- default:
rules/default_rules.ttl
- default:
-l, --rdf-layout {aggregate,batch}required in full mode-P, --spark-partitionsoptional Spark partition hint (positive integer)- low-cost way to reduce output part count by setting
spark.default.parallelismandspark.sql.shuffle.partitions
- low-cost way to reduce output part count by setting
-k, --keep-tsvkeep hidden TSV intermediates-R, --keep-rdfkeep raw.ntafter compression-e, --estimate-sizepreflight size estimate
TSV Mode Flags
-i, --inputrequired VCF file or directory- Outputs per-run benchmark summary in
run_metrics/<RUN_ID>/tsv_metrics.csv - Raw TSV timing + artifact JSON per input in
run_metrics/<RUN_ID>/raw_metrics/tsv_*
Compression Mode Flags
-q, --rdf, --ntrequired input.ntfile
Decompression Mode Flags
-C, --compressed-inputrequired.nt.gz,.nt.br, or.hdt-d, --decompress-outoptional explicit output.ntpath (must be inside--out)
Quick Start
Show help:
vcf-rdfizer --help
Full pipeline (aggregate RDF):
vcf-rdfizer \
--mode full \
--input ./vcf_files \
--rdf-layout aggregate \
--out ./results
Full pipeline (batch RDF parts):
vcf-rdfizer \
--mode full \
--input ./vcf_files \
--rdf-layout batch \
--compression hdt \
--out ./results
Full pipeline with low-cost partition cap (helps avoid too many tiny batch files):
vcf-rdfizer \
--mode full \
--input ./vcf_files \
--rdf-layout batch \
--spark-partitions 8 \
--compression hdt \
--out ./results
Full pipeline with custom rules + keep RDF:
vcf-rdfizer \
--mode full \
--input ./vcf_files \
--rules ./rules/my_rules.ttl \
--rdf-layout aggregate \
--compression hdt,brotli \
--keep-rdf \
--out ./results
TSV-only benchmark:
vcf-rdfizer \
--mode tsv \
--input ./vcf_files \
--out ./results
Compression-only:
vcf-rdfizer \
--mode compress \
--rdf ./results/sample/sample.nt \
--compression hdt_gzip \
--out ./results
Decompression-only:
vcf-rdfizer \
--mode decompress \
--compressed-input ./results/sample/sample.hdt \
--out ./results
Output Layout
Given --out ./results:
- final outputs:
./results/<sample>/...
- per-run metrics/logs:
./results/run_metrics/<RUN_ID>/...
- hidden intermediates:
./results/.intermediate/tsv/
Intermediates are hidden by default.
Raw .nt files are removed after compression unless --keep-rdf is provided.
Metrics
For each run, VCF-RDFizer writes:
run_metrics/<RUN_ID>/metrics.csvrun_metrics/<RUN_ID>/wrapper_execution_times.csvrun_metrics/<RUN_ID>/progress.log
Compression metrics now include per-method:
wall_seconds_*user_seconds_*sys_seconds_*max_rss_kb_*
Rules
- default rules file:
rules/default_rules.ttl - rules guide:
rules/README.md
Troubleshooting
If Docker permission issues occur, rerun with a Docker-allowed user (or configure Docker group/sudo access on your system).
If HDT compression fails on very large .nt files, use batch layout and/or non-HDT compression methods.
Safe termination:
- Press
Ctrl+Cto interrupt a run. - The wrapper exits with code
130, writes progress torun_metrics/<RUN_ID>/progress.log, and performs best-effort cleanup of tracked intermediates. - Raw RDF cleanup on interrupt follows
--keep-rdf:- with
--keep-rdf, raw.ntfiles are preserved - without
--keep-rdf, tracked raw.ntfiles are removed during interrupt cleanup
- with
Citation
If you use VCF-RDFizer in a publication, please cite:
VCF-RDFizer maintainers. (2026). VCF-RDFizer (Version 1.1.0) [Computer software]. GitHub. https://github.com/ecrum19/VCF-RDFizer
BibTeX:
@software{vcf_rdfizer_2026,
author = {{VCF-RDFizer maintainers}},
title = {VCF-RDFizer},
year = {2026},
version = {1.1.0},
url = {https://github.com/ecrum19/VCF-RDFizer},
note = {Computer software}
}
You can also use the machine-readable citation file: CITATION.cff.
Contributing
Contributions are welcome. If you want to improve VCF-RDFizer:
- Open an issue first for bug reports, feature requests, or design changes.
- Fork the repo and create a feature branch from
main. - Keep changes focused and include/update tests for behavior changes.
- Run the unit tests locally before opening a PR:
python3 -m unittest discover -s test -p "test_*_unit.py" -q
- In your PR, include what changed, why it changed, and how you validated it.
- Use clear commit messages (for Docker publish control, include
[publish-docker]only when intended).
Licensing
- Project license:
LICENSE(MIT) - Third-party runtime notices:
THIRD_PARTY_NOTICES.md
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vcf_rdfizer-1.1.0.tar.gz.
File metadata
- Download URL: vcf_rdfizer-1.1.0.tar.gz
- Upload date:
- Size: 55.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ee70623a1ab675cc8a68445f0ac45db403858cc334fd503a3d32196c51acf09
|
|
| MD5 |
bc6a5bf31dfa025e5a59de3f9eeaba21
|
|
| BLAKE2b-256 |
b82f5c09d5fb0a83f253e78c38b118f91ee34e44535965042fd95c87688ae970
|
Provenance
The following attestation bundles were made for vcf_rdfizer-1.1.0.tar.gz:
Publisher:
publish-python.yml on ecrum19/VCF-RDFizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vcf_rdfizer-1.1.0.tar.gz -
Subject digest:
4ee70623a1ab675cc8a68445f0ac45db403858cc334fd503a3d32196c51acf09 - Sigstore transparency entry: 1109817936
- Sigstore integration time:
-
Permalink:
ecrum19/VCF-RDFizer@51ccbab28d3746311a10fdd81b75fa46b27b89e5 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/ecrum19
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-python.yml@51ccbab28d3746311a10fdd81b75fa46b27b89e5 -
Trigger Event:
push
-
Statement type:
File details
Details for the file vcf_rdfizer-1.1.0-py3-none-any.whl.
File metadata
- Download URL: vcf_rdfizer-1.1.0-py3-none-any.whl
- Upload date:
- Size: 35.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2b51139ab90a02a47d7a6c277c512aeed95213fabe72989e77ba68f89610e2b6
|
|
| MD5 |
4591ef70e66f96861141d202ab6851b9
|
|
| BLAKE2b-256 |
6a11e33806324a186462eb14805d40e697c2faa9ed5f8d76d134be9e8deff957
|
Provenance
The following attestation bundles were made for vcf_rdfizer-1.1.0-py3-none-any.whl:
Publisher:
publish-python.yml on ecrum19/VCF-RDFizer
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
vcf_rdfizer-1.1.0-py3-none-any.whl -
Subject digest:
2b51139ab90a02a47d7a6c277c512aeed95213fabe72989e77ba68f89610e2b6 - Sigstore transparency entry: 1109817938
- Sigstore integration time:
-
Permalink:
ecrum19/VCF-RDFizer@51ccbab28d3746311a10fdd81b75fa46b27b89e5 -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/ecrum19
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish-python.yml@51ccbab28d3746311a10fdd81b75fa46b27b89e5 -
Trigger Event:
push
-
Statement type: