Skip to main content

Docker-first VCF to RDF conversion wrapper with compression/decompression modes

Project description

Unit Tests Publish Python Publish Docker Codecov PyPI version Python versions Docker Pulls Conda Version License

VCF-RDFizer logo

VCF-RDFizer is a Docker-first CLI wrapper for:

  1. VCF -> RDF (N-Triples) with RMLStreamer
  2. Optional RDF compression/decompression

The VCF-RDFizer vocabulary is available at https://w3id.org/vcf-rdfizer/vocab#.

Requirements

  • Python 3.10+
  • Docker (installed and running)

Install options:

pip install vcf-rdfizer

or

pipx install vcf-rdfizer

or

conda install -c conda-forge vcf-rdfizer

or pull the prebuilt Docker image directly:

docker pull ecrum19/vcf-rdfizer:latest

Important CLI Rule

--out is required for all modes.

This is the run output root directory. VCF-RDFizer places:

  • final RDF/compression outputs
  • run metrics/logs
  • hidden intermediates

inside this directory.

Modes

  • full: VCF -> TSV -> RDF -> compression
  • tsv: VCF -> TSV only (benchmarking)
  • compress: compress an existing .nt
  • decompress: decompress .nt.gz, .nt.br, or .hdt

In full mode with multiple VCF inputs, failures are isolated per input:

  • the run continues with remaining files
  • failed inputs are summarized in run_metrics/<RUN_ID>/failed_inputs.csv

Main Flags (Most Used)

  • -m, --mode {full,compress,decompress,tsv}
  • -o, --out required output root directory
  • -c, --compression methods: gzip,brotli,hdt,hdt_gzip,hdt_brotli,none
  • -I, --image Docker image repo (default ecrum19/vcf-rdfizer)
  • -v, --image-version Docker tag/version
  • -b, --build force Docker build
  • -B, --no-build fail if image not found
  • -h, --help show full usage

Full Mode Flags

  • -i, --input required VCF file or directory
  • -r, --rules mapping rules file (.ttl)
    • default: rules/default_rules.ttl
  • -l, --rdf-layout {aggregate,batch} required in full mode
  • -P, --spark-partitions optional Spark partition hint (positive integer)
    • low-cost way to reduce output part count by setting spark.default.parallelism and spark.sql.shuffle.partitions
  • -k, --keep-tsv keep hidden TSV intermediates
  • -R, --keep-rdf keep raw .nt after compression
  • -e, --estimate-size preflight size estimate

TSV Mode Flags

  • -i, --input required VCF file or directory
  • Outputs per-run benchmark summary in run_metrics/<RUN_ID>/tsv_metrics.csv
  • Raw TSV timing + artifact JSON per input in run_metrics/<RUN_ID>/raw_metrics/tsv_*

Compression Mode Flags

  • -q, --rdf, --nt required input .nt file

Decompression Mode Flags

  • -C, --compressed-input required .nt.gz, .nt.br, or .hdt
  • -d, --decompress-out optional explicit output .nt path (must be inside --out)

Quick Start

Show help:

vcf-rdfizer --help

Full pipeline (aggregate RDF):

vcf-rdfizer \
  --mode full \
  --input ./vcf_files \
  --rdf-layout aggregate \
  --out ./results

Full pipeline (batch RDF parts):

vcf-rdfizer \
  --mode full \
  --input ./vcf_files \
  --rdf-layout batch \
  --compression hdt \
  --out ./results

Full pipeline with low-cost partition cap (helps avoid too many tiny batch files):

vcf-rdfizer \
  --mode full \
  --input ./vcf_files \
  --rdf-layout batch \
  --spark-partitions 8 \
  --compression hdt \
  --out ./results

Full pipeline with custom rules + keep RDF:

vcf-rdfizer \
  --mode full \
  --input ./vcf_files \
  --rules ./rules/my_rules.ttl \
  --rdf-layout aggregate \
  --compression hdt,brotli \
  --keep-rdf \
  --out ./results

TSV-only benchmark:

vcf-rdfizer \
  --mode tsv \
  --input ./vcf_files \
  --out ./results

Compression-only:

vcf-rdfizer \
  --mode compress \
  --rdf ./results/sample/sample.nt \
  --compression hdt_gzip \
  --out ./results

Decompression-only:

vcf-rdfizer \
  --mode decompress \
  --compressed-input ./results/sample/sample.hdt \
  --out ./results

Output Layout

Given --out ./results:

  • final outputs:
    • ./results/<sample>/...
  • per-run metrics/logs:
    • ./results/run_metrics/<RUN_ID>/...
  • hidden intermediates:
    • ./results/.intermediate/tsv/

Intermediates are hidden by default. Raw .nt files are removed after compression unless --keep-rdf is provided.

Metrics

For each run, VCF-RDFizer writes:

  • run_metrics/<RUN_ID>/metrics.csv
  • run_metrics/<RUN_ID>/wrapper_execution_times.csv
  • run_metrics/<RUN_ID>/progress.log

Compression metrics now include per-method:

  • wall_seconds_*
  • user_seconds_*
  • sys_seconds_*
  • max_rss_kb_*

Rules

  • default rules file: rules/default_rules.ttl
  • rules guide: rules/README.md

Troubleshooting

If Docker permission issues occur, rerun with a Docker-allowed user (or configure Docker group/sudo access on your system).

If HDT compression fails on very large .nt files, use batch layout and/or non-HDT compression methods.

Safe termination:

  • Press Ctrl+C to interrupt a run.
  • The wrapper exits with code 130, writes progress to run_metrics/<RUN_ID>/progress.log, and performs best-effort cleanup of tracked intermediates.
  • Raw RDF cleanup on interrupt follows --keep-rdf:
    • with --keep-rdf, raw .nt files are preserved
    • without --keep-rdf, tracked raw .nt files are removed during interrupt cleanup

Citation

If you use VCF-RDFizer in a publication, please cite:

VCF-RDFizer maintainers. (2026). VCF-RDFizer (Version 1.2.3) [Computer software]. GitHub. https://github.com/ecrum19/VCF-RDFizer

BibTeX:

@software{vcf_rdfizer_2026,
  author  = {{VCF-RDFizer maintainers}},
  title   = {VCF-RDFizer},
  year    = {2026},
  version = {1.2.3},
  url     = {https://github.com/ecrum19/VCF-RDFizer},
  note    = {Computer software}
}

You can also use the machine-readable citation file: CITATION.cff.

Contributing

Contributions are welcome. If you want to improve VCF-RDFizer:

  • Open an issue first for bug reports, feature requests, or design changes.
  • Fork the repo and create a feature branch from main.
  • Keep changes focused and include/update tests for behavior changes.
  • Run the unit tests locally before opening a PR:
python3 -m unittest discover -s test -p "test_*_unit.py" -q
  • In your PR, include what changed, why it changed, and how you validated it.
  • Use clear commit messages (for Docker publish control, include [publish-docker] only when intended).

Licensing

  • Project license: LICENSE (MIT)
  • Third-party runtime notices: THIRD_PARTY_NOTICES.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcf_rdfizer-1.2.3.tar.gz (56.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vcf_rdfizer-1.2.3-py3-none-any.whl (35.3 kB view details)

Uploaded Python 3

File details

Details for the file vcf_rdfizer-1.2.3.tar.gz.

File metadata

  • Download URL: vcf_rdfizer-1.2.3.tar.gz
  • Upload date:
  • Size: 56.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vcf_rdfizer-1.2.3.tar.gz
Algorithm Hash digest
SHA256 1e3808bdfd94e517e7b64dd826b0ac685e54e301872ce42d8c1ce4003826bd2b
MD5 06adda83ea22e3ffb2bfd19d7be7bcac
BLAKE2b-256 bb406d68ba45ede09aacb0606b7d5b961dda81b35c34be8386f4c2bd5ebfadad

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcf_rdfizer-1.2.3.tar.gz:

Publisher: publish-python.yml on ecrum19/VCF-RDFizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vcf_rdfizer-1.2.3-py3-none-any.whl.

File metadata

  • Download URL: vcf_rdfizer-1.2.3-py3-none-any.whl
  • Upload date:
  • Size: 35.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vcf_rdfizer-1.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 0f105d3a69f2f42517ac20724e42ce6f676ef7eac79cf3128bfa5eecabb57a1e
MD5 c79ccadd54de018ef9bfa200b77cfe15
BLAKE2b-256 7d4ab7257ea8d57f8a8faf845ba07e6440050d014b431129103b8f8d95f80e9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcf_rdfizer-1.2.3-py3-none-any.whl:

Publisher: publish-python.yml on ecrum19/VCF-RDFizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page