Skip to main content

Docker-first VCF to RDF conversion wrapper with compression/decompression modes

Project description

Unit Tests Publish Python Publish Docker Codecov PyPI version Python versions Docker Pulls Conda Version License

VCF-RDFizer logo

VCF-RDFizer is a Docker-first CLI wrapper for:

  1. VCF -> RDF (N-Triples) with RMLStreamer
  2. Optional RDF compression/decompression

The VCF-RDFizer vocabulary is available at https://w3id.org/vcf-rdfizer/vocab#.

Requirements

  • Python 3.10+
  • Docker (installed and running)

Install options:

pip install vcf-rdfizer

or

pipx install vcf-rdfizer

or

conda install -c conda-forge vcf-rdfizer

or pull the prebuilt Docker image directly:

docker pull ecrum19/vcf-rdfizer:latest

Important CLI Rule

--out is required for all modes.

This is the run output root directory. VCF-RDFizer places:

  • final RDF/compression outputs
  • run metrics/logs
  • hidden intermediates

inside this directory.

Modes

  • full: VCF -> TSV -> RDF -> compression
  • tsv: VCF -> TSV only (benchmarking)
  • compress: compress an existing .nt
  • decompress: decompress .nt.gz, .nt.br, or .hdt

In full mode with multiple VCF inputs, failures are isolated per input:

  • the run continues with remaining files
  • failed inputs are summarized in run_metrics/<RUN_ID>/failed_inputs.csv

Main Flags (Most Used)

  • -m, --mode {full,compress,decompress,tsv}
  • -o, --out required output root directory
  • -c, --compression methods: gzip,brotli,hdt,hdt_gzip,hdt_brotli,none
  • -I, --image Docker image repo (default ecrum19/vcf-rdfizer)
  • -v, --image-version Docker tag/version
  • -b, --build force Docker build
  • -B, --no-build fail if image not found
  • -h, --help show full usage

Full Mode Flags

  • -i, --input required VCF file or directory
  • -r, --rules mapping rules file (.ttl)
    • default: rules/default_rules.ttl
  • -l, --rdf-layout {aggregate,batch} required in full mode
  • -P, --spark-partitions optional Spark partition hint (positive integer)
    • low-cost way to reduce output part count by setting spark.default.parallelism and spark.sql.shuffle.partitions
  • -k, --keep-tsv keep hidden TSV intermediates
  • -R, --keep-rdf keep raw .nt after compression
  • -e, --estimate-size preflight size estimate

TSV Mode Flags

  • -i, --input required VCF file or directory
  • Outputs per-run benchmark summary in run_metrics/<RUN_ID>/tsv_metrics.csv
  • Raw TSV timing + artifact JSON per input in run_metrics/<RUN_ID>/raw_metrics/tsv_*

Compression Mode Flags

  • -q, --rdf, --nt required input .nt file

Decompression Mode Flags

  • -C, --compressed-input required .nt.gz, .nt.br, or .hdt
  • -d, --decompress-out optional explicit output .nt path (must be inside --out)

Quick Start

Show help:

vcf-rdfizer --help

Full pipeline (aggregate RDF):

vcf-rdfizer \
  --mode full \
  --input ./vcf_files \
  --rdf-layout aggregate \
  --out ./results

Full pipeline (batch RDF parts):

vcf-rdfizer \
  --mode full \
  --input ./vcf_files \
  --rdf-layout batch \
  --compression hdt \
  --out ./results

Full pipeline with low-cost partition cap (helps avoid too many tiny batch files):

vcf-rdfizer \
  --mode full \
  --input ./vcf_files \
  --rdf-layout batch \
  --spark-partitions 8 \
  --compression hdt \
  --out ./results

Full pipeline with custom rules + keep RDF:

vcf-rdfizer \
  --mode full \
  --input ./vcf_files \
  --rules ./rules/my_rules.ttl \
  --rdf-layout aggregate \
  --compression hdt,brotli \
  --keep-rdf \
  --out ./results

TSV-only benchmark:

vcf-rdfizer \
  --mode tsv \
  --input ./vcf_files \
  --out ./results

Compression-only:

vcf-rdfizer \
  --mode compress \
  --rdf ./results/sample/sample.nt \
  --compression hdt_gzip \
  --out ./results

Decompression-only:

vcf-rdfizer \
  --mode decompress \
  --compressed-input ./results/sample/sample.hdt \
  --out ./results

Output Layout

Given --out ./results:

  • final outputs:
    • ./results/<sample>/...
  • per-run metrics/logs:
    • ./results/run_metrics/<RUN_ID>/...
  • hidden intermediates:
    • ./results/.intermediate/tsv/

Intermediates are hidden by default. Raw .nt files are removed after compression unless --keep-rdf is provided.

Metrics

For each run, VCF-RDFizer writes:

  • run_metrics/<RUN_ID>/metrics.csv
  • run_metrics/<RUN_ID>/wrapper_execution_times.csv
  • run_metrics/<RUN_ID>/progress.log

Compression metrics now include per-method:

  • wall_seconds_*
  • user_seconds_*
  • sys_seconds_*
  • max_rss_kb_*

Rules

  • default rules file: rules/default_rules.ttl
  • rules guide: rules/README.md

Troubleshooting

If Docker permission issues occur, rerun with a Docker-allowed user (or configure Docker group/sudo access on your system).

If HDT compression fails on very large .nt files, use batch layout and/or non-HDT compression methods.

Safe termination:

  • Press Ctrl+C to interrupt a run.
  • The wrapper exits with code 130, writes progress to run_metrics/<RUN_ID>/progress.log, and performs best-effort cleanup of tracked intermediates.
  • Raw RDF cleanup on interrupt follows --keep-rdf:
    • with --keep-rdf, raw .nt files are preserved
    • without --keep-rdf, tracked raw .nt files are removed during interrupt cleanup

Citation

If you use VCF-RDFizer in a publication, please cite:

VCF-RDFizer maintainers. (2026). VCF-RDFizer (Version 1.2.0) [Computer software]. GitHub. https://github.com/ecrum19/VCF-RDFizer

BibTeX:

@software{vcf_rdfizer_2026,
  author  = {{VCF-RDFizer maintainers}},
  title   = {VCF-RDFizer},
  year    = {2026},
  version = {1.2.0},
  url     = {https://github.com/ecrum19/VCF-RDFizer},
  note    = {Computer software}
}

You can also use the machine-readable citation file: CITATION.cff.

Contributing

Contributions are welcome. If you want to improve VCF-RDFizer:

  • Open an issue first for bug reports, feature requests, or design changes.
  • Fork the repo and create a feature branch from main.
  • Keep changes focused and include/update tests for behavior changes.
  • Run the unit tests locally before opening a PR:
python3 -m unittest discover -s test -p "test_*_unit.py" -q
  • In your PR, include what changed, why it changed, and how you validated it.
  • Use clear commit messages (for Docker publish control, include [publish-docker] only when intended).

Licensing

  • Project license: LICENSE (MIT)
  • Third-party runtime notices: THIRD_PARTY_NOTICES.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcf_rdfizer-1.2.0.tar.gz (56.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vcf_rdfizer-1.2.0-py3-none-any.whl (35.2 kB view details)

Uploaded Python 3

File details

Details for the file vcf_rdfizer-1.2.0.tar.gz.

File metadata

  • Download URL: vcf_rdfizer-1.2.0.tar.gz
  • Upload date:
  • Size: 56.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vcf_rdfizer-1.2.0.tar.gz
Algorithm Hash digest
SHA256 f3c11d972f736277bceeaa5cdc0e8aeaa63be0e5edd3ea7d7ee679e9c8d653f2
MD5 da570aa02bf45695b89f0a70a1835d55
BLAKE2b-256 8c59ac95902ec7d0899e11081845e3bf09c75550272ace358476338b4860f44e

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcf_rdfizer-1.2.0.tar.gz:

Publisher: publish-python.yml on ecrum19/VCF-RDFizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vcf_rdfizer-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: vcf_rdfizer-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 35.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for vcf_rdfizer-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1cac366adf98aa57613bdd01aeed4490df076770f3487ca8eff78c815e70f730
MD5 9e7f652c902e5d4d7afb2c38cb5f7a91
BLAKE2b-256 a744d41477d3da9a728684650e036f6be02a831fa0255495319fce3aecad6b81

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcf_rdfizer-1.2.0-py3-none-any.whl:

Publisher: publish-python.yml on ecrum19/VCF-RDFizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page