Skip to main content

Docker-first VCF to RDF conversion wrapper with compression/decompression modes

Project description

VCF-RDFizer

Unit Tests Publish Python Publish Docker Codecov PyPI version Python versions Docker Pulls Conda Version License

VCF-RDFizer is a Docker-first CLI wrapper for:

  1. VCF -> RDF (N-Triples) with RMLStreamer
  2. Optional RDF compression/decompression

Requirements

  • Python 3.10+
  • Docker (installed and running)

Install options:

pip install vcf-rdfizer

or

pipx install vcf-rdfizer

or

conda install -c conda-forge vcf-rdfizer

or pull the prebuilt Docker image directly:

docker pull ecrum19/vcf-rdfizer:latest

Important CLI Rule

--out is required for all modes.

This is the run output root directory. VCF-RDFizer places:

  • final RDF/compression outputs
  • run metrics/logs
  • hidden intermediates

inside this directory.

Modes

  • full: VCF -> TSV -> RDF -> compression
  • tsv: VCF -> TSV only (benchmarking)
  • compress: compress an existing .nt
  • decompress: decompress .nt.gz, .nt.br, or .hdt

In full mode with multiple VCF inputs, failures are isolated per input:

  • the run continues with remaining files
  • failed inputs are summarized in run_metrics/<RUN_ID>/failed_inputs.csv

Main Flags (Most Used)

  • -m, --mode {full,compress,decompress,tsv}
  • -o, --out required output root directory
  • -c, --compression methods: gzip,brotli,hdt,hdt_gzip,hdt_brotli,none
  • -I, --image Docker image repo (default ecrum19/vcf-rdfizer)
  • -v, --image-version Docker tag/version
  • -b, --build force Docker build
  • -B, --no-build fail if image not found
  • -h, --help show full usage

Full Mode Flags

  • -i, --input required VCF file or directory
  • -r, --rules mapping rules file (.ttl)
    • default: rules/default_rules.ttl
  • -l, --rdf-layout {aggregate,batch} required in full mode
  • -P, --spark-partitions optional Spark partition hint (positive integer)
    • low-cost way to reduce output part count by setting spark.default.parallelism and spark.sql.shuffle.partitions
  • -k, --keep-tsv keep hidden TSV intermediates
  • -R, --keep-rdf keep raw .nt after compression
  • -e, --estimate-size preflight size estimate

TSV Mode Flags

  • -i, --input required VCF file or directory
  • Outputs per-run benchmark summary in run_metrics/<RUN_ID>/tsv_metrics.csv
  • Raw TSV timing + artifact JSON per input in run_metrics/<RUN_ID>/raw_metrics/tsv_*

Compression Mode Flags

  • -q, --rdf, --nt required input .nt file

Decompression Mode Flags

  • -C, --compressed-input required .nt.gz, .nt.br, or .hdt
  • -d, --decompress-out optional explicit output .nt path (must be inside --out)

Quick Start

Show help:

vcf-rdfizer --help

Full pipeline (aggregate RDF):

vcf-rdfizer \
  --mode full \
  --input ./vcf_files \
  --rdf-layout aggregate \
  --out ./results

Full pipeline (batch RDF parts):

vcf-rdfizer \
  --mode full \
  --input ./vcf_files \
  --rdf-layout batch \
  --compression hdt \
  --out ./results

Full pipeline with low-cost partition cap (helps avoid too many tiny batch files):

vcf-rdfizer \
  --mode full \
  --input ./vcf_files \
  --rdf-layout batch \
  --spark-partitions 8 \
  --compression hdt \
  --out ./results

Full pipeline with custom rules + keep RDF:

vcf-rdfizer \
  --mode full \
  --input ./vcf_files \
  --rules ./rules/my_rules.ttl \
  --rdf-layout aggregate \
  --compression hdt,brotli \
  --keep-rdf \
  --out ./results

TSV-only benchmark:

vcf-rdfizer \
  --mode tsv \
  --input ./vcf_files \
  --out ./results

Compression-only:

vcf-rdfizer \
  --mode compress \
  --rdf ./results/sample/sample.nt \
  --compression hdt_gzip \
  --out ./results

Decompression-only:

vcf-rdfizer \
  --mode decompress \
  --compressed-input ./results/sample/sample.hdt \
  --out ./results

Output Layout

Given --out ./results:

  • final outputs:
    • ./results/<sample>/...
  • per-run metrics/logs:
    • ./results/run_metrics/<RUN_ID>/...
  • hidden intermediates:
    • ./results/.intermediate/tsv/

Intermediates are hidden by default. Raw .nt files are removed after compression unless --keep-rdf is provided.

Metrics

For each run, VCF-RDFizer writes:

  • run_metrics/<RUN_ID>/metrics.csv
  • run_metrics/<RUN_ID>/wrapper_execution_times.csv
  • run_metrics/<RUN_ID>/progress.log

Compression metrics now include per-method:

  • wall_seconds_*
  • user_seconds_*
  • sys_seconds_*
  • max_rss_kb_*

Rules

  • default rules file: rules/default_rules.ttl
  • rules guide: rules/README.md

Troubleshooting

If Docker permission issues occur, rerun with a Docker-allowed user (or configure Docker group/sudo access on your system).

If HDT compression fails on very large .nt files, use batch layout and/or non-HDT compression methods.

Safe termination:

  • Press Ctrl+C to interrupt a run.
  • The wrapper exits with code 130, writes progress to run_metrics/<RUN_ID>/progress.log, and performs best-effort cleanup of tracked intermediates.
  • Raw RDF cleanup on interrupt follows --keep-rdf:
    • with --keep-rdf, raw .nt files are preserved
    • without --keep-rdf, tracked raw .nt files are removed during interrupt cleanup

Citation

If you use VCF-RDFizer in a publication, please cite:

VCF-RDFizer maintainers. (2026). VCF-RDFizer (Version 1.1.0) [Computer software]. GitHub. https://github.com/ecrum19/VCF-RDFizer

BibTeX:

@software{vcf_rdfizer_2026,
  author  = {{VCF-RDFizer maintainers}},
  title   = {VCF-RDFizer},
  year    = {2026},
  version = {1.1.0},
  url     = {https://github.com/ecrum19/VCF-RDFizer},
  note    = {Computer software}
}

You can also use the machine-readable citation file: CITATION.cff.

Contributing

Contributions are welcome. If you want to improve VCF-RDFizer:

  • Open an issue first for bug reports, feature requests, or design changes.
  • Fork the repo and create a feature branch from main.
  • Keep changes focused and include/update tests for behavior changes.
  • Run the unit tests locally before opening a PR:
python3 -m unittest discover -s test -p "test_*_unit.py" -q
  • In your PR, include what changed, why it changed, and how you validated it.
  • Use clear commit messages (for Docker publish control, include [publish-docker] only when intended).

Licensing

  • Project license: LICENSE (MIT)
  • Third-party runtime notices: THIRD_PARTY_NOTICES.md

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcf_rdfizer-1.1.0.tar.gz (55.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vcf_rdfizer-1.1.0-py3-none-any.whl (35.1 kB view details)

Uploaded Python 3

File details

Details for the file vcf_rdfizer-1.1.0.tar.gz.

File metadata

  • Download URL: vcf_rdfizer-1.1.0.tar.gz
  • Upload date:
  • Size: 55.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vcf_rdfizer-1.1.0.tar.gz
Algorithm Hash digest
SHA256 4ee70623a1ab675cc8a68445f0ac45db403858cc334fd503a3d32196c51acf09
MD5 bc6a5bf31dfa025e5a59de3f9eeaba21
BLAKE2b-256 b82f5c09d5fb0a83f253e78c38b118f91ee34e44535965042fd95c87688ae970

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcf_rdfizer-1.1.0.tar.gz:

Publisher: publish-python.yml on ecrum19/VCF-RDFizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file vcf_rdfizer-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: vcf_rdfizer-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 35.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for vcf_rdfizer-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2b51139ab90a02a47d7a6c277c512aeed95213fabe72989e77ba68f89610e2b6
MD5 4591ef70e66f96861141d202ab6851b9
BLAKE2b-256 6a11e33806324a186462eb14805d40e697c2faa9ed5f8d76d134be9e8deff957

See more details on using hashes here.

Provenance

The following attestation bundles were made for vcf_rdfizer-1.1.0-py3-none-any.whl:

Publisher: publish-python.yml on ecrum19/VCF-RDFizer

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page