Skip to main content

Small VCF databases. One per cohort. Embedded ClickHouse engine, embedded DuckDB annotations, MCP natural-language layer.

Project description

vcfclick

A modern VCF database for research labs and bioinformatics teams. Embedded chDB (ClickHouse engine, no server) for sample data, embedded DuckDB for reference annotations, and a natural-language query layer that turns plain English into SQL you can read.

Single binary. uv run vcfclick. No Docker, no port, no server, no Gatekeeper dialog. The headline demo runs from a clean git clone.

Status: research preview. Architecture validated against real 1000 Genomes data.

Why

Two complaints heard repeatedly in research bioinformatics:

  1. "My cohort grew and bcftools | pandas stopped scaling." When you have 500+ samples, ad-hoc cohort correlation queries become painfully slow. The standard answer is "go install Hail," which is correct and operationally expensive.

  2. "I can write the SQL, but I shouldn't have to type the boilerplate every time — and when it's written for me, I want to see it." Bioinformaticians don't want SQL hidden. They want it generated and visible, because trust comes from being able to read what ran.

vcfclick closes both:

  • chDB (ClickHouse embedded as a library) handles cohort scale. We've measured ~963 variants/sec single-process ingest, 6% sparse compression vs dense, in-process Native query speed.
  • The MCP server lets any LLM client translate plain English into the SQL underneath. The generated SQL is shown alongside the result — it's part of the answer, not a debug trace.

Architecture

┌────────────────────────────────────┐
│  Tiny web UI (separate repo)       │   English in → SQL + result out
└────────────────┬───────────────────┘
                 │
┌────────────────▼───────────────────┐
│  MCP server (Python)               │   Composes the two embedded stores
│  Tools: get_schema, run_sql,       │
│    position_for_gene, gene_at,     │
│    clinvar_lookup                  │
└────┬─────────────────────────┬─────┘
     │                         │
┌────▼──────────────┐  ┌───────▼────────────┐
│  chDB             │  │  DuckDB            │
│  (embedded)       │  │  (embedded)        │
│  sample data      │  │  reference data    │
│  - variants       │  │  - genes (RefSeq)  │
│  - genotypes      │  │  - clinvar_*       │
│  - samples        │  │                    │
│  - ingestions     │  │                    │
└───────────────────┘  └────────────────────┘

Two embedded stores, distinct purposes:

  • chDB holds sample data: wide pre-declared schema for VCF 4.3 reserved + common GATK INFO/FORMAT fields, with Map(String, String) overflow for anything else. Same SQL surface, same MergeTree engines, same projections as full ClickHouse — no server. Persistent on disk under .chdb/.
  • DuckDB holds reference data: RefSeq genes, ClinVar. Embedded, swappable, monthly refresh. Never touches sample data.

The MCP server composes across them at query time. Annotation lookups happen first (DuckDB), then their results parameterise the sample query (chDB). The chain of reasoning is visible in the UI.

Using vcfclick

Each cohort / study / VCF lives in its own small database under ~/.vcfclick/dbs/<name>/. The vcfclick CLI manages them.

# Normalise the VCF (one-time per file)
bcftools norm -m - input.vcf.gz | bgzip > normalised.vcf.gz

# Create a database for this cohort
vcfclick db create my-cohort

# Ingest the VCF into it
vcfclick db ingest my-cohort normalised.vcf.gz \
    --cohort demo --ingest-id batch_a

# Inspect what's in it
vcfclick db info my-cohort

# Run SQL directly
vcfclick db query my-cohort "SELECT count() FROM variants"

# Export the whole database as Parquet (interop with DuckDB,
# Snowflake, BigQuery, Spark, Iceberg)
vcfclick db dump my-cohort --out my-cohort-export/

# Bundle a database as a single tar.gz for sharing
vcfclick db push my-cohort /path/to/my-cohort.tar.gz

# Restore from a bundle — local file or HTTPS URL
vcfclick db pull other-cohort https://example.com/other-cohort.tar.gz

# List, remove
vcfclick db list
vcfclick db rm my-cohort

Each database is a self-contained chDB session — the on-disk format is byte-identical to a full ClickHouse server. Multiple databases sit side by side; each is cheap to create, dump, share, or delete.

The ingester prints a classification of the VCF's INFO/FORMAT fields on startup — what landed in typed columns vs. the overflow Maps. That log line is the "adapts to any VCF" claim made literally visible.

Per-ingestion identity inside a database. Every row carries ingest_id. Rows are NOT merged across uploads — the same (chrom, pos, ref, alt) observed in two different VCFs is two rows, because annotations and QC origin can differ. Re-running with the same --ingest-id is idempotent (silently replaces prior rows via ReplacingMergeTree). Using a new --ingest-id appends.

Parallel ingestion is the default; pass --serial to force the single-process loader. The parallel splitter does a single-pass count of variants per 100Kb position bucket via the tabix .tbi index (~1 ms) and greedy-splits each contig into ranges of approximately equal variant count — so dense subregions (gene panels, exomes) don't leave N–1 workers idle.

Pointing the MCP server at a specific database

In your Claude Desktop / MCP-client config, set VCFCLICK_DB_NAME to the database you want the LLM to talk to:

"vcfclick": {
  "command": "/path/to/vcfclick/.venv/bin/python",
  "args": ["-m", "vcfclick_mcp.server"],
  "cwd": "/path/to/vcfclick",
  "env": {
    "PYTHONPATH": "/path/to/vcfclick",
    "VCFCLICK_DB_NAME": "my-cohort"
  }
}

Register multiple vcfclick-<dbname> entries if you want the LLM to be able to switch between cohorts in a single Claude Desktop session.

Legacy Python-module entry points

The pre-CLI module commands still work for scripted use:

# Single database at ./.chdb/  (no CLI involvement)
uv run python -m ingest.parallel normalised.vcf.gz \
    --cohort demo --ingest-id batch --workers 4

uv run python -m export.parquet variants /path/out.parquet
uv run python -m export.parquet --all /path/output_dir/

These ingest into / read from ./.chdb/ (or VCFCLICK_DB-pointed directory) and ignore the named-DB layout.

Layout

  • schema/ — ClickHouse DDL (chDB applies it unchanged).
  • storage/db.py — chDB session singleton; apply_schema() helper.
  • ingest/vcf_load.py — serial cyvcf2-based ingester.
  • ingest/parallel.py — multi-process variant; Parquet staging.
  • ingest/_arrow.py — pyarrow schemas matching the ClickHouse tables.
  • export/parquet.py — table → Parquet export CLI.
  • annotations/db.py — DuckDB annotation API (gene, ClinVar).
  • annotations/transcripts.py — transcript/exon/CDS API stubs (Phase 2).
  • vcfclick_mcp/server.py — MCP server (chDB + DuckDB tool surface). Renamed from mcp/ so the directory does not shadow the upstream mcp Python SDK.
  • data/ — VCF inputs (gitignored).

Validated against real data

Workload Vars Samples Calls stored Throughput
BRCA1 region (1000G 30x) 1,863 3,202 369,776 small-VCF baseline
10 Mb chr17 (1000G 30x) — serial 235,768 3,202 44,986,737 952 v/s
10 Mb chr17 (1000G 30x) — parallel 4 workers 235,768 3,202 44,986,737 1,983 v/s (2.1×)
10 Mb chr17 (1000G 30x) — parallel 8 workers 235,768 3,202 44,986,737 2,466 v/s (2.6×)

Parallel speedup comes from the variant-count-aware splitter — each worker gets approximately equal work regardless of where the data actually lives along the chromosome. Sparse-table compression empirically 6.2% of dense theoretical max.

TileDB-VCF comparison

End-to-end on the same 235k-variant / 3,202-sample workload, native arm64 (vcfclick) vs Rosetta-emulated linux/amd64 (TileDB-VCF Docker):

vcfclick TileDB-VCF
Source VCF format joint VCF ingested directly per-sample VCFs only ("Combined VCFs are currently not supported")
Pre-processing none bcftools +split + tabix × 3,202 ≈ 8+ min
Source VCF disk 114 MB 15.1 GB (132× inflation)
Ingest, best stable config 69 s (parallel-8) ~79 min projected (single-thread, multi-thread failed)
End-to-end ~1 min ~87 min

Full methodology, caveats (including the Rosetta penalty), and reproduction commands: bench/BENCHMARK.md.

License

License choice (AGPL-3.0 vs BSL) is pending. A standard LICENSE file will be added before the first tagged release. See LICENSING.md.

Open work

  • VCF schema auto-discovery utility (vcf-discover).
  • ClinVar VCF loader under annotations/loaders/ (the GENCODE gene loader is in; ClinVar significance lookup is still stubbed).
  • Phase 2: transcript / exon / CDS hierarchy + corresponding MCP tools.
  • End-to-end MCP integration test with a real LLM client — the SCHEMA_DESCRIPTION prompt is theoretical until it's stress-tested.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vcfclick-0.1.0.tar.gz (35.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vcfclick-0.1.0-py3-none-any.whl (41.8 kB view details)

Uploaded Python 3

File details

Details for the file vcfclick-0.1.0.tar.gz.

File metadata

  • Download URL: vcfclick-0.1.0.tar.gz
  • Upload date:
  • Size: 35.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for vcfclick-0.1.0.tar.gz
Algorithm Hash digest
SHA256 89491783e1fe5702ef02ac3092c55fd311dde84f27bece67e306388f04ca24e7
MD5 d80715f4def77203e9365087feee7b5f
BLAKE2b-256 06af6b01645f20e2542cc403ce02c21ed6d0865419e25a5d771e583fea307417

See more details on using hashes here.

File details

Details for the file vcfclick-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vcfclick-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 41.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.18 {"installer":{"name":"uv","version":"0.11.18","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for vcfclick-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d988e8dfef6d40294cad218de59780617e7a815eaa558d7b03e691db6b758c12
MD5 60a7336b4e6f5a46ba54cbeb963f97c9
BLAKE2b-256 383952d20db2d7ce800f2c432f28015427b1d552b41972bf1d50ee8000f2b804

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page