A VGI (Vector Gateway Interface) worker exposing scikit-bio sequence analysis, diversity, ordination, and phylogenetics to DuckDB/SQL
Project description
vgi-scikit-bio
scikit-bio for SQL — a VGI worker that exposes biological sequence analysis, pairwise alignment, community diversity (including phylogenetic), ordination, distance-matrix statistics, compositional transforms, and phylogenetics to DuckDB as ordinary scalar, aggregate, and table functions.
INSTALL vgi FROM community; LOAD vgi;
-- Attach with zero local install — from PyPI via uvx, or from ghcr via oci://
ATTACH 'skbio' (TYPE vgi, LOCATION 'uvx vgi-scikit-bio');
-- ATTACH 'skbio' (TYPE vgi, LOCATION 'oci://ghcr.io/query-farm/vgi-scikit-bio:0.1.0');
SELECT skbio.sequence.gc_content('ATGCGGATTACAGG'); -- 0.5
SELECT skbio.sequence.reverse_complement('ATGCGGATTACAGG'); -- CCTGTAATCCGCAT
SELECT skbio.sequence.translate('ATGCGGATTACAGGT'); -- MRITG
Everything runs in a separate Python process and streams Apache Arrow between DuckDB and scikit-bio, so you get scikit-bio's algorithms without leaving SQL.
Contents
- Install & attach
- What's inside
- A worked microbiome example
- Data-shape conventions
- Development
- Deployment
- License & support
Install & attach
First install the DuckDB vgi community extension, then attach the worker via a
LOCATION. Pick whichever way suits you — you don't have to install anything
locally; uvx and oci:// both fetch and run the published worker for you.
INSTALL vgi FROM community;
LOAD vgi;
Zero-install from PyPI (uvx) — recommended for a laptop with uv:
ATTACH 'skbio' (TYPE vgi, LOCATION 'uvx vgi-scikit-bio');
SELECT skbio.sequence.gc_content('ATGCGGATTACAGG'); -- 0.5
uvx downloads vgi-scikit-bio from
PyPI into a cache and runs its stdio console script; the first attach resolves
the package, later ones are instant.
Container from ghcr (oci://) — recommended when you'd rather not touch
Python at all (just needs Docker or Podman):
ATTACH 'skbio' (TYPE vgi, LOCATION 'oci://ghcr.io/query-farm/vgi-scikit-bio:0.1.0');
The extension turns the oci:// address into a pooled docker run -i --rm … stdio and keeps the container warm across queries. Use :latest to track the
newest release, or docker://… (an alias). A struct LOCATION adds options —
{'image': 'oci://…', 'runtime': 'podman', 'volumes': ['skbio_state:/data']}.
Installed console script — if you pip install vgi-scikit-bio, the
vgi-scikit-bio (stdio) and vgi-scikit-bio-http (HTTP) scripts are on PATH:
ATTACH 'skbio' (TYPE vgi, LOCATION 'vgi-scikit-bio');
From a source checkout (development):
ATTACH 'skbio' (TYPE vgi, LOCATION 'uv run scikit_bio_worker.py');
HTTP server — for a warm, shared, long-lived worker, run the container (or
vgi-scikit-bio-http) and attach over its URL:
docker run -d -p 8000:8000 -e VGI_SIGNING_KEY=change-me ghcr.io/query-farm/vgi-scikit-bio:0.1.0
ATTACH 'skbio' (TYPE vgi, LOCATION 'http://localhost:8000');
Functions are organised into five schemas — sequence (the default),
alignment, diversity, stats, and tree — so skbio.sequence.gc_content(...)
also resolves as skbio.gc_content(...). There are ~90 functions; the tables
below list them by area (grouped by the schema's category sections, which the
worker also exposes for navigation).
What's inside
sequence — biological sequences
Per-sequence functions over VARCHAR columns of DNA/RNA/protein. Input is
upper-cased and stripped; a NULL or invalid sequence yields NULL rather than
failing the whole query.
- Transforms —
gc_content,gc_frequency,reverse_complement,complement,transcribe,reverse_transcribe,translate,degap - Validation —
is_valid_dna,is_valid_protein,has_gaps,has_degenerates,is_reverse_complement - Distance —
hamming_distance,mismatch_count,match_count - Composition (table functions, long output) —
kmer_frequencies,residue_frequencies,count_subsequence,translate_six_frames
-- Profile reads by GC content, keeping only valid DNA
SELECT id, skbio.sequence.gc_content(seq) AS gc
FROM reads WHERE skbio.sequence.is_valid_dna(seq);
-- 4-mer feature matrix (long → pivot for a wide matrix)
SELECT id, kmer, count
FROM skbio.sequence.kmer_frequencies((SELECT id, seq FROM reads), id := 'id', k := 4);
alignment — pairwise alignment
- Score —
align_score_nucleotide(a, b),align_score_protein(a, b)→ optimal global-alignment score - Pairwise —
pairwise_align_nucleotide/pairwise_align_protein→ aligned strings + score + length;mode := 'global'(Needleman–Wunsch) or'local'(Smith–Waterman)
SELECT aligned_1, aligned_2, score
FROM skbio.alignment.pairwise_align_nucleotide((SELECT id, ref, read FROM pairs),
seq1 := 'ref', seq2 := 'read');
diversity — alpha, beta, phylogenetic, rarefaction
Community-ecology diversity over a long feature table — one row per
(sample, feature, count).
- Alpha (aggregates over
count,GROUP BY sample) — the full scikit-bio metric family:shannon,simpson,inv_simpson,observed_features,chao1,pielou_evenness,dominance,ace,berger_parker_d,brillouin_d,fisher_alpha,gini_index,goods_coverage,heip_evenness,kempton_taylor_q,margalef,mcintosh_d,mcintosh_e,menhinick,robbins,simpson_d,simpson_e,strong,singles,doubles,enspie; parameterizedhill(count, q := …)/renyi/tsallis; and interval/triple metricschao1_ci,esty_ci,osd(returnDOUBLE[]) - Beta —
beta_diversity(tbl, metric := …)→ the full distance matrix long (braycurtis,jaccard,euclidean,canberra,cosine,jensenshannon, …) - Phylogenetic (given a
tree := '<newick>') —faith_pd(per-sample PD),unifrac(weighted/unweighted distance matrix) - Preprocessing —
subsample_counts(tbl, depth := N)(rarefaction)
SELECT sample_id,
skbio.diversity.shannon(count) AS shannon,
skbio.diversity.chao1(count) AS chao1
FROM feature_table GROUP BY sample_id;
SELECT * FROM skbio.diversity.beta_diversity(
(SELECT sample_id, feature_id, count FROM feature_table), metric := 'braycurtis');
stats — ordination, distance tests, composition, differential abundance
- Ordination —
pcoa(dm)(distance matrix),pca(tbl)/ca(tbl)(feature table) →(sample_id, <axis>_1 … k) - Hypothesis tests —
permanova/anosim(grouping in a 4th column),mantel(two distance columns) - Composition — transforms
clr,ilr,alr,closure,centralize,rclr,multi_replace,power; inversesclr_inv,ilr_inv,alr_inv; associationpairwise_vlr - Differential abundance —
ancom,dirmult_ttest(grouping in a 4th column)
pcoa/permanova/anosim/mantel read the long (id_1, id_2, distance) matrix
from beta_diversity. pca/ca and the composition/ANCOM functions read a long
feature table.
-- Embed samples in 2-D from a Bray–Curtis matrix
SELECT * FROM skbio.stats.pcoa(
(SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table))),
n_components := 2);
tree — phylogenetics
- Construction (distance matrix → one
newickrow) —neighbor_joining,upgma,gme,bme - Inspection (scalars over a Newick string) —
tip_count,total_branch_length,tree_height - Comparison (two Newick trees → distance) —
robinson_foulds,weighted_robinson_foulds,cophenetic_distance
SELECT newick
FROM skbio.tree.neighbor_joining(
(SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table))));
Not exposed (they don't map to a single SQL function): constrained ordination
cca/rdaandbioenv/pwmantel(need a second matrix beyond the one subquery slot);permdisp(an upstream bug in scikit-bio 0.7.3); and stateful/IO surface (TabularMSA, BIOMTable, FASTA/FASTQ readers — DuckDB already reads files) and randomness-seeded estimators (lladser_*,michaelis_menten_fit).
A worked microbiome example
Starting from a long OTU/feature table feature_table(sample_id, feature_id, count)
and a metadata(sample_id, group) table:
-- 1. Alpha diversity per sample
SELECT sample_id, skbio.diversity.shannon(count) AS shannon
FROM feature_table GROUP BY sample_id;
-- 2. Bray–Curtis distances → PCoA coordinates for an ordination plot
CREATE TABLE bc AS
SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table));
SELECT * FROM skbio.stats.pcoa((SELECT * FROM bc), n_components := 2);
-- 3. Does the grouping explain the community differences? (PERMANOVA)
SELECT * FROM skbio.stats.permanova((
SELECT bc.id_1, bc.id_2, bc.distance, m.group
FROM bc JOIN metadata m ON bc.id_1 = m.sample_id));
-- 4. Neighbour-joining tree of the samples
SELECT newick FROM skbio.tree.neighbor_joining((SELECT * FROM bc));
Data-shape conventions
- Feature tables are long:
(sample_id, feature_id, count)(orvalue). This is the natural SQL shape and avoids a fixed, data-dependent column width.PIVOTif you need a wide matrix. - Distance matrices are long:
(id_1, id_2, distance)— exactly whatbeta_diversityemits, and whatpcoa/permanova/anosim/mantel/neighbor_joiningconsume. The matrix is symmetrised on read, so a full square or a triangle both work (the grouped tests need every sample to appear asid_1, which the full square frombeta_diversityguarantees). - Column names default positionally and can be overridden with named args
(
sample :=,feature :=,count :=,id_1 :=, …). - Bad rows degrade to NULL in the sequence scalars rather than raising.
Development
The repo is an installable package (pyproject.toml, hatchling, uv.lock).
uv sync # resolve deps (scikit-bio, numpy, pandas, scipy, vgi-python[http])
uv run pytest tests/ -q # unit tests
uv run ruff check . # lint
uv run ruff format --check .
uv run mypy vgi_scikit_bio/
make test-stdio # DuckDB sqllogictest suite, worker as a subprocess (authoritative)
make test-http # same suite against a local HTTP server
The unit tests exercise each function's logic in-process; the SQL suite
(test/sql/*.test) is authoritative because it drives the real DuckDB → VGI →
worker path. It needs a sqllogictest runner with the vgi extension — CI uses a
prebuilt standalone haybarn-unittest plus INSTALL vgi FROM community (no C++
build). See ci/README.md.
To add a function: implement it in the relevant vgi_scikit_bio/*.py, export it
from that module's *_FUNCTIONS list, and splice the list into
_SCHEMA_FUNCTIONS in vgi_scikit_bio/worker.py. See CLAUDE.md for the
framework patterns and sharp edges.
Deployment
One multi-arch container image (ghcr.io/query-farm/vgi-scikit-bio) serves both
transports — HTTP (default) and stdio — via docker-entrypoint.sh. It is built,
tested against the signed vgi extension, and signed by CI on every tag. The
image mounts a /data volume for the shared framework BoundStorage. make deploy points Fly.io at the published image.
License & support
MIT — see LICENSE. Copyright 2026 Query Farm LLC.
Developed and maintained by Query.Farm. For bug reports and feature requests use the issue tracker; for commercial support and SLAs see SUPPORT.md or email hello@query.farm.
Built on scikit-bio — if you use it in research, please also cite scikit-bio.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vgi_scikit_bio-0.1.1.tar.gz.
File metadata
- Download URL: vgi_scikit_bio-0.1.1.tar.gz
- Upload date:
- Size: 80.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5d93138d99775355eef40da7bd989447496b9f1ae5536cb0bb7728149170c19d
|
|
| MD5 |
40fe503a0877e1edd3f65813bb2667f2
|
|
| BLAKE2b-256 |
0c4d826d68b854d8160041be2574f9da7e04b9f774e2dab1f77289d037bd90b7
|
File details
Details for the file vgi_scikit_bio-0.1.1-py3-none-any.whl.
File metadata
- Download URL: vgi_scikit_bio-0.1.1-py3-none-any.whl
- Upload date:
- Size: 72.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5c5be850f7bfc1770d20fccf681ff5cc08708a8e57bf59877a40467d562e3000
|
|
| MD5 |
da09af629de3b178a8d0f48e490dc272
|
|
| BLAKE2b-256 |
25a7bab679444882514a0d77b7a6ca8e61b5a77921937a26ee12890c7b01c306
|