A VGI (Vector Gateway Interface) worker exposing scikit-bio sequence analysis, diversity, ordination, and phylogenetics to DuckDB/SQL
Project description
vgi-scikit-bio
scikit-bio for SQL — a VGI worker that exposes biological sequence analysis, pairwise alignment, community diversity (including phylogenetic), ordination, distance-matrix statistics, compositional transforms, and phylogenetics to DuckDB as ordinary scalar, aggregate, and table functions.
-- Attach the worker (installed console script, or `uv run scikit_bio_worker.py`)
ATTACH 'skbio' (TYPE vgi, LOCATION 'vgi-scikit-bio');
SELECT skbio.sequence.gc_content('ATGCGGATTACAGG'); -- 0.5
SELECT skbio.sequence.reverse_complement('ATGCGGATTACAGG'); -- CCTGTAATCCGCAT
SELECT skbio.sequence.translate('ATGCGGATTACAGGT'); -- MRITG
Everything runs in a separate Python process and streams Apache Arrow between DuckDB and scikit-bio, so you get scikit-bio's algorithms without leaving SQL.
Contents
- Install & attach
- What's inside
- A worked microbiome example
- Data-shape conventions
- Development
- Deployment
- License & support
Install & attach
The worker is a normal Python package that ships two console scripts:
vgi-scikit-bio (stdio, spawned by DuckDB) and vgi-scikit-bio-http (HTTP
server).
pip install vgi-scikit-bio # or: uv pip install vgi-scikit-bio
Attach it from DuckDB (the DuckDB vgi community extension must be installed):
INSTALL vgi FROM community;
LOAD vgi;
-- stdio: DuckDB spawns the worker as a subprocess
ATTACH 'skbio' (TYPE vgi, LOCATION 'vgi-scikit-bio');
-- or run straight from a source checkout without installing:
ATTACH 'skbio' (TYPE vgi, LOCATION 'uv run scikit_bio_worker.py');
Functions are organised into five schemas — sequence (the default),
alignment, diversity, stats, and tree — so skbio.sequence.gc_content(...)
also resolves as skbio.gc_content(...). There are ~90 functions; the tables
below list them by area (grouped by the schema's category sections, which the
worker also exposes for navigation).
What's inside
sequence — biological sequences
Per-sequence functions over VARCHAR columns of DNA/RNA/protein. Input is
upper-cased and stripped; a NULL or invalid sequence yields NULL rather than
failing the whole query.
- Transforms —
gc_content,gc_frequency,reverse_complement,complement,transcribe,reverse_transcribe,translate,degap - Validation —
is_valid_dna,is_valid_protein,has_gaps,has_degenerates,is_reverse_complement - Distance —
hamming_distance,mismatch_count,match_count - Composition (table functions, long output) —
kmer_frequencies,residue_frequencies,count_subsequence,translate_six_frames
-- Profile reads by GC content, keeping only valid DNA
SELECT id, skbio.sequence.gc_content(seq) AS gc
FROM reads WHERE skbio.sequence.is_valid_dna(seq);
-- 4-mer feature matrix (long → pivot for a wide matrix)
SELECT id, kmer, count
FROM skbio.sequence.kmer_frequencies((SELECT id, seq FROM reads), id := 'id', k := 4);
alignment — pairwise alignment
- Score —
align_score_nucleotide(a, b),align_score_protein(a, b)→ optimal global-alignment score - Pairwise —
pairwise_align_nucleotide/pairwise_align_protein→ aligned strings + score + length;mode := 'global'(Needleman–Wunsch) or'local'(Smith–Waterman)
SELECT aligned_1, aligned_2, score
FROM skbio.alignment.pairwise_align_nucleotide((SELECT id, ref, read FROM pairs),
seq1 := 'ref', seq2 := 'read');
diversity — alpha, beta, phylogenetic, rarefaction
Community-ecology diversity over a long feature table — one row per
(sample, feature, count).
- Alpha (aggregates over
count,GROUP BY sample) — the full scikit-bio metric family:shannon,simpson,inv_simpson,observed_features,chao1,pielou_evenness,dominance,ace,berger_parker_d,brillouin_d,fisher_alpha,gini_index,goods_coverage,heip_evenness,kempton_taylor_q,margalef,mcintosh_d,mcintosh_e,menhinick,robbins,simpson_d,simpson_e,strong,singles,doubles,enspie; parameterizedhill(count, q := …)/renyi/tsallis; and interval/triple metricschao1_ci,esty_ci,osd(returnDOUBLE[]) - Beta —
beta_diversity(tbl, metric := …)→ the full distance matrix long (braycurtis,jaccard,euclidean,canberra,cosine,jensenshannon, …) - Phylogenetic (given a
tree := '<newick>') —faith_pd(per-sample PD),unifrac(weighted/unweighted distance matrix) - Preprocessing —
subsample_counts(tbl, depth := N)(rarefaction)
SELECT sample_id,
skbio.diversity.shannon(count) AS shannon,
skbio.diversity.chao1(count) AS chao1
FROM feature_table GROUP BY sample_id;
SELECT * FROM skbio.diversity.beta_diversity(
(SELECT sample_id, feature_id, count FROM feature_table), metric := 'braycurtis');
stats — ordination, distance tests, composition, differential abundance
- Ordination —
pcoa(dm)(distance matrix),pca(tbl)/ca(tbl)(feature table) →(sample_id, <axis>_1 … k) - Hypothesis tests —
permanova/anosim(grouping in a 4th column),mantel(two distance columns) - Composition — transforms
clr,ilr,alr,closure,centralize,rclr,multi_replace,power; inversesclr_inv,ilr_inv,alr_inv; associationpairwise_vlr - Differential abundance —
ancom,dirmult_ttest(grouping in a 4th column)
pcoa/permanova/anosim/mantel read the long (id_1, id_2, distance) matrix
from beta_diversity. pca/ca and the composition/ANCOM functions read a long
feature table.
-- Embed samples in 2-D from a Bray–Curtis matrix
SELECT * FROM skbio.stats.pcoa(
(SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table))),
n_components := 2);
tree — phylogenetics
- Construction (distance matrix → one
newickrow) —neighbor_joining,upgma,gme,bme - Inspection (scalars over a Newick string) —
tip_count,total_branch_length,tree_height - Comparison (two Newick trees → distance) —
robinson_foulds,weighted_robinson_foulds,cophenetic_distance
SELECT newick
FROM skbio.tree.neighbor_joining(
(SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table))));
Not exposed (they don't map to a single SQL function): constrained ordination
cca/rdaandbioenv/pwmantel(need a second matrix beyond the one subquery slot);permdisp(an upstream bug in scikit-bio 0.7.3); and stateful/IO surface (TabularMSA, BIOMTable, FASTA/FASTQ readers — DuckDB already reads files) and randomness-seeded estimators (lladser_*,michaelis_menten_fit).
A worked microbiome example
Starting from a long OTU/feature table feature_table(sample_id, feature_id, count)
and a metadata(sample_id, group) table:
-- 1. Alpha diversity per sample
SELECT sample_id, skbio.diversity.shannon(count) AS shannon
FROM feature_table GROUP BY sample_id;
-- 2. Bray–Curtis distances → PCoA coordinates for an ordination plot
CREATE TABLE bc AS
SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table));
SELECT * FROM skbio.stats.pcoa((SELECT * FROM bc), n_components := 2);
-- 3. Does the grouping explain the community differences? (PERMANOVA)
SELECT * FROM skbio.stats.permanova((
SELECT bc.id_1, bc.id_2, bc.distance, m.group
FROM bc JOIN metadata m ON bc.id_1 = m.sample_id));
-- 4. Neighbour-joining tree of the samples
SELECT newick FROM skbio.tree.neighbor_joining((SELECT * FROM bc));
Data-shape conventions
- Feature tables are long:
(sample_id, feature_id, count)(orvalue). This is the natural SQL shape and avoids a fixed, data-dependent column width.PIVOTif you need a wide matrix. - Distance matrices are long:
(id_1, id_2, distance)— exactly whatbeta_diversityemits, and whatpcoa/permanova/anosim/mantel/neighbor_joiningconsume. The matrix is symmetrised on read, so a full square or a triangle both work (the grouped tests need every sample to appear asid_1, which the full square frombeta_diversityguarantees). - Column names default positionally and can be overridden with named args
(
sample :=,feature :=,count :=,id_1 :=, …). - Bad rows degrade to NULL in the sequence scalars rather than raising.
Development
The repo is an installable package (pyproject.toml, hatchling, uv.lock).
uv sync # resolve deps (scikit-bio, numpy, pandas, scipy, vgi-python[http])
uv run pytest tests/ -q # unit tests
uv run ruff check . # lint
uv run ruff format --check .
uv run mypy vgi_scikit_bio/
make test-stdio # DuckDB sqllogictest suite, worker as a subprocess (authoritative)
make test-http # same suite against a local HTTP server
The unit tests exercise each function's logic in-process; the SQL suite
(test/sql/*.test) is authoritative because it drives the real DuckDB → VGI →
worker path. It needs a sqllogictest runner with the vgi extension — CI uses a
prebuilt standalone haybarn-unittest plus INSTALL vgi FROM community (no C++
build). See ci/README.md.
To add a function: implement it in the relevant vgi_scikit_bio/*.py, export it
from that module's *_FUNCTIONS list, and splice the list into
_SCHEMA_FUNCTIONS in vgi_scikit_bio/worker.py. See CLAUDE.md for the
framework patterns and sharp edges.
Deployment
One multi-arch container image (ghcr.io/query-farm/vgi-scikit-bio) serves both
transports — HTTP (default) and stdio — via docker-entrypoint.sh. It is built,
tested against the signed vgi extension, and signed by CI on every tag. The
image mounts a /data volume for the shared framework BoundStorage. make deploy points Fly.io at the published image.
License & support
MIT — see LICENSE. Copyright 2026 Query Farm LLC.
Developed and maintained by Query.Farm. For bug reports and feature requests use the issue tracker; for commercial support and SLAs see SUPPORT.md or email hello@query.farm.
Built on scikit-bio — if you use it in research, please also cite scikit-bio.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vgi_scikit_bio-0.1.0.tar.gz.
File metadata
- Download URL: vgi_scikit_bio-0.1.0.tar.gz
- Upload date:
- Size: 79.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
58b59206b58ac3e6092a599fdb114adc2c3f24026e0d967b7b855cb04c01eea5
|
|
| MD5 |
badd1f047555afd32abbfa2ba174f773
|
|
| BLAKE2b-256 |
2da72ca117246523b9adee7d8e199b59cf6c8f991cdcded245a93765ef07b31e
|
File details
Details for the file vgi_scikit_bio-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vgi_scikit_bio-0.1.0-py3-none-any.whl
- Upload date:
- Size: 71.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ab0492a9d3409800030e444752aebfc9dcbe39e72d5b8fad68ba426f3aead2f
|
|
| MD5 |
05da38c849eac09af8e4012e195ab951
|
|
| BLAKE2b-256 |
8d4861b414266f8dffa27de8eccdf92d03a62e6a85de302b494ac65b6428358f
|