Skip to main content

A VGI (Vector Gateway Interface) worker exposing scikit-bio sequence analysis, diversity, ordination, and phylogenetics to DuckDB/SQL

Project description

Vector Gateway Interface        scikit-bio        DuckDB

vgi-scikit-bio

scikit-bio for SQL — a VGI worker that exposes biological sequence analysis, pairwise alignment, community diversity (including phylogenetic), ordination, distance-matrix statistics, compositional transforms, and phylogenetics to DuckDB as ordinary scalar, aggregate, and table functions.

INSTALL vgi FROM community; LOAD vgi;

-- Attach with zero local install — from PyPI via uvx, or from ghcr via oci://
ATTACH 'skbio' (TYPE vgi, LOCATION 'uvx vgi-scikit-bio');
-- ATTACH 'skbio' (TYPE vgi, LOCATION 'oci://ghcr.io/query-farm/vgi-scikit-bio:0.1.0');

SELECT skbio.sequence.gc_content('ATGCGGATTACAGG');            -- 0.5
SELECT skbio.sequence.reverse_complement('ATGCGGATTACAGG');    -- CCTGTAATCCGCAT
SELECT skbio.sequence.translate('ATGCGGATTACAGGT');            -- MRITG

Everything runs in a separate Python process and streams Apache Arrow between DuckDB and scikit-bio, so you get scikit-bio's algorithms without leaving SQL.


Contents


Install & attach

First install the DuckDB vgi community extension, then attach the worker via a LOCATION. Pick whichever way suits you — you don't have to install anything locally; uvx and oci:// both fetch and run the published worker for you.

INSTALL vgi FROM community;
LOAD vgi;

Zero-install from PyPI (uvx) — recommended for a laptop with uv:

ATTACH 'skbio' (TYPE vgi, LOCATION 'uvx vgi-scikit-bio');
SELECT skbio.sequence.gc_content('ATGCGGATTACAGG');   -- 0.5

uvx downloads vgi-scikit-bio from PyPI into a cache and runs its stdio console script; the first attach resolves the package, later ones are instant.

Container from ghcr (oci://) — recommended when you'd rather not touch Python at all (just needs Docker or Podman):

ATTACH 'skbio' (TYPE vgi, LOCATION 'oci://ghcr.io/query-farm/vgi-scikit-bio:0.1.0');

The extension turns the oci:// address into a pooled docker run -i --rm … stdio and keeps the container warm across queries. Use :latest to track the newest release, or docker://… (an alias). A struct LOCATION adds options — {'image': 'oci://…', 'runtime': 'podman', 'volumes': ['skbio_state:/data']}.

Installed console script — if you pip install vgi-scikit-bio, the vgi-scikit-bio (stdio) and vgi-scikit-bio-http (HTTP) scripts are on PATH:

ATTACH 'skbio' (TYPE vgi, LOCATION 'vgi-scikit-bio');

From a source checkout (development):

ATTACH 'skbio' (TYPE vgi, LOCATION 'uv run scikit_bio_worker.py');

HTTP server — for a warm, shared, long-lived worker, run the container (or vgi-scikit-bio-http) and attach over its URL:

docker run -d -p 8000:8000 -e VGI_SIGNING_KEY=change-me ghcr.io/query-farm/vgi-scikit-bio:0.1.0
ATTACH 'skbio' (TYPE vgi, LOCATION 'http://localhost:8000');

Functions are organised into five schemassequence (the default), alignment, diversity, stats, and tree — so skbio.sequence.gc_content(...) also resolves as skbio.gc_content(...). There are ~90 functions; the tables below list them by area (grouped by the schema's category sections, which the worker also exposes for navigation).


What's inside

sequence — biological sequences

Per-sequence functions over VARCHAR columns of DNA/RNA/protein. Input is upper-cased and stripped; a NULL or invalid sequence yields NULL rather than failing the whole query.

  • Transformsgc_content, gc_frequency, reverse_complement, complement, transcribe, reverse_transcribe, translate, degap
  • Validationis_valid_dna, is_valid_protein, has_gaps, has_degenerates, is_reverse_complement
  • Distancehamming_distance, mismatch_count, match_count
  • Composition (table functions, long output) — kmer_frequencies, residue_frequencies, count_subsequence, translate_six_frames
-- Profile reads by GC content, keeping only valid DNA
SELECT id, skbio.sequence.gc_content(seq) AS gc
FROM reads WHERE skbio.sequence.is_valid_dna(seq);

-- 4-mer feature matrix (long → pivot for a wide matrix)
SELECT id, kmer, count
FROM skbio.sequence.kmer_frequencies((SELECT id, seq FROM reads), id := 'id', k := 4);

alignment — pairwise alignment

  • Scorealign_score_nucleotide(a, b), align_score_protein(a, b) → optimal global-alignment score
  • Pairwisepairwise_align_nucleotide / pairwise_align_protein → aligned strings + score + length; mode := 'global' (Needleman–Wunsch) or 'local' (Smith–Waterman)
SELECT aligned_1, aligned_2, score
FROM skbio.alignment.pairwise_align_nucleotide((SELECT id, ref, read FROM pairs),
     seq1 := 'ref', seq2 := 'read');

diversity — alpha, beta, phylogenetic, rarefaction

Community-ecology diversity over a long feature table — one row per (sample, feature, count).

  • Alpha (aggregates over count, GROUP BY sample) — the full scikit-bio metric family: shannon, simpson, inv_simpson, observed_features, chao1, pielou_evenness, dominance, ace, berger_parker_d, brillouin_d, fisher_alpha, gini_index, goods_coverage, heip_evenness, kempton_taylor_q, margalef, mcintosh_d, mcintosh_e, menhinick, robbins, simpson_d, simpson_e, strong, singles, doubles, enspie; parameterized hill(count, q := …) / renyi / tsallis; and interval/triple metrics chao1_ci, esty_ci, osd (return DOUBLE[])
  • Betabeta_diversity(tbl, metric := …) → the full distance matrix long (braycurtis, jaccard, euclidean, canberra, cosine, jensenshannon, …)
  • Phylogenetic (given a tree := '<newick>') — faith_pd (per-sample PD), unifrac (weighted/unweighted distance matrix)
  • Preprocessingsubsample_counts(tbl, depth := N) (rarefaction)
SELECT sample_id,
       skbio.diversity.shannon(count) AS shannon,
       skbio.diversity.chao1(count)   AS chao1
FROM feature_table GROUP BY sample_id;

SELECT * FROM skbio.diversity.beta_diversity(
  (SELECT sample_id, feature_id, count FROM feature_table), metric := 'braycurtis');

stats — ordination, distance tests, composition, differential abundance

  • Ordinationpcoa(dm) (distance matrix), pca(tbl) / ca(tbl) (feature table) → (sample_id, <axis>_1 … k)
  • Hypothesis testspermanova / anosim (grouping in a 4th column), mantel (two distance columns)
  • Composition — transforms clr, ilr, alr, closure, centralize, rclr, multi_replace, power; inverses clr_inv, ilr_inv, alr_inv; association pairwise_vlr
  • Differential abundanceancom, dirmult_ttest (grouping in a 4th column)

pcoa/permanova/anosim/mantel read the long (id_1, id_2, distance) matrix from beta_diversity. pca/ca and the composition/ANCOM functions read a long feature table.

-- Embed samples in 2-D from a Bray–Curtis matrix
SELECT * FROM skbio.stats.pcoa(
  (SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table))),
  n_components := 2);

tree — phylogenetics

  • Construction (distance matrix → one newick row) — neighbor_joining, upgma, gme, bme
  • Inspection (scalars over a Newick string) — tip_count, total_branch_length, tree_height
  • Comparison (two Newick trees → distance) — robinson_foulds, weighted_robinson_foulds, cophenetic_distance
SELECT newick
FROM skbio.tree.neighbor_joining(
  (SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table))));

Not exposed (they don't map to a single SQL function): constrained ordination cca/rda and bioenv/pwmantel (need a second matrix beyond the one subquery slot); permdisp (an upstream bug in scikit-bio 0.7.3); and stateful/IO surface (TabularMSA, BIOM Table, FASTA/FASTQ readers — DuckDB already reads files) and randomness-seeded estimators (lladser_*, michaelis_menten_fit).


A worked microbiome example

Starting from a long OTU/feature table feature_table(sample_id, feature_id, count) and a metadata(sample_id, group) table:

-- 1. Alpha diversity per sample
SELECT sample_id, skbio.diversity.shannon(count) AS shannon
FROM feature_table GROUP BY sample_id;

-- 2. Bray–Curtis distances → PCoA coordinates for an ordination plot
CREATE TABLE bc AS
  SELECT * FROM skbio.diversity.beta_diversity((SELECT * FROM feature_table));

SELECT * FROM skbio.stats.pcoa((SELECT * FROM bc), n_components := 2);

-- 3. Does the grouping explain the community differences? (PERMANOVA)
SELECT * FROM skbio.stats.permanova((
  SELECT bc.id_1, bc.id_2, bc.distance, m.group
  FROM bc JOIN metadata m ON bc.id_1 = m.sample_id));

-- 4. Neighbour-joining tree of the samples
SELECT newick FROM skbio.tree.neighbor_joining((SELECT * FROM bc));

Data-shape conventions

  • Feature tables are long: (sample_id, feature_id, count) (or value). This is the natural SQL shape and avoids a fixed, data-dependent column width. PIVOT if you need a wide matrix.
  • Distance matrices are long: (id_1, id_2, distance) — exactly what beta_diversity emits, and what pcoa/permanova/anosim/mantel/ neighbor_joining consume. The matrix is symmetrised on read, so a full square or a triangle both work (the grouped tests need every sample to appear as id_1, which the full square from beta_diversity guarantees).
  • Column names default positionally and can be overridden with named args (sample :=, feature :=, count :=, id_1 :=, …).
  • Bad rows degrade to NULL in the sequence scalars rather than raising.

Development

The repo is an installable package (pyproject.toml, hatchling, uv.lock).

uv sync                    # resolve deps (scikit-bio, numpy, pandas, scipy, vgi-python[http])
uv run pytest tests/ -q    # unit tests
uv run ruff check .        # lint
uv run ruff format --check .
uv run mypy vgi_scikit_bio/

make test-stdio            # DuckDB sqllogictest suite, worker as a subprocess (authoritative)
make test-http             # same suite against a local HTTP server

The unit tests exercise each function's logic in-process; the SQL suite (test/sql/*.test) is authoritative because it drives the real DuckDB → VGI → worker path. It needs a sqllogictest runner with the vgi extension — CI uses a prebuilt standalone haybarn-unittest plus INSTALL vgi FROM community (no C++ build). See ci/README.md.

To add a function: implement it in the relevant vgi_scikit_bio/*.py, export it from that module's *_FUNCTIONS list, and splice the list into _SCHEMA_FUNCTIONS in vgi_scikit_bio/worker.py. See CLAUDE.md for the framework patterns and sharp edges.

Deployment

One multi-arch container image (ghcr.io/query-farm/vgi-scikit-bio) serves both transports — HTTP (default) and stdio — via docker-entrypoint.sh. It is built, tested against the signed vgi extension, and signed by CI on every tag. The image mounts a /data volume for the shared framework BoundStorage. make deploy points Fly.io at the published image.

License & support

MIT — see LICENSE. Copyright 2026 Query Farm LLC.

Developed and maintained by Query.Farm. For bug reports and feature requests use the issue tracker; for commercial support and SLAs see SUPPORT.md or email hello@query.farm.

Built on scikit-bio — if you use it in research, please also cite scikit-bio.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vgi_scikit_bio-0.1.1.tar.gz (80.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vgi_scikit_bio-0.1.1-py3-none-any.whl (72.4 kB view details)

Uploaded Python 3

File details

Details for the file vgi_scikit_bio-0.1.1.tar.gz.

File metadata

  • Download URL: vgi_scikit_bio-0.1.1.tar.gz
  • Upload date:
  • Size: 80.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for vgi_scikit_bio-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5d93138d99775355eef40da7bd989447496b9f1ae5536cb0bb7728149170c19d
MD5 40fe503a0877e1edd3f65813bb2667f2
BLAKE2b-256 0c4d826d68b854d8160041be2574f9da7e04b9f774e2dab1f77289d037bd90b7

See more details on using hashes here.

File details

Details for the file vgi_scikit_bio-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: vgi_scikit_bio-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 72.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.26 {"installer":{"name":"uv","version":"0.11.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for vgi_scikit_bio-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5c5be850f7bfc1770d20fccf681ff5cc08708a8e57bf59877a40467d562e3000
MD5 da09af629de3b178a8d0f48e490dc272
BLAKE2b-256 25a7bab679444882514a0d77b7a6ca8e61b5a77921937a26ee12890c7b01c306

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page