Genomic allele frequency query engine with bitmap-encoded genotypes

These details have not been verified by PyPI

Project description

AFQuery

Fast, capture-aware allele frequency queries on local genomic cohorts — without re-scanning VCFs.

AFQuery is a bitmap-indexed engine that recomputes AC/AN/AF for arbitrary subcohorts (by phenotype, sex, sequencing technology, or any combination) in tens of milliseconds, independently of cohort size. It accounts for capture-kit heterogeneity, ploidy on sex chromosomes, and FILTER/coverage evidence, and runs locally as a file-based system (Parquet + SQLite) with no server or cluster required.

Full documentation →

Headline results

~14 ms point queries, constant from 1,000 to 50,000 samples (O(1) scaling).
33× faster than bcftools for full-chromosome bulk export at 2,504 samples; R² > 0.99999 AF concordance over 1.1M common variants.
~25,000 variants/s VCF annotation on 4 cores — a typical exome annotates in ~1 second.
Up to 45× reduction in toward-pathogenic ACMG classification errors vs. naive AN on mixed-capture-kit cohorts.
9.5×–13.4× storage compression vs. input single-sample VCFs.

Why AFQuery

Local allele frequencies are a core input to ACMG/AMP variant classification (criteria BA1, BS1, PM2), and global resources like gnomAD systematically underrepresent local ancestries and disease-enriched institutional cohorts. Computing AF on your own cohort sounds simple — until the cohort mixes WGS, WES, and panels, or several versions of the same capture kit. Successive Agilent SureSelect versions (v5/v6/v7), for example, only share ~57% of their targets on chromosome 22. Naive AN counting then inflates AN at positions outside some kits' targets, deflates AF, and systematically shifts variants toward "pathogenic" by ACMG criteria.

AFQuery solves this with per-position, per-technology capture-aware AN, ploidy-aware sex chromosome handling, and an explicit N_NO_COVERAGE channel that separates trusted hom-ref from "we cannot tell". Queries on subcohorts are answered by intersecting precomputed Roaring Bitmaps, so latency is constant in cohort size.

When to use AFQuery

You need allele frequencies for phenotype-defined or arbitrarily filtered subcohorts.
Your cohort mixes sequencing technologies (WGS, WES, panels, multiple capture-kit versions).
You want fast, repeated, interactive queries instead of one-off VCF re-scans.
You need a local, reproducible workflow — no cloud, no Spark cluster.

Features

Constant-time subcohort queries — bitmap intersections at query time; no per-query VCF re-scan.
Capture-aware AN — per-position eligibility from each technology's BED, eliminating systematic AF bias when mixing WGS / WES / panels and kit versions.
Ploidy-aware sex chromosomes — correct AN on chrX PAR / non-PAR, chrY, chrM, by sample sex.
Coverage evidence model — N_NO_COVERAGE separates trusted hom-ref from samples lacking sufficient evidence; query-time gates (--min-pass, --min-observed, --min-quality-evidence) keep AF conservative.
ACMG-compatible AC/AN/AF — per-standard definitions, exposed as both query output and VCF INFO fields.
Flexible metadata filtering — arbitrary phenotype labels (ICD-10, HPO, OMIM, custom tags), inclusion or exclusion (^ prefix), combined with sex and technology.
Parallel VCF annotation — multi-threaded; adds AFQUERY_AC/AN/AF/N_HET/N_HOM_ALT/N_HOM_REF/N_FAIL/N_NO_COVERAGE INFO fields.
Bulk CSV export — per-variant frequencies with optional disaggregation by sex, technology, or phenotype.
Incremental updates — add or remove samples, edit phenotype/sex metadata, compact storage, without full rebuilds.
Audit changelog — every database operation is logged with timestamps and operator notes.
Database validation — afquery check with scripted exit codes.
Serverless — Parquet + SQLite on disk; no daemon, no Java, no Spark.

Installation

# PyPI
pip install afquery

# Bioconda
conda install -c bioconda afquery

# Docker (linux/amd64, linux/arm64)
docker pull ghcr.io/babelomics/afquery:latest

# From source
git clone https://github.com/babelomics/afquery.git
cd afquery
pip install -e .

Requires Python ≥ 3.10. Core dependencies: pyroaring, pyarrow, duckdb, pyranges, cyvcf2, click, tqdm.

Quickstart

1. Prepare a manifest

One row per sample (TSV, header required):

sample_name	vcf_path	sex	tech_name	phenotype_codes
SAMP_001	/data/vcfs/SAMP_001.vcf.gz	female	wgs	E11.9,I10
SAMP_002	/data/vcfs/SAMP_002.vcf.gz	male	wes_v6	E11.9
SAMP_003	/data/vcfs/SAMP_003.vcf.gz	female	panel_card	I42.0

For every non-WGS technology, place a 3-column BED in --bed-dir named <tech_name>.bed.

2. Build the database

afquery create-db \
  --manifest manifest.tsv \
  --output-dir ./db/ \
  --genome-build GRCh38 \
  --bed-dir ./beds/

3. Query

# Single locus
afquery query --db ./db/ --locus chr1:925952

# Locus filtered to a phenotype-and-sex subcohort
afquery query --db ./db/ --locus chr1:925952 --phenotype E11.9 --sex female

# Genomic region
afquery query --db ./db/ --region chr1:900000-1000000

# Batch from file (chrom pos [ref [alt]] per line)
afquery query --db ./db/ --from-file variants.tsv

4. Inspect carriers of a variant

afquery variant-info --db ./db/ --locus chr17:43093454

5. Annotate a VCF

afquery annotate \
  --db ./db/ \
  --input patient.vcf \
  --output patient.annotated.vcf \
  --threads 4

6. Export

# Region export, disaggregated by sex
afquery dump --db ./db/ --chrom chr17 --start 43044292 --end 43170327 \
  --output brca1.csv --by-sex

7. Update

afquery update-db --db ./db/ --add-samples new_batch.tsv
afquery update-db --db ./db/ --update-sample SAMP_007 --set-phenotype I42.0

Output fields

Every query and annotated VCF reports:

Field	Meaning
`AC`	Alt-allele count over eligible samples (FILTER=PASS)
`AN`	Total alleles considered, ploidy- and capture-aware
`AF`	`AC / AN`
`N_HET`	Heterozygous PASS carriers
`N_HOM_ALT`	Homozygous-alt PASS carriers
`N_HOM_REF`	Samples trusted as homozygous reference
`N_FAIL`	Carriers with FILTER ≠ PASS (excluded from AC/AN)
`N_NO_COVERAGE`	Non-carriers on a partially-covered tech without sufficient evidence to call hom-ref

VCF INFO field names use the AFQUERY_ prefix (e.g. AFQUERY_AF, AFQUERY_N_NO_COVERAGE).

CLI commands

Command	Purpose
`create-db`	Build a database from a manifest of single-sample VCFs
`query`	Point / region / batch AF queries
`variant-info`	List samples carrying a variant, with metadata
`annotate`	Annotate a VCF with cohort `AFQUERY_*` INFO fields
`dump`	Bulk CSV export, optionally disaggregated
`update-db`	Add / remove samples, edit metadata, compact
`info`	Show database metadata, sample list, changelog
`check`	Validate database integrity (scripted exit code)
`version show` / `version set`	Inspect or set the database version label
`benchmark`	Run synthetic or on-database performance benchmarks

See the CLI reference for all options.

How it works

AFQuery indexes each variant as three Roaring Bitmaps — heterozygous PASS carriers, homozygous-alt PASS carriers, and FILTER≠PASS carriers — stored in Apache Parquet, partitioned by chromosome and 1-Mbp positional buckets. Sample metadata (sex, technology, phenotype) is precomputed as bitmaps in SQLite. A query resolves its sample filter into a single candidate bitmap by intersection/difference in microseconds, then intersects it against each variant's genotype bitmaps to compute AC. AN is computed per position from the same candidate bitmap, restricted to samples whose technology actually covers the position (via BED-derived capture indices) and adjusted for ploidy on sex chromosomes. See the data model reference for details.

How AFQuery compares

	AFQuery	bcftools	GATK GenomicsDB	Hail
Capture-aware AN	Yes	No	No	No
Metadata filtering	Arbitrary labels	No	No	Custom code
Ploidy-aware sex chromosomes	Yes	Manual	No	Manual
Dynamic subcohort queries	Yes	No	Limited	Requires code
FILTER / coverage tracking	Per variant	Manual	No	Manual
Incremental updates	Yes	No	Yes	No
Infrastructure required	None	None	Java/server	Spark cluster

Benchmarks vs. bcftools (1000 Genomes Phase 3, n = 2,504, chr22)

Workload	AFQuery	bcftools	Speedup
Full-chromosome AC/AN/AF export	~7.0 s	~3.8 min	~33×
AF concordance over 1,106,181 common variants	—	—	R² > 0.99999

Point-query latency on AFQuery is ~14 ms and constant from 1K to 50K samples (median over 50 replicates, warm cache).

Documentation

Citation

If you use AFQuery in your work, please cite:

AFQuery: fast, capture-aware allele frequency queries on local genomic cohorts. (manuscript in preparation)

License

MIT

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.3.3

May 14, 2026

0.3.2

May 13, 2026

0.3.1

May 12, 2026

0.3.0

May 7, 2026

0.2.2

Mar 24, 2026

0.2.1

Mar 23, 2026

0.2.0

Mar 23, 2026

0.1.4

Mar 16, 2026

0.1.3

Mar 16, 2026

0.1.2.2

Mar 18, 2026

0.1.2.1

Mar 16, 2026

0.1.2

Mar 16, 2026

0.1.1

Mar 16, 2026

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

afquery-0.3.3.tar.gz (1.2 MB view details)

Uploaded May 14, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

afquery-0.3.3-py3-none-any.whl (74.1 kB view details)

Uploaded May 14, 2026 Python 3

File details

Details for the file afquery-0.3.3.tar.gz.

File metadata

Download URL: afquery-0.3.3.tar.gz
Upload date: May 14, 2026
Size: 1.2 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for afquery-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`c6e4824892e328cd31b1a582ff46a65a9fade096564ac710dadf84582c1b94c9`
MD5	`f53a707638f49ec3c40c12c177d34bf2`
BLAKE2b-256	`86ad3d01e09704d306862fa69bede19c074a95a799500378ea5554923b4df224`

See more details on using hashes here.

Provenance

The following attestation bundles were made for afquery-0.3.3.tar.gz:

Publisher: release.yml on babelomics/afquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: afquery-0.3.3.tar.gz
- Subject digest: c6e4824892e328cd31b1a582ff46a65a9fade096564ac710dadf84582c1b94c9
- Sigstore transparency entry: 1537998695
- Sigstore integration time: May 14, 2026
Source repository:
- Permalink: babelomics/afquery@df69af58cc9f6d759d3bd54c3a069962a143c442
- Branch / Tag: refs/tags/v0.3.3
- Owner: https://github.com/babelomics
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@df69af58cc9f6d759d3bd54c3a069962a143c442
- Trigger Event: push

File details

Details for the file afquery-0.3.3-py3-none-any.whl.

File metadata

Download URL: afquery-0.3.3-py3-none-any.whl
Upload date: May 14, 2026
Size: 74.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for afquery-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`869e12fd7ee550ef0537774d61081cd86f653fc3bb6e69aad1a2f38f16f524fc`
MD5	`ca77254e75c66326a31f06d5838b2235`
BLAKE2b-256	`503ee7b5526d83be72e1e5ccb2d07f307598b034ebb70dabd68a133fdb020a50`

See more details on using hashes here.

Provenance

The following attestation bundles were made for afquery-0.3.3-py3-none-any.whl:

Publisher: release.yml on babelomics/afquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: afquery-0.3.3-py3-none-any.whl
- Subject digest: 869e12fd7ee550ef0537774d61081cd86f653fc3bb6e69aad1a2f38f16f524fc
- Sigstore transparency entry: 1537998819
- Sigstore integration time: May 14, 2026
Source repository:
- Permalink: babelomics/afquery@df69af58cc9f6d759d3bd54c3a069962a143c442
- Branch / Tag: refs/tags/v0.3.3
- Owner: https://github.com/babelomics
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@df69af58cc9f6d759d3bd54c3a069962a143c442
- Trigger Event: push

afquery 0.3.3

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

AFQuery

Headline results

Why AFQuery

When to use AFQuery

Features

Installation

Quickstart

1. Prepare a manifest

2. Build the database

3. Query

4. Inspect carriers of a variant

5. Annotate a VCF

6. Export

7. Update

Output fields

CLI commands

How it works

How AFQuery compares

Benchmarks vs. bcftools (1000 Genomes Phase 3, n = 2,504, chr22)

Documentation

Citation

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance