Skip to main content

Genomic allele frequency query engine with bitmap-encoded genotypes

Project description

AFQuery

CI Coverage Docs
PyPI Bioconda Docker Python License: MIT

AFQuery enables fast allele frequency queries on user-defined subsets of local genomic cohorts, without rescanning VCFs.

AFQuery is a bitmap-indexed engine that efficiently recomputes AC/AN/AF for dynamically defined subcohorts (e.g., by phenotype, sex, or sequencing technology), a common requirement in ACMG/AMP variant classification. It stores per-variant genotype data as Roaring Bitmaps in Parquet files and resolves sample filters into bitmaps that can be intersected in microseconds, enabling sub-100 ms queries on large cohorts. The system accounts for ploidy in sex chromosomes, adjusts AN based on sequencing technology, supports incremental updates, and runs locally using a file-based setup (Parquet + SQLite) without requiring server or cloud infrastructure.

Full Documentation→

When to use AFQuery

  • You need allele frequencies for phenotype or user-defined subcohorts
  • You work with mixed sequencing technologies or capture kits versions (WGS, WES, targeted panels)
  • You require fast, repeated queries without rescanning VCFs
  • You want a local, reproducible workflow without cloud or cluster dependencies

Features

  • Dynamic subcohort queries (<100 ms) — bitmap intersections at query time; no VCF re-scan required
  • Technology-aware — avoids bias when mixing WGS, WES, and panels using different BED capture indexes
  • Ploidy-aware — correct handling of sex chromosomes (PAR/non-PAR, chrX, chrY)
  • ACMG-compatible allele counting — AC/AN/AF computed per standard definitions
  • Flexible metadata filtering — arbitrary labels (ICD-10, HPO, custom fields) with inclusion/exclusion rules
  • Incremental updates — add or remove samples and update metadata without rebuilding the database
  • VCF annotation — annotate variants using subcohort-specific frequencies
  • FILTER/call quality tracking — failed calls (FILTER!=PASS) tracked per variant and reported as N_FAIL
  • Batch and region queries — query a single locus, a genomic region, or a list of variants from a file
  • Bulk CSV export — export all variant frequencies with optional disaggregation by sex, technology, or phenotype
  • Audit changelog — all database operations logged with timestamps and operator notes
  • Database validation — integrity checks with scripted exit codes
  • Portable and serverless — file-based system, no infrastructure required

Performance

  • Query latency: <100 ms (tested up to 50,000 samples)
  • Storage: ~2 bytes/sample/variant
  • Scales to millions of variants per chromosome

Comparison with Alternative Tools

AFQuery bcftools GATK GenomicsDB Hail
Technology-aware AN Yes No No No
Metadata filtering Arbitrary labels No No Custom code
Ploidy-aware sex chromosomes Yes Manual No Manual
Dynamic subcohort queries Yes No Limited Requires code
FILTER/call quality tracking Per variant Manual No Manual
Incremental updates Yes No Yes No
Infrastructure required None None Java/server Spark cluster
Query latency (50K samples) <100 ms ~5 min <1 min 1–2 min

Algorithm Overview

AFQuery pre-indexes per-variant genotype data as Roaring Bitmaps stored in Parquet files. Each variant row holds three bitmaps: heterozygous carriers, homozygous alt carriers, and samples with FILTER!=PASS. Sample metadata (sex, phenotype, technology) is pre-serialized as bitmaps in SQLite.

At query time, the requested sample filter is resolved to a single candidate bitmap via bitmap intersections and differences — taking microseconds regardless of cohort size. For each variant, the candidate bitmap is intersected with the genotype bitmaps to compute AC/AN/AF. AN accounts for WES capture regions (via BED-indexed interval trees) and for ploidy on sex chromosomes (males are haploid on non-PAR chrX and chrY).

Input Requirements

  • VCF files: normalized and consistent with the selected genome build (GRCh37 or GRCh38)
  • Sample metadata: must include sex, sequencing technology, and any fields used for filtering (e.g., phenotype)
  • BED files (optional): define capture regions for each sequencing technology

Quick Start

Example workflow from raw VCFs to query, export, and annotation:

pip install afquery
# Docker: see Installation docs for docker pull / run usage

# Build the database
afquery create-db --manifest samples.tsv --output-dir ./db/ --genome-build GRCh38

# Inspect the database
afquery info --db ./db/

# Query a single position, filtered to a phenotype
afquery query --db ./db/ --locus chr1:925952 --phenotype E11.9 --sex female

# Query a genomic region
afquery query --db ./db/ --region chr1:900000-1000000

# Export BRCA1 variant frequencies to CSV
afquery dump --db ./db/ --output all_variants.csv --chrom chr17 --start 43044292 --end 43170327

# Annotate a VCF with cohort frequencies
afquery annotate --db ./db/ --input patient.vcf --output annotated.vcf --threads 12

# Add new samples to an existing database
afquery update-db --db ./db/ --add-samples new_samples.tsv

Documentation

Citation

If you use AFQuery, please cite:

AFQuery: fast, metadata-aware allele frequency queries on local genomic cohorts.
(manuscript in preparation)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

afquery-0.2.2.tar.gz (240.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

afquery-0.2.2-py3-none-any.whl (66.5 kB view details)

Uploaded Python 3

File details

Details for the file afquery-0.2.2.tar.gz.

File metadata

  • Download URL: afquery-0.2.2.tar.gz
  • Upload date:
  • Size: 240.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for afquery-0.2.2.tar.gz
Algorithm Hash digest
SHA256 28ddc9afe5260ecdefab83679e0633684980666c49812215d1233ce3bae4fb7f
MD5 991f9b664a3e50efbfc14f354a908cf8
BLAKE2b-256 8eb6b7304f06bf555ccbb86d67f94994d5ecc41b6a0f76090ffe9846e4c18d78

See more details on using hashes here.

Provenance

The following attestation bundles were made for afquery-0.2.2.tar.gz:

Publisher: release.yml on dlopez-bioinfo/afquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file afquery-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: afquery-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 66.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for afquery-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 25b3c18ce12f1efd3b39221225ef7af8275bf7053b622dae79d4fefac0a6f96a
MD5 de4c2388bdfa34ba2806d69ec858f4fe
BLAKE2b-256 07a8501cc1d7beddf701fb1dcbb1cd5122f992e37948052f291f8cd7799479a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for afquery-0.2.2-py3-none-any.whl:

Publisher: release.yml on dlopez-bioinfo/afquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page