Skip to main content

Genomic allele frequency query engine with bitmap-encoded genotypes

Project description

AFQuery

CI Coverage Docs
PyPI Bioconda Docker Python License: MIT

AFQuery enables fast allele frequency queries on user-defined subsets of local genomic cohorts, without rescanning VCFs.

AFQuery is a bitmap-indexed engine that efficiently recomputes AC/AN/AF for dynamically defined subcohorts (e.g., by phenotype, sex, or sequencing technology), a common requirement in ACMG/AMP variant classification. It stores per-variant genotype data as Roaring Bitmaps in Parquet files and resolves sample filters into bitmaps that can be intersected in microseconds, enabling sub-100 ms queries on large cohorts. The system accounts for ploidy in sex chromosomes, adjusts AN based on sequencing technology, supports incremental updates, and runs locally using a file-based setup (Parquet + SQLite) without requiring server or cloud infrastructure.

Full Documentation→

When to use AFQuery

  • You need allele frequencies for phenotype or user-defined subcohorts
  • You work with mixed sequencing technologies or capture kits versions (WGS, WES, targeted panels)
  • You require fast, repeated queries without rescanning VCFs
  • You want a local, reproducible workflow without cloud or cluster dependencies

Features

  • Dynamic subcohort queries (<100 ms) — bitmap intersections at query time; no VCF re-scan required
  • Technology-aware — avoids bias when mixing WGS, WES, and panels using different BED capture indexes
  • Ploidy-aware — correct handling of sex chromosomes (PAR/non-PAR, chrX, chrY)
  • ACMG-compatible allele counting — AC/AN/AF computed per standard definitions
  • Flexible metadata filtering — arbitrary labels (ICD-10, HPO, custom fields) with inclusion/exclusion rules
  • Incremental updates — add or remove samples and update metadata without rebuilding the database
  • VCF annotation — annotate variants using subcohort-specific frequencies
  • FILTER/call quality tracking — failed calls (FILTER!=PASS) tracked per variant and reported as N_FAIL
  • Batch and region queries — query a single locus, a genomic region, or a list of variants from a file
  • Bulk CSV export — export all variant frequencies with optional disaggregation by sex, technology, or phenotype
  • Audit changelog — all database operations logged with timestamps and operator notes
  • Database validation — integrity checks with scripted exit codes
  • Portable and serverless — file-based system, no infrastructure required

Performance

  • Query latency: <100 ms (tested up to 50,000 samples)
  • Storage: ~2 bytes/sample/variant
  • Scales to millions of variants per chromosome

Comparison with Alternative Tools

AFQuery bcftools GATK GenomicsDB Hail
Technology-aware AN Yes No No No
Metadata filtering Arbitrary labels No No Custom code
Ploidy-aware sex chromosomes Yes Manual No Manual
Dynamic subcohort queries Yes No Limited Requires code
FILTER/call quality tracking Per variant Manual No Manual
Incremental updates Yes No Yes No
Infrastructure required None None Java/server Spark cluster
Query latency (50K samples) <100 ms ~5 min <1 min 1–2 min

Algorithm Overview

AFQuery pre-indexes per-variant genotype data as Roaring Bitmaps stored in Parquet files. Each variant row holds three bitmaps: heterozygous carriers, homozygous alt carriers, and samples with FILTER!=PASS. Sample metadata (sex, phenotype, technology) is pre-serialized as bitmaps in SQLite.

At query time, the requested sample filter is resolved to a single candidate bitmap via bitmap intersections and differences — taking microseconds regardless of cohort size. For each variant, the candidate bitmap is intersected with the genotype bitmaps to compute AC/AN/AF. AN accounts for WES capture regions (via BED-indexed interval trees) and for ploidy on sex chromosomes (males are haploid on non-PAR chrX and chrY).

Input Requirements

  • VCF files: normalized and consistent with the selected genome build (GRCh37 or GRCh38)
  • Sample metadata: must include sex, sequencing technology, and any fields used for filtering (e.g., phenotype)
  • BED files (optional): define capture regions for each sequencing technology

Quick Start

Example workflow from raw VCFs to query, export, and annotation:

pip install afquery
# Docker: see Installation docs for docker pull / run usage

# Build the database
afquery create-db --manifest samples.tsv --output-dir ./db/ --genome-build GRCh38

# Inspect the database
afquery info --db ./db/

# Query a single position, filtered to a phenotype
afquery query --db ./db/ --locus chr1:925952 --phenotype E11.9 --sex female

# Query a genomic region
afquery query --db ./db/ --region chr1:900000-1000000

# Export BRCA1 variant frequencies to CSV
afquery dump --db ./db/ --output all_variants.csv --chrom chr17 --start 43044292 --end 43170327

# Annotate a VCF with cohort frequencies
afquery annotate --db ./db/ --input patient.vcf --output annotated.vcf --threads 12

# Add new samples to an existing database
afquery update-db --db ./db/ --add-samples new_samples.tsv

Documentation

Citation

If you use AFQuery, please cite:

AFQuery: fast, metadata-aware allele frequency queries on local genomic cohorts.
(manuscript in preparation)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

afquery-0.2.0.tar.gz (162.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

afquery-0.2.0-py3-none-any.whl (63.7 kB view details)

Uploaded Python 3

File details

Details for the file afquery-0.2.0.tar.gz.

File metadata

  • Download URL: afquery-0.2.0.tar.gz
  • Upload date:
  • Size: 162.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for afquery-0.2.0.tar.gz
Algorithm Hash digest
SHA256 5ec1b950d008548997ba5fa22df28bbbfdd132a669ab0dc314536a74fd0392e2
MD5 8e53866ad6656900fdba5519efc3a17f
BLAKE2b-256 a1de389ba88f1db5c814ef1f20f09ba8f52c0694b5c78b774bd81ab9b56f63c6

See more details on using hashes here.

Provenance

The following attestation bundles were made for afquery-0.2.0.tar.gz:

Publisher: release.yml on dlopez-bioinfo/afquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file afquery-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: afquery-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 63.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for afquery-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3b2edb20cf67662ecbe585ffbd359e22ea0adeb477c4912b7a7da7da2afc96ec
MD5 82a97c0be351adc874a34c20cb523bc0
BLAKE2b-256 03fdcd1b431b3fddc222b662b3374506f4995879ee4e8e0c44468efe9cb1f6ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for afquery-0.2.0-py3-none-any.whl:

Publisher: release.yml on dlopez-bioinfo/afquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page