Genomic allele frequency query engine with bitmap-encoded genotypes
Project description
AFQuery
AFQuery enables fast allele frequency queries on user-defined subsets of local genomic cohorts, without rescanning VCFs.
AFQuery is a bitmap-indexed engine that efficiently recomputes AC/AN/AF for dynamically defined subcohorts (e.g., by phenotype, sex, or sequencing technology), a common requirement in ACMG/AMP variant classification. It stores per-variant genotype data as Roaring Bitmaps in Parquet files and resolves sample filters into bitmaps that can be intersected in microseconds, enabling sub-100 ms queries on large cohorts. The system accounts for ploidy in sex chromosomes, adjusts AN based on sequencing technology, supports incremental updates, and runs locally using a file-based setup (Parquet + SQLite) without requiring server or cloud infrastructure.
When to use AFQuery
- You need allele frequencies for phenotype or user-defined subcohorts
- You work with mixed sequencing technologies or capture kits versions (WGS, WES, targeted panels)
- You require fast, repeated queries without rescanning VCFs
- You want a local, reproducible workflow without cloud or cluster dependencies
Features
- Dynamic subcohort queries (<100 ms) — bitmap intersections at query time; no VCF re-scan required
- Technology-aware — avoids bias when mixing WGS, WES, and panels using different BED capture indexes
- Ploidy-aware — correct handling of sex chromosomes (PAR/non-PAR, chrX, chrY)
- ACMG-compatible allele counting — AC/AN/AF computed per standard definitions
- Flexible metadata filtering — arbitrary labels (ICD-10, HPO, custom fields) with inclusion/exclusion rules
- Incremental updates — add or remove samples and update metadata without rebuilding the database
- VCF annotation — annotate variants using subcohort-specific frequencies
- FILTER/call quality tracking — failed calls (FILTER!=PASS) tracked per variant and reported as N_FAIL
- Batch and region queries — query a single locus, a genomic region, or a list of variants from a file
- Bulk CSV export — export all variant frequencies with optional disaggregation by sex, technology, or phenotype
- Audit changelog — all database operations logged with timestamps and operator notes
- Database validation — integrity checks with scripted exit codes
- Portable and serverless — file-based system, no infrastructure required
Performance
- Query latency: <100 ms (tested up to 50,000 samples)
- Storage: ~2 bytes/sample/variant
- Scales to millions of variants per chromosome
Comparison with Alternative Tools
| AFQuery | bcftools | GATK GenomicsDB | Hail | |
|---|---|---|---|---|
| Technology-aware AN | Yes | No | No | No |
| Metadata filtering | Arbitrary labels | No | No | Custom code |
| Ploidy-aware sex chromosomes | Yes | Manual | No | Manual |
| Dynamic subcohort queries | Yes | No | Limited | Requires code |
| FILTER/call quality tracking | Per variant | Manual | No | Manual |
| Incremental updates | Yes | No | Yes | No |
| Infrastructure required | None | None | Java/server | Spark cluster |
| Query latency (50K samples) | <100 ms | ~5 min | <1 min | 1–2 min |
Algorithm Overview
AFQuery pre-indexes per-variant genotype data as Roaring Bitmaps stored in Parquet files. Each variant row holds three bitmaps: heterozygous carriers, homozygous alt carriers, and samples with FILTER!=PASS. Sample metadata (sex, phenotype, technology) is pre-serialized as bitmaps in SQLite.
At query time, the requested sample filter is resolved to a single candidate bitmap via bitmap intersections and differences — taking microseconds regardless of cohort size. For each variant, the candidate bitmap is intersected with the genotype bitmaps to compute AC/AN/AF. AN accounts for WES capture regions (via BED-indexed interval trees) and for ploidy on sex chromosomes (males are haploid on non-PAR chrX and chrY).
Input Requirements
- VCF files: normalized and consistent with the selected genome build (GRCh37 or GRCh38)
- Sample metadata: must include sex, sequencing technology, and any fields used for filtering (e.g., phenotype)
- BED files (optional): define capture regions for each sequencing technology
Quick Start
Example workflow from raw VCFs to query, export, and annotation:
pip install afquery
# Docker: see Installation docs for docker pull / run usage
# Build the database
afquery create-db --manifest samples.tsv --output-dir ./db/ --genome-build GRCh38
# Inspect the database
afquery info --db ./db/
# Query a single position, filtered to a phenotype
afquery query --db ./db/ --locus chr1:925952 --phenotype E11.9 --sex female
# Query a genomic region
afquery query --db ./db/ --region chr1:900000-1000000
# Export BRCA1 variant frequencies to CSV
afquery dump --db ./db/ --output all_variants.csv --chrom chr17 --start 43044292 --end 43170327
# Annotate a VCF with cohort frequencies
afquery annotate --db ./db/ --input patient.vcf --output annotated.vcf --threads 12
# Add new samples to an existing database
afquery update-db --db ./db/ --add-samples new_samples.tsv
Documentation
Citation
If you use AFQuery, please cite:
AFQuery: fast, metadata-aware allele frequency queries on local genomic cohorts.
(manuscript in preparation)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file afquery-0.2.0.tar.gz.
File metadata
- Download URL: afquery-0.2.0.tar.gz
- Upload date:
- Size: 162.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5ec1b950d008548997ba5fa22df28bbbfdd132a669ab0dc314536a74fd0392e2
|
|
| MD5 |
8e53866ad6656900fdba5519efc3a17f
|
|
| BLAKE2b-256 |
a1de389ba88f1db5c814ef1f20f09ba8f52c0694b5c78b774bd81ab9b56f63c6
|
Provenance
The following attestation bundles were made for afquery-0.2.0.tar.gz:
Publisher:
release.yml on dlopez-bioinfo/afquery
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
afquery-0.2.0.tar.gz -
Subject digest:
5ec1b950d008548997ba5fa22df28bbbfdd132a669ab0dc314536a74fd0392e2 - Sigstore transparency entry: 1159601483
- Sigstore integration time:
-
Permalink:
dlopez-bioinfo/afquery@3b270eda68e3a42e41ee7f4bf0326ae2a6a0b807 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/dlopez-bioinfo
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@3b270eda68e3a42e41ee7f4bf0326ae2a6a0b807 -
Trigger Event:
push
-
Statement type:
File details
Details for the file afquery-0.2.0-py3-none-any.whl.
File metadata
- Download URL: afquery-0.2.0-py3-none-any.whl
- Upload date:
- Size: 63.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b2edb20cf67662ecbe585ffbd359e22ea0adeb477c4912b7a7da7da2afc96ec
|
|
| MD5 |
82a97c0be351adc874a34c20cb523bc0
|
|
| BLAKE2b-256 |
03fdcd1b431b3fddc222b662b3374506f4995879ee4e8e0c44468efe9cb1f6ee
|
Provenance
The following attestation bundles were made for afquery-0.2.0-py3-none-any.whl:
Publisher:
release.yml on dlopez-bioinfo/afquery
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
afquery-0.2.0-py3-none-any.whl -
Subject digest:
3b2edb20cf67662ecbe585ffbd359e22ea0adeb477c4912b7a7da7da2afc96ec - Sigstore transparency entry: 1159601553
- Sigstore integration time:
-
Permalink:
dlopez-bioinfo/afquery@3b270eda68e3a42e41ee7f4bf0326ae2a6a0b807 -
Branch / Tag:
refs/tags/v0.2.0 - Owner: https://github.com/dlopez-bioinfo
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@3b270eda68e3a42e41ee7f4bf0326ae2a6a0b807 -
Trigger Event:
push
-
Statement type: