Genomic allele frequency query engine with bitmap-encoded genotypes
Project description
AFQuery
Fast, file-based genomic allele frequency queries for large cohorts (10K–50K samples). No server, no cloud — just files. Sub-100ms point queries with flexible filtering by sex, phenotype (ICD codes), and sequencing technology.
Quick Example
pip install afquery
# Build database from your VCFs
afquery create-db --manifest samples.tsv --output-dir ./db/ --genome-build GRCh38
# Query allele frequency
afquery query --db ./db/ --chrom chr1 --pos 123456 --phenotype E11.9 --sex female
# Annotate a VCF
afquery annotate --db ./db/ --input variants.vcf --output annotated.vcf
Features
- Sub-100ms point queries on 50K-sample cohorts
- Filter by sex, phenotype (ICD codes), and sequencing technology
- Bitmap-compressed storage (Roaring Bitmaps + Parquet)
- Incremental updates (add/remove samples without full rebuild)
- VCF annotation with custom sample subsets
- Ploidy-aware AN for sex chromosomes (X/Y/MT)
- Zero infrastructure — purely file-based
Installation
pip install afquery
# or
conda install -c bioconda -c conda-forge afquery
Requires Python 3.10+.
Quick Start
1. Create a Database
First, prepare a manifest TSV with sample metadata:
sample_name sex tech_name vcf_path phenotype_codes
sample_1 male wgs vcfs/sample_1.vcf E11.9,I10
sample_2 female wes_kit_a vcfs/sample_2.vcf E11.9
sample_3 male wgs vcfs/sample_3.vcf I10
Key points about the manifest:
tech_name: Eitherwgs(case-insensitive) for whole genome, or a custom technology namevcf_path: Path to the single-sample VCF file (relative to manifest directory, or absolute)- For WES/exome technologies, capture regions are loaded from
--bed-dir/{tech_name}.bed- Example:
tech_name=wes_kit_a→ loadsbeds/wes_kit_a.bed
- Example:
Organize your files:
project/
├── manifest.tsv
├── vcfs/
│ ├── sample_1.vcf
│ ├── sample_2.vcf
│ └── sample_3.vcf
└── beds/ # Required for non-WGS technologies
└── wes_kit_a.bed # BED file for WES kit A
Then build your database:
afquery create-db \
--manifest manifest.tsv \
--bed-dir ./beds/ \
--output-dir ./my_db/ \
--genome-build GRCh38
This creates:
my_db/manifest.json— database metadatamy_db/metadata.sqlite— samples, technologies, phenotype codes, precomputed bitmapsmy_db/variants/{chrom}.parquet— variant data with encoded genotypesmy_db/capture/— capture regions for each technology
2. Query Allele Frequencies
# Point query
afquery query --db my_db --chrom chr1 --pos 1000 --alt G
# Batch query (from file with columns: pos ref alt)
afquery query --db my_db --chrom chr1 --from-file positions.tsv --phenotype E11.9
# Region query
afquery query --db my_db --chrom chr1 --start 1000 --end 10000 --phenotype E11.9 --sex M
3. Annotate VCF Files
afquery annotate \
--db my_db \
--input input.vcf \
--output annotated.vcf \
--phenotype E11.9 \
--tech WGS
Adds AFQUERY_AC, AFQUERY_AN, AFQUERY_AF and genotype fields to your VCF.
Python API
from afquery import Database
db = Database("/path/to/db")
# Single position query
# Automatically filters samples by: sex + phenotype codes + capture coverage
results = db.query(
chrom="chr1",
pos=1000,
alt="G",
phenotype=["E11.9"],
sex="both"
)
for r in results:
print(f"AC={r.AC}, AN={r.AN}, AF={r.AF}")
# Batch query (multi-variant)
results = db.query_batch(
"chr1",
variants=[(1500, "A", "T"), (3500, "G", "C")],
phenotype=["E11.9"],
)
# Region query (genomic range)
results = db.query_region(
chrom="chr1",
start=1000,
end=10000,
phenotype=["E11.9", "I10"]
)
# Annotate VCF with allele frequencies
# Note: tech filters annotation to samples of that technology
db.annotate_vcf(
input_vcf="input.vcf",
output_vcf="annotated.vcf",
phenotype=["E11.9"],
tech=["wgs"] # Only annotate using WGS samples
)
How samples are filtered in queries:
- Sex filter:
male,female, orboth - phenotype filter: All codes must match
- Capture filter: Automatic—only samples whose tech's BED covers the position
Database Structure
my_db/
├── manifest.json # Metadata: genome_build, sample_count, schema_version
├── metadata.sqlite # SQLite: samples, technologies, phenotype codes, bitmaps
├── variants/
│ ├── chr1.parquet
│ ├── chr2.parquet
│ └── ...
└── capture/
├── tech_0.pickle # WGS capture region (always covered)
└── tech_1.pickle # WES kit capture region
Each variant row contains:
pos— 1-based genomic positionref— reference allelealt— alternate allelehet_bitmap— Roaring Bitmap of heterozygous sampleshom_bitmap— Roaring Bitmap of homozygous samples
How Capture BED Files Are Associated with Samples
Samples are linked to capture regions through their technology:
- Manifest specifies technology: Each sample lists
tech_name(e.g.,wgs,wes_kit_a) - Technology maps to BED file:
- WGS: No BED file needed (always fully covered)
- WES/Custom: BED file loaded from
{bed_dir}/{tech_name}.bed
- Storage in database:
metadata.sqlite::technologiesstores tech_id, tech_name, and bed_pathmetadata.sqlite::samplesstores sample_id, sample_name, and tech_id (foreign key)
- Query-time filtering: When querying, samples are filtered by:
- Sex (male/female/both)
- phenotype diagnosis codes
- Capture region coverage (via tech's BED file)
Example: If you have samples on two exome kits:
sample_name sex tech_name vcf_path phenotype_codes
S001 male exome_v1 vcfs/S001.vcf E11.9
S002 female exome_v1 vcfs/S002.vcf E11.9
S003 male exome_v2 vcfs/S003.vcf I10
Then provide:
beds/
├── exome_v1.bed # Coverage for samples S001, S002
└── exome_v2.bed # Coverage for sample S003
At query time, each sample's eligible regions are determined by its tech's BED file.
Advanced Features
Incremental Updates
afquery update-db \
--db my_db \
--add-samples new_samples.tsv
Adds new samples without rebuilding the entire database.
Remove Samples and Compact
Remove samples and reclaim disk space:
afquery update-db --db my_db --remove-samples sample_1,sample_2 --compact
Run Benchmarks
afquery benchmark --n-samples 5000 --n-variants 100000
Ploidy Rules
AF computation respects chromosome-specific ploidy:
| Chromosome | Formula |
|---|---|
| Autosomes | AN = 2 × eligible_samples |
| chrM | AN = 1 × eligible_samples |
| chrY | AN = 1 × eligible_males |
| chrX (PAR) | AN = 2 × eligible_samples |
| chrX (non-PAR) | AN = 2 × eligible_females + 1 × eligible_males |
Where eligible = samples matching sex, phenotype, and technology capture filters.
Performance Targets
- Point query (cold): <100 ms
- Point query (warm): ~10 ms
- Batch 100 positions: ~200 ms
- VCF annotation (5K variants): ~30 s
- VCF annotation (5M variants): ~30 min
Command Reference
afquery query Query one position, region, or batch (--from-file)
afquery annotate Annotate VCF file with AF info fields
afquery dump Export allele frequencies for all variants to CSV
afquery info Show database metadata and sample list
afquery check Validate database integrity
afquery create-db Build database from a VCF manifest
afquery update-db Add/remove samples or compact the database
afquery version Show or set the database version label
afquery benchmark Run performance benchmarks
Run afquery --help for full options.
Development
Running Tests
# All tests
python3 -m pytest --tb=short -q
# Specific test module
python3 -m pytest tests/test_query.py -v
Key Modules
afquery.query— QueryEngine, point/batch/region queriesafquery.annotate— VCF annotation pipelineafquery.database— Database wrapper (public API)afquery.preprocess— Manifest parsing, VCF ingestion, Parquet buildingafquery.bitmaps— Roaring Bitmap encoding/decodingafquery.ploidy— Chromosome-specific ploidy rulesafquery.models— Data classes (QueryResult, ParsedSample, etc.)
Genome Builds
- GRCh37 (hg19) — PAR regions: chrX:1-2649520
- GRCh38 (hg38) — PAR regions: chrX:1-3099677
Troubleshooting
ImportError: cyvcf2
cyvcf2 import happens inside worker processes during preprocessing. Do not import at module level.
DuckDB Temp Files
Uses Parquet format (not Arrow IPC) for compatibility. Set DUCKDB_TEMP_DIRECTORY if needed.
License
(Add your license here)
Citation
If you use afquery in research, please cite:
(Citation format to be determined)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file afquery-0.1.4.tar.gz.
File metadata
- Download URL: afquery-0.1.4.tar.gz
- Upload date:
- Size: 109.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
846f89cc014b9a4a287606b930c929591ff10efcc0c599b9d869f9930f6ba07b
|
|
| MD5 |
87e1fee85a0c75f5e5ef629f5bcf1313
|
|
| BLAKE2b-256 |
7dda87c0a9fea7797276598d1868b04b84327c78d90ecb0c804a08ff87e094b5
|
Provenance
The following attestation bundles were made for afquery-0.1.4.tar.gz:
Publisher:
release.yml on dlopez-bioinfo/afquery
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
afquery-0.1.4.tar.gz -
Subject digest:
846f89cc014b9a4a287606b930c929591ff10efcc0c599b9d869f9930f6ba07b - Sigstore transparency entry: 1111121864
- Sigstore integration time:
-
Permalink:
dlopez-bioinfo/afquery@fc7a3412ff3dd189a13b693d9298c8c4d0f310b9 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/dlopez-bioinfo
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@fc7a3412ff3dd189a13b693d9298c8c4d0f310b9 -
Trigger Event:
push
-
Statement type:
File details
Details for the file afquery-0.1.4-py3-none-any.whl.
File metadata
- Download URL: afquery-0.1.4-py3-none-any.whl
- Upload date:
- Size: 59.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
44adb9ce218da786b0a08da07b4df7f0f9e17686d190ebbde9836090a1fbffec
|
|
| MD5 |
6f587d178fd483309175071604dfd294
|
|
| BLAKE2b-256 |
2de80771b561b104c3d1e715909f36bd0e91718d7bf63a2c99772e24f1ac2609
|
Provenance
The following attestation bundles were made for afquery-0.1.4-py3-none-any.whl:
Publisher:
release.yml on dlopez-bioinfo/afquery
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
afquery-0.1.4-py3-none-any.whl -
Subject digest:
44adb9ce218da786b0a08da07b4df7f0f9e17686d190ebbde9836090a1fbffec - Sigstore transparency entry: 1111121929
- Sigstore integration time:
-
Permalink:
dlopez-bioinfo/afquery@fc7a3412ff3dd189a13b693d9298c8c4d0f310b9 -
Branch / Tag:
refs/tags/v0.1.4 - Owner: https://github.com/dlopez-bioinfo
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@fc7a3412ff3dd189a13b693d9298c8c4d0f310b9 -
Trigger Event:
push
-
Statement type: