Genomic allele frequency query engine with bitmap-encoded genotypes
Project description
afquery
Genomic allele frequency query engine with bitmap-encoded genotypes. Fast, file-based queries over 10K-50K samples at sub-100ms latency.
Features
- Fast point queries: <100ms cold start, ~10ms warm queries on single positions
- Batch queries: Multi-position queries via SQL IN clauses or temporary tables
- Region queries: Genomic range queries with automatic partitioning
- VCF annotation: Annotate VCF files with computed allele frequencies and sample genotypes
- Ploidy-aware: Correct AN/AC computation for autosomes, chrX, chrY, and chrM
- Bitmap compression: Roaring Bitmaps for efficient genotype storage
- In-process: No server process—queries run locally with DuckDB
- Incremental updates: Add new samples to existing databases
- Multiple genome builds: Support for GRCh37 and GRCh38
Installation
pip install -e /home/dan/projects/afquery/ --break-system-packages
Requires Python 3.10+.
Quick Start
1. Create a Database
First, prepare a manifest TSV with sample metadata:
sample_name sex tech_name vcf_path phenotype_codes
sample_1 male wgs vcfs/sample_1.vcf E11.9,I10
sample_2 female wes_kit_a vcfs/sample_2.vcf E11.9
sample_3 male wgs vcfs/sample_3.vcf I10
Key points about the manifest:
tech_name: Eitherwgs(case-insensitive) for whole genome, or a custom technology namevcf_path: Path to the single-sample VCF file (relative to manifest directory, or absolute)- For WES/exome technologies, capture regions are loaded from
--bed-dir/{tech_name}.bed- Example:
tech_name=wes_kit_a→ loadsbeds/wes_kit_a.bed
- Example:
Organize your files:
project/
├── manifest.tsv
├── vcfs/
│ ├── sample_1.vcf
│ ├── sample_2.vcf
│ └── sample_3.vcf
└── beds/ # Required for non-WGS technologies
└── wes_kit_a.bed # BED file for WES kit A
Then preprocess your VCF files:
afquery preprocess \
--manifest manifest.tsv \
--bed-dir ./beds/ \
--output-dir ./my_db/ \
--genome-build GRCh38
This creates:
my_db/manifest.json— database metadatamy_db/metadata.sqlite— samples, technologies, phenotype codes, precomputed bitmapsmy_db/variants/{chrom}.parquet— variant data with encoded genotypesmy_db/capture/— capture regions for each technology
2. Query Allele Frequencies
# Point query
afquery query --db my_db --chrom chr1 --pos 1000 --alt G
# Batch query (100 positions)
afquery query-batch --db my_db --positions positions.tsv --phenotype E11.9
# Region query
afquery query --db my_db --chrom chr1 --start 1000 --end 10000 --phenotype E11.9 --sex M
3. Annotate VCF Files
afquery annotate \
--db my_db \
--vcf input.vcf \
--output annotated.vcf \
--phenotype E11.9 \
--tech WGS
Adds AFQUERY_AC, AFQUERY_AN, AFQUERY_AF and genotype fields to your VCF.
Python API
from afquery import Database
db = Database("/path/to/db")
# Single position query
# Automatically filters samples by: sex + phenotype codes + capture coverage
result = db.query(
chrom="chr1",
pos=1000,
alt="G",
phenotype_codes=["E11.9"],
sex="both"
)
print(f"AC={result.ac}, AN={result.an}, AF={result.af}")
# Batch query (multi-variant)
results = db.query_batch(
"chr1",
variants=[(1500, "A", "T"), (3500, "G", "C")],
phenotype=["E11.9"],
)
# Region query (genomic range)
results = db.query_region(
chrom="chr1",
start=1000,
end=10000,
phenotype_codes=["E11.9", "I10"]
)
# Annotate VCF with allele frequencies
# Note: tech_name filters annotation to samples of that technology
db.annotate_vcf(
vcf_path="input.vcf",
output_path="annotated.vcf",
phenotype_codes=["E11.9"],
tech_name="wgs" # Only annotate using WGS samples
)
How samples are filtered in queries:
- Sex filter:
male,female, orboth - phenotype filter: All codes must match
- Capture filter: Automatic—only samples whose tech's BED covers the position
Database Structure
my_db/
├── manifest.json # Metadata: genome_build, sample_count, schema_version
├── metadata.sqlite # SQLite: samples, technologies, phenotype codes, bitmaps
├── variants/
│ ├── chr1.parquet
│ ├── chr2.parquet
│ └── ...
└── capture/
├── tech_0.pickle # WGS capture region (always covered)
└── tech_1.pickle # WES kit capture region
Each variant row contains:
pos— 1-based genomic positionref— reference allelealt— alternate allelehet_bitmap— Roaring Bitmap of heterozygous sampleshom_bitmap— Roaring Bitmap of homozygous samples
How Capture BED Files Are Associated with Samples
Samples are linked to capture regions through their technology:
- Manifest specifies technology: Each sample lists
tech_name(e.g.,wgs,wes_kit_a) - Technology maps to BED file:
- WGS: No BED file needed (always fully covered)
- WES/Custom: BED file loaded from
{bed_dir}/{tech_name}.bed
- Storage in database:
metadata.sqlite::technologiesstores tech_id, tech_name, and bed_pathmetadata.sqlite::samplesstores sample_id, sample_name, and tech_id (foreign key)
- Query-time filtering: When querying, samples are filtered by:
- Sex (male/female/both)
- phenotype diagnosis codes
- Capture region coverage (via tech's BED file)
Example: If you have samples on two exome kits:
sample_name sex tech_name vcf_path phenotype_codes
S001 male exome_v1 vcfs/S001.vcf E11.9
S002 female exome_v1 vcfs/S002.vcf E11.9
S003 male exome_v2 vcfs/S003.vcf I10
Then provide:
beds/
├── exome_v1.bed # Coverage for samples S001, S002
└── exome_v2.bed # Coverage for sample S003
At query time, each sample's eligible regions are determined by its tech's BED file.
Advanced Features
Incremental Updates (add_samples)
afquery add-samples \
--db my_db \
--manifest new_samples.tsv \
--vcf-dir ./new_vcfs/
Adds new samples without rebuilding the entire database.
Compact Database
Remove samples and reclaim disk space:
afquery compact --db my_db --samples-to-remove sample_1,sample_2
Run Benchmarks
afquery benchmark --db my_db --n-queries 1000 --query-type point
Generate Synthetic Data
afquery synth --output synthetic_db/ --n-samples 5000 --n-variants 100000
Ploidy Rules
AF computation respects chromosome-specific ploidy:
| Chromosome | Formula |
|---|---|
| Autosomes | AN = 2 × eligible_samples |
| chrM | AN = 1 × eligible_samples |
| chrY | AN = 1 × eligible_males |
| chrX (PAR) | AN = 2 × eligible_samples |
| chrX (non-PAR) | AN = 2 × eligible_females + 1 × eligible_males |
Where eligible = samples matching sex, phenotype, and technology capture filters.
Performance Targets
- Point query (cold): <100 ms
- Point query (warm): ~10 ms
- Batch 100 positions: ~200 ms
- VCF annotation (5K variants): ~30 s
- VCF annotation (5M variants): ~30 min
Command Reference
afquery query Query single position
afquery query-batch Batch query multiple positions
afquery annotate Annotate VCF file
afquery info Show database info
afquery preprocess Build database from VCFs
afquery add-samples Add new samples to database
afquery compact Remove samples and reclaim space
afquery synth Generate synthetic test database
afquery benchmark Run performance benchmarks
Run afquery --help for full options.
Development
Running Tests
# All 190 tests
python3 -m pytest --tb=short -q
# Specific test module
python3 -m pytest tests/test_query.py -v
Key Modules
afquery.query— QueryEngine, point/batch/region queriesafquery.annotate— VCF annotation pipelineafquery.database— Database wrapper (public API)afquery.preprocess— Manifest parsing, VCF ingestion, Parquet buildingafquery.bitmaps— Roaring Bitmap encoding/decodingafquery.ploidy— Chromosome-specific ploidy rulesafquery.models— Data classes (QueryResult, ParsedSample, etc.)
Architecture
See brain/architecture.md for detailed system design, data flow, and query algorithm.
Genome Builds
- GRCh37 (hg19) — PAR regions: chrX:1-2649520
- GRCh38 (hg38) — PAR regions: chrX:1-3099677
Technologies Supported
-
WGS — Whole genome sequencing (always fully covered, no BED file needed)
- Manifest:
tech_name = wgs(case-insensitive) - Query-time: All positions in genome considered covered
- Manifest:
-
WES — Whole exome sequencing (coverage defined by BED file)
- Manifest:
tech_name = wes_kit_a(or any custom name) - Preprocessing: Loads
{bed_dir}/wes_kit_a.bed(0-based half-open BED format) - Query-time: Only positions within BED intervals considered covered
- Manifest:
-
Custom — Any technology with a BED file (e.g., gene panels, targeted sequencing)
- Manifest: Use any
tech_name - Preprocessing: Loads
{bed_dir}/{tech_name}.bed - Query-time: Respects BED file coverage
- Manifest: Use any
Troubleshooting
ImportError: cyvcf2
cyvcf2 import happens inside worker processes during preprocessing. Do not import at module level.
DuckDB Temp Files
Uses Parquet format (not Arrow IPC) for compatibility. Set DUCKDB_TEMP_DIRECTORY if needed.
Sample IDs
Sample IDs are 0-indexed and monotonically increasing. Never reuse removed IDs—use compact to reclaim space.
Contributing
- Read
brain/project_state.jsonfor current phase and test count - Read
brain/architecture.mdfor system design - Follow code conventions in
CLAUDE.md - Update
brain/docs after architectural changes - Run tests before submitting
License
(Add your license here)
Citation
If you use afquery in research, please cite:
(Citation format to be determined)
Status: Phase 5 complete (190 tests passing). Active development.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file afquery-0.1.2.1.tar.gz.
File metadata
- Download URL: afquery-0.1.2.1.tar.gz
- Upload date:
- Size: 81.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc79d8f295ea6b84215154bf581025f7fba3a9da42f5deff028710d29296572c
|
|
| MD5 |
05be0e4e4f70bd17b8b801f6ce3b12a0
|
|
| BLAKE2b-256 |
9c5f9c58b5051fc23fc848c3e4c0283d0ac2ff3c79dc09b538b354aab22961a8
|
Provenance
The following attestation bundles were made for afquery-0.1.2.1.tar.gz:
Publisher:
release.yml on dlopez-bioinfo/afquery
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
afquery-0.1.2.1.tar.gz -
Subject digest:
cc79d8f295ea6b84215154bf581025f7fba3a9da42f5deff028710d29296572c - Sigstore transparency entry: 1109969588
- Sigstore integration time:
-
Permalink:
dlopez-bioinfo/afquery@55f4ede8946fed883cb9fd68a17423a072f51038 -
Branch / Tag:
refs/tags/v0.1.2.1 - Owner: https://github.com/dlopez-bioinfo
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@55f4ede8946fed883cb9fd68a17423a072f51038 -
Trigger Event:
push
-
Statement type:
File details
Details for the file afquery-0.1.2.1-py3-none-any.whl.
File metadata
- Download URL: afquery-0.1.2.1-py3-none-any.whl
- Upload date:
- Size: 58.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6c4547cbd29ff5f981395378abe256bbb13e45b2af327b5bf51c5cbb914254de
|
|
| MD5 |
02a4b5daad7185017b2a9d06f61d286d
|
|
| BLAKE2b-256 |
8b631ba132fea3360063fd5f895bc51ad5a3b592678af02cd64feb09b67436fb
|
Provenance
The following attestation bundles were made for afquery-0.1.2.1-py3-none-any.whl:
Publisher:
release.yml on dlopez-bioinfo/afquery
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
afquery-0.1.2.1-py3-none-any.whl -
Subject digest:
6c4547cbd29ff5f981395378abe256bbb13e45b2af327b5bf51c5cbb914254de - Sigstore transparency entry: 1109969610
- Sigstore integration time:
-
Permalink:
dlopez-bioinfo/afquery@55f4ede8946fed883cb9fd68a17423a072f51038 -
Branch / Tag:
refs/tags/v0.1.2.1 - Owner: https://github.com/dlopez-bioinfo
-
Access:
private
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@55f4ede8946fed883cb9fd68a17423a072f51038 -
Trigger Event:
push
-
Statement type: