Genomic allele frequency query engine with bitmap-encoded genotypes

Project description

afquery

Genomic allele frequency query engine with bitmap-encoded genotypes. Fast, file-based queries over 10K-50K samples at sub-100ms latency.

Features

Fast point queries: <100ms cold start, ~10ms warm queries on single positions
Batch queries: Multi-position queries via SQL IN clauses or temporary tables
Region queries: Genomic range queries with automatic partitioning
VCF annotation: Annotate VCF files with computed allele frequencies and sample genotypes
Ploidy-aware: Correct AN/AC computation for autosomes, chrX, chrY, and chrM
Bitmap compression: Roaring Bitmaps for efficient genotype storage
In-process: No server process—queries run locally with DuckDB
Incremental updates: Add new samples to existing databases
Multiple genome builds: Support for GRCh37 and GRCh38

Installation

pip install -e /home/dan/projects/afquery/ --break-system-packages

Requires Python 3.10+.

Quick Start

1. Create a Database

First, prepare a manifest TSV with sample metadata:

sample_name	sex	tech_name	vcf_path	phenotype_codes
sample_1	male	wgs	vcfs/sample_1.vcf	E11.9,I10
sample_2	female	wes_kit_a	vcfs/sample_2.vcf	E11.9
sample_3	male	wgs	vcfs/sample_3.vcf	I10

Key points about the manifest:

tech_name: Either wgs (case-insensitive) for whole genome, or a custom technology name
vcf_path: Path to the single-sample VCF file (relative to manifest directory, or absolute)
For WES/exome technologies, capture regions are loaded from --bed-dir/{tech_name}.bed
- Example: tech_name=wes_kit_a → loads beds/wes_kit_a.bed

Organize your files:

project/
├── manifest.tsv
├── vcfs/
│   ├── sample_1.vcf
│   ├── sample_2.vcf
│   └── sample_3.vcf
└── beds/              # Required for non-WGS technologies
    └── wes_kit_a.bed  # BED file for WES kit A

Then preprocess your VCF files:

afquery preprocess \
  --manifest manifest.tsv \
  --bed-dir ./beds/ \
  --output-dir ./my_db/ \
  --genome-build GRCh38

This creates:

my_db/manifest.json — database metadata
my_db/metadata.sqlite — samples, technologies, phenotype codes, precomputed bitmaps
my_db/variants/{chrom}.parquet — variant data with encoded genotypes
my_db/capture/ — capture regions for each technology

2. Query Allele Frequencies

# Point query
afquery query --db my_db --chrom chr1 --pos 1000 --alt G

# Batch query (100 positions)
afquery query-batch --db my_db --positions positions.tsv --phenotype E11.9

# Region query
afquery query --db my_db --chrom chr1 --start 1000 --end 10000 --phenotype E11.9 --sex M

3. Annotate VCF Files

afquery annotate \
  --db my_db \
  --vcf input.vcf \
  --output annotated.vcf \
  --phenotype E11.9 \
  --tech WGS

Adds AFQUERY_AC, AFQUERY_AN, AFQUERY_AF and genotype fields to your VCF.

Python API

from afquery import Database

db = Database("/path/to/db")

# Single position query
# Automatically filters samples by: sex + phenotype codes + capture coverage
result = db.query(
    chrom="chr1",
    pos=1000,
    alt="G",
    phenotype_codes=["E11.9"],
    sex="both"
)
print(f"AC={result.ac}, AN={result.an}, AF={result.af}")

# Batch query (multi-variant)
results = db.query_batch(
    "chr1",
    variants=[(1500, "A", "T"), (3500, "G", "C")],
    phenotype=["E11.9"],
)

# Region query (genomic range)
results = db.query_region(
    chrom="chr1",
    start=1000,
    end=10000,
    phenotype_codes=["E11.9", "I10"]
)

# Annotate VCF with allele frequencies
# Note: tech_name filters annotation to samples of that technology
db.annotate_vcf(
    vcf_path="input.vcf",
    output_path="annotated.vcf",
    phenotype_codes=["E11.9"],
    tech_name="wgs"  # Only annotate using WGS samples
)

How samples are filtered in queries:

Sex filter: male, female, or both
phenotype filter: All codes must match
Capture filter: Automatic—only samples whose tech's BED covers the position

Database Structure

my_db/
├── manifest.json          # Metadata: genome_build, sample_count, schema_version
├── metadata.sqlite        # SQLite: samples, technologies, phenotype codes, bitmaps
├── variants/
│   ├── chr1.parquet
│   ├── chr2.parquet
│   └── ...
└── capture/
    ├── tech_0.pickle      # WGS capture region (always covered)
    └── tech_1.pickle      # WES kit capture region

Each variant row contains:

pos — 1-based genomic position
ref — reference allele
alt — alternate allele
het_bitmap — Roaring Bitmap of heterozygous samples
hom_bitmap — Roaring Bitmap of homozygous samples

How Capture BED Files Are Associated with Samples

Samples are linked to capture regions through their technology:

Manifest specifies technology: Each sample lists tech_name (e.g., wgs, wes_kit_a)
Technology maps to BED file:
- WGS: No BED file needed (always fully covered)
- WES/Custom: BED file loaded from {bed_dir}/{tech_name}.bed
Storage in database:
- metadata.sqlite::technologies stores tech_id, tech_name, and bed_path
- metadata.sqlite::samples stores sample_id, sample_name, and tech_id (foreign key)
Query-time filtering: When querying, samples are filtered by:
- Sex (male/female/both)
- phenotype diagnosis codes
- Capture region coverage (via tech's BED file)

Example: If you have samples on two exome kits:

sample_name	sex	tech_name	vcf_path	phenotype_codes
S001	male	exome_v1	vcfs/S001.vcf	E11.9
S002	female	exome_v1	vcfs/S002.vcf	E11.9
S003	male	exome_v2	vcfs/S003.vcf	I10

Then provide:

beds/
├── exome_v1.bed    # Coverage for samples S001, S002
└── exome_v2.bed    # Coverage for sample S003

At query time, each sample's eligible regions are determined by its tech's BED file.

Advanced Features

Incremental Updates (add_samples)

afquery add-samples \
  --db my_db \
  --manifest new_samples.tsv \
  --vcf-dir ./new_vcfs/

Adds new samples without rebuilding the entire database.

Compact Database

Remove samples and reclaim disk space:

afquery compact --db my_db --samples-to-remove sample_1,sample_2

Run Benchmarks

afquery benchmark --db my_db --n-queries 1000 --query-type point

Generate Synthetic Data

afquery synth --output synthetic_db/ --n-samples 5000 --n-variants 100000

Ploidy Rules

AF computation respects chromosome-specific ploidy:

Chromosome	Formula
Autosomes	`AN = 2 × eligible_samples`
chrM	`AN = 1 × eligible_samples`
chrY	`AN = 1 × eligible_males`
chrX (PAR)	`AN = 2 × eligible_samples`
chrX (non-PAR)	`AN = 2 × eligible_females + 1 × eligible_males`

Where eligible = samples matching sex, phenotype, and technology capture filters.

Performance Targets

Point query (cold): <100 ms
Point query (warm): ~10 ms
Batch 100 positions: ~200 ms
VCF annotation (5K variants): ~30 s
VCF annotation (5M variants): ~30 min

Command Reference

afquery query                Query single position
afquery query-batch          Batch query multiple positions
afquery annotate             Annotate VCF file
afquery info                 Show database info
afquery preprocess           Build database from VCFs
afquery add-samples          Add new samples to database
afquery compact              Remove samples and reclaim space
afquery synth                Generate synthetic test database
afquery benchmark            Run performance benchmarks

Run afquery --help for full options.

Development

Running Tests

# All 190 tests
python3 -m pytest --tb=short -q

# Specific test module
python3 -m pytest tests/test_query.py -v

Key Modules

afquery.query — QueryEngine, point/batch/region queries
afquery.annotate — VCF annotation pipeline
afquery.database — Database wrapper (public API)
afquery.preprocess — Manifest parsing, VCF ingestion, Parquet building
afquery.bitmaps — Roaring Bitmap encoding/decoding
afquery.ploidy — Chromosome-specific ploidy rules
afquery.models — Data classes (QueryResult, ParsedSample, etc.)

Architecture

See brain/architecture.md for detailed system design, data flow, and query algorithm.

Genome Builds

GRCh37 (hg19) — PAR regions: chrX:1-2649520
GRCh38 (hg38) — PAR regions: chrX:1-3099677

Technologies Supported

WGS — Whole genome sequencing (always fully covered, no BED file needed)
- Manifest: tech_name = wgs (case-insensitive)
- Query-time: All positions in genome considered covered
WES — Whole exome sequencing (coverage defined by BED file)
- Manifest: tech_name = wes_kit_a (or any custom name)
- Preprocessing: Loads {bed_dir}/wes_kit_a.bed (0-based half-open BED format)
- Query-time: Only positions within BED intervals considered covered
Custom — Any technology with a BED file (e.g., gene panels, targeted sequencing)
- Manifest: Use any tech_name
- Preprocessing: Loads {bed_dir}/{tech_name}.bed
- Query-time: Respects BED file coverage

Troubleshooting

ImportError: `cyvcf2`

cyvcf2 import happens inside worker processes during preprocessing. Do not import at module level.

DuckDB Temp Files

Uses Parquet format (not Arrow IPC) for compatibility. Set DUCKDB_TEMP_DIRECTORY if needed.

Sample IDs

Sample IDs are 0-indexed and monotonically increasing. Never reuse removed IDs—use compact to reclaim space.

Contributing

Read brain/project_state.json for current phase and test count
Read brain/architecture.md for system design
Follow code conventions in CLAUDE.md
Update brain/ docs after architectural changes
Run tests before submitting

License

(Add your license here)

Citation

If you use afquery in research, please cite:

(Citation format to be determined)

Status: Phase 5 complete (190 tests passing). Active development.

Project details

Release history Release notifications | RSS feed

0.2.2

Mar 24, 2026

0.2.1

Mar 23, 2026

0.2.0

Mar 23, 2026

0.1.4

Mar 16, 2026

0.1.3

Mar 16, 2026

0.1.2.2

Mar 18, 2026

This version

0.1.2.1

Mar 16, 2026

0.1.2

Mar 16, 2026

0.1.1

Mar 16, 2026

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

afquery-0.1.2.1.tar.gz (81.7 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

afquery-0.1.2.1-py3-none-any.whl (58.6 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file afquery-0.1.2.1.tar.gz.

File metadata

Download URL: afquery-0.1.2.1.tar.gz
Upload date: Mar 16, 2026
Size: 81.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for afquery-0.1.2.1.tar.gz
Algorithm	Hash digest
SHA256	`cc79d8f295ea6b84215154bf581025f7fba3a9da42f5deff028710d29296572c`
MD5	`05be0e4e4f70bd17b8b801f6ce3b12a0`
BLAKE2b-256	`9c5f9c58b5051fc23fc848c3e4c0283d0ac2ff3c79dc09b538b354aab22961a8`

See more details on using hashes here.

Provenance

The following attestation bundles were made for afquery-0.1.2.1.tar.gz:

Publisher: release.yml on dlopez-bioinfo/afquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: afquery-0.1.2.1.tar.gz
- Subject digest: cc79d8f295ea6b84215154bf581025f7fba3a9da42f5deff028710d29296572c
- Sigstore transparency entry: 1109969588
- Sigstore integration time: Mar 16, 2026
Source repository:
- Permalink: dlopez-bioinfo/afquery@55f4ede8946fed883cb9fd68a17423a072f51038
- Branch / Tag: refs/tags/v0.1.2.1
- Owner: https://github.com/dlopez-bioinfo
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@55f4ede8946fed883cb9fd68a17423a072f51038
- Trigger Event: push

File details

Details for the file afquery-0.1.2.1-py3-none-any.whl.

File metadata

Download URL: afquery-0.1.2.1-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 58.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for afquery-0.1.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6c4547cbd29ff5f981395378abe256bbb13e45b2af327b5bf51c5cbb914254de`
MD5	`02a4b5daad7185017b2a9d06f61d286d`
BLAKE2b-256	`8b631ba132fea3360063fd5f895bc51ad5a3b592678af02cd64feb09b67436fb`

See more details on using hashes here.

Provenance

The following attestation bundles were made for afquery-0.1.2.1-py3-none-any.whl:

Publisher: release.yml on dlopez-bioinfo/afquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: afquery-0.1.2.1-py3-none-any.whl
- Subject digest: 6c4547cbd29ff5f981395378abe256bbb13e45b2af327b5bf51c5cbb914254de
- Sigstore transparency entry: 1109969610
- Sigstore integration time: Mar 16, 2026
Source repository:
- Permalink: dlopez-bioinfo/afquery@55f4ede8946fed883cb9fd68a17423a072f51038
- Branch / Tag: refs/tags/v0.1.2.1
- Owner: https://github.com/dlopez-bioinfo
- Access: private
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@55f4ede8946fed883cb9fd68a17423a072f51038
- Trigger Event: push

afquery 0.1.2.1

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

afquery

Features

Installation

Quick Start

1. Create a Database

2. Query Allele Frequencies

3. Annotate VCF Files

Python API

Database Structure

How Capture BED Files Are Associated with Samples

Advanced Features

Incremental Updates (add_samples)

Compact Database

Run Benchmarks

Generate Synthetic Data

Ploidy Rules

Performance Targets

Command Reference

Development

Running Tests

Key Modules

Architecture

Genome Builds

Technologies Supported

Troubleshooting

ImportError: cyvcf2

DuckDB Temp Files

Sample IDs

Contributing

License

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

ImportError: `cyvcf2`