Genomic allele frequency query engine with bitmap-encoded genotypes

These details have not been verified by PyPI

Project description

AFQuery

Fast, file-based genomic allele frequency queries for large cohorts (10K–50K samples). No server, no cloud — just files. Sub-100ms point queries with flexible filtering by sex, phenotype (ICD codes), and sequencing technology.

Quick Example

pip install afquery

# Build database from your VCFs
afquery create-db --manifest samples.tsv --output-dir ./db/ --genome-build GRCh38

# Query allele frequency
afquery query --db ./db/ --chrom chr1 --pos 123456 --phenotype E11.9 --sex female

# Annotate a VCF
afquery annotate --db ./db/ --input variants.vcf --output annotated.vcf

Full documentation →

Features

Sub-100ms point queries on 50K-sample cohorts
Filter by sex, phenotype (ICD codes), and sequencing technology
Bitmap-compressed storage (Roaring Bitmaps + Parquet)
Incremental updates (add/remove samples without full rebuild)
VCF annotation with custom sample subsets
Ploidy-aware AN for sex chromosomes (X/Y/MT)
Zero infrastructure — purely file-based

Installation

pip install afquery
# or
conda install -c bioconda -c conda-forge afquery

Requires Python 3.10+.

Quick Start

1. Create a Database

First, prepare a manifest TSV with sample metadata:

sample_name	sex	tech_name	vcf_path	phenotype_codes
sample_1	male	wgs	vcfs/sample_1.vcf	E11.9,I10
sample_2	female	wes_kit_a	vcfs/sample_2.vcf	E11.9
sample_3	male	wgs	vcfs/sample_3.vcf	I10

Key points about the manifest:

tech_name: Either wgs (case-insensitive) for whole genome, or a custom technology name
vcf_path: Path to the single-sample VCF file (relative to manifest directory, or absolute)
For WES/exome technologies, capture regions are loaded from --bed-dir/{tech_name}.bed
- Example: tech_name=wes_kit_a → loads beds/wes_kit_a.bed

Organize your files:

project/
├── manifest.tsv
├── vcfs/
│   ├── sample_1.vcf
│   ├── sample_2.vcf
│   └── sample_3.vcf
└── beds/              # Required for non-WGS technologies
    └── wes_kit_a.bed  # BED file for WES kit A

Then build your database:

afquery create-db \
  --manifest manifest.tsv \
  --bed-dir ./beds/ \
  --output-dir ./my_db/ \
  --genome-build GRCh38

This creates:

my_db/manifest.json — database metadata
my_db/metadata.sqlite — samples, technologies, phenotype codes, precomputed bitmaps
my_db/variants/{chrom}.parquet — variant data with encoded genotypes
my_db/capture/ — capture regions for each technology

2. Query Allele Frequencies

# Point query
afquery query --db my_db --chrom chr1 --pos 1000 --alt G

# Batch query (from file with columns: pos ref alt)
afquery query --db my_db --chrom chr1 --from-file positions.tsv --phenotype E11.9

# Region query
afquery query --db my_db --chrom chr1 --start 1000 --end 10000 --phenotype E11.9 --sex M

3. Annotate VCF Files

afquery annotate \
  --db my_db \
  --input input.vcf \
  --output annotated.vcf \
  --phenotype E11.9 \
  --tech WGS

Adds AFQUERY_AC, AFQUERY_AN, AFQUERY_AF and genotype fields to your VCF.

Python API

from afquery import Database

db = Database("/path/to/db")

# Single position query
# Automatically filters samples by: sex + phenotype codes + capture coverage
results = db.query(
    chrom="chr1",
    pos=1000,
    alt="G",
    phenotype=["E11.9"],
    sex="both"
)
for r in results:
    print(f"AC={r.AC}, AN={r.AN}, AF={r.AF}")

# Batch query (multi-variant)
results = db.query_batch(
    "chr1",
    variants=[(1500, "A", "T"), (3500, "G", "C")],
    phenotype=["E11.9"],
)

# Region query (genomic range)
results = db.query_region(
    chrom="chr1",
    start=1000,
    end=10000,
    phenotype=["E11.9", "I10"]
)

# Annotate VCF with allele frequencies
# Note: tech filters annotation to samples of that technology
db.annotate_vcf(
    input_vcf="input.vcf",
    output_vcf="annotated.vcf",
    phenotype=["E11.9"],
    tech=["wgs"]  # Only annotate using WGS samples
)

How samples are filtered in queries:

Sex filter: male, female, or both
phenotype filter: All codes must match
Capture filter: Automatic—only samples whose tech's BED covers the position

Database Structure

my_db/
├── manifest.json          # Metadata: genome_build, sample_count, schema_version
├── metadata.sqlite        # SQLite: samples, technologies, phenotype codes, bitmaps
├── variants/
│   ├── chr1.parquet
│   ├── chr2.parquet
│   └── ...
└── capture/
    ├── tech_0.pickle      # WGS capture region (always covered)
    └── tech_1.pickle      # WES kit capture region

Each variant row contains:

pos — 1-based genomic position
ref — reference allele
alt — alternate allele
het_bitmap — Roaring Bitmap of heterozygous samples
hom_bitmap — Roaring Bitmap of homozygous samples

How Capture BED Files Are Associated with Samples

Samples are linked to capture regions through their technology:

Manifest specifies technology: Each sample lists tech_name (e.g., wgs, wes_kit_a)
Technology maps to BED file:
- WGS: No BED file needed (always fully covered)
- WES/Custom: BED file loaded from {bed_dir}/{tech_name}.bed
Storage in database:
- metadata.sqlite::technologies stores tech_id, tech_name, and bed_path
- metadata.sqlite::samples stores sample_id, sample_name, and tech_id (foreign key)
Query-time filtering: When querying, samples are filtered by:
- Sex (male/female/both)
- phenotype diagnosis codes
- Capture region coverage (via tech's BED file)

Example: If you have samples on two exome kits:

sample_name	sex	tech_name	vcf_path	phenotype_codes
S001	male	exome_v1	vcfs/S001.vcf	E11.9
S002	female	exome_v1	vcfs/S002.vcf	E11.9
S003	male	exome_v2	vcfs/S003.vcf	I10

Then provide:

beds/
├── exome_v1.bed    # Coverage for samples S001, S002
└── exome_v2.bed    # Coverage for sample S003

At query time, each sample's eligible regions are determined by its tech's BED file.

Advanced Features

Incremental Updates

afquery update-db \
  --db my_db \
  --add-samples new_samples.tsv

Adds new samples without rebuilding the entire database.

Remove Samples and Compact

Remove samples and reclaim disk space:

afquery update-db --db my_db --remove-samples sample_1,sample_2 --compact

Run Benchmarks

afquery benchmark --n-samples 5000 --n-variants 100000

Ploidy Rules

AF computation respects chromosome-specific ploidy:

Chromosome	Formula
Autosomes	`AN = 2 × eligible_samples`
chrM	`AN = 1 × eligible_samples`
chrY	`AN = 1 × eligible_males`
chrX (PAR)	`AN = 2 × eligible_samples`
chrX (non-PAR)	`AN = 2 × eligible_females + 1 × eligible_males`

Where eligible = samples matching sex, phenotype, and technology capture filters.

Performance Targets

Point query (cold): <100 ms
Point query (warm): ~10 ms
Batch 100 positions: ~200 ms
VCF annotation (5K variants): ~30 s
VCF annotation (5M variants): ~30 min

Command Reference

afquery query                Query one position, region, or batch (--from-file)
afquery annotate             Annotate VCF file with AF info fields
afquery dump                 Export allele frequencies for all variants to CSV
afquery info                 Show database metadata and sample list
afquery check                Validate database integrity
afquery create-db            Build database from a VCF manifest
afquery update-db            Add/remove samples or compact the database
afquery version              Show or set the database version label
afquery benchmark            Run performance benchmarks

Run afquery --help for full options.

Development

Running Tests

# All tests
python3 -m pytest --tb=short -q

# Specific test module
python3 -m pytest tests/test_query.py -v

Key Modules

afquery.query — QueryEngine, point/batch/region queries
afquery.annotate — VCF annotation pipeline
afquery.database — Database wrapper (public API)
afquery.preprocess — Manifest parsing, VCF ingestion, Parquet building
afquery.bitmaps — Roaring Bitmap encoding/decoding
afquery.ploidy — Chromosome-specific ploidy rules
afquery.models — Data classes (QueryResult, ParsedSample, etc.)

Genome Builds

GRCh37 (hg19) — PAR regions: chrX:1-2649520
GRCh38 (hg38) — PAR regions: chrX:1-3099677

Troubleshooting

ImportError: `cyvcf2`

cyvcf2 import happens inside worker processes during preprocessing. Do not import at module level.

DuckDB Temp Files

Uses Parquet format (not Arrow IPC) for compatibility. Set DUCKDB_TEMP_DIRECTORY if needed.

License

(Add your license here)

Citation

If you use afquery in research, please cite:

(Citation format to be determined)

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.2

Mar 24, 2026

0.2.1

Mar 23, 2026

0.2.0

Mar 23, 2026

This version

0.1.4

Mar 16, 2026

0.1.3

Mar 16, 2026

0.1.2.2

Mar 18, 2026

0.1.2.1

Mar 16, 2026

0.1.2

Mar 16, 2026

0.1.1

Mar 16, 2026

0.1.0

Mar 16, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

afquery-0.1.4.tar.gz (109.3 kB view details)

Uploaded Mar 16, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

afquery-0.1.4-py3-none-any.whl (59.4 kB view details)

Uploaded Mar 16, 2026 Python 3

File details

Details for the file afquery-0.1.4.tar.gz.

File metadata

Download URL: afquery-0.1.4.tar.gz
Upload date: Mar 16, 2026
Size: 109.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for afquery-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`846f89cc014b9a4a287606b930c929591ff10efcc0c599b9d869f9930f6ba07b`
MD5	`87e1fee85a0c75f5e5ef629f5bcf1313`
BLAKE2b-256	`7dda87c0a9fea7797276598d1868b04b84327c78d90ecb0c804a08ff87e094b5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for afquery-0.1.4.tar.gz:

Publisher: release.yml on dlopez-bioinfo/afquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: afquery-0.1.4.tar.gz
- Subject digest: 846f89cc014b9a4a287606b930c929591ff10efcc0c599b9d869f9930f6ba07b
- Sigstore transparency entry: 1111121864
- Sigstore integration time: Mar 16, 2026
Source repository:
- Permalink: dlopez-bioinfo/afquery@fc7a3412ff3dd189a13b693d9298c8c4d0f310b9
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/dlopez-bioinfo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@fc7a3412ff3dd189a13b693d9298c8c4d0f310b9
- Trigger Event: push

File details

Details for the file afquery-0.1.4-py3-none-any.whl.

File metadata

Download URL: afquery-0.1.4-py3-none-any.whl
Upload date: Mar 16, 2026
Size: 59.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for afquery-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`44adb9ce218da786b0a08da07b4df7f0f9e17686d190ebbde9836090a1fbffec`
MD5	`6f587d178fd483309175071604dfd294`
BLAKE2b-256	`2de80771b561b104c3d1e715909f36bd0e91718d7bf63a2c99772e24f1ac2609`

See more details on using hashes here.

Provenance

The following attestation bundles were made for afquery-0.1.4-py3-none-any.whl:

Publisher: release.yml on dlopez-bioinfo/afquery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: afquery-0.1.4-py3-none-any.whl
- Subject digest: 44adb9ce218da786b0a08da07b4df7f0f9e17686d190ebbde9836090a1fbffec
- Sigstore transparency entry: 1111121929
- Sigstore integration time: Mar 16, 2026
Source repository:
- Permalink: dlopez-bioinfo/afquery@fc7a3412ff3dd189a13b693d9298c8c4d0f310b9
- Branch / Tag: refs/tags/v0.1.4
- Owner: https://github.com/dlopez-bioinfo
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@fc7a3412ff3dd189a13b693d9298c8c4d0f310b9
- Trigger Event: push

afquery 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

AFQuery

Quick Example

Features

Installation

Quick Start

1. Create a Database

2. Query Allele Frequencies

3. Annotate VCF Files

Python API

Database Structure

How Capture BED Files Are Associated with Samples

Advanced Features

Incremental Updates

Remove Samples and Compact

Run Benchmarks

Ploidy Rules

Performance Targets

Command Reference

Development

Running Tests

Key Modules

Genome Builds

Troubleshooting

ImportError: cyvcf2

DuckDB Temp Files

License

Citation

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

ImportError: `cyvcf2`