Modern, pipeline-friendly Y-chromosome haplogroup inference

These details have not been verified by PyPI

Project links

Project description

yallHap

Modern, pipeline-friendly Y-chromosome haplogroup inference.

Features

YFull tree: Uses the most comprehensive Y-chromosome phylogeny (185,780+ SNPs)
Probabilistic scoring: Likelihood-based confidence scores, not just SNP counting
Ancient DNA support: Built-in damage filtering, transversions-only mode, quality rescaling
Multiple references: Supports GRCh37, GRCh38, and T2T-CHM13v2.0 with automatic liftover
Multi-threaded: Parallel sample processing with --threads N for population-scale studies
Batch processing: Classify thousands of samples efficiently with classify_batch()
Pipeline-friendly: Proper exit codes, JSON/TSV output, Nextflow/Snakemake examples
Bioconda/Docker: Easy installation and containerized execution

Accuracy

Validated against established datasets:

Dataset	Samples	Same Major Lineage	Reference	Notes
1000 Genomes Phase 3	1,233	99.8% (95% CI: 99.3-100%)	GRCh37	Modern WGS, heuristic mode
AADR Ancient DNA	7,333	90.7% Bayesian / 88.3% Heuristic	GRCh37	Full dataset, stratified by variant density
gnomAD HGDP/1KG	1,231	99.9% (95% CI: 99.5-100%)	GRCh38	High-coverage WGS

1000 Genomes details:

Only 3 misclassified samples (2 rare A0 haplogroups, 1 NO/K confusion)
Mean confidence: 0.994
Mean derived SNPs: 15.4

AADR Ancient DNA details (7,333 samples):

Overall: 90.7% accuracy with Bayesian ancient mode vs 88.3% with heuristic transversions-only
Stratified by variant density: <1% (33.7%), 1-4% (37.9%), 4-10% (71.7%), 10-50% (97.8%), ≥50% (99.0%)
At ≥10% variant density, both modes achieve 97-99% accuracy, comparable to modern WGS
Bayesian mode recommended for 4-10% variant density (+12-24 pp improvement)
Variant density = (called variants / total variants in chrY VCF) × 100%

gnomAD High-Coverage details:

200 samples randomly selected from 1,231 overlapping with 1000 Genomes
30× high-coverage whole-genome sequencing
Mean derived SNPs: 26.7
95% confidence interval: 98.17-100%

See VALIDATION_TESTING.md for reproducible validation protocols.

Installation

pip (recommended)

pip install yallhap

Conda

conda install -c bioconda yallhap

Docker

docker pull trianglegrrl/yallhap

From source

git clone https://github.com/trianglegrrl/yallhap.git
cd yallhap
pip install -e ".[dev]"

Quick Start

1. Download reference data

yallhap download --output-dir data/

This downloads:

YFull tree JSON (~14 MB)
YBrowse SNP database for GRCh38 (~430 MB)
YBrowse SNP database for GRCh37 (~50 MB)

2. Classify a sample

Use the SNP database matching your VCF's reference genome:

# For GRCh38/hg38 VCFs
yallhap classify sample.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --reference grch38 \
    --output result.json

# For GRCh37/hg19 VCFs
yallhap classify sample.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch37.csv \
    --reference grch37 \
    --output result.json

3. View results

cat result.json | jq '.haplogroup, .confidence'
# "R-L21"
# 0.97

Usage

Single Sample Classification

yallhap classify sample.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --reference grch38 \
    --output result.json

Multi-Sample VCF

For VCFs containing multiple samples, specify which sample to classify:

yallhap classify multi_sample.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --sample NA12878 \
    --output result.json

Batch Processing

Process multiple VCF files into a single TSV:

yallhap batch sample1.vcf.gz sample2.vcf.gz sample3.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --output results.tsv

Parallel Processing

Use multiple threads for faster batch processing:

yallhap batch samples/*.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --threads 16 \
    --output results.tsv

With 16 threads, processing 1,000+ samples takes approximately 10 minutes.

TSV Output Format

Use --format tsv for tab-separated output (useful for pipelines):

yallhap classify sample.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --format tsv \
    --output result.tsv

Reference Genomes

yallHap supports three reference genomes. Use the SNP database matching your VCF's reference:

VCF Reference	SNP Database	`-r` flag
GRCh37/hg19	`ybrowse_snps_grch37.csv`	`grch37`
GRCh38/hg38	`ybrowse_snps_grch38.csv`	`grch38`
T2T-CHM13v2.0	`ybrowse_snps_grch38.csv`	`t2t`

# GRCh37 (hg19) - 1000 Genomes Phase 3, many ancient DNA datasets
yallhap classify sample.vcf.gz \
    -s data/ybrowse_snps_grch37.csv -r grch37 ...

# GRCh38 (hg38) - current standard, gnomAD, most modern studies
yallhap classify sample.vcf.gz \
    -s data/ybrowse_snps_grch38.csv -r grch38 ...

# T2T-CHM13v2.0 - complete Y chromosome (62 Mb)
yallhap classify sample.vcf.gz \
    -s data/ybrowse_snps_grch38.csv -r t2t ...

T2T Note: T2T coordinates are computed automatically via liftover from GRCh38 positions. Ensure liftover chain files are available (run python scripts/download_liftover_chains.py).

Ancient DNA Mode

yallHap includes specialized handling for ancient DNA samples with post-mortem damage.

Recommended: Bayesian Ancient Mode

For ancient DNA samples with moderate variant density (4–10%), Bayesian ancient mode is recommended, achieving +12–24 percentage point improvement over heuristic mode in this range:

yallhap classify ancient.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --ancient \
    --bayesian \
    --output result.json

Variant density is calculated as (called variants / total variants in chrY VCF) × 100%. You can estimate this from your VCF or calculate it directly. At ≥10% variant density, both modes achieve comparable accuracy (97–99%); below 4%, classification is unreliable regardless of mode.

Basic Ancient Mode

Filters C>T and G>A transitions at read termini:

yallhap classify ancient.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --ancient \
    --min-depth 1 \
    --output result.json

Transversions-Only Mode

Strictest mode for heavily damaged samples (ignores all transitions):

yallhap classify ancient.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --transversions-only \
    --output result.json

Damage Rescaling

Downweight potentially damaged variants without excluding them:

yallhap classify ancient.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --ancient \
    --damage-rescale moderate \
    --output result.json

Options for --damage-rescale:

none (default): No rescaling
moderate: 50% weight reduction for damage-like transitions
aggressive: 80% weight reduction

Python API

Single Sample

from yallhap.tree import Tree
from yallhap.snps import SNPDatabase
from yallhap.classifier import HaplogroupClassifier

# Load resources
tree = Tree.from_json("data/yfull_tree.json")
snp_db = SNPDatabase.from_csv("data/ybrowse_snps_grch38.csv")

# Create classifier
classifier = HaplogroupClassifier(
    tree=tree,
    snp_db=snp_db,
    reference="grch38",
)

# Classify
result = classifier.classify("sample.vcf.gz")
print(f"{result.sample}: {result.haplogroup} (confidence: {result.confidence:.2f})")

Batch Classification (Multi-Sample VCF)

For multi-sample VCFs, classify_batch() is 10x faster than calling classify() repeatedly:

# Get list of sample names to classify
samples = ["NA12878", "NA12891", "NA12892"]

# Classify all samples in one pass
results = classifier.classify_batch("multi_sample.vcf.gz", samples)

for result in results:
    print(f"{result.sample}: {result.haplogroup}")

Ancient DNA Mode

# Recommended: Bayesian ancient mode for moderate variant density (4-10%)
classifier = HaplogroupClassifier(
    tree=tree,
    snp_db=snp_db,
    reference="grch37",
    ancient_mode=True,
    bayesian=True,  # Recommended for 4-10% variant density
)

# Alternative: Transversions-only mode (strictest filtering)
classifier = HaplogroupClassifier(
    tree=tree,
    snp_db=snp_db,
    reference="grch37",
    ancient_mode=True,
    transversions_only=True,
    damage_rescale="moderate",
)

Output Format

JSON (default)

{
  "sample": "SAMPLE1",
  "haplogroup": "R-L21",
  "confidence": 0.97,
  "reference": "grch38",
  "tree_version": "YFull (185780 SNPs, hash: a1b2c3d4)",
  "snp_stats": {
    "informative_tested": 1247,
    "derived": 145,
    "ancestral": 1089,
    "missing": 13,
    "filtered_damage": 0
  },
  "quality_scores": {
    "qc1_backbone": 0.98,
    "qc2_terminal": 1.0,
    "qc3_path": 0.95,
    "qc4_posterior": 0.97
  },
  "path": ["ROOT", "A0-T", "A1", "...", "R-L21"],
  "defining_snps": ["L21"]
}

Reproducibility

The tree_version field includes a hash of the tree file content, enabling exact reproducibility. When citing yallHap results, include the tree_version value to document the exact phylogeny version used. The format is:

YFull (<snp_count> SNPs, hash: <8-char SHA256>)

Example: "YFull (185780 SNPs, hash: a1b2c3d4)"

TSV (for batch processing)

sample	haplogroup	confidence	qc1	qc2	qc3	qc4	derived	ancestral	missing
SAMPLE1	R-L21	0.9700	0.9800	1.0000	0.9500	0.9700	145	1089	13

Exit Codes

Code	Meaning
0	Success (high confidence, ≥0.95)
1	Classification failed (no haplogroup)
2	Low confidence (<0.95)
10	File not found
11	Invalid input
99	Unexpected error

Quality Scores

Score	Name	Description
QC1	Backbone	Intermediate markers on path to haplogroup match expected states
QC2	Terminal	Defining markers for called haplogroup are present
QC3	Path	Consistency within the called haplogroup branch
QC4	Posterior	Overall posterior probability from likelihood calculation

CLI Reference

`yallhap classify`

Classify a single VCF file.

Usage: yallhap classify [OPTIONS] VCF

Options:
  -t, --tree PATH          Path to YFull tree JSON [required]
  -s, --snp-db PATH        Path to SNP database CSV [required]
  -r, --reference TEXT     Reference genome: grch37, grch38, t2t [default: grch38]
  --sample TEXT            Sample name (for multi-sample VCFs)
  --ancient                Enable ancient DNA mode
  --transversions-only     Only use transversions (strictest aDNA mode)
  --damage-rescale TEXT    Rescale quality: none, moderate, aggressive
  --min-depth INTEGER      Minimum read depth [default: 1]
  --min-quality INTEGER    Minimum base quality [default: 20]
  -o, --output PATH        Output file (stdout if omitted)
  --format TEXT            Output format: json, tsv [default: json]

`yallhap batch`

Batch process multiple VCF files.

Usage: yallhap batch [OPTIONS] VCF_FILES...

Options:
  -t, --tree PATH          Path to YFull tree JSON [required]
  -s, --snp-db PATH        Path to SNP database CSV [required]
  -r, --reference TEXT     Reference genome: grch37, grch38, t2t [default: grch38]
  --ancient                Enable ancient DNA mode
  --transversions-only     Only use transversions
  --damage-rescale TEXT    Rescale quality: none, moderate, aggressive
  -o, --output PATH        Output TSV file [required]
  --threads INTEGER        Parallel threads [default: 1]

`yallhap download`

Download reference data (YFull tree + SNP databases for all reference genomes).

Usage: yallhap download [OPTIONS]

Options:
  -o, --output-dir PATH    Output directory [default: data/]
  -f, --force              Overwrite existing files

Downloads:

yfull_tree.json - YFull phylogenetic tree (~14 MB)
ybrowse_snps_grch38.csv - SNP positions for GRCh38/hg38 (~430 MB)
ybrowse_snps_grch37.csv - SNP positions for GRCh37/hg19 (~50 MB)

Pipeline Integration

Nextflow

See pipelines/nextflow/ for a complete example.

process YALLHAP {
    input:
    path vcf

    output:
    path "*.json"

    script:
    """
    yallhap classify ${vcf} \
        --tree ${params.tree} \
        --snp-db ${params.snp_db} \
        --reference ${params.reference} \
        --output ${vcf.baseName}.json
    """
}

Snakemake

See pipelines/snakemake/ for a complete example.

rule yallhap:
    input:
        vcf="{sample}.vcf.gz"
    output:
        json="{sample}.haplogroup.json"
    params:
        tree=config["yallhap_tree"],
        snp_db=config["yallhap_snps"]
    shell:
        """
        yallhap classify {input.vcf} \
            --tree {params.tree} \
            --snp-db {params.snp_db} \
            --output {output.json}
        """

Experimental Features

Bayesian Mode

A Bayesian classification mode is available that computes posterior probabilities over tree paths using log-likelihood ratios:

# For modern samples
yallhap classify sample.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --bayesian \
    --output result.json

# For ancient DNA (recommended for 4-10% variant density)
yallhap classify ancient.vcf.gz \
    --tree data/yfull_tree.json \
    --snp-db data/ybrowse_snps_grch38.csv \
    --ancient \
    --bayesian \
    --output result.json

Performance: On modern high-coverage samples (1000 Genomes, gnomAD), Bayesian mode produces identical results to heuristic mode—no accuracy improvement. However, for ancient DNA with moderate variant density (4–10%), Bayesian ancient mode achieves +12–24 percentage point improvement over heuristic mode (71.7% vs 52.4% accuracy). On the full AADR ancient DNA dataset (7,333 samples), Bayesian ancient mode achieves 90.7% accuracy vs 88.3% for heuristic transversions-only mode.

This mode incorporates allelic depth (AD) information when available and uses adjusted error rates for ancient DNA damage modeling. For modern samples, heuristic mode is recommended for speed; for ancient DNA at 4–10% variant density, Bayesian mode is recommended for improved accuracy.

Development

# Clone repository
git clone https://github.com/trianglegrrl/yallhap.git
cd yallhap

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run linters
black src/ tests/
ruff check src/ tests/
mypy src/

Citation

If you use yallHap in your research, please cite:

@software{yallhap,
  title = {yallHap: Modern Y-chromosome haplogroup inference},
  year = {2025},
  url = {https://github.com/trianglegrrl/yallhap}
}

License

PolyForm Noncommercial License 1.0.0 - see LICENSE for details.

This license allows use for noncommercial purposes, including research, education, and personal projects. For commercial use, please contact the maintainers.

Acknowledgments

YFull for maintaining the comprehensive Y-chromosome phylogeny
YBrowse for the SNP database
Yleaf and pathPhynder for algorithmic inspiration
1000 Genomes Project and AADR for validation data

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.1

Jan 21, 2026

1.0.0

Jan 21, 2026

0.5.0

Jan 19, 2026

0.4.0

Jan 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

yallhap-1.0.1.tar.gz (87.9 kB view details)

Uploaded Jan 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

yallhap-1.0.1-py3-none-any.whl (57.6 kB view details)

Uploaded Jan 21, 2026 Python 3

File details

Details for the file yallhap-1.0.1.tar.gz.

File metadata

Download URL: yallhap-1.0.1.tar.gz
Upload date: Jan 21, 2026
Size: 87.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0rc1

File hashes

Hashes for yallhap-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`aae52eed20a7a3fc14ad30a098c86395d9cb293e3bc0d19cece6fe26813b2e45`
MD5	`3a0d534bce598a0b76ef402fb8b84be9`
BLAKE2b-256	`2499f6a3e569cdb7c1d71e85a7fb917e4c4cc8eff00461ee577f5225082d6549`

See more details on using hashes here.

File details

Details for the file yallhap-1.0.1-py3-none-any.whl.

File metadata

Download URL: yallhap-1.0.1-py3-none-any.whl
Upload date: Jan 21, 2026
Size: 57.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.0rc1

File hashes

Hashes for yallhap-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a458ca9f2ff3f2f1e6be173467ce0c4bcf4150081ad8ef61ea28d9aa2a06ed13`
MD5	`616e4bd230afb66522988c27e7c71b2e`
BLAKE2b-256	`cd8b5806906d17f1e9fbb11ec74dd7e9c145dbc79db47e0ecdbf8d1cb393dd99`

See more details on using hashes here.

yallhap 1.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

yallHap

Features

Accuracy

Installation

pip (recommended)

Conda

Docker

From source

Quick Start

1. Download reference data

2. Classify a sample

3. View results

Usage

Single Sample Classification

Multi-Sample VCF

Batch Processing

Parallel Processing

TSV Output Format

Reference Genomes

Ancient DNA Mode

Recommended: Bayesian Ancient Mode

Basic Ancient Mode

Transversions-Only Mode

Damage Rescaling

Python API

Single Sample

Batch Classification (Multi-Sample VCF)

Ancient DNA Mode

Output Format

JSON (default)

Reproducibility

TSV (for batch processing)

Exit Codes

Quality Scores

CLI Reference

yallhap classify

yallhap batch

yallhap download

Pipeline Integration

Nextflow

Snakemake

Experimental Features

Bayesian Mode

Development

Citation

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`yallhap classify`

`yallhap batch`

`yallhap download`