Skip to main content

Universal mtDNA Haplogroup Classifier

Project description

eveHap

Universal mtDNA Haplogroup Classifier

eveHap is a Python implementation of mtDNA haplogroup classification, inspired by and building upon Haplogrep 3. It brings haplogroup classification into the Python ecosystem with support for modern high-coverage sequencing data as well as ancient DNA with damage filtering.

Features

  • Pipeline-Friendly: Explicit file paths required - no hidden auto-downloads
  • Offline-Ready: Bundled rsrs and rcrs resources work without internet
  • Multiple Input Formats: BAM, CRAM, VCF, FASTA, HSD, and consumer genotyping files (23andMe, AncestryDNA)
  • Dual Classification Strategy:
    • Kulczynski scoring for high-coverage modern DNA
    • Tree traversal for low-coverage/ancient DNA
  • mitoLeaf Phylotree: Optionally download the complete mitoLeaf phylotree (6400+ haplogroups)
  • Ancient DNA Support: Damage pattern detection and filtering for C→T/G→A substitutions
  • Quality Metrics: Confidence scores, coverage statistics, and QC warnings
  • Flexible Output: TSV, JSON, or human-readable text

Installation

cd evehap
pip install -e .

Quick Start

# Classify using bundled resources (offline-ready, no download required)
evehap classify sample.bam --tree rsrs --reference rcrs

# Classify a VCF file with specific sample
evehap classify samples.vcf.gz --tree rsrs --reference rcrs --sample-id HG00096

# Classify multiple files with output to TSV
evehap classify *.bam --tree rsrs --reference rcrs -o results.tsv

# Classify ancient DNA with damage filtering
evehap classify ancient.bam --tree rsrs --reference rcrs --damage-filter --method traversal

# Download mitoLeaf tree (more haplogroups) and use it
evehap download --outdir ./evehap_data/
evehap classify sample.bam --tree ./evehap_data/mitoleaf_tree.json --reference ./evehap_data/rCRS.fasta

# Estimate ancient DNA damage rates
evehap damage ancient.bam

# Show version and bundled resources
evehap version

Usage

classify

Classify mtDNA haplogroup from input files.

evehap classify [OPTIONS] INPUT_FILES...

Options:
  --tree TEXT                     REQUIRED: 'rsrs', 'rcrs', or path to tree file
  --reference TEXT                REQUIRED: 'rcrs' or path to reference FASTA
  --format TEXT                   Input format (auto, bam, vcf, fasta, hsd, microarray)
  -o, --output PATH               Output file (default: stdout)
  --output-format [json|tsv|text] Output format
  --sample-id TEXT                Sample ID for multi-sample VCF/HSD files
  --damage-filter                 Apply ancient DNA damage filtering
  --method [auto|kulczynski|traversal]
                                  Classification method
  --top-n INTEGER                 Number of alternatives to report
  -q, --quiet                     Suppress progress output

Tree Options:

Name Description Haplogroups Requires Download
rsrs Bundled RSRS-based XML tree ~5400 No (bundled)
rcrs Bundled rCRS-based XML tree ~2400 No (bundled)
mitoleaf_tree.json mitoLeaf JSON tree ~6400 Yes
# Use bundled RSRS tree (offline-ready)
evehap classify sample.bam --tree rsrs --reference rcrs

# Use bundled rCRS tree
evehap classify sample.bam --tree rcrs --reference rcrs

# Download and use mitoLeaf tree (most comprehensive)
evehap download --outdir ./evehap_data/
evehap classify sample.bam --tree ./evehap_data/mitoleaf_tree.json --reference ./evehap_data/rCRS.fasta

download

Download phylotree and reference resources.

evehap download --outdir ./evehap_data/

Options:
  -o, --outdir PATH               REQUIRED: Directory to save downloaded files
  --category [all|reference|phylotree]
                                  Category to download (default: all)
  --resource TEXT                 Specific resource(s) to download
  --force                         Overwrite existing files
  --check-updates                 Check for updates without downloading
  --list                          List available resources
# Download all resources
evehap download --outdir ./evehap_data/

# List available resources
evehap download --outdir ./evehap_data/ --list

# Download only the mitoLeaf tree
evehap download --outdir ./evehap_data/ --resource mitoleaf-tree

# Check for updates
evehap download --outdir ./evehap_data/ --check-updates

info

Show information about an input file.

evehap info sample.bam

damage

Estimate ancient DNA damage rates from BAM file.

evehap damage ancient.bam

download

Download reference sequences and phylotree resources.

evehap download [OPTIONS]

Options:
  --category [all|reference|phylotree]  Category of resources (default: all)
  --resource TEXT                       Specific resource(s) to download
  --force                               Overwrite existing files
  --check-updates                       Check for updates without downloading
  --list                                List available resources

Available Resources:

Resource Source Description
Reference Sequences
rsrs phylotree.org Reconstructed Sapiens Reference Sequence
rcrs phylotree.org Revised Cambridge Reference Sequence
mitoLeaf Phylotree Data
mitoleaf-tree forensicgenomics.github.io Complete phylotree in JSON format (~15MB)
mitoleaf-motifs forensicgenomics.github.io Haplogroup defining mutations
mitoleaf-representatives forensicgenomics.github.io Representative sequences per haplogroup

Examples:

# Download all resources
evehap download

# Download only reference sequences
evehap download --category reference

# Download only phylotree data
evehap download --category phylotree

# Download specific resource
evehap download --resource mitoleaf-tree

# Check for updates to installed resources
evehap download --check-updates

# Force re-download (update all)
evehap download --force

# List all available resources
evehap download --list

version

Show version information and installed resources.

evehap version [OPTIONS]

Options:
  --check-updates    Check for resource updates

Output shows:

  • Package and data directory locations
  • Default tree and reference paths
  • Bundled phylotrees with sizes
  • Downloaded resources with sizes and dates

Example:

$ evehap version
eveHap v0.1.0
Package directory: /path/to/evehap
Data directory: /path/to/evehap/data
Default tree: tree-rsrs.xml
Default reference: rCRS.fasta

Bundled phylotrees:
   tree-rsrs.xml (1837.7 KB)
   tree.xml (2369.2 KB)

Downloaded resources:

  [reference]
     rsrs (16.7 KB, 2025-01-01)
     rcrs (16.5 KB, 2025-01-01)

  [phylotree]
     mitoleaf-tree (15293.6 KB, 2025-01-01)
    ...

Python API

from evehap.adapters.bam import BAMAdapter
from evehap.core.classifier import Classifier
from evehap.core.phylotree import Phylotree

# Load phylotree (mitoLeaf JSON or Haplogrep XML - auto-detected)
phylotree = Phylotree.load("data/phylotree/mitoleaf/tree.json")

# Extract profile from BAM
adapter = BAMAdapter(reference_path="data/reference/rCRS.fasta")
profile = adapter.extract_profile("sample.bam")

# Classify
classifier = Classifier(phylotree)
result = classifier.classify(profile)

print(f"Haplogroup: {result.haplogroup}")
print(f"Confidence: {result.confidence:.1%}")
print(f"Quality: {result.quality}")

Supported Input Formats

Format Extensions Description
BAM/CRAM .bam, .cram Aligned sequencing reads
VCF .vcf, .vcf.gz Variant call format
FASTA .fasta, .fa, .fna Consensus sequences
HSD .hsd Haplogrep polymorphism format
Microarray .txt, .csv 23andMe, AncestryDNA raw data

Classification Methods

Kulczynski Scoring (default for high-coverage)

Uses the Kulczynski similarity measure to score each haplogroup based on the proportion of expected mutations present in the sample. Best for samples with >80% mtDNA coverage.

Tree Traversal (default for low-coverage)

Traverses the phylogenetic tree from root, evaluating support for each branch based on derived/ancestral allele counts. Best for ancient DNA or samples with sparse coverage.

Reference Sequences

eveHap supports two mtDNA reference sequences:

RSRS (Reconstructed Sapiens Reference Sequence)

The ancestral human mtDNA sequence reconstructed from phylogenetic analysis. The RSRS-based phylotree has the complete haplogroup structure starting from mtMRCA (mitochondrial Most Recent Common Ancestor).

rCRS (Revised Cambridge Reference Sequence)

The standard reference for modern mtDNA sequencing. Most BAM/VCF files are aligned to rCRS. It represents haplogroup H2a2a1 (a European lineage).

Note: eveHap automatically handles the translation between rCRS-aligned input files and the RSRS-based phylotree.

Ancient DNA Damage Filtering

Ancient DNA exhibits characteristic damage patterns:

  • C→T substitutions at 5' read ends
  • G→A substitutions at 3' read ends

Use --damage-filter to automatically detect and filter potentially damaged bases.

Output Formats

TSV (default)

sample_id	haplogroup	confidence	quality	coverage	depth	method	warnings
HGDP00582	H2a2a1g	1.0000	high	1.0000	19004.1	kulczynski	HETEROPLASMY

JSON

{
  "sample_id": "HGDP00582",
  "haplogroup": "H2a2a1g",
  "confidence": 1.0,
  "quality": "high",
  ...
}

Text

Sample: HGDP00582
  Haplogroup: H2a2a1g
  Confidence: 100.0%
  Quality: high
  Method: kulczynski
  Coverage: 100.0%

Pipeline Integration

eveHap is designed to be pipeline-friendly for batch processing on HPC and cloud environments. All resource paths must be explicitly specified - no hidden auto-downloads.

Using Bundled Resources (Offline-Ready)

# Bundled resources work without internet access
evehap classify sample.bam --tree rsrs --reference rcrs

Using Downloaded Resources

# Download resources to a directory
evehap download --outdir ./evehap_data/

# Use downloaded files explicitly
evehap classify sample.bam \
    --tree ./evehap_data/mitoleaf_tree.json \
    --reference ./evehap_data/rCRS.fasta

Nextflow

# Using bundled resources
nextflow run pipelines/nextflow/main.nf \
    --input 'samples/*.bam' \
    --tree rsrs \
    --reference rcrs \
    --outdir results/

# Using downloaded resources
nextflow run pipelines/nextflow/main.nf \
    --input 'samples/*.bam' \
    --tree ./evehap_data/mitoleaf_tree.json \
    --reference ./evehap_data/rCRS.fasta \
    --outdir results/

# SLURM cluster
nextflow run pipelines/nextflow/main.nf \
    --input 'samples/*.bam' \
    --tree rsrs \
    --reference rcrs \
    -profile slurm

Snakemake

cd pipelines/snakemake

# Using bundled resources
snakemake --cores 4 --config input_dir=../../samples tree=rsrs reference=rcrs

# SLURM cluster
snakemake --profile profiles/slurm --config tree=rsrs reference=rcrs

See pipelines/README.md for complete documentation.

Testing

# Run all tests
pytest tests/

# Run fast tests only (exclude slow benchmarks)
pytest tests/ -m "not slow"

# Run with verbose output
pytest tests/ -v

Project Structure

evehap/
├── evehap/
│   ├── adapters/       # Input format adapters
│   │   ├── bam.py      # BAM/CRAM adapter
│   │   ├── vcf.py      # VCF adapter
│   │   ├── fasta.py    # FASTA adapter
│   │   ├── hsd.py      # HSD adapter
│   │   └── microarray.py  # 23andMe/Ancestry adapter
│   ├── core/           # Core classification logic
│   │   ├── classifier.py  # Classifier algorithms
│   │   ├── damage.py   # Ancient DNA damage filtering
│   │   ├── phylotree.py   # Phylotree parser
│   │   └── profile.py  # AlleleProfile data structures
│   ├── output/         # Output formatting
│   │   ├── report.py   # Report generation
│   │   └── result.py   # Result data structures
│   └── cli.py          # Command-line interface
├── data/
│   ├── phylotree/      # Phylotree data files
│   │   ├── tree-rsrs.xml  # RSRS-based tree (default, complete phylogeny)
│   │   ├── tree.xml       # rCRS-based tree
│   │   └── mitoleaf/      # mitoLeaf data (downloaded)
│   │       ├── tree.json          # Complete phylotree (JSON)
│   │       ├── hgmotifs.json      # Haplogroup motifs
│   │       └── mito_representatives.csv
│   └── reference/      # Reference sequences (downloaded)
│       ├── RSRS.fasta  # Reconstructed Sapiens Reference Sequence
│       └── rCRS.fasta  # Revised Cambridge Reference Sequence
└── tests/              # Test suite

Requirements

  • Python 3.8+
  • pysam
  • click
  • numpy

License

PolyForm Noncommercial License 1.0.0

This software is free for noncommercial use, including academic research, personal projects, and use by nonprofit organizations. See LICENSE for full terms.

For commercial licensing inquiries, please contact the authors.

Citation

If you use eveHap in your research, please cite both eveHap and the foundational Haplogrep 3 work:

Haplogrep 3 (classification methodology and phylotree data):

Schönherr S, Weissensteiner H, Kronenberg F, Forer L. Haplogrep 3 - an interactive haplogroup classification and analysis platform. Nucleic Acids Res. 2023. https://doi.org/10.1093/nar/gkad284

eveHap (this implementation):

eveHap: Universal mtDNA Haplogroup Classifier https://doi.org/10.5281/zenodo.18305868

Acknowledgments

eveHap implements haplogroup classification algorithms originally developed by the Haplogrep team at the Medical University of Innsbruck. We are grateful for their foundational work in making mtDNA haplogroup classification accessible and rigorous.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

evehap-0.1.1.tar.gz (390.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

evehap-0.1.1-py3-none-any.whl (427.3 kB view details)

Uploaded Python 3

File details

Details for the file evehap-0.1.1.tar.gz.

File metadata

  • Download URL: evehap-0.1.1.tar.gz
  • Upload date:
  • Size: 390.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for evehap-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f02a2c13a4dee6de11392dd2a3d7c2021872e76565706fae1a2b5151c178445b
MD5 f1130902ad9d82e1a4d8973f07b3e65b
BLAKE2b-256 12154b88f6ad4481af431a868a3e63c372b469145e1b0c19c19aa0002c27a5d3

See more details on using hashes here.

File details

Details for the file evehap-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: evehap-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 427.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for evehap-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c91a4fc67d9e9b51a57c94f9bb05aa1d1749d8a52951167108eff14b72d005e1
MD5 0b8dc847cfe80ce237212d4d6d303695
BLAKE2b-256 0f725eb9938bbc1cfcb4200512e14cf96c720008e6e5a1d3f407503f03890086

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page